
If you’re interested in diving into Reinforcement Learning, the OpenAI gym stands out as a leading platform for creating environments to train your agents. The OpenAI Gym provides a plethora of environments that serve as benchmarks for testing any new research methodology right out of the box. In addition, it offers a user-friendly API that allows you to create your own environments smoothly.
This article will walk you through the fundamental components of OpenAI Gym. Below is a summary of the topics that will be discussed.
Prerequisites
- Python: Basic knowledge of Python is needed to follow along.
- OpenAI Gym: You should have access to the OpenAI Gym environment and its packages.
Topics Covered
- Installation
- Environments
- Spaces
- Wrappers
- Vectorized Environments
Now, let’s jump in.
Installation
The first step is to ensure you have the latest version of gym
installed.
You can install gym
using either conda
or pip
. Here, we’ll use pip
.
pip install -U gym
Environments
The core component of OpenAI Gym is the Env
class. This class acts as a simulator for the environment you want your agent to train in. OpenAI Gym comes with a variety of environments, such as driving a car up a hill, balancing a pendulum, or performing well in Atari games. Additionally, it gives you the ability to create custom environments as needed.
We’ll start with an environment named MountainCar
. The goal here is to drive a car up a mountain situated between two hills. The objective is to summate enough momentum to reach the peak on the right. However, the car’s engine lacks sufficient power to make it in a single attempt. The strategy is to move back and forth to gather momentum.
The objective of the Mountain Car Environment is to gain momentum until the flag is reached.
import gym
env = gym.make(‘MountainCar-v0’)
The structural details of the environment are represented by the observation_space
and action_space
attributes of the Gym Env
class.
The observation_space
outlines the format and acceptable values for observing the state of the environment. The form of observation can vary across environments. Often, it displays a visual representation of the game, but other forms like vector representations of environment features are also possible.
Conversely, the action_space
defines the numerical format for permissible actions that can be executed within the environment.
# Observation and action space
obs_space = env.observation_space
action_space = env.action_space
print(“The observation space: {}”.format(obs_space))
print(“The action space: {}”.format(action_space))
OUTPUT:
The observation space: Box(2,)
The action space: Discrete(3)
The observations in the Mountain Car environment consist of two values representing velocity and position. The midpoint between the hills is considered the origin, with the right side considered positive and left as negative.
The observation and action spaces are represented by classes called Box
and Discrete
, respectively. These classes are among the various data structures provided by gym
for implementing observation and action spaces tailored to different scenarios. We will delve deeper into these concepts later in the article.
Interacting with the Environment
This section covers the functions within the Env
class that facilitate agent interaction with the environment. Two essential functions are:
reset
: Resets the environment to its initial state and returns the initial observation.step
: This function accepts an action as input, applies it to the environment, and transitions the environment to a new state. The reset method returns four items:
observation
: The observation of the current state.reward
: The reward from executing the provided action.done
: Indicates if the episode has ended. If true, you may need to either complete the simulation or reset the environment to repeat the episode.info
: Supplies additional information depending on the environment, such as remaining lives or other supportive details for debugging.
Let’s illustrate these principles through an example. We will start by resetting the environment, checking the observation, applying an action, and observing the result.
import matplotlib.pyplot as plt
# Reset the environment to check the initial state
obs = env.reset()
print(“The initial observation is {}”.format(obs))
# Sample a random action
random_action = env.action_space.sample()
# Take the action and receive updated observations
new_obs, reward, done, info = env.step(random_action)
print(“The new observation is {}”.format(new_obs))
OUTPUT:
The initial observation is [-0.48235664 0.]
The new observation is [-0.48366517 -0.00130853]
Unlike many environments, the observation here is not a screenshot of the activity being performed. However, if you want to visualize the current state of the environment, you can use the render
method.
env.render(mode="human")
This command will open a pop-up window to show the environment’s current state. You can close the window by calling the close
function.
env.close()
If you prefer to capture the game state as an image instead of through a pop-up, set the mode
argument of the render
method to rgb_array
.
env_screen = env.render(mode='rgb_array')
env.close()
import matplotlib.pyplot as plt
plt.imshow(env_screen)
OUTPUT
Combining all the previous code snippets, a typical code setup for running your agent in the MountainCar
environment may look like the following. Currently, random actions are taken, but a more intelligent agent could utilize the observations for decision making.
import time
# Number of steps to run the agent
num_steps = 1500
obs = env.reset()
for step in range(num_steps):
# Take a random action or implement a more intelligent decision
action = env.action_space.sample()
# Execute the action
obs, reward, done, info = env.step(action)
# Render the environment
env.render()
# Allow some time before the next frame
time.sleep(0.001)
# If the episode has ended, reset for a new one
if done:
env.reset()
# Close the environment
env.close()
Spaces
The observation_space
for our environment was Box(2,)
, while the action_space
was Discrete(2,)
. Understanding their meanings is crucial – both Box
and Discrete
are types of data structures termed “Spaces,” which define the acceptable values for observations and actions.
These structures derive from the gym.Space
base class.
type(env.observation_space)
OUTPUT -> gym.spaces.box.Box
Box(n,)
refers to an n
-dimensional continuous space. In our case, n=2
, hence the observational space is 2-D. The space is also confined to maximum and minimum limits that dictate the legitimate observation values. You can determine these limits via the high
and low
attributes of the observation space, which reference the bounds for various positions and velocities in the environment.
print("Upper Bound for Env Observation", env.observation_space.high)
print(“Lower Bound for Env Observation”, env.observation_space.low)
OUTPUT:
Upper Bound for Env Observation [0.6 0.07]
Lower Bound for Env Observation [-1.2 -0.07]
You can specify these limits during space creation as well as environment setup.
The Discrete(n)
space defines a set of discrete values from [0.....n-1]
. In our scenario, n = 3
, meaning our action values can be either 0, 1, or 2. Unlike Box
, Discrete
doesn’t possess high
and low
methods as its allowed values are clear by definition.
Submitting invalid values within the step
function (like 4 in our scenario) leads to an error.
# Valid
env.step(2)
print(“It works!”)
# Invalid
env.step(4)
print(“It works!”)
OUTPUT
There are various other spaces available for diverse requirements such as MultiDiscrete
, which allows multiple discrete variables for your observation and action spaces.
Wrappers
The Wrapper
class in OpenAI Gym empowers you with the capacity to modify different aspects of an environment based on your requirements. But why would you need such changes? Perhaps you aim to normalize input pixels or clip the output rewards. Although similar modifications could be done by creating subclasses of the environment Env
, the Wrapper
class offers a more systematic approach.
Before we proceed, let’s explore a more complex environment where the utility of Wrapper
will be evident: the Atari game Breakout.
To begin, we need to install the relevant Atari components of gym
.
!pip install --upgrade pip setuptools wheel
!pip install opencv-python
!pip install gym[atari]
If you encounter an error like AttributeError: module 'enum' has no attribute 'IntFlag'
, it’s advisable to uninstall the enum
package and retry the installation.
pip uninstall -y enum34
Let’s see the gameplay of Atari Breakout.
env = gym.make("BreakoutNoFrameskip-v4")
print(“Observation Space: “, env.observation_space)
print(“Action Space “, env.action_space)
obs = env.reset()
for i in range(1000):
action = env.action_space.sample()
obs, reward, done, info = env.step(action)
env.render()
time.sleep(0.01)
env.close()
OUTPUT:
Observation Space: Box(210, 160, 3)
Action Space Discrete(4)
The observation space is a continuous array of dimensions (210, 160, 3), indicating an RGB pixel observation. Our action space lets us execute four separate actions: Left, Right, Do Nothing, and Fire.
Now that we have our environment incorporated, let’s apply some modifications to the Atari Environment. In Deep RL practices, it’s common to concatenate past k
frames to construct our observation. We need to adapt the Breakout Environment so the reset
and step
functions return concatenated observations.
We’ll create a wrapper class of type gym.Wrapper
to override the functions in the Breakout Env
. The Wrapper
class offers a layer above an Env
class, allowing for attribute and function modifications.
The __init__
function receives the environment for which the wrapper is intended and the count of past frames to concatenate. Note: the observation space must also be redefined to accommodate concatenated frames as observations.
In the reset
method, as we’re initializing the environment, we can’t concatenate prior observations yet, so we repeat the initial observations.
from collections import deque
from gym import spaces
import numpy as np
class ConcatObs(gym.Wrapper):
def __init__(self, env, k):
super().__init__(env)
self.k = k
self.frames = deque([], maxlen=k)
shp = env.observation_space.shape
self.observation_space = spaces.Box(low=0, high=255, shape=(k,) + shp, dtype=env.observation_space.dtype)
def reset(self):
ob = self.env.reset()
for _ in range(self.k):
self.frames.append(ob)
return self._get_ob()
def step(self, action):
ob, reward, done, info = self.env.step(action)
self.frames.append(ob)
return self._get_ob(), reward, done, info
def _get_ob(self):
return np.array(self.frames)
To utilize our modified environment, we simply need to wrap our Env
within the wrapper we just created.
env = gym.make("BreakoutNoFrameskip-v4")
wrapped_env = ConcatObs(env, 4)
print(“The new observation space is”, wrapped_env.observation_space)
OUTPUT:
The new observation space is Box(4, 210, 160, 3)
Next, we can confirm if the observations are indeed concatenated.
# Reset the Env
obs = wrapped_env.reset()
print(“Initial obs is of the shape”, obs.shape)
# Take one step
obs, _, _, _ = wrapped_env.step(2)
print(“Obs after taking a step is”, obs.shape)
OUTPUT:
Initial obs is of the shape (4, 210, 160, 3)
Obs after taking a step is (4, 210, 160, 3)
There’s more to Wrappers
than just the vanilla Wrapper
class. Gym also provides specific wrappers that address particular environmental features such as observations, rewards, and actions. These are illustrated in the following section.
ObservationWrapper
: Modify the observation via theobservation
method of the wrapper class.RewardWrapper
: Adjust the reward using thereward
function of the wrapper class.ActionWrapper
: Alter the action using theaction
function of the wrapper class.
Now let’s explore a scenario where we will implement the following changes in our environment:
- Normalize pixel observations by 255.
- Clip rewards between 0 and 1.
- Restrict the slider from moving to the left (action 3).
import random
class ObservationWrapper(gym.ObservationWrapper):
def __init__(self, env):
super().__init__(env)
def observation(self, obs):
# Normalize observation by 255
return obs / 255.0
class RewardWrapper(gym.RewardWrapper):
def __init__(self, env):
super().__init__(env)
def reward(self, reward):
# Clip reward between 0 to 1
return np.clip(reward, 0, 1)
class ActionWrapper(gym.ActionWrapper):
def __init__(self, env):
super().__init__(env)
def action(self, action):
if action == 3:
return random.choice([0, 1, 2])
else:
return action
Now, we can apply all these wrappers to our environment in a single instructive line and verify if all the intended modifications have taken effect.
env = gym.make("BreakoutNoFrameskip-v4")
wrapped_env = ObservationWrapper(RewardWrapper(ActionWrapper(env)))
obs = wrapped_env.reset()
for step in range(500):
action = wrapped_env.action_space.sample()
obs, reward, done, info = wrapped_env.step(action)
# Check if values are correctly normalized
if (obs > 1.0).any() or (obs < 0.0).any():
print(“Max and min value of observations out of range”)
# Ensure rewards are clipped
if reward 1.0:
assert False, “Reward out of bounds”
# Render to confirm the slider does not move left
wrapped_env.render()
time.sleep(0.001)
wrapped_env.close()
print(“All checks passed”)
OUTPUT: All checks passed
If you need to revert to the original Env
after applying wrappers, you can utilize the unwrapped
attribute of an Env
class. While the Wrapper
class may appear as just another class extending Env
, it does keep track of the list of wrappers applied to the base Env
.
print("Wrapped Env:", wrapped_env)
print(“Unwrapped Env”, wrapped_env.unwrapped)
print(“Getting the meaning of actions”, wrapped_env.unwrapped.get_action_meanings())
OUTPUT:
Wrapped Env: <ObservationWrapper<RewardWrapper<ActionWrapper<TimeLimit<AtariEnv>>>
Unwrapped Env: <AtariEnv>
Getting the meaning of actions [‘NOOP’, ‘FIRE’, ‘RIGHT’, ‘LEFT’]
Vectorized Environments
Many Deep RL algorithms, such as Asynchronous Actor-Critic Methods, make use of parallel threads where each thread runs an instance of the environment to expedite the training process and enhance efficiency.
For this, we will utilize another library from OpenAI called baselines
. This library offers high-performance implementations of numerous standard Deep RL algorithms, allowing comparisons with any new algorithm. Moreover, baselines
provides additional features to prepare environments conforming to the standards used in OpenAI experiments.
One of those features includes wrappers that allow simultaneous operations across multiple environments with one function call. To begin, we will install baselines using the following terminal commands.
!git clone https://github.com/openai/baselines
!cd baselines
!pip install .
Restart your Jupyter notebook, if necessary, for the installed package to be available.
The wrapper we focus on here is SubProcEnv
, which executes all environments asynchronously. First, let’s create a list of functions returning the environment we’re running. I’ve implemented a lambda
function to produce an anonymous function returning the gym environment.
# Import required packages
import gym
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
# Setting the number of environments
num_envs = 3
envs = [lambda: gym.make(“BreakoutNoFrameskip-v4”) for i in range(num_envs)]
# Create a vectorized environment
envs = SubprocVecEnv(envs)
This envs
now acts as a single environment, facilitating calls to reset
and step
functions. However, these functions now return an array of observations/actions, not just a single observation/action.
# Get initial state
init_obs = envs.reset()
# Display the number of environments
print(“Number of Envs:”, len(init_obs))
# Check the shape of one observation
one_obs = init_obs[0]
print(“Shape of one Env:”, one_obs.shape)
# Prepare action list and apply them to the environment
actions = [0, 1, 2]
obs = envs.step(actions)
OUTPUT:
Number of Envs: 3
Shape of one Env: (210, 160, 3)
Calling the render
function on the vectorized envs
displays the games’ screenshots in a tiled format.
# Render the environments
import time
# Setting the number of environments
num_envs = 3
envs = [lambda: gym.make(“BreakoutNoFrameskip-v4”) for i in range(num_envs)]
# Create a vectorized environment
envs = SubprocVecEnv(envs)
init_obs = envs.reset()
for i in range(1000):
actions = [envs.action_space.sample() for i in range(num_envs)]
envs.step(actions)
envs.render()
time.sleep(0.001)
envs.close()
You’ll be greeted with the following visual display.
OUTPUT for render
with SubProcEnv
You can explore more about Vectorized environments here.
Conclusion
This wraps up Part 1. With the topics covered, you should now be equipped to start training your reinforcement learning agents within the environments provided by OpenAI Gym. What if the specific environment you want to train your agent in isn’t available? If that’s the case, you’re in luck for two reasons!
First, OpenAI Gym allows you to implement your own custom environments. Second, this will be explored in detail in Part 2 of this series. Until then, enjoy your journey into the innovative realm of reinforcement learning using OpenAI Gym!
Further Reading
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Learn more about our products
Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.