Tutorials

A Beginner’s Guide to OpenAI Gym: Understanding the Essential Building Blocks

October 10, 2024
6:01 am

If you’re interested in diving into Reinforcement Learning, the OpenAI gym stands out as a leading platform for creating environments to train your agents. The OpenAI Gym provides a plethora of environments that serve as benchmarks for testing any new research methodology right out of the box. In addition, it offers a user-friendly API that allows you to create your own environments smoothly.

This article will walk you through the fundamental components of OpenAI Gym. Below is a summary of the topics that will be discussed.

Prerequisites

Python: Basic knowledge of Python is needed to follow along.
OpenAI Gym: You should have access to the OpenAI Gym environment and its packages.

Topics Covered

Installation
Environments
Spaces
Wrappers
Vectorized Environments

Now, let’s jump in.

Installation

The first step is to ensure you have the latest version of gym installed.

You can install gym using either conda or pip. Here, we’ll use pip.

pip install -U gym

Environments

The core component of OpenAI Gym is the Env class. This class acts as a simulator for the environment you want your agent to train in. OpenAI Gym comes with a variety of environments, such as driving a car up a hill, balancing a pendulum, or performing well in Atari games. Additionally, it gives you the ability to create custom environments as needed.

We’ll start with an environment named MountainCar. The goal here is to drive a car up a mountain situated between two hills. The objective is to summate enough momentum to reach the peak on the right. However, the car’s engine lacks sufficient power to make it in a single attempt. The strategy is to move back and forth to gather momentum.

The objective of the Mountain Car Environment is to gain momentum until the flag is reached.

import gym

env = gym.make(‘MountainCar-v0’)

The structural details of the environment are represented by the observation_space and action_space attributes of the Gym Env class.

The observation_space outlines the format and acceptable values for observing the state of the environment. The form of observation can vary across environments. Often, it displays a visual representation of the game, but other forms like vector representations of environment features are also possible.

Conversely, the action_space defines the numerical format for permissible actions that can be executed within the environment.

# Observation and action space

obs_space = env.observation_space

action_space = env.action_space

print(“The observation space: {}”.format(obs_space))

print(“The action space: {}”.format(action_space))

OUTPUT:

The observation space: Box(2,)

The action space: Discrete(3)

The observations in the Mountain Car environment consist of two values representing velocity and position. The midpoint between the hills is considered the origin, with the right side considered positive and left as negative.

The observation and action spaces are represented by classes called Box and Discrete, respectively. These classes are among the various data structures provided by gym for implementing observation and action spaces tailored to different scenarios. We will delve deeper into these concepts later in the article.

Interacting with the Environment

This section covers the functions within the Env class that facilitate agent interaction with the environment. Two essential functions are:

reset: Resets the environment to its initial state and returns the initial observation.
step: This function accepts an action as input, applies it to the environment, and transitions the environment to a new state. The reset method returns four items:

observation: The observation of the current state.
reward: The reward from executing the provided action.
done: Indicates if the episode has ended. If true, you may need to either complete the simulation or reset the environment to repeat the episode.
info: Supplies additional information depending on the environment, such as remaining lives or other supportive details for debugging.

Let’s illustrate these principles through an example. We will start by resetting the environment, checking the observation, applying an action, and observing the result.

import matplotlib.pyplot as plt

# Reset the environment to check the initial state

obs = env.reset()

print(“The initial observation is {}”.format(obs))

# Sample a random action

random_action = env.action_space.sample()

# Take the action and receive updated observations

new_obs, reward, done, info = env.step(random_action)

print(“The new observation is {}”.format(new_obs))

OUTPUT:

The initial observation is [-0.48235664 0.]

The new observation is [-0.48366517 -0.00130853]

Unlike many environments, the observation here is not a screenshot of the activity being performed. However, if you want to visualize the current state of the environment, you can use the render method.

env.render(mode="human")

This command will open a pop-up window to show the environment’s current state. You can close the window by calling the close function.

env.close()

If you prefer to capture the game state as an image instead of through a pop-up, set the mode argument of the render method to rgb_array.

env_screen = env.render(mode='rgb_array')

env.close()

import matplotlib.pyplot as plt

plt.imshow(env_screen)

OUTPUT

Combining all the previous code snippets, a typical code setup for running your agent in the MountainCar environment may look like the following. Currently, random actions are taken, but a more intelligent agent could utilize the observations for decision making.

import time

# Number of steps to run the agent

num_steps = 1500

obs = env.reset()

for step in range(num_steps):

# Take a random action or implement a more intelligent decision

action = env.action_space.sample()

# Execute the action

obs, reward, done, info = env.step(action)

# Render the environment

env.render()

# Allow some time before the next frame

time.sleep(0.001)

# If the episode has ended, reset for a new one

if done:

env.reset()

# Close the environment

env.close()

Spaces

The observation_space for our environment was Box(2,), while the action_space was Discrete(2,). Understanding their meanings is crucial – both Box and Discrete are types of data structures termed “Spaces,” which define the acceptable values for observations and actions.

These structures derive from the gym.Space base class.

type(env.observation_space)

OUTPUT -> gym.spaces.box.Box

Box(n,) refers to an n-dimensional continuous space. In our case, n=2, hence the observational space is 2-D. The space is also confined to maximum and minimum limits that dictate the legitimate observation values. You can determine these limits via the high and low attributes of the observation space, which reference the bounds for various positions and velocities in the environment.

print("Upper Bound for Env Observation", env.observation_space.high)

print(“Lower Bound for Env Observation”, env.observation_space.low)

OUTPUT:

Upper Bound for Env Observation [0.6 0.07]

Lower Bound for Env Observation [-1.2 -0.07]

You can specify these limits during space creation as well as environment setup.

The Discrete(n) space defines a set of discrete values from [0.....n-1]. In our scenario, n = 3, meaning our action values can be either 0, 1, or 2. Unlike Box, Discrete doesn’t possess high and low methods as its allowed values are clear by definition.

Submitting invalid values within the step function (like 4 in our scenario) leads to an error.

# Valid

env.step(2)

print(“It works!”)

# Invalid

env.step(4)

print(“It works!”)

OUTPUT

There are various other spaces available for diverse requirements such as MultiDiscrete, which allows multiple discrete variables for your observation and action spaces.

Wrappers

The Wrapper class in OpenAI Gym empowers you with the capacity to modify different aspects of an environment based on your requirements. But why would you need such changes? Perhaps you aim to normalize input pixels or clip the output rewards. Although similar modifications could be done by creating subclasses of the environment Env, the Wrapper class offers a more systematic approach.

Before we proceed, let’s explore a more complex environment where the utility of Wrapper will be evident: the Atari game Breakout.

To begin, we need to install the relevant Atari components of gym.

!pip install --upgrade pip setuptools wheel

!pip install opencv-python

!pip install gym[atari]

If you encounter an error like AttributeError: module 'enum' has no attribute 'IntFlag', it’s advisable to uninstall the enum package and retry the installation.

pip uninstall -y enum34

Let’s see the gameplay of Atari Breakout.

env = gym.make("BreakoutNoFrameskip-v4")

print(“Observation Space: “, env.observation_space)

print(“Action Space “, env.action_space)

obs = env.reset()

for i in range(1000):

action = env.action_space.sample()

obs, reward, done, info = env.step(action)

env.render()

time.sleep(0.01)

env.close()

OUTPUT:

Observation Space: Box(210, 160, 3)

Action Space Discrete(4)

The observation space is a continuous array of dimensions (210, 160, 3), indicating an RGB pixel observation. Our action space lets us execute four separate actions: Left, Right, Do Nothing, and Fire.

Now that we have our environment incorporated, let’s apply some modifications to the Atari Environment. In Deep RL practices, it’s common to concatenate past k frames to construct our observation. We need to adapt the Breakout Environment so the reset and step functions return concatenated observations.

We’ll create a wrapper class of type gym.Wrapper to override the functions in the Breakout Env. The Wrapper class offers a layer above an Env class, allowing for attribute and function modifications.

The __init__ function receives the environment for which the wrapper is intended and the count of past frames to concatenate. Note: the observation space must also be redefined to accommodate concatenated frames as observations.

In the reset method, as we’re initializing the environment, we can’t concatenate prior observations yet, so we repeat the initial observations.

from collections import deque

from gym import spaces

import numpy as np

class ConcatObs(gym.Wrapper):

def __init__(self, env, k):

super().__init__(env)

self.k = k

self.frames = deque([], maxlen=k)

shp = env.observation_space.shape

self.observation_space = spaces.Box(low=0, high=255, shape=(k,) + shp, dtype=env.observation_space.dtype)

def reset(self):

ob = self.env.reset()

for _ in range(self.k):

self.frames.append(ob)

return self._get_ob()

def step(self, action):

ob, reward, done, info = self.env.step(action)

self.frames.append(ob)

return self._get_ob(), reward, done, info

def _get_ob(self):

return np.array(self.frames)

To utilize our modified environment, we simply need to wrap our Env within the wrapper we just created.

env = gym.make("BreakoutNoFrameskip-v4")

wrapped_env = ConcatObs(env, 4)

print(“The new observation space is”, wrapped_env.observation_space)

OUTPUT:

The new observation space is Box(4, 210, 160, 3)

Next, we can confirm if the observations are indeed concatenated.

# Reset the Env

obs = wrapped_env.reset()

print(“Initial obs is of the shape”, obs.shape)

# Take one step

obs, _, _, _ = wrapped_env.step(2)

print(“Obs after taking a step is”, obs.shape)

OUTPUT:

Initial obs is of the shape (4, 210, 160, 3)

Obs after taking a step is (4, 210, 160, 3)

There’s more to Wrappers than just the vanilla Wrapper class. Gym also provides specific wrappers that address particular environmental features such as observations, rewards, and actions. These are illustrated in the following section.

ObservationWrapper: Modify the observation via the observation method of the wrapper class.
RewardWrapper: Adjust the reward using the reward function of the wrapper class.
ActionWrapper: Alter the action using the action function of the wrapper class.

Now let’s explore a scenario where we will implement the following changes in our environment:

Normalize pixel observations by 255.
Clip rewards between 0 and 1.
Restrict the slider from moving to the left (action 3).

import random

class ObservationWrapper(gym.ObservationWrapper):

def __init__(self, env):

super().__init__(env)

def observation(self, obs):

# Normalize observation by 255

return obs / 255.0

class RewardWrapper(gym.RewardWrapper):

def __init__(self, env):

super().__init__(env)

def reward(self, reward):

# Clip reward between 0 to 1

return np.clip(reward, 0, 1)

class ActionWrapper(gym.ActionWrapper):

def __init__(self, env):

super().__init__(env)

def action(self, action):

if action == 3:

return random.choice([0, 1, 2])

else:

return action

Now, we can apply all these wrappers to our environment in a single instructive line and verify if all the intended modifications have taken effect.

env = gym.make("BreakoutNoFrameskip-v4")

wrapped_env = ObservationWrapper(RewardWrapper(ActionWrapper(env)))

obs = wrapped_env.reset()

for step in range(500):

action = wrapped_env.action_space.sample()

obs, reward, done, info = wrapped_env.step(action)

# Check if values are correctly normalized

if (obs > 1.0).any() or (obs < 0.0).any():

print(“Max and min value of observations out of range”)

# Ensure rewards are clipped

if reward 1.0:

assert False, “Reward out of bounds”

# Render to confirm the slider does not move left

wrapped_env.render()

time.sleep(0.001)

wrapped_env.close()

print(“All checks passed”)

OUTPUT: All checks passed

If you need to revert to the original Env after applying wrappers, you can utilize the unwrapped attribute of an Env class. While the Wrapper class may appear as just another class extending Env, it does keep track of the list of wrappers applied to the base Env.

print("Wrapped Env:", wrapped_env)

print(“Unwrapped Env”, wrapped_env.unwrapped)

print(“Getting the meaning of actions”, wrapped_env.unwrapped.get_action_meanings())

OUTPUT:

Wrapped Env: <ObservationWrapper<RewardWrapper<ActionWrapper<TimeLimit<AtariEnv>>>

Unwrapped Env: <AtariEnv>

Getting the meaning of actions [‘NOOP’, ‘FIRE’, ‘RIGHT’, ‘LEFT’]

Vectorized Environments

Many Deep RL algorithms, such as Asynchronous Actor-Critic Methods, make use of parallel threads where each thread runs an instance of the environment to expedite the training process and enhance efficiency.

For this, we will utilize another library from OpenAI called baselines. This library offers high-performance implementations of numerous standard Deep RL algorithms, allowing comparisons with any new algorithm. Moreover, baselines provides additional features to prepare environments conforming to the standards used in OpenAI experiments.

One of those features includes wrappers that allow simultaneous operations across multiple environments with one function call. To begin, we will install baselines using the following terminal commands.

!git clone https://github.com/openai/baselines

!cd baselines

!pip install .

Restart your Jupyter notebook, if necessary, for the installed package to be available.

The wrapper we focus on here is SubProcEnv, which executes all environments asynchronously. First, let’s create a list of functions returning the environment we’re running. I’ve implemented a lambda function to produce an anonymous function returning the gym environment.

# Import required packages

import gym

from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv

# Setting the number of environments

num_envs = 3

envs = [lambda: gym.make(“BreakoutNoFrameskip-v4”) for i in range(num_envs)]

# Create a vectorized environment

envs = SubprocVecEnv(envs)

This envs now acts as a single environment, facilitating calls to reset and step functions. However, these functions now return an array of observations/actions, not just a single observation/action.

# Get initial state

init_obs = envs.reset()

# Display the number of environments

print(“Number of Envs:”, len(init_obs))

# Check the shape of one observation

one_obs = init_obs[0]

print(“Shape of one Env:”, one_obs.shape)

# Prepare action list and apply them to the environment

actions = [0, 1, 2]

obs = envs.step(actions)

OUTPUT:

Number of Envs: 3

Shape of one Env: (210, 160, 3)

Calling the render function on the vectorized envs displays the games’ screenshots in a tiled format.

# Render the environments

import time

# Setting the number of environments

num_envs = 3

envs = [lambda: gym.make(“BreakoutNoFrameskip-v4”) for i in range(num_envs)]

# Create a vectorized environment

envs = SubprocVecEnv(envs)

init_obs = envs.reset()

for i in range(1000):

actions = [envs.action_space.sample() for i in range(num_envs)]

envs.step(actions)

envs.render()

time.sleep(0.001)

envs.close()

You’ll be greeted with the following visual display.

OUTPUT for render with SubProcEnv

You can explore more about Vectorized environments here.

Conclusion

This wraps up Part 1. With the topics covered, you should now be equipped to start training your reinforcement learning agents within the environments provided by OpenAI Gym. What if the specific environment you want to train your agent in isn’t available? If that’s the case, you’re in luck for two reasons!

First, OpenAI Gym allows you to implement your own custom environments. Second, this will be explored in detail in Part 2 of this series. Until then, enjoy your journey into the innovative realm of reinforcement learning using OpenAI Gym!

Share this Post

0 0 votes

Article Rating

0 Comments

Oldest

Newest Most Voted

FRESH DEALS: KVM VPS PROMOS NOW AVAILABLE IN SELECT LOCATIONS!

DediRock is Waging War On High Prices Sign Up Now

A Beginner’s Guide to OpenAI Gym: Understanding the Essential Building Blocks

Prerequisites

Topics Covered

Installation

Environments

Interacting with the Environment

Spaces

Wrappers

Vectorized Environments

Conclusion

Further Reading

Share this Post

Search

Categories

Tags

Address

We Accept