Reward Wrappers

class gymnasium.RewardWrapper(env: Env[ObsType, ActType])[source]

Superclass of wrappers that can modify the returning reward from a step.

If you would like to apply a function to the reward that is returned by the base environment before passing it to learning code, you can simply inherit from RewardWrapper and overwrite the method reward() to implement that transformation.

Parameters:

env – Environment to be wrapped.

reward(reward: SupportsFloat) SupportsFloat[source]

Returns a modified environment reward.

Parameters:

reward – The env step() reward

Returns:

The modified `reward`

Implemented Wrappers

class gymnasium.wrappers.TransformReward(env: Env[ObsType, ActType], func: Callable[[SupportsFloat], SupportsFloat])[source]

Applies a function to the reward received from the environment’s step.

A vector version of the wrapper exists gymnasium.wrappers.vector.TransformReward.

Example

>>> import gymnasium as gym
>>> from gymnasium.wrappers import TransformReward
>>> env = gym.make("CartPole-v1")
>>> env = TransformReward(env, lambda r: 2 * r + 1)
>>> _ = env.reset()
>>> _, rew, _, _, _ = env.step(0)
>>> rew
3.0
Change logs:
  • v0.15.0 - Initially added

Parameters:
  • env (Env) – The environment to wrap

  • func – (Callable): The function to apply to reward

class gymnasium.wrappers.NormalizeReward(env: Env[ObsType, ActType], gamma: float = 0.99, epsilon: float = 1e-8)[source]

This wrapper will scale rewards s.t. the discounted returns have a mean of 0 and std of 1.

In a nutshell, the rewards are divided through by the standard deviation of a rolling discounted sum of the reward. The exponential moving average will have variance \((1 - \gamma)^2\).

The property _update_running_mean allows to freeze/continue the running mean calculation of the reward statistics. If True (default), the RunningMeanStd will get updated every time self.normalize() is called. If False, the calculated statistics are used but not updated anymore; this may be used during evaluation.

A vector version of the wrapper exists gymnasium.wrappers.vector.NormalizeReward.

Important note:

Contrary to what the name suggests, this wrapper does not normalize the rewards to have a mean of 0 and a standard deviation of 1. Instead, it scales the rewards such that discounted returns have approximately unit variance. See [Engstrom et al.](https://openreview.net/forum?id=r1etN1rtPB) on “reward scaling” for more information.

Note

In v0.27, NormalizeReward was updated as the forward discounted reward estimate was incorrectly computed in Gym v0.25+. For more detail, read [#3154](https://github.com/openai/gym/pull/3152).

Note

The scaling depends on past trajectories and rewards will not be scaled correctly if the wrapper was newly instantiated or the policy was changed recently.

Example without the normalize reward wrapper:
>>> import numpy as np
>>> import gymnasium as gym
>>> env = gym.make("MountainCarContinuous-v0")
>>> _ = env.reset(seed=123)
>>> _ = env.action_space.seed(123)
>>> episode_rewards = []
>>> terminated, truncated = False, False
>>> while not (terminated or truncated):
...     observation, reward, terminated, truncated, info = env.step(env.action_space.sample())
...     episode_rewards.append(reward)
...
>>> env.close()
>>> np.var(episode_rewards)
np.float64(0.0008876301247721108)
Example with the normalize reward wrapper:
>>> import numpy as np
>>> import gymnasium as gym
>>> env = gym.make("MountainCarContinuous-v0")
>>> env = NormalizeReward(env, gamma=0.99, epsilon=1e-8)
>>> _ = env.reset(seed=123)
>>> _ = env.action_space.seed(123)
>>> episode_rewards = []
>>> terminated, truncated = False, False
>>> while not (terminated or truncated):
...     observation, reward, terminated, truncated, info = env.step(env.action_space.sample())
...     episode_rewards.append(reward)
...
>>> env.close()
>>> # will approach 0.99 with more episodes
>>> np.var(episode_rewards)
np.float64(0.010162116476634746)
Change logs:
  • v0.21.0 - Initially added

  • v1.0.0 - Add update_running_mean attribute to allow disabling of updating the running mean / standard

Parameters:
  • env (env) – The environment to apply the wrapper

  • epsilon (float) – A stability parameter

  • gamma (float) – The discount factor that is used in the exponential moving average.

class gymnasium.wrappers.ClipReward(env: gym.Env[ObsType, ActType], min_reward: float | np.ndarray | None = None, max_reward: float | np.ndarray | None = None)[source]

Clips the rewards for an environment between an upper and lower bound.

A vector version of the wrapper exists gymnasium.wrappers.vector.ClipReward.

Example

>>> import gymnasium as gym
>>> from gymnasium.wrappers import ClipReward
>>> env = gym.make("CartPole-v1")
>>> env = ClipReward(env, 0, 0.5)
>>> _ = env.reset()
>>> _, rew, _, _, _ = env.step(1)
>>> rew
np.float64(0.5)
Change logs:
  • v1.0.0 - Initially added

Parameters:
  • env (Env) – The environment to wrap

  • min_reward (Union[float, np.ndarray]) – lower bound to apply

  • max_reward (Union[float, np.ndarray]) – higher bound to apply