Reward Wrappers¶
- class gymnasium.RewardWrapper(env: Env[ObsType, ActType])[source]¶
Superclass of wrappers that can modify the returning reward from a step.
If you would like to apply a function to the reward that is returned by the base environment before passing it to learning code, you can simply inherit from
RewardWrapper
and overwrite the methodreward()
to implement that transformation.- Parameters:
env – Environment to be wrapped.
Implemented Wrappers¶
- class gymnasium.wrappers.TransformReward(env: Env[ObsType, ActType], func: Callable[[SupportsFloat], SupportsFloat])[source]¶
Applies a function to the
reward
received from the environment’sstep
.A vector version of the wrapper exists
gymnasium.wrappers.vector.TransformReward
.Example
>>> import gymnasium as gym >>> from gymnasium.wrappers import TransformReward >>> env = gym.make("CartPole-v1") >>> env = TransformReward(env, lambda r: 2 * r + 1) >>> _ = env.reset() >>> _, rew, _, _, _ = env.step(0) >>> rew 3.0
- Change logs:
v0.15.0 - Initially added
- Parameters:
env (Env) – The environment to wrap
func – (Callable): The function to apply to reward
- class gymnasium.wrappers.NormalizeReward(env: Env[ObsType, ActType], gamma: float = 0.99, epsilon: float = 1e-8)[source]¶
This wrapper will scale rewards s.t. the discounted returns have a mean of 0 and std of 1.
In a nutshell, the rewards are divided through by the standard deviation of a rolling discounted sum of the reward. The exponential moving average will have variance \((1 - \gamma)^2\).
The property _update_running_mean allows to freeze/continue the running mean calculation of the reward statistics. If True (default), the RunningMeanStd will get updated every time self.normalize() is called. If False, the calculated statistics are used but not updated anymore; this may be used during evaluation.
A vector version of the wrapper exists
gymnasium.wrappers.vector.NormalizeReward
.- Important note:
Contrary to what the name suggests, this wrapper does not normalize the rewards to have a mean of 0 and a standard deviation of 1. Instead, it scales the rewards such that discounted returns have approximately unit variance. See [Engstrom et al.](https://openreview.net/forum?id=r1etN1rtPB) on “reward scaling” for more information.
Note
In v0.27, NormalizeReward was updated as the forward discounted reward estimate was incorrectly computed in Gym v0.25+. For more detail, read [#3154](https://github.com/openai/gym/pull/3152).
Note
The scaling depends on past trajectories and rewards will not be scaled correctly if the wrapper was newly instantiated or the policy was changed recently.
- Example without the normalize reward wrapper:
>>> import numpy as np >>> import gymnasium as gym >>> env = gym.make("MountainCarContinuous-v0") >>> _ = env.reset(seed=123) >>> _ = env.action_space.seed(123) >>> episode_rewards = [] >>> terminated, truncated = False, False >>> while not (terminated or truncated): ... observation, reward, terminated, truncated, info = env.step(env.action_space.sample()) ... episode_rewards.append(reward) ... >>> env.close() >>> np.var(episode_rewards) np.float64(0.0008876301247721108)
- Example with the normalize reward wrapper:
>>> import numpy as np >>> import gymnasium as gym >>> env = gym.make("MountainCarContinuous-v0") >>> env = NormalizeReward(env, gamma=0.99, epsilon=1e-8) >>> _ = env.reset(seed=123) >>> _ = env.action_space.seed(123) >>> episode_rewards = [] >>> terminated, truncated = False, False >>> while not (terminated or truncated): ... observation, reward, terminated, truncated, info = env.step(env.action_space.sample()) ... episode_rewards.append(reward) ... >>> env.close() >>> # will approach 0.99 with more episodes >>> np.var(episode_rewards) np.float64(0.010162116476634746)
- Change logs:
v0.21.0 - Initially added
v1.0.0 - Add update_running_mean attribute to allow disabling of updating the running mean / standard
- Parameters:
env (env) – The environment to apply the wrapper
epsilon (float) – A stability parameter
gamma (float) – The discount factor that is used in the exponential moving average.
- class gymnasium.wrappers.ClipReward(env: gym.Env[ObsType, ActType], min_reward: float | np.ndarray | None = None, max_reward: float | np.ndarray | None = None)[source]¶
Clips the rewards for an environment between an upper and lower bound.
A vector version of the wrapper exists
gymnasium.wrappers.vector.ClipReward
.Example
>>> import gymnasium as gym >>> from gymnasium.wrappers import ClipReward >>> env = gym.make("CartPole-v1") >>> env = ClipReward(env, 0, 0.5) >>> _ = env.reset() >>> _, rew, _, _, _ = env.step(1) >>> rew np.float64(0.5)
- Change logs:
v1.0.0 - Initially added
- Parameters:
env (Env) – The environment to wrap
min_reward (Union[float, np.ndarray]) – lower bound to apply
max_reward (Union[float, np.ndarray]) – higher bound to apply