{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Training A2C with Vector Envs and Domain Randomization\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n\nIn this tutorial, you'll learn how to use vectorized environments to train an Advantage Actor-Critic agent.\nWe are going to use A2C, which is the synchronous version of the A3C algorithm [1].\n\nVectorized environments [3] can help to achieve quicker and more robust training by allowing multiple instances\nof the same environment to run in parallel (on multiple CPUs). This can significantly reduce the variance and thus speeds up the training.\n\nWe will implement an Advantage Actor-Critic from scratch to look at how you can feed batched states into your networks to get a vector of actions\n(one action per environment) and calculate the losses for actor and critic on minibatches of transitions.\nEach minibatch contains the transitions of one sampling phase: `n_steps_per_update` steps are executed in `n_envs` environments in parallel\n(multiply the two to get the number of transitions in a minibatch). After each sampling phase, the losses are calculated and one gradient step is executed.\nTo calculate the advantages, we are going to use the Generalized Advantage Estimation (GAE) method [2], which balances the tradeoff\nbetween variance and bias of the advantage estimates.\n\nThe A2C agent class is initialized with the number of features of the input state, the number of actions the agent can take,\nthe learning rates and the number of environments that run in parallel to collect experiences. The actor and critic networks are defined\nand their respective optimizers are initialized. The forward pass of the networks takes in a batched vector of states and returns a tensor of state values\nand a tensor of action logits. The select_action method returns a tuple of the chosen actions, the log-probs of those actions, and the state values for each action.\nIn addition, it also returns the entropy of the policy distribution, which is subtracted from the loss later (with a weighting factor `ent_coef`) to encourage exploration.\n\nThe get_losses function calculates the losses for the actor and critic networks (using GAE), which are then updated using the update_parameters function.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------------------------\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Author: Till Zemann\n# License: MIT License\n\nfrom __future__ import annotations\n\nimport os\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch import optim\nfrom tqdm import tqdm\n\nimport gymnasium as gym"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Advantage Actor-Critic (A2C)\n\nThe Actor-Critic combines elements of value-based and policy-based methods. In A2C, the agent has two separate neural networks:\na critic network that estimates the state-value function, and an actor network that outputs logits for a categorical probability distribution over all actions.\nThe critic network is trained to minimize the mean squared error between the predicted state values and the actual returns received by the agent\n(this is equivalent to minimizing the squared advantages, because the advantage of an action is as the difference between the return and the state-value: A(s,a) = Q(s,a) - V(s).\nThe actor network is trained to maximize the expected return by selecting actions that have high expected values according to the critic network.\n\nThe focus of this tutorial will not be on the details of A2C itself. Instead, the tutorial will focus on how to use vectorized environments\nand domain randomization to accelerate the training process for A2C (and other reinforcement learning algorithms).\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------------------------\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"class A2C(nn.Module):\n \"\"\"\n (Synchronous) Advantage Actor-Critic agent class\n\n Args:\n n_features: The number of features of the input state.\n n_actions: The number of actions the agent can take.\n device: The device to run the computations on (running on a GPU might be quicker for larger Neural Nets,\n for this code CPU is totally fine).\n critic_lr: The learning rate for the critic network (should usually be larger than the actor_lr).\n actor_lr: The learning rate for the actor network.\n n_envs: The number of environments that run in parallel (on multiple CPUs) to collect experiences.\n \"\"\"\n\n def __init__(\n self,\n n_features: int,\n n_actions: int,\n device: torch.device,\n critic_lr: float,\n actor_lr: float,\n n_envs: int,\n ) -> None:\n \"\"\"Initializes the actor and critic networks and their respective optimizers.\"\"\"\n super().__init__()\n self.device = device\n self.n_envs = n_envs\n\n critic_layers = [\n nn.Linear(n_features, 32),\n nn.ReLU(),\n nn.Linear(32, 32),\n nn.ReLU(),\n nn.Linear(32, 1), # estimate V(s)\n ]\n\n actor_layers = [\n nn.Linear(n_features, 32),\n nn.ReLU(),\n nn.Linear(32, 32),\n nn.ReLU(),\n nn.Linear(\n 32, n_actions\n ), # estimate action logits (will be fed into a softmax later)\n ]\n\n # define actor and critic networks\n self.critic = nn.Sequential(*critic_layers).to(self.device)\n self.actor = nn.Sequential(*actor_layers).to(self.device)\n\n # define optimizers for actor and critic\n self.critic_optim = optim.RMSprop(self.critic.parameters(), lr=critic_lr)\n self.actor_optim = optim.RMSprop(self.actor.parameters(), lr=actor_lr)\n\n def forward(self, x: np.ndarray) -> tuple[torch.Tensor, torch.Tensor]:\n \"\"\"\n Forward pass of the networks.\n\n Args:\n x: A batched vector of states.\n\n Returns:\n state_values: A tensor with the state values, with shape [n_envs,].\n action_logits_vec: A tensor with the action logits, with shape [n_envs, n_actions].\n \"\"\"\n x = torch.Tensor(x).to(self.device)\n state_values = self.critic(x) # shape: [n_envs,]\n action_logits_vec = self.actor(x) # shape: [n_envs, n_actions]\n return (state_values, action_logits_vec)\n\n def select_action(\n self, x: np.ndarray\n ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:\n \"\"\"\n Returns a tuple of the chosen actions and the log-probs of those actions.\n\n Args:\n x: A batched vector of states.\n\n Returns:\n actions: A tensor with the actions, with shape [n_steps_per_update, n_envs].\n action_log_probs: A tensor with the log-probs of the actions, with shape [n_steps_per_update, n_envs].\n state_values: A tensor with the state values, with shape [n_steps_per_update, n_envs].\n \"\"\"\n state_values, action_logits = self.forward(x)\n action_pd = torch.distributions.Categorical(\n logits=action_logits\n ) # implicitly uses softmax\n actions = action_pd.sample()\n action_log_probs = action_pd.log_prob(actions)\n entropy = action_pd.entropy()\n return (actions, action_log_probs, state_values, entropy)\n\n def get_losses(\n self,\n rewards: torch.Tensor,\n action_log_probs: torch.Tensor,\n value_preds: torch.Tensor,\n entropy: torch.Tensor,\n masks: torch.Tensor,\n gamma: float,\n lam: float,\n ent_coef: float,\n device: torch.device,\n ) -> tuple[torch.Tensor, torch.Tensor]:\n \"\"\"\n Computes the loss of a minibatch (transitions collected in one sampling phase) for actor and critic\n using Generalized Advantage Estimation (GAE) to compute the advantages (https://arxiv.org/abs/1506.02438).\n\n Args:\n rewards: A tensor with the rewards for each time step in the episode, with shape [n_steps_per_update, n_envs].\n action_log_probs: A tensor with the log-probs of the actions taken at each time step in the episode, with shape [n_steps_per_update, n_envs].\n value_preds: A tensor with the state value predictions for each time step in the episode, with shape [n_steps_per_update, n_envs].\n masks: A tensor with the masks for each time step in the episode, with shape [n_steps_per_update, n_envs].\n gamma: The discount factor.\n lam: The GAE hyperparameter. (lam=1 corresponds to Monte-Carlo sampling with high variance and no bias,\n and lam=0 corresponds to normal TD-Learning that has a low variance but is biased\n because the estimates are generated by a Neural Net).\n device: The device to run the computations on (e.g. CPU or GPU).\n\n Returns:\n critic_loss: The critic loss for the minibatch.\n actor_loss: The actor loss for the minibatch.\n \"\"\"\n T = len(rewards)\n advantages = torch.zeros(T, self.n_envs, device=device)\n\n # compute the advantages using GAE\n gae = 0.0\n for t in reversed(range(T - 1)):\n td_error = (\n rewards[t] + gamma * masks[t] * value_preds[t + 1] - value_preds[t]\n )\n gae = td_error + gamma * lam * masks[t] * gae\n advantages[t] = gae\n\n # calculate the loss of the minibatch for actor and critic\n critic_loss = advantages.pow(2).mean()\n\n # give a bonus for higher entropy to encourage exploration\n actor_loss = (\n -(advantages.detach() * action_log_probs).mean() - ent_coef * entropy.mean()\n )\n return (critic_loss, actor_loss)\n\n def update_parameters(\n self, critic_loss: torch.Tensor, actor_loss: torch.Tensor\n ) -> None:\n \"\"\"\n Updates the parameters of the actor and critic networks.\n\n Args:\n critic_loss: The critic loss.\n actor_loss: The actor loss.\n \"\"\"\n self.critic_optim.zero_grad()\n critic_loss.backward()\n self.critic_optim.step()\n\n self.actor_optim.zero_grad()\n actor_loss.backward()\n self.actor_optim.step()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Vectorized Environments\n\nWhen you calculate the losses for the two Neural Networks over only one epoch, it might have a high variance. With vectorized environments,\nwe can play with `n_envs` in parallel and thus get up to a linear speedup (meaning that in theory, we collect samples `n_envs` times quicker)\nthat we can use to calculate the loss for the current policy and critic network. When we are using more samples to calculate the loss,\nit will have a lower variance and theirfore leads to quicker learning.\n\nA2C is a synchronous method, meaning that the parameter updates to Networks take place deterministically (after each sampling phase),\nbut we can still make use of asynchronous vector envs to spawn multiple processes for parallel environment execution.\n\nThe simplest way to create vector environments is by calling `gym.vector.make`, which creates multiple instances of the same environment:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"envs = gym.vector.make(\"LunarLander-v2\", num_envs=3, max_episode_steps=600)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Domain Randomization\n\nIf we want to randomize the environment for training to get more robust agents (that can deal with different parameterizations of an environment\nand theirfore might have a higher degree of generalization), we can set the desired parameters manually or use a pseudo-random number generator to generate them.\n\nManually setting up 3 parallel 'LunarLander-v2' envs with different parameters:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"envs = gym.vector.AsyncVectorEnv(\n [\n lambda: gym.make(\n \"LunarLander-v2\",\n gravity=-10.0,\n enable_wind=True,\n wind_power=15.0,\n turbulence_power=1.5,\n max_episode_steps=600,\n ),\n lambda: gym.make(\n \"LunarLander-v2\",\n gravity=-9.8,\n enable_wind=True,\n wind_power=10.0,\n turbulence_power=1.3,\n max_episode_steps=600,\n ),\n lambda: gym.make(\n \"LunarLander-v2\", gravity=-7.0, enable_wind=False, max_episode_steps=600\n ),\n ]\n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------------------------\n\nRandomly generating the parameters for 3 parallel 'LunarLander-v2' envs, using `np.clip` to stay in the recommended parameter space:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"envs = gym.vector.AsyncVectorEnv(\n [\n lambda: gym.make(\n \"LunarLander-v2\",\n gravity=np.clip(\n np.random.normal(loc=-10.0, scale=1.0), a_min=-11.99, a_max=-0.01\n ),\n enable_wind=np.random.choice([True, False]),\n wind_power=np.clip(\n np.random.normal(loc=15.0, scale=1.0), a_min=0.01, a_max=19.99\n ),\n turbulence_power=np.clip(\n np.random.normal(loc=1.5, scale=0.5), a_min=0.01, a_max=1.99\n ),\n max_episode_steps=600,\n )\n for i in range(3)\n ]\n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------------------------\n\nHere we are using normal distributions with the standard parameterization of the environment as the mean and an arbitrary standard deviation (scale).\nDepending on the problem, you can experiment with higher variance and use different distributions as well.\n\nIf you are training on the same `n_envs` environments for the entire training time, and `n_envs` is a relatively low number\n(in proportion to how complex the environment is), you might still get some overfitting to the specific parameterizations that you picked.\nTo mitigate this, you can either pick a high number of randomly parameterized environments or remake your environments every couple of sampling phases\nto generate a new set of pseudo-random parameters.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# environment hyperparams\nn_envs = 10\nn_updates = 1000\nn_steps_per_update = 128\nrandomize_domain = False\n\n# agent hyperparams\ngamma = 0.999\nlam = 0.95 # hyperparameter for GAE\nent_coef = 0.01 # coefficient for the entropy bonus (to encourage exploration)\nactor_lr = 0.001\ncritic_lr = 0.005\n\n# Note: the actor has a slower learning rate so that the value targets become\n# more stationary and are theirfore easier to estimate for the critic\n\n# environment setup\nif randomize_domain:\n envs = gym.vector.AsyncVectorEnv(\n [\n lambda: gym.make(\n \"LunarLander-v2\",\n gravity=np.clip(\n np.random.normal(loc=-10.0, scale=1.0), a_min=-11.99, a_max=-0.01\n ),\n enable_wind=np.random.choice([True, False]),\n wind_power=np.clip(\n np.random.normal(loc=15.0, scale=1.0), a_min=0.01, a_max=19.99\n ),\n turbulence_power=np.clip(\n np.random.normal(loc=1.5, scale=0.5), a_min=0.01, a_max=1.99\n ),\n max_episode_steps=600,\n )\n for i in range(n_envs)\n ]\n )\n\nelse:\n envs = gym.vector.make(\"LunarLander-v2\", num_envs=n_envs, max_episode_steps=600)\n\n\nobs_shape = envs.single_observation_space.shape[0]\naction_shape = envs.single_action_space.n\n\n# set the device\nuse_cuda = False\nif use_cuda:\n device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nelse:\n device = torch.device(\"cpu\")\n\n# init the agent\nagent = A2C(obs_shape, action_shape, device, critic_lr, actor_lr, n_envs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training the A2C Agent\n\nFor our training loop, we are using the `RecordEpisodeStatistics` wrapper to record the episode lengths and returns and we are also saving\nthe losses and entropies to plot them after the agent finished training.\n\nYou may notice that the don't reset the vectorized envs at the start of each episode like we would usually do.\nThis is because each environment resets automatically once the episode finishes (each environment takes a different number of timesteps to finish\nan episode because of the random seeds). As a result, we are also not collecting data in `episodes`, but rather just play a certain number of steps\n(`n_steps_per_update`) in each environment (as an example, this could mean that we play 20 timesteps to finish an episode and then\nuse the rest of the timesteps to begin a new one).\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# create a wrapper environment to save episode returns and episode lengths\nenvs_wrapper = gym.wrappers.RecordEpisodeStatistics(envs, deque_size=n_envs * n_updates)\n\ncritic_losses = []\nactor_losses = []\nentropies = []\n\n# use tqdm to get a progress bar for training\nfor sample_phase in tqdm(range(n_updates)):\n\n # we don't have to reset the envs, they just continue playing\n # until the episode is over and then reset automatically\n\n # reset lists that collect experiences of an episode (sample phase)\n ep_value_preds = torch.zeros(n_steps_per_update, n_envs, device=device)\n ep_rewards = torch.zeros(n_steps_per_update, n_envs, device=device)\n ep_action_log_probs = torch.zeros(n_steps_per_update, n_envs, device=device)\n masks = torch.zeros(n_steps_per_update, n_envs, device=device)\n\n # at the start of training reset all envs to get an initial state\n if sample_phase == 0:\n states, info = envs_wrapper.reset(seed=42)\n\n # play n steps in our parallel environments to collect data\n for step in range(n_steps_per_update):\n\n # select an action A_{t} using S_{t} as input for the agent\n actions, action_log_probs, state_value_preds, entropy = agent.select_action(\n states\n )\n\n # perform the action A_{t} in the environment to get S_{t+1} and R_{t+1}\n states, rewards, terminated, truncated, infos = envs_wrapper.step(\n actions.numpy()\n )\n\n ep_value_preds[step] = torch.squeeze(state_value_preds)\n ep_rewards[step] = torch.tensor(rewards, device=device)\n ep_action_log_probs[step] = action_log_probs\n\n # add a mask (for the return calculation later);\n # for each env the mask is 1 if the episode is ongoing and 0 if it is terminated (not by truncation!)\n masks[step] = torch.tensor([not term for term in terminated])\n\n # calculate the losses for actor and critic\n critic_loss, actor_loss = agent.get_losses(\n ep_rewards,\n ep_action_log_probs,\n ep_value_preds,\n entropy,\n masks,\n gamma,\n lam,\n ent_coef,\n device,\n )\n\n # update the actor and critic networks\n agent.update_parameters(critic_loss, actor_loss)\n\n # log the losses and entropy\n critic_losses.append(critic_loss.detach().cpu().numpy())\n actor_losses.append(actor_loss.detach().cpu().numpy())\n entropies.append(entropy.detach().mean().cpu().numpy())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plotting\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"\"\"\" plot the results \"\"\"\n\n# %matplotlib inline\n\nrolling_length = 20\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12, 5))\nfig.suptitle(\n f\"Training plots for {agent.__class__.__name__} in the LunarLander-v2 environment \\n \\\n (n_envs={n_envs}, n_steps_per_update={n_steps_per_update}, randomize_domain={randomize_domain})\"\n)\n\n# episode return\naxs[0][0].set_title(\"Episode Returns\")\nepisode_returns_moving_average = (\n np.convolve(\n np.array(envs_wrapper.return_queue).flatten(),\n np.ones(rolling_length),\n mode=\"valid\",\n )\n / rolling_length\n)\naxs[0][0].plot(\n np.arange(len(episode_returns_moving_average)) / n_envs,\n episode_returns_moving_average,\n)\naxs[0][0].set_xlabel(\"Number of episodes\")\n\n# entropy\naxs[1][0].set_title(\"Entropy\")\nentropy_moving_average = (\n np.convolve(np.array(entropies), np.ones(rolling_length), mode=\"valid\")\n / rolling_length\n)\naxs[1][0].plot(entropy_moving_average)\naxs[1][0].set_xlabel(\"Number of updates\")\n\n\n# critic loss\naxs[0][1].set_title(\"Critic Loss\")\ncritic_losses_moving_average = (\n np.convolve(\n np.array(critic_losses).flatten(), np.ones(rolling_length), mode=\"valid\"\n )\n / rolling_length\n)\naxs[0][1].plot(critic_losses_moving_average)\naxs[0][1].set_xlabel(\"Number of updates\")\n\n\n# actor loss\naxs[1][1].set_title(\"Actor Loss\")\nactor_losses_moving_average = (\n np.convolve(np.array(actor_losses).flatten(), np.ones(rolling_length), mode=\"valid\")\n / rolling_length\n)\naxs[1][1].plot(actor_losses_moving_average)\naxs[1][1].set_xlabel(\"Number of updates\")\n\nplt.tight_layout()\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Performance Analysis of Synchronous and Asynchronous Vectorized Environments\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------------------------\n\nAsynchronous environments can lead to quicker training times and a higher speedup\nfor data collection compared to synchronous environments. This is because asynchronous environments\nallow multiple agents to interact with their environments in parallel,\nwhile synchronous environments run multiple environments serially.\nThis results in better efficiency and faster training times for asynchronous environments.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------------------------\n\nAccording to the Karp-Flatt metric (a metric used in parallel computing to estimate the limit for the\nspeedup when scaling up the number of parallel processes, here the number of environments),\nthe estimated max. speedup for asynchronous environments is 57, while the estimated maximum speedup\nfor synchronous environments is 21. This suggests that asynchronous environments have significantly\nfaster training times compared to synchronous environments (see graphs).\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------------------------\n\nHowever, it is important to note that increasing the number of parallel vector environments\ncan lead to slower training times after a certain number of environments (see plot below, where the\nagent was trained until the mean training returns were above -120). The slower training times might occur\nbecause the gradients of the environments are good enough after a relatively low number of environments\n(especially if the environment is not very complex). In this case, increasing the number of environments\ndoes not increase the learning speed, and actually increases the runtime, possibly due to the additional time\nneeded to calculate the gradients. For LunarLander-v2, the best performing configuration used a AsyncVectorEnv\nwith 10 parallel environments, but environments with a higher complexity may require more\nparallel environments to achieve optimal performance.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving/ Loading Weights\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"save_weights = False\nload_weights = False\n\nactor_weights_path = \"weights/actor_weights.h5\"\ncritic_weights_path = \"weights/critic_weights.h5\"\n\nif not os.path.exists(\"weights\"):\n os.mkdir(\"weights\")\n\n\"\"\" save network weights \"\"\"\nif save_weights:\n torch.save(agent.actor.state_dict(), actor_weights_path)\n torch.save(agent.critic.state_dict(), critic_weights_path)\n\n\n\"\"\" load network weights \"\"\"\nif load_weights:\n agent = A2C(obs_shape, action_shape, device, critic_lr, actor_lr)\n\n agent.actor.load_state_dict(torch.load(actor_weights_path))\n agent.critic.load_state_dict(torch.load(critic_weights_path))\n agent.actor.eval()\n agent.critic.eval()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Showcase the Agent\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"\"\"\" play a couple of showcase episodes \"\"\"\n\nn_showcase_episodes = 3\n\nfor episode in range(n_showcase_episodes):\n print(f\"starting episode {episode}...\")\n\n # create a new sample environment to get new random parameters\n if randomize_domain:\n env = gym.make(\n \"LunarLander-v2\",\n render_mode=\"human\",\n gravity=np.clip(\n np.random.normal(loc=-10.0, scale=2.0), a_min=-11.99, a_max=-0.01\n ),\n enable_wind=np.random.choice([True, False]),\n wind_power=np.clip(\n np.random.normal(loc=15.0, scale=2.0), a_min=0.01, a_max=19.99\n ),\n turbulence_power=np.clip(\n np.random.normal(loc=1.5, scale=1.0), a_min=0.01, a_max=1.99\n ),\n max_episode_steps=500,\n )\n else:\n env = gym.make(\"LunarLander-v2\", render_mode=\"human\", max_episode_steps=500)\n\n # get an initial state\n state, info = env.reset()\n\n # play one episode\n done = False\n while not done:\n\n # select an action A_{t} using S_{t} as input for the agent\n with torch.no_grad():\n action, _, _, _ = agent.select_action(state[None, :])\n\n # perform the action A_{t} in the environment to get S_{t+1} and R_{t+1}\n state, reward, terminated, truncated, info = env.step(action.item())\n\n # update if the environment is done\n done = terminated or truncated\n\nenv.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Try playing the environment yourself\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# from gymnasium.utils.play import play\n#\n# play(gym.make('LunarLander-v2', render_mode='rgb_array'),\n# keys_to_action={'w': 2, 'a': 1, 'd': 3}, noop=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n\n[1] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu. \"Asynchronous Methods for Deep Reinforcement Learning\" ICML (2016).\n\n[2] J. Schulman, P. Moritz, S. Levine, M. Jordan and P. Abbeel. \"High-dimensional continuous control using generalized advantage estimation.\" ICLR (2016).\n\n[3] Gymnasium Documentation: Vector environments. (URL: https://gymnasium.farama.org/api/vector/)\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 0
}