{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Training using REINFORCE for Mujoco\n\n<img src=\"file://_static/img/tutorials/reinforce_invpend_gym_v26_fig1.gif\" width=\"400\" alt=\"agent-environment-diagram\">\n\nThis tutorial serves 2 purposes:\n 1. To understand how to implement REINFORCE [1] from scratch to solve Mujoco's InvertedPendulum-v4\n 2. Implementation a deep reinforcement learning algorithm with Gymnasium's v0.26+ `step()` function\n\nWe will be using **REINFORCE**, one of the earliest policy gradient methods. Unlike going under the burden of learning a value function first and then deriving a policy out of it,\nREINFORCE optimizes the policy directly. In other words, it is trained to maximize the probability of Monte-Carlo returns. More on that later.\n\n**Inverted Pendulum** is Mujoco's cartpole but now powered by the Mujoco physics simulator -\nwhich allows more complex experiments (such as varying the effects of gravity).\nThis environment involves a cart that can moved linearly, with a pole fixed on it at one end and having another end free.\nThe cart can be pushed left or right, and the goal is to balance the pole on the top of the cart by applying forces on the cart.\nMore information on the environment could be found at https://gymnasium.farama.org/environments/mujoco/inverted_pendulum/\n\n**Training Objectives**: To balance the pole (inverted pendulum) on top of the cart\n\n**Actions**: The agent takes a 1D vector for actions. The action space is a continuous ``(action)`` in ``[-3, 3]``,\nwhere action represents the numerical force applied to the cart\n(with magnitude representing the amount of force and sign representing the direction)\n\n**Approach**: We use PyTorch to code REINFORCE from scratch to train a Neural Network policy to master Inverted Pendulum.\n\nAn explanation of the Gymnasium v0.26+ `Env.step()` function\n\n``env.step(A)`` allows us to take an action 'A' in the current environment 'env'. The environment then executes the action\nand returns five variables:\n\n - ``next_obs``: This is the observation that the agent will receive after taking the action.\n - ``reward``: This is the reward that the agent will receive after taking the action.\n - ``terminated``: This is a boolean variable that indicates whether or not the environment has terminated.\n - ``truncated``: This is a boolean variable that also indicates whether the episode ended by early truncation, i.e., a time limit is reached.\n - ``info``: This is a dictionary that might contain additional information about the environment.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from __future__ import annotations\n\nimport random\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\nimport torch\nimport torch.nn as nn\nfrom torch.distributions.normal import Normal\n\nimport gymnasium as gym\n\n\nplt.rcParams[\"figure.figsize\"] = (10, 5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Policy Network\n\n<img src=\"file://_static/img/tutorials/reinforce_invpend_gym_v26_fig2.png\">\n\nWe start by building a policy that the agent will learn using REINFORCE.\nA policy is a mapping from the current environment observation to a probability distribution of the actions to be taken.\nThe policy used in the tutorial is parameterized by a neural network. It consists of 2 linear layers that are shared between both the predicted mean and standard deviation.\nFurther, the single individual linear layers are used to estimate the mean and the standard deviation. ``nn.Tanh`` is used as a non-linearity between the hidden layers.\nThe following function estimates a mean and standard deviation of a normal distribution from which an action is sampled. Hence it is expected for the policy to learn\nappropriate weights to output means and standard deviation based on the current observation.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class Policy_Network(nn.Module):\n    \"\"\"Parametrized Policy Network.\"\"\"\n\n    def __init__(self, obs_space_dims: int, action_space_dims: int):\n        \"\"\"Initializes a neural network that estimates the mean and standard deviation\n         of a normal distribution from which an action is sampled from.\n\n        Args:\n            obs_space_dims: Dimension of the observation space\n            action_space_dims: Dimension of the action space\n        \"\"\"\n        super().__init__()\n\n        hidden_space1 = 16  # Nothing special with 16, feel free to change\n        hidden_space2 = 32  # Nothing special with 32, feel free to change\n\n        # Shared Network\n        self.shared_net = nn.Sequential(\n            nn.Linear(obs_space_dims, hidden_space1),\n            nn.Tanh(),\n            nn.Linear(hidden_space1, hidden_space2),\n            nn.Tanh(),\n        )\n\n        # Policy Mean specific Linear Layer\n        self.policy_mean_net = nn.Sequential(\n            nn.Linear(hidden_space2, action_space_dims)\n        )\n\n        # Policy Std Dev specific Linear Layer\n        self.policy_stddev_net = nn.Sequential(\n            nn.Linear(hidden_space2, action_space_dims)\n        )\n\n    def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"Conditioned on the observation, returns the mean and standard deviation\n         of a normal distribution from which an action is sampled from.\n\n        Args:\n            x: Observation from the environment\n\n        Returns:\n            action_means: predicted mean of the normal distribution\n            action_stddevs: predicted standard deviation of the normal distribution\n        \"\"\"\n        shared_features = self.shared_net(x.float())\n\n        action_means = self.policy_mean_net(shared_features)\n        action_stddevs = torch.log(\n            1 + torch.exp(self.policy_stddev_net(shared_features))\n        )\n\n        return action_means, action_stddevs"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Building an agent\n\n<img src=\"file://_static/img/tutorials/reinforce_invpend_gym_v26_fig3.jpeg\">\n\nNow that we are done building the policy, let us develop **REINFORCE** which gives life to the policy network.\nThe algorithm of REINFORCE could be found above. As mentioned before, REINFORCE aims to maximize the Monte-Carlo returns.\n\nFun Fact: REINFORCE is an acronym for \" 'RE'ward 'I'ncrement 'N'on-negative 'F'actor times 'O'ffset 'R'einforcement times 'C'haracteristic 'E'ligibility\n\nNote: The choice of hyperparameters is to train a decently performing agent. No extensive hyperparameter\ntuning was done.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class REINFORCE:\n    \"\"\"REINFORCE algorithm.\"\"\"\n\n    def __init__(self, obs_space_dims: int, action_space_dims: int):\n        \"\"\"Initializes an agent that learns a policy via REINFORCE algorithm [1]\n        to solve the task at hand (Inverted Pendulum v4).\n\n        Args:\n            obs_space_dims: Dimension of the observation space\n            action_space_dims: Dimension of the action space\n        \"\"\"\n\n        # Hyperparameters\n        self.learning_rate = 1e-4  # Learning rate for policy optimization\n        self.gamma = 0.99  # Discount factor\n        self.eps = 1e-6  # small number for mathematical stability\n\n        self.probs = []  # Stores probability values of the sampled action\n        self.rewards = []  # Stores the corresponding rewards\n\n        self.net = Policy_Network(obs_space_dims, action_space_dims)\n        self.optimizer = torch.optim.AdamW(self.net.parameters(), lr=self.learning_rate)\n\n    def sample_action(self, state: np.ndarray) -> float:\n        \"\"\"Returns an action, conditioned on the policy and observation.\n\n        Args:\n            state: Observation from the environment\n\n        Returns:\n            action: Action to be performed\n        \"\"\"\n        state = torch.tensor(np.array([state]))\n        action_means, action_stddevs = self.net(state)\n\n        # create a normal distribution from the predicted\n        #   mean and standard deviation and sample an action\n        distrib = Normal(action_means[0] + self.eps, action_stddevs[0] + self.eps)\n        action = distrib.sample()\n        prob = distrib.log_prob(action)\n\n        action = action.numpy()\n\n        self.probs.append(prob)\n\n        return action\n\n    def update(self):\n        \"\"\"Updates the policy network's weights.\"\"\"\n        running_g = 0\n        gs = []\n\n        # Discounted return (backwards) - [::-1] will return an array in reverse\n        for R in self.rewards[::-1]:\n            running_g = R + self.gamma * running_g\n            gs.insert(0, running_g)\n\n        deltas = torch.tensor(gs)\n\n        log_probs = torch.stack(self.probs)\n\n        # Calculate the mean of log probabilities for all actions in the episode\n        log_prob_mean = log_probs.mean()\n\n        # Update the loss with the mean log probability and deltas\n        # Now, we compute the correct total loss by taking the sum of the element-wise products.\n        loss = -torch.sum(log_prob_mean * deltas)\n\n        # Update the policy network\n        self.optimizer.zero_grad()\n        loss.backward()\n        self.optimizer.step()\n\n        # Empty / zero out all episode-centric/related variables\n        self.probs = []\n        self.rewards = []"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now lets train the policy using REINFORCE to master the task of Inverted Pendulum.\n\nFollowing is the overview of the training procedure\n\n   for seed in random seeds\n       reinitialize agent\n\n       for episode in range of max number of episodes\n           until episode is done\n               sample action based on current observation\n\n               take action and receive reward and next observation\n\n               store action take, its probability, and the observed reward\n           update the policy\n\nNote: Deep RL is fairly brittle concerning random seed in a lot of common use cases (https://spinningup.openai.com/en/latest/spinningup/spinningup.html).\nHence it is important to test out various seeds, which we will be doing.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Create and wrap the environment\nenv = gym.make(\"InvertedPendulum-v4\")\nwrapped_env = gym.wrappers.RecordEpisodeStatistics(env, 50)  # Records episode-reward\n\ntotal_num_episodes = int(5e3)  # Total number of episodes\n# Observation-space of InvertedPendulum-v4 (4)\nobs_space_dims = env.observation_space.shape[0]\n# Action-space of InvertedPendulum-v4 (1)\naction_space_dims = env.action_space.shape[0]\nrewards_over_seeds = []\n\nfor seed in [1, 2, 3, 5, 8]:  # Fibonacci seeds\n    # set seed\n    torch.manual_seed(seed)\n    random.seed(seed)\n    np.random.seed(seed)\n\n    # Reinitialize agent every seed\n    agent = REINFORCE(obs_space_dims, action_space_dims)\n    reward_over_episodes = []\n\n    for episode in range(total_num_episodes):\n        # gymnasium v26 requires users to set seed while resetting the environment\n        obs, info = wrapped_env.reset(seed=seed)\n\n        done = False\n        while not done:\n            action = agent.sample_action(obs)\n\n            # Step return type - `tuple[ObsType, SupportsFloat, bool, bool, dict[str, Any]]`\n            # These represent the next observation, the reward from the step,\n            # if the episode is terminated, if the episode is truncated and\n            # additional info from the step\n            obs, reward, terminated, truncated, info = wrapped_env.step(action)\n            agent.rewards.append(reward)\n\n            # End the episode when either truncated or terminated is true\n            #  - truncated: The episode duration reaches max number of timesteps\n            #  - terminated: Any of the state space values is no longer finite.\n            done = terminated or truncated\n\n        reward_over_episodes.append(wrapped_env.return_queue[-1])\n        agent.update()\n\n        if episode % 1000 == 0:\n            avg_reward = int(np.mean(wrapped_env.return_queue))\n            print(\"Episode:\", episode, \"Average Reward:\", avg_reward)\n\n    rewards_over_seeds.append(reward_over_episodes)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Plot learning curve\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "rewards_to_plot = [[reward[0] for reward in rewards] for rewards in rewards_over_seeds]\ndf1 = pd.DataFrame(rewards_to_plot).melt()\ndf1.rename(columns={\"variable\": \"episodes\", \"value\": \"reward\"}, inplace=True)\nsns.set(style=\"darkgrid\", context=\"talk\", palette=\"rainbow\")\nsns.lineplot(x=\"episodes\", y=\"reward\", data=df1).set(\n    title=\"REINFORCE for InvertedPendulum-v4\"\n)\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<img src=\"file://_static/img/tutorials/reinforce_invpend_gym_v26_fig4.png\">\n\nAuthor: Siddarth Chandrasekar\n\nLicense: MIT License\n\n## References\n\n[1] Williams, Ronald J.. \u201cSimple statistical gradient-following\nalgorithms for connectionist reinforcement learning.\u201d Machine Learning 8\n(2004): 229-256.\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.20"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}