RL.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploitation-Exploration tradeoff\n",
    "\n",
    "## Multi-armed Bandits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Read Chapter 2 from \"[Reinforcement Learing - An intruction](http://incompleteideas.net/book/RLbook2020trimmed.pdf)\" (Sutton and Barto, 2018), and focus on sections 2.1, 2.2, 2.3, 2.6, 2.7 and 2.10. [This is a legal copy from one of the authors.]\n",
    "\n",
    "  - A high level overview of the chapter can be gained by watching this video: https://www.youtube-nocookie.com/embed/9LhNHK1ULxs?start=5\n",
    "    \n",
    "- Study the Chapter 2 code, reproduced below, from \"[Re-implementations in Python by Shangtong Zhang](http://incompleteideas.net/book/code/code2nd.html)\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The tasks for assessment are:\n",
    "1. Set the **random-numbe-generator seed** to be your student ID. (See `np.random.seed(.....)` below.)\n",
    "2. Choose a value for $k$ from the set $\\{7, 8, 9, 10, 11, 12\\}$ (i.e. `k_arm=....`).\n",
    "3. Devise and run 5 computational experiments to study the effect of the parameters `epsilon`, `initial`, `step_size`, `sample_averages`, `UCB_param`, `gradient`.\n",
    "\n",
    "    For each experiment:\n",
    "    - Explain what its aim is.\n",
    "    - Explain what parameters are being used, and what they are meant to control.\n",
    "    - Use diagrams to show your results, then discuss them.\n",
    "    \n",
    "    Your experiments must be sufficiently distinct from those presented below (which reproduce experiments in the book). For example, you may try a wider range of $\\varepsilon$ values to try and find an optimal value. You should also look into the initial distributions of the bandits' values.\n",
    "    \n",
    "**NB** Note that the values for `MAX_RUNS = 100` `MAX_TIME = 300` are low to make the code faster. You will need to increase these at the final run to get smoother results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Marking scheme\n",
    "\n",
    "|Item|Mark|\n",
    "|:----|---:|\n",
    "|Experimet 1|/4|\n",
    "|Experimet 2|/4|\n",
    "|Experimet 3|/4|\n",
    "|Experimet 4|/4|\n",
    "|Experimet 5|/4|\n",
    "|||\n",
    "|**Total**:     |/20|\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-10-25T11:31:17.835969Z",
     "start_time": "2022-10-25T11:31:17.045793Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numpy.random import rand, randn, choice\n",
    "\n",
    "np.random.seed(123456)   ##   ---   SET THIS TO YOUR SID   ---   ##"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-10-25T11:31:17.851583Z",
     "start_time": "2022-10-25T11:31:17.835969Z"
    }
   },
   "outputs": [],
   "source": [
    "##   ---   Increase these values at the last run to get smoother statistics   ---   ##\n",
    "MAX_RUNS = 100\n",
    "MAX_TIME = 300"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-10-25T11:31:19.782382Z",
     "start_time": "2022-10-25T11:31:17.851583Z"
    }
   },
   "outputs": [],
   "source": [
    "# Adapted by Kamal Bentahar (2022) from:\n",
    "# https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter02/ten_armed_testbed.py\n",
    "\n",
    "#######################################################################\n",
    "# Copyright (C)                                                       #\n",
    "# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n",
    "# 2016 Tian Jun(tianjun.cpp@gmail.com)                                #\n",
    "# 2016 Artem Oboturov(oboturov@gmail.com)                             #\n",
    "# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n",
    "# Permission given to modify the code as long as you keep this        #\n",
    "# declaration at the top                                              #\n",
    "#######################################################################\n",
    "\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "from tqdm import trange"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-10-25T11:31:19.828263Z",
     "start_time": "2022-10-25T11:31:19.782382Z"
    }
   },
   "outputs": [],
   "source": [
    "class Bandit:\n",
    "    # @k_arm:             number of arms\n",
    "    # @epsilon:           probability for exploration in epsilon-greedy algorithm\n",
    "    # @initial:           initial estimation for each action\n",
    "    # @step_size:         constant step size for updating estimations\n",
    "    # @sample_averages:   if True, use sample averages to update estimations instead of constant step size\n",
    "    # @UCB_param:         if not None, use UCB algorithm to select action\n",
    "    # @gradient:          if True, use gradient based bandit algorithm\n",
    "    # @gradient_baseline: if True, use average reward as baseline for gradient based bandit algorithm\n",
    "\n",
    "    def __init__(self, k_arm=7, epsilon=0.0, initial=0.0, step_size=0.1, sample_averages=False,\n",
    "                 UCB_param=None, gradient=False, gradient_baseline=False, true_reward=0.0):\n",
    "        self.k = k_arm\n",
    "        self.step_size = step_size\n",
    "        self.sample_averages = sample_averages\n",
    "        self.indices = np.arange(self.k)\n",
    "        self.time = 0\n",
    "        self.UCB_param = UCB_param\n",
    "        self.gradient = gradient\n",
    "        self.gradient_baseline = gradient_baseline\n",
    "        self.average_reward = 0\n",
    "        self.true_reward = true_reward\n",
    "        self.epsilon = epsilon\n",
    "        self.initial = initial\n",
    "\n",
    "    def reset(self):        \n",
    "        self.q_true = randn(self.k) + self.true_reward  # real reward for each action\n",
    "        self.q_estimation = np.zeros(self.k) + self.initial  # estimation for each action\n",
    "        self.action_count = np.zeros(self.k)  # number of chosen times for each action\n",
    "        self.best_action = np.argmax(self.q_true)\n",
    "        self.time = 0\n",
    "\n",
    "    def act(self):\n",
    "        ''' Get an action for this bandit '''\n",
    "        if rand() < self.epsilon:\n",
    "            return choice(self.indices)\n",
    "\n",
    "        if self.UCB_param is not None:\n",
    "            UCB_estimation = self.q_estimation\n",
    "            UCB_estimation += self.UCB_param * np.sqrt(np.log(self.time + 1) / (self.action_count + 1e-5))\n",
    "            q_best = np.max(UCB_estimation)\n",
    "            return choice(np.where(UCB_estimation == q_best)[0])\n",
    "\n",
    "        if self.gradient:\n",
    "            exp_est = np.exp(self.q_estimation)\n",
    "            self.action_prob = exp_est / np.sum(exp_est)\n",
    "            return choice(self.indices, p=self.action_prob)\n",
    "\n",
    "        q_best = np.max(self.q_estimation)\n",
    "        return choice(np.where(self.q_estimation == q_best)[0])\n",
    "\n",
    "    def step(self, action):\n",
    "        ''' Take an action, update estimation for this action '''\n",
    "        # generate the reward under N(real reward, 1)\n",
    "        reward = randn() + self.q_true[action]\n",
    "        self.time += 1\n",
    "        self.action_count[action] += 1\n",
    "        self.average_reward += (reward - self.average_reward) / self.time\n",
    "        if self.sample_averages:  # update estimation using sample averages\n",
    "            self.q_estimation[action] += (reward - self.q_estimation[action]) / self.action_count[action]\n",
    "        elif self.gradient:\n",
    "            one_hot = np.zeros(self.k)\n",
    "            one_hot[action] = 1\n",
    "            baseline = self.average_reward if self.gradient_baseline else 0\n",
    "            self.q_estimation += self.step_size * (reward - baseline) * (one_hot - self.action_prob)\n",
    "        else:  # update estimation with constant step size\n",
    "            self.q_estimation[action] += self.step_size * (reward - self.q_estimation[action])\n",
    "        return reward"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-10-25T11:31:19.859949Z",
     "start_time": "2022-10-25T11:31:19.833259Z"
    }
   },
   "outputs": [],
   "source": [
    "def simulate(runs, time, bandits):\n",
    "    ''' Returns: mean_best_action_counts, mean_rewards '''\n",
    "    rewards = np.zeros((len(bandits), runs, time))\n",
    "    best_action_counts = np.zeros(rewards.shape)\n",
    "    for i, bandit in enumerate(bandits):\n",
    "        for r in trange(runs):\n",
    "            bandit.reset()\n",
    "            for t in range(time):\n",
    "                action = bandit.act()\n",
    "                reward = bandit.step(action)\n",
    "                rewards[i, r, t] = reward\n",
    "                if action == bandit.best_action:\n",
    "                    best_action_counts[i, r, t] = 1\n",
    "    mean_best_action_counts = best_action_counts.mean(axis=1)\n",
    "    mean_rewards = rewards.mean(axis=1)\n",
    "    return mean_best_action_counts, mean_rewards"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](RLImages/Figure_1.png)\n",
    "\n",
    "\n",
    "$$\n",
    "\n",
    "Figure \\space 1 \n",
    "\n",
    "$$\n",
    "\n",
    "For K-armed bandit, this is the 7-armed test bed\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-10-25T11:31:20.525782Z",
     "start_time": "2022-10-25T11:31:19.859949Z"
    }
   },
   "outputs": [],
   "source": [
    "def figure_2_1(k=7):\n",
    "    plt.figure(figsize=(12, 3))\n",
    "    plt.violinplot(dataset=randn(200, k) + randn(k))\n",
    "    plt.xlabel(\"Action\")\n",
    "    plt.ylabel(\"Reward distribution\")\n",
    "    plt.show()\n",
    "figure_2_1()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Experiment 1\n",
    "\n",
    "The Aim of this experiment is to understand the K-Arm bandit problem and results of various Re-enforcement Learning Algorithms, in this experiment we study various greedy methods\n",
    "\n",
    "For K_arm = 7\n",
    "\n",
    "Max Time = 1000 and Max Runs = 300, Please see figure below. \n",
    "\n",
    "In the figure we see Average performance of ε-greedy action-value methods on the 10-armed testbed.\n",
    "These data are averages over 300 runs with different bandit problems. We see comparisons of greedy method with the ε-greedy methods for ε = 0.10 and ε = 0.01, we see for optimum action the greedy approach took less time to find the optimum strategy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](RLImages/Figure_2.png)\n",
    "\n",
    "\n",
    "$$\n",
    "\n",
    "Figure \\space 2\n",
    "\n",
    "$$\n",
    "\n",
    "100%|███████████████████████████████████████████████████████████████████████| 200/200 [01:01<00:00,  3.25it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 200/200 [00:59<00:00,  3.39it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 200/200 [01:01<00:00,  3.24it/s]\n",
    "\n",
    "$$\n",
    "\n",
    "Terminal \\space Output\n",
    "\n",
    "$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-10-25T11:35:51.969695Z",
     "start_time": "2022-10-25T11:31:20.525782Z"
    }
   },
   "outputs": [],
   "source": [
    "def figure_2_2(runs=MAX_RUNS, time=MAX_TIME):\n",
    "    epsilons = [0, 0.1, 0.01]\n",
    "    bandits = [Bandit(epsilon=eps, sample_averages=True) for eps in epsilons]\n",
    "    best_action_counts, rewards = simulate(runs, time, bandits)\n",
    "\n",
    "    plt.figure(figsize=(12, 6))\n",
    "\n",
    "    plt.subplot(2, 1, 1)\n",
    "    for eps, rewards in zip(epsilons, rewards):\n",
    "        plt.plot(rewards, label=f'$\\epsilon = {eps:.02f}$')\n",
    "    plt.xlabel('steps')\n",
    "    plt.ylabel('average reward')\n",
    "    plt.legend()\n",
    "\n",
    "    plt.subplot(2, 1, 2)\n",
    "    for eps, counts in zip(epsilons, best_action_counts):\n",
    "        plt.plot(counts, label=f'$\\epsilon = {eps:.02f}$')\n",
    "    plt.xlabel('steps')\n",
    "    plt.ylabel('% optimal action')\n",
    "    plt.legend()\n",
    "    plt.show()\n",
    "figure_2_2(200, 10000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Experiment 2\n",
    "\n",
    "In this experiment, we see The effect of optimistic initial action-value estimates on the 10-armed testbed.\n",
    "Both methods used a constant step-size parameter = 0.1. we provide the optimistic initial values as q=5 to the greedy method and for ε=0.1 for ε-greedy method the initial value stays at 0. \n",
    "\n",
    "As seen in the figure below we see initially greedy method is better but over time of 1000, ε=0.1 method gives a lower % optimal action "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](RLImages/Figure_3.png)\n",
    "\n",
    "\n",
    "$$\n",
    "\n",
    "Figure \\space 3\n",
    "\n",
    "$$\n",
    "\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:11<00:00, 29.22it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:11<00:00, 29.77it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:15<00:00, 21.50it/s]\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-10-25T11:35:54.919508Z",
     "start_time": "2022-10-25T11:35:51.969695Z"
    }
   },
   "outputs": [],
   "source": [
    "def figure_2_3(runs=MAX_RUNS, time=MAX_TIME):\n",
    "    bandits = []\n",
    "    bandits.append(Bandit(epsilon=0, initial=5, step_size=0.1))\n",
    "    bandits.append(Bandit(epsilon=0.1, initial=0, step_size=0.1))\n",
    "    best_action_counts, _ = simulate(runs, time, bandits)\n",
    "\n",
    "    plt.figure(figsize=(12, 3))\n",
    "    plt.plot(best_action_counts[0], label='$\\epsilon = 0, q = 5$')\n",
    "    plt.plot(best_action_counts[1], label='$\\epsilon = 0.1, q = 0$')\n",
    "    plt.xlabel('Steps')\n",
    "    plt.ylabel('% optimal action')\n",
    "    plt.legend()\n",
    "    plt.show()\n",
    "figure_2_3()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Experiment 3\n",
    "\n",
    "In this experiment we repeat the same method as above, however we keep the step size as 0.1 for both methods with greedy and ε-greedy, on the bandit problem for a 10-armed bed\n",
    "\n",
    "As seen in the result below, ε-greedy approach reaches an optimum result quicker that the greedy method "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](RLImages/Figure_4.png)\n",
    "\n",
    "\n",
    "$$\n",
    "\n",
    "Figure \\space 4\n",
    "\n",
    "$$\n",
    "\n",
    "\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:10<00:00, 31.91it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:10<00:00, 32.83it/s]\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-10-25T11:35:58.305039Z",
     "start_time": "2022-10-25T11:35:54.919508Z"
    }
   },
   "outputs": [],
   "source": [
    "def figure_2_4(runs=MAX_RUNS, time=MAX_TIME):\n",
    "    bandits = []\n",
    "    bandits.append(Bandit(epsilon=0, UCB_param=2, sample_averages=True))\n",
    "    bandits.append(Bandit(epsilon=0.1, sample_averages=True))\n",
    "    _, average_rewards = simulate(runs, time, bandits)\n",
    "\n",
    "    plt.figure(figsize=(12, 3))\n",
    "    plt.plot(average_rewards[0], label='UCB $c = 2$')\n",
    "    plt.plot(average_rewards[1], label='epsilon greedy $\\epsilon = 0.1$')\n",
    "    plt.xlabel('Steps')\n",
    "    plt.ylabel('Average reward')\n",
    "    plt.legend()\n",
    "    plt.show()\n",
    "figure_2_4()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Experiment 4\n",
    "\n",
    "In this experiment, we compare the average performance of UCB action selection on the 10-armed testbed. \n",
    "\n",
    "As seen in the results below, UCB generally performs better than ε-greedy action selection, except in the first k steps, when it selects randomly among the actions that are not tried before"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](RLImages/Figure_5.png)\n",
    "\n",
    "\n",
    "$$\n",
    "\n",
    "Figure \\space 5\n",
    "\n",
    "$$\n",
    "\n",
    "\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:21<00:00, 15.41it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:19<00:00, 16.69it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:20<00:00, 16.49it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:18<00:00, 17.78it/s]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-10-25T11:36:02.554756Z",
     "start_time": "2022-10-25T11:35:58.305039Z"
    }
   },
   "outputs": [],
   "source": [
    "def figure_2_5(runs=MAX_RUNS, time=MAX_TIME):\n",
    "    bandits = []\n",
    "    bandits.append(Bandit(gradient=True, step_size=0.1, gradient_baseline=True, true_reward=4))\n",
    "    bandits.append(Bandit(gradient=True, step_size=0.4, gradient_baseline=True, true_reward=4))\n",
    "    best_action_counts, _ = simulate(runs, time, bandits)\n",
    "    labels = [r'$\\alpha = 0.1$, with baseline',\n",
    "              r'$\\alpha = 0.4$, with baseline'\n",
    "             ]\n",
    "\n",
    "    plt.figure(figsize=(12, 3))\n",
    "    for i in range(len(bandits)):\n",
    "        plt.plot(best_action_counts[i], label=labels[i])\n",
    "    plt.xlabel('Steps')\n",
    "    plt.ylabel('% Optimal action')\n",
    "    plt.legend()\n",
    "    plt.show()\n",
    "figure_2_5()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Experiment 5\n",
    "\n",
    "In this experiment, we see average performance of the gradient bandit algorithm with and without a reward\n",
    "baseline on the 10-armed testbed when the q(a) are chosen to be near 4 rather than near zero.\n",
    "\n",
    "As seen in the figure below , with a gradient baseline present , the optimal action is reached quicker\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](RLImages/Figure_6.png)\n",
    "\n",
    "\n",
    "$$\n",
    "\n",
    "Figure \\space 6\n",
    "\n",
    "$$\n",
    "\n",
    "\n",
    "\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:18<00:00, 17.78it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:14<00:00, 22.15it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:25<00:00, 12.77it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:28<00:00, 11.64it/s]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-10-25T11:36:44.335468Z",
     "start_time": "2022-10-25T11:36:02.554756Z"
    }
   },
   "outputs": [],
   "source": [
    "def figure_2_6(runs=MAX_RUNS, time=MAX_TIME):\n",
    "    labels = ['epsilon-greedy', 'gradient bandit',\n",
    "              'UCB', 'optimistic initialization']\n",
    "    generators = [lambda epsilon: Bandit(epsilon=epsilon, sample_averages=True),\n",
    "                  lambda alpha: Bandit(gradient=True, step_size=alpha, gradient_baseline=True),\n",
    "                  lambda coef: Bandit(epsilon=0, UCB_param=coef, sample_averages=True),\n",
    "                  lambda initial: Bandit(epsilon=0, initial=initial, step_size=0.1)]\n",
    "    parameters = [np.arange(-7, -1, dtype=float),\n",
    "                  np.arange(-5, 2, dtype=float),\n",
    "                  np.arange(-4, 3, dtype=float),\n",
    "                  np.arange(-2, 3, dtype=float)]\n",
    "\n",
    "    bandits = []\n",
    "    for generator, parameter in zip(generators, parameters):\n",
    "        for param in parameter:\n",
    "            bandits.append(generator(2**param))\n",
    "\n",
    "    _, average_rewards = simulate(runs, time, bandits)\n",
    "    rewards = np.mean(average_rewards, axis=1)\n",
    "\n",
    "    plt.figure(figsize=(12, 3))\n",
    "    i = 0\n",
    "    for label, parameter in zip(labels, parameters):\n",
    "        l = len(parameter)\n",
    "        plt.plot(parameter, rewards[i:i+l], label=label)\n",
    "        i += l\n",
    "    plt.xlabel('Parameter ($2^x$)')\n",
    "    plt.ylabel('Average reward')\n",
    "    plt.legend()\n",
    "    plt.show()\n",
    "figure_2_6()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Summary \n",
    "\n",
    "In the below figure we see a summary of the different types bandit algorithms which we tested in the the experiments (K_arm = 7, Max Run = 330, Max time = 1000) above.\n",
    "Each point is the average reward obtained over 330 steps with a particular algorithm at a\n",
    "particular setting of its parameter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](RLImages/Figure_7.png)\n",
    "\n",
    "\n",
    "$$\n",
    "\n",
    "Figure \\space 7\n",
    "\n",
    "$$\n",
    "\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:21<00:00, 15.53it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:19<00:00, 16.65it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:15<00:00, 21.93it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:11<00:00, 28.16it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:11<00:00, 29.43it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:11<00:00, 29.15it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:13<00:00, 25.17it/s]\n",
    "100%|███████████████████████████████████████████████████████████████████████| 330/330 [00:14<00:00, 22.75it/s\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "\n",
    "The various reinforcement learning application was studied using the bandit algorithms with various experiments.  \n",
    "\n",
    "Please refer to RL.py for python code. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# List of references\n",
    "\n",
    "Reinforcement learning : an introduction / Richard S. Sutton and Andrew G. Barto.\n",
    "Description: Second edition. | Cambridge, MA : The MIT Press, [2018] http://incompleteideas.net/book/RLbook2020trimmed.pdf\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.11.0 ('venv': venv)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "vscode": {
   "interpreter": {
    "hash": "9646fcfabfca22912ce5fe7fa2239f453c97b6dafcc6a8d175371d4d5afbb8ca"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}