Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

Reinforcement Learning

Introduction

Monte Carlo Methods

  • What if we don’t know the transition probabilities $p(s,a,s’)$?
  • We can still estimate the long term reward

What does Monte Carlo mean?

  1. simulate a system for a period of time
  2. note down the result
  3. repeat from step 1 until enough results obtained
  4. plot the distribution of results

What does Monte Carlo mean in the context of RL?

  • We want to estimate the state value $v_π(s)$ or the state action value $q_π(s, a)$
  • BUT our environment may be stochastic
  • So we repeatedly start at state $s$, (and for $q$, perform $a$)
    • then act according to $π$ from then on
    • Note: $π(s)$ may not equal $a$
    • we need to terminate the episode somehow
  • What should we do if we revisit the same state?

Exercise: simulate blackjack

Write Python code to simulate blackjack

  • Assume an infinite deck and no discount ($γ=1$)
  • State: current card sum (with or without usable ace), and dealer’s showing card
  • Actions: hit, or stick
  • Reward: +1 beat the banker, 0 draw, -1 loss or bust
  • Use first visit MC to evaluate policy where player sticks only on 20 or 21

Monte Carlo Control

  • If you can find the state action value $q_π(s,a)$, then logically, you can adjust the policy to maximise it $$ π(s) ← max_a q_π(s,a) $$
  • Exploration can be
    • Exploring starts (randomly select the first action)
    • ε -soft $$ π(a | s) =\begin{cases} 1 - ε + ε / |A| & \text{if } a = a_*
      ε / |A| & \text{if } a ≠ a_* \end{cases}$$
  • Note that we need to take care of repeats

Off-policy versus on-policy

  • When learning a policy and updating that policy based on learning, this is on-policy learning
  • If we instead try to find the optimal action without updating the policy that we are learning from, this is off-policy learning
    • An example is Facebook recommender systems
    • ReAgent specialises in off-policy learning

Importance sampling

  • When we use MC to estimate the state action value $q_π(s,a)$ that is sampling
  • Sampling is biased if it is taken from a different than intended distribution
  • e.g., If we want to know how many people wear beards (in the world) and we only sample men, then we will get a biased sample