Go to file

Cannot retrieve contributors at this time

60 lines (53 sloc) 2.4 KB

Raw Blame

Reinforcement Learning

Introduction

Monte Carlo Methods

What if we don’t know the transition probabilities $p(s,a,s’)$?
We can still estimate the long term reward

What does Monte Carlo mean?

simulate a system for a period of time
note down the result
repeat from step 1 until enough results obtained
plot the distribution of results

What does Monte Carlo mean in the context of RL?

We want to estimate the state value $v_π(s)$ or the state action value $q_π(s, a)$
BUT our environment may be stochastic
So we repeatedly start at state $s$, (and for $q$, perform $a$)
- then act according to $π$ from then on
- Note: $π(s)$ may not equal $a$
- we need to terminate the episode somehow

What should we do if we revisit the same state?

Exercise: simulate blackjack

Write Python code to simulate blackjack

Assume an infinite deck and no discount ($γ=1$)
State: current card sum (with or without usable ace), and dealer’s showing card
Actions: hit, or stick
Reward: +1 beat the banker, 0 draw, -1 loss or bust
Use first visit MC to evaluate policy where player sticks only on 20 or 21

Monte Carlo Control

If you can find the state action value $q_π(s,a)$, then logically, you can adjust the policy to maximise it $$ π(s) ← max_a q_π(s,a) $$
Exploration can be
- Exploring starts (randomly select the first action)
- ε -soft $$ π(a | s) =\begin{cases} 1 - ε + ε / |A| & \text{if } a = a_*
  ε / |A| & \text{if } a ≠ a_* \end{cases}$$
Note that we need to take care of repeats

Off-policy versus on-policy

When learning a policy and updating that policy based on learning, this is on-policy learning
If we instead try to find the optimal action without updating the policy that we are learning from, this is off-policy learning
- An example is Facebook recommender systems
- ReAgent specialises in off-policy learning

Importance sampling

When we use MC to estimate the state action value $q_π(s,a)$ that is sampling
Sampling is biased if it is taken from a different than intended distribution
e.g., If we want to know how many people wear beards (in the world) and we only sample men, then we will get a biased sample