Reinforcement Learning
Introduction
Monte Carlo Methods
- What if we don’t know the transition probabilities
$p(s,a,s’)$ ? - We can still estimate the long term reward
What does Monte Carlo mean?
- simulate a system for a period of time
- note down the result
- repeat from step 1 until enough results obtained
- plot the distribution of results
What does Monte Carlo mean in the context of RL?
- We want to estimate the state value
$v_π(s)$ or the state action value$q_π(s, a)$ - BUT our environment may be stochastic
- So we repeatedly start at state
$s$ , (and for$q$ , perform$a$ )- then act according to
$π$ from then on - Note:
$π(s)$ may not equal$a$ - we need to terminate the episode somehow
- then act according to
- What should we do if we revisit the same state?
Exercise: simulate blackjack
Write Python code to simulate blackjack
- Assume an infinite deck and no discount ($γ=1$)
- State: current card sum (with or without usable ace), and dealer’s showing card
- Actions: hit, or stick
- Reward: +1 beat the banker, 0 draw, -1 loss or bust
- Use first visit MC to evaluate policy where player sticks only on 20 or 21
Monte Carlo Control
- If you can find the state action value
$q_π(s,a)$ , then logically, you can adjust the policy to maximise it $$ π(s) ← max_a q_π(s,a) $$ - Exploration can be
- Exploring starts (randomly select the first action)
- ε -soft
$$
π(a | s) =\begin{cases}
1 - ε + ε / |A| & \text{if } a = a_*
ε / |A| & \text{if } a ≠ a_* \end{cases}$$
- Note that we need to take care of repeats
Off-policy versus on-policy
- When learning a policy and updating that policy based on learning, this is on-policy learning
- If we instead try to find the optimal action without updating the policy that we are learning from, this is off-policy learning
- An example is Facebook recommender systems
- ReAgent specialises in off-policy learning
Importance sampling
- When we use MC to estimate the state action value
$q_π(s,a)$ that is sampling - Sampling is biased if it is taken from a different than intended distribution
- e.g., If we want to know how many people wear beards (in the world) and we only sample men, then we will get a biased sample