Reinforcement Learning

Function approximation

For some set of weights $\vec{w} ∈ \mathbb{R}^d$, we write $$ \hat{v}(s, \vec{w}) ≈ v_π (s) $$
Our objective is to minimise the error $$ \overline{\mathrm{VE}}(\vec{w}) \doteq ∑_{s ∈ \mathcal{S}} μ(s) \left[v_π(s) - \hat{v}(s,\vec{w})\right]^2 $$
Note that this is a weighted mean with weights $μ(s)$

SGD moves the weight a small amount in the direction of the negative of the gradient $$ \vec{w}_t+1 \doteq \vec{w}_t - \frac{1}{2}α ∇ \left[ v_π (S_t) - \hat{v}(S_t, \vec{w}_t)\right]^2 $$ $$ = \vec{w}_t + α \left[ v_π (S_t) - \hat{v} (S_t, \vec{w}_t)\right] ∇ \hat{v} (S_t, \vec{w}_t) $$
Note that $∇ f(\vec{w})$ means the vector of partial derivatives $$ ∇ f(\vec{w}) \doteq \left( \frac{ ∂ f(\vec{w})}{∂ w_1},\frac{ ∂ f(\vec{w})}{∂ w_2},\cdots,\frac{ ∂ f(\vec{w})}{∂ w_d} \right)^\top $$

One simple approximator is $$ \hat{v}(s,\vec{w}) \doteq \vec{w}^\top \vec{x}(s), $$ where $\vec{x}(s)$ is $s$ expressed as a feature vector
The gradient is then simply $$ ∇ \hat{v}(s, \vec{w}) = \vec{x}(s). $$
When we use a linear function approximator, we refer to the algorithm as linear
- e.g., Linear Sarsa, Linear TD

Consider the game of chess
Two positions may be very similar (e.g., only one piece different)
However, one position may be leading to checkmate (win) whereas the other may be a loss
This is a sharp discontinuity
For this reason, we want a complex function approximator

In principle, we can converge on the value of a state action if our search is guaranteed to visit that state-action an infinite number of times
However, as the state-action space grows, it becomes hard to visit every instance
Thus we need a smooth and simple function approximator

Linear approximators are simple and convergence proofs are possible
Non-linear might seem better, especially when there is difficulty designing $\vec{x}(s)$
Artificial Neural Networks might be used
Problem is that samples are not independent from prior samples
Solution is to use Experience Replay (used by Deep Q-Network (DQN))