Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

Reinforcement Learning

Aim

The aim for this lab session is to try to code a basic algorithm, such as Sarsa or Q-learning, to learn a policy for FrozenLake-v1.

Note that you might need to consult the previous lab sheet to find out how to get started with FrozenLake. We assume that the 4x4 map is used.

Sarsa (on-policy TD control)

figures/sarsa.png

Implement in python

Try to have a close correspondence between the algorithm and your code. Note: I have not implemented policy—you’ll need to do that. Also, this code is untested—it may not work! Check it carefully against the pseudo code.

import numpy as np
Q = np.zeros((N_STATES, N_ACTIONS))

for i in range(1000):
    s, info = env.reset()
    a = policy(s, Q)

    while True:
        sprime, reward, terminated, truncated, info = env.step(a)
        aprime = policy(sprime, Q)
        Q[s, a] += alpha * (reward + gamma * Q[sprime, aprime] - Q[s, a])
        s, a = sprime, aprime
        if terminated or truncated:
            break

Evaluation

  • A simple method of evaluating the policy so far is by keeping track of the reward per episode.
  • For some problems, it makes more sense to use the average reward per step.
  • Generally, learning should be turned off and $ε$ should be zero during
  • Given that this signal may be noisy, it is recommended to apply some form of smoothing.

See also https://github.com/google-research/rliable

Experiments to try

  • Try varying $α$ and $ε$.
  • Can you graph episode reward (after say 100 episodes) versus $α$?