diff --git a/.gitignore b/.gitignore index 956e3a6..699c1b7 100644 --- a/.gitignore +++ b/.gitignore @@ -28,3 +28,4 @@ /.DS_Store *.aux *.log +.DS_Store diff --git a/2023-02-rl/why-rl-exciting.html b/2023-02-rl/why-rl-exciting.html index 9eea1c1..266f460 100644 --- a/2023-02-rl/why-rl-exciting.html +++ b/2023-02-rl/why-rl-exciting.html @@ -23,8 +23,8 @@

James Brusey

-
-

Overview

+
+

Overview

  • What is Reinforcement Learning?
      @@ -38,10 +38,10 @@
-
-

What is Reinforcement Learning?

+
+

What is Reinforcement Learning?

-
+

helicopter_tail_rotor_thrust_antitorque_compensation.jpeg

@@ -57,9 +57,9 @@
-
-

What is Reinforcement Learning?

- +
+

What is Reinforcement Learning?

+
-
-

What is Reinforcement Learning?

+
+

What is Reinforcement Learning?

-
+

RLvsML.jpeg

@@ -90,8 +90,8 @@
-
-

Some definitions

+
+

Some definitions

  • policy—how an agent behaves
      @@ -138,9 +138,9 @@ So let's summarise the key aspects of RL
-
-

Example: maze with pitfalls

- +
+

Example: maze with pitfalls

+
-
-

Example problem: Balance a pole

- +
+

Example problem: Balance a pole

+
  • State: pole angle, angular momentum, cart position, velocity
  • Actions: force on cart to left or right
  • @@ -180,9 +180,9 @@ So let's summarise the key aspects of RL
-
-

Example problem: Playing football

- +
+

Example problem: Playing football

+
  • States: where am I? other players? ball?
  • Actions: turn, run, pass, shoot, tackle
  • @@ -200,14 +200,14 @@ So let's summarise the key aspects of RL
-
-

A Brief History of RL

-
+
+

A Brief History of RL

+
-
-

Where does the term "reinforcement" come from?

- +
+

Where does the term "reinforcement" come from?

+
-
-

TOBY (1951) - W. Grey Walter

- +
+

TOBY (1951) - W. Grey Walter

+
-
-

Bellman equation (1957) and dynamic programming

- +
+

Bellman equation (1957) and dynamic programming

+
-
-

Barto, Sutton and Anderson: Actor Critic (1983)

+
+

Barto, Sutton and Anderson: Actor Critic (1983)

-
+

figtmp34.png

-
+

sutton-head5.jpg

-
+

barto_andrew_crop.jpeg

-
+

Charles-Anderson.jpg

@@ -296,10 +296,10 @@ So let's summarise the key aspects of RL
-
-

Watkins Q-learning (1989)

+
+

Watkins Q-learning (1989)

-
+

cw090311.jpg

@@ -322,10 +322,10 @@ Q^{new}(s_{t},a_{t}) \leftarrow \underbrace{Q(s_{t},a_{t})}_{\text{old}} + \unde
-
-

Tesauro's TD Gammon (1992)

+
+

Tesauro's TD Gammon (1992)

-
+

td-gammon.png

@@ -341,10 +341,10 @@ Q^{new}(s_{t},a_{t}) \leftarrow \underbrace{Q(s_{t},a_{t})}_{\text{old}} + \unde
-
-

RL parallels in Neuroscience (1994-)

+
+

RL parallels in Neuroscience (1994-)

-
+

dopamine.png

@@ -363,10 +363,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

My PhD work - RoboCup

+
+

My PhD work - RoboCup

-
+

socbot1.png

@@ -378,10 +378,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Move to point (hand coded)

+
+

Move to point (hand coded)

-
+

phys-hc-1.png

@@ -395,10 +395,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Move to point (RL)

+
+

Move to point (RL)

-
+

phys-mcsoft-1.png

@@ -412,10 +412,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Ball dribbling

+
+

Ball dribbling

-
+

sym2-0.png

@@ -428,10 +428,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Ball dribbling (hand coded)

+
+

Ball dribbling (hand coded)

-
+

t61.2.png

@@ -443,10 +443,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Ball dribbling (RL)

+
+

Ball dribbling (RL)

-
+

t62.12.png

@@ -459,9 +459,9 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Andrew Ng and Pieter Abbeel's Helicopter (2004)

- +
+

Andrew Ng and Pieter Abbeel's Helicopter (2004)

+
-
-

Atari DQN Google DeepMind (2016) - Start of DeepRL

- +
+

Atari DQN Google DeepMind (2016) - Start of DeepRL

+
-
-

AlphaGo and AlphaZero (Google DeepMind 2016)

+
+

AlphaGo and AlphaZero (Google DeepMind 2016)

-
+

alphago.png

@@ -517,9 +517,9 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Sim to real: Quadruped robots

- +
+

Sim to real: Quadruped robots

+
-
-

OpenAI Rubik's cube robot

- +
+

OpenAI Rubik's cube robot

+
-
-

Learning to walk in 1 hour (Dreamer v3)

+
+

Learning to walk in 1 hour (Dreamer v3)

-
-

Champion level drone racing using Deep RL (Oct 23)

+
+

Champion level drone racing using Deep RL (Oct 23)

-
-

Key challenges for RL for real-world problems

+
+

Key challenges for RL for real-world problems

  • Common framework
  • Resolve the environment problem
  • @@ -581,8 +581,8 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Common framework

+
+

Common framework

  • RL is based on a well-structured problem formulation
      @@ -606,8 +606,8 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Resolve the environment problem

+
+

Resolve the environment problem

  • Simple environments are easy - results are fast
  • Bugs in the simulator can lead to poor control behaviour
  • @@ -629,8 +629,8 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Collect open data

+
+

Collect open data

  • Simulating environments from first principles tends to miss key characteristics
      @@ -649,9 +649,9 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Consider the human element

- +
+

Consider the human element

+
-
-

RL applied to electric vehicle comfort control

+
+

RL applied to electric vehicle comfort control

-
+

car-air-conditioning-service.jpeg

@@ -687,10 +687,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

EV range issue

+
+

EV range issue

-
+

46-51_Cabin-Conditioning_atrApr19_1.jpeg

@@ -704,10 +704,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Seat heating

+
+

Seat heating

-
+

heated-seats-button.jpeg

@@ -723,10 +723,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Natural ventilation

+
+

Natural ventilation

-
+

Coventry_University_Lanchester_Library_6933825422.jpeg

@@ -741,10 +741,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

I've been working on it a while

+
+

I've been working on it a while

-
+

DSCF0052.jpg

@@ -756,10 +756,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

H2020 EU Project - DOMUS

+
+

H2020 EU Project - DOMUS

-
+

domus-partners.jpg

@@ -771,10 +771,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Climate control as an RL problem

+
+

Climate control as an RL problem

-
+

comfort-problem.png

@@ -789,8 +789,8 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Producing a fast thermal cabin model

+
+

Producing a fast thermal cabin model

  • Let's focus on one aspect - the thermal cabin model
  • Past work suggests that learning a comfort controller requires about 8 years of simulated experience
  • @@ -806,10 +806,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Gathering data from the Climatic Wind Tunnel

+
+

Gathering data from the Climatic Wind Tunnel

-
+

cwt.png

@@ -823,8 +823,8 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Accelerating the cabin model

+
+

Accelerating the cabin model

  • Key idea: it's possible to learn the cabin model from data \[ \mathbf{x}_{t+1} \approx \mathbf{f}_\theta \left( \mathbf{x}_t, \mathbf{u}_t, \mathbf{x}_{t-1},\ldots \right) \] @@ -847,8 +847,8 @@ where
-
-

Intuition for cabin model

+
+

Intuition for cabin model

  • Lumped thermal model is based on Newton's law of cooling \[ \frac{dy}{dt} = -k(y-y_0) \]
  • @@ -871,8 +871,8 @@ where
-
-

Intuition for cabin model

+
+

Intuition for cabin model

  • Therefore @@ -902,8 +902,8 @@ y(t+\Delta t) &\approx y(t) + \frac{\Delta y}{\Delta t}\cdot \Delta t \\

-
-

Simulator results - driver foot, torso, head

+
+

Simulator results - driver foot, torso, head

cwt-driver-head-foot.png]]

@@ -918,8 +918,8 @@ y(t+\Delta t) &\approx y(t) + \frac{\Delta y}{\Delta t}\cdot \Delta t \\
-
-

Results from this simulator

+
+

Results from this simulator

  • Linear Regression-based model NRMSE 1.8% overall
      @@ -943,10 +943,10 @@ y(t+\Delta t) &\approx y(t) + \frac{\Delta y}{\Delta t}\cdot \Delta t \\
-
-

Preliminary results using RL

+
+

Preliminary results using RL

-
+

energyweight.png

@@ -962,8 +962,8 @@ y(t+\Delta t) &\approx y(t) + \frac{\Delta y}{\Delta t}\cdot \Delta t \\
-
-

Conclusions

+
+

Conclusions

  • RL is a very active and exciting domain
  • Surprisingly it has made little inroads into real-world systems
  • @@ -982,8 +982,8 @@ Focus on optimality
-
-

Thank you

+
+

Thank you

Questions?

diff --git a/2023-02-rl/why-rl-exciting.org b/2023-02-rl/why-rl-exciting.org index 1504316..f084c96 100644 --- a/2023-02-rl/why-rl-exciting.org +++ b/2023-02-rl/why-rl-exciting.org @@ -22,7 +22,7 @@ ** What is Reinforcement Learning? -#+REVEAL_HTML: +#+REVEAL_HTML: #+BEGIN_NOTES + How would you control a helicopter to perform this stunt? #+END_NOTES @@ -58,7 +58,7 @@ ** Example: maze with pitfalls -#+REVEAL_HTML: +#+REVEAL_HTML: #+BEGIN_NOTES + example problem - simple enough to derive a solution @@ -72,7 +72,7 @@ ** Example problem: Balance a pole -#+REVEAL_HTML: +#+REVEAL_HTML: - State: pole angle, angular momentum, cart position, velocity - Actions: force on cart to left or right - Reward: +1 for each time step that the pole is upright @@ -81,7 +81,7 @@ #+END_NOTES ** Example problem: Playing football -#+REVEAL_HTML: +#+REVEAL_HTML: - States: where am I? other players? ball? - Actions: turn, run, pass, shoot, tackle - Reward: 1 for win, 0 for draw, -1 for loss @@ -92,13 +92,13 @@ ** A Brief History of RL *** Where does the term "reinforcement" come from? -#+REVEAL_HTML: +#+REVEAL_HTML: #+BEGIN_NOTES + Pavlov introduced conditioning which says that experience of rewards /reinforces/ that action happening in the same situation next time #+END_NOTES *** TOBY (1951) - W. Grey Walter -#+REVEAL_HTML: +#+REVEAL_HTML: #+BEGIN_NOTES + By 1950, cybernetics theorised that behaviour was driven by /simple/ rules @@ -106,7 +106,7 @@ *** Bellman equation (1957) and dynamic programming -#+REVEAL_HTML: +#+REVEAL_HTML: #+BEGIN_NOTES + Recursive form became theoretical basis for RL @@ -219,7 +219,7 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel #+END_NOTES *** Andrew Ng and Pieter Abbeel's Helicopter (2004) -#+REVEAL_HTML: +#+REVEAL_HTML: #+BEGIN_NOTES + Key for this talk is how they learnt each stunt in /simulation/ @@ -235,7 +235,7 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel #+END_NOTES *** Atari DQN Google DeepMind (2016) - Start of DeepRL -#+REVEAL_HTML: +#+REVEAL_HTML: #+BEGIN_NOTES + prior to this - full access to internal state + this RL agent just sees pixel values @@ -254,13 +254,13 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel #+END_NOTES *** Sim to real: Quadruped robots -#+REVEAL_HTML: +#+REVEAL_HTML: #+BEGIN_NOTES + small problems in simulator lead to problems with real world performance + however potential for simulator issues to be overcome #+END_NOTES *** OpenAI Rubik's cube robot -#+REVEAL_HTML: +#+REVEAL_HTML: #+BEGIN_NOTES + training in simulation starts with deterministic simulation @@ -318,7 +318,7 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel There is a lot of data being collected already but it is not always openly accessible #+END_NOTES *** Consider the human element -#+REVEAL_HTML: +#+REVEAL_HTML: #+BEGIN_NOTES + simple rules yield complex behaviour + we shouldn't ignore this problem just because it is hard