diff --git a/2023-02-rl/why-rl-exciting.html b/2023-02-rl/why-rl-exciting.html index 266f460..ab52d8e 100644 --- a/2023-02-rl/why-rl-exciting.html +++ b/2023-02-rl/why-rl-exciting.html @@ -23,8 +23,8 @@

James Brusey

-
-

Overview

+
+

Overview

  • What is Reinforcement Learning?
      @@ -38,10 +38,10 @@
-
-

What is Reinforcement Learning?

+
+

What is Reinforcement Learning?

-
+

helicopter_tail_rotor_thrust_antitorque_compensation.jpeg

@@ -57,8 +57,8 @@
-
-

What is Reinforcement Learning?

+
+

What is Reinforcement Learning?

-
-

What is Reinforcement Learning?

+
+

What is Reinforcement Learning?

-
+

RLvsML.jpeg

@@ -90,8 +90,8 @@
-
-

Some definitions

+
+

Some definitions

  • policy—how an agent behaves
      @@ -138,9 +138,9 @@ So let's summarise the key aspects of RL
-
-

Example: maze with pitfalls

- +
+

Example: maze with pitfalls

+
-
-

Example problem: Balance a pole

+
+

Example problem: Balance a pole

  • State: pole angle, angular momentum, cart position, velocity
  • @@ -180,8 +180,8 @@ So let's summarise the key aspects of RL
-
-

Example problem: Playing football

+
+

Example problem: Playing football

  • States: where am I? other players? ball?
  • @@ -200,13 +200,13 @@ So let's summarise the key aspects of RL
-
-

A Brief History of RL

-
+
+

A Brief History of RL

+
-
-

Where does the term "reinforcement" come from?

+
+

Where does the term "reinforcement" come from?

-
-

TOBY (1951) - W. Grey Walter

+
+

TOBY (1951) - W. Grey Walter

-
-

Bellman equation (1957) and dynamic programming

- +
+

Bellman equation (1957) and dynamic programming

+
-
-

Barto, Sutton and Anderson: Actor Critic (1983)

+
+

Barto, Sutton and Anderson: Actor Critic (1983)

-
+

figtmp34.png

-
+

sutton-head5.jpg

-
+

barto_andrew_crop.jpeg

-
+

Charles-Anderson.jpg

@@ -296,10 +296,10 @@ So let's summarise the key aspects of RL
-
-

Watkins Q-learning (1989)

+
+

Watkins Q-learning (1989)

-
+

cw090311.jpg

@@ -322,10 +322,10 @@ Q^{new}(s_{t},a_{t}) \leftarrow \underbrace{Q(s_{t},a_{t})}_{\text{old}} + \unde
-
-

Tesauro's TD Gammon (1992)

+
+

Tesauro's TD Gammon (1992)

-
+

td-gammon.png

@@ -341,10 +341,10 @@ Q^{new}(s_{t},a_{t}) \leftarrow \underbrace{Q(s_{t},a_{t})}_{\text{old}} + \unde
-
-

RL parallels in Neuroscience (1994-)

+
+

RL parallels in Neuroscience (1994-)

-
+

dopamine.png

@@ -363,10 +363,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

My PhD work - RoboCup

+
+

My PhD work - RoboCup

-
+

socbot1.png

@@ -378,10 +378,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Move to point (hand coded)

+
+

Move to point (hand coded)

-
+

phys-hc-1.png

@@ -395,10 +395,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Move to point (RL)

+
+

Move to point (RL)

-
+

phys-mcsoft-1.png

@@ -412,10 +412,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Ball dribbling

+
+

Ball dribbling

-
+

sym2-0.png

@@ -428,10 +428,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Ball dribbling (hand coded)

+
+

Ball dribbling (hand coded)

-
+

t61.2.png

@@ -443,10 +443,10 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Ball dribbling (RL)

+
+

Ball dribbling (RL)

-
+

t62.12.png

@@ -459,8 +459,8 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Andrew Ng and Pieter Abbeel's Helicopter (2004)

+
+

Andrew Ng and Pieter Abbeel's Helicopter (2004)

-
-

Atari DQN Google DeepMind (2016) - Start of DeepRL

+
+

Atari DQN Google DeepMind (2016) - Start of DeepRL

-
-

AlphaGo and AlphaZero (Google DeepMind 2016)

+
+

AlphaGo and AlphaZero (Google DeepMind 2016)

-
+

alphago.png

@@ -517,8 +517,8 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Sim to real: Quadruped robots

+
+

Sim to real: Quadruped robots

-
-

OpenAI Rubik's cube robot

+
+

OpenAI Rubik's cube robot

-
-

Learning to walk in 1 hour (Dreamer v3)

+
+

Learning to walk in 1 hour (Dreamer v3)

-
-

Champion level drone racing using Deep RL (Oct 23)

+
+

Champion level drone racing using Deep RL (Oct 23)

-
-

Key challenges for RL for real-world problems

+
+

Key challenges for RL for real-world problems

  • Common framework
  • Resolve the environment problem
  • @@ -581,8 +581,8 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Common framework

+
+

Common framework

  • RL is based on a well-structured problem formulation
      @@ -606,8 +606,8 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Resolve the environment problem

+
+

Resolve the environment problem

  • Simple environments are easy - results are fast
  • Bugs in the simulator can lead to poor control behaviour
  • @@ -629,8 +629,8 @@ The dopamine response coding an error in the prediction of reward (Eq. 1) closel
-
-

Collect open data

+
+

Collect open data

  • Simulating environments from first principles tends to miss key characteristics
      @@ -649,8 +649,8 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Consider the human element

+
+

Consider the human element

-
-

RL applied to electric vehicle comfort control

+
+

RL applied to electric vehicle comfort control

-
+

car-air-conditioning-service.jpeg

@@ -687,10 +687,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

EV range issue

+
+

EV range issue

-
+

46-51_Cabin-Conditioning_atrApr19_1.jpeg

@@ -704,10 +704,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Seat heating

+
+

Seat heating

-
+

heated-seats-button.jpeg

@@ -723,10 +723,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Natural ventilation

+
+

Natural ventilation

-
+

Coventry_University_Lanchester_Library_6933825422.jpeg

@@ -741,10 +741,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

I've been working on it a while

+
+

I've been working on it a while

-
+

DSCF0052.jpg

@@ -756,10 +756,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

H2020 EU Project - DOMUS

+
+

H2020 EU Project - DOMUS

-
+

domus-partners.jpg

@@ -771,10 +771,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Climate control as an RL problem

+
+

Climate control as an RL problem

-
+

comfort-problem.png

@@ -789,8 +789,8 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Producing a fast thermal cabin model

+
+

Producing a fast thermal cabin model

  • Let's focus on one aspect - the thermal cabin model
  • Past work suggests that learning a comfort controller requires about 8 years of simulated experience
  • @@ -806,10 +806,10 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Gathering data from the Climatic Wind Tunnel

+
+

Gathering data from the Climatic Wind Tunnel

-
+

cwt.png

@@ -823,8 +823,8 @@ There is a lot of data being collected already but it is not always openly acces
-
-

Accelerating the cabin model

+
+

Accelerating the cabin model

  • Key idea: it's possible to learn the cabin model from data \[ \mathbf{x}_{t+1} \approx \mathbf{f}_\theta \left( \mathbf{x}_t, \mathbf{u}_t, \mathbf{x}_{t-1},\ldots \right) \] @@ -847,8 +847,8 @@ where
-
-

Intuition for cabin model

+
+

Intuition for cabin model

  • Lumped thermal model is based on Newton's law of cooling \[ \frac{dy}{dt} = -k(y-y_0) \]
  • @@ -871,8 +871,8 @@ where
-
-

Intuition for cabin model

+
+

Intuition for cabin model

  • Therefore @@ -902,8 +902,8 @@ y(t+\Delta t) &\approx y(t) + \frac{\Delta y}{\Delta t}\cdot \Delta t \\

-
-

Simulator results - driver foot, torso, head

+
+

Simulator results - driver foot, torso, head

cwt-driver-head-foot.png]]

@@ -918,8 +918,8 @@ y(t+\Delta t) &\approx y(t) + \frac{\Delta y}{\Delta t}\cdot \Delta t \\
-
-

Results from this simulator

+
+

Results from this simulator

  • Linear Regression-based model NRMSE 1.8% overall
      @@ -943,10 +943,10 @@ y(t+\Delta t) &\approx y(t) + \frac{\Delta y}{\Delta t}\cdot \Delta t \\
-
-

Preliminary results using RL

+
+

Preliminary results using RL

-
+

energyweight.png

@@ -962,8 +962,8 @@ y(t+\Delta t) &\approx y(t) + \frac{\Delta y}{\Delta t}\cdot \Delta t \\
-
-

Conclusions

+
+

Conclusions

  • RL is a very active and exciting domain
  • Surprisingly it has made little inroads into real-world systems
  • @@ -982,8 +982,8 @@ Focus on optimality
-
-

Thank you

+
+

Thank you

Questions?