COGSCI 200 Lecture Notes - Lecture 15: Temporal Difference Learning, Reinforcement Learning, B. F. Skinner
Reinforcement Learning
Reinforcement
I. BF Skinner: instrumental conditioning→ All of psychology is based on rewards for actions
A. But, should a rat be rewarded for accidentally finding the cheese?
II. We need a way to “intelligently” learn sequences of actions
A. Temporal difference (TD) learning accomplishes this!
The Calendar Problem
Goal: For each month, predict the cumulative expected reward you will get in all subsequent months of the year
(ending in Dec)
These predictions are called “states values”, abbreviated “V”
The “true” state values for this version of the problem are shown above.
Your goal is to use experience to learn the true state values (or get reasonably close to them)
Temporal Difference Learning
1. Initialize all state values to 0
2. Do the following each time you transition from state s to state s’
a. Calculate the prediction error: [R(s’) + V(s’)] - V(s)
b. Update the value of V(s) (get from video)
*If you follow this rule .. ?
REVIEW
Review Temporal Difference Learning
I. Components of Reinforcement Learning Problem:
States, rewards, actions Q values
- Q values are predictions of the future cumulative reward you will get if you start in state
s and take action a and behave “optimally” thereafter
II. Calculate Prediction Error: prediction error = [R(s’) + Q(s’, a’)] - Q(s,a)
Update the values of Q(s,a): Q(s,a) ← Q(s,a) + (a * prediction error)
III. Directions:
find more resources at oneclass.com
find more resources at oneclass.com
Document Summary
We need a way to intelligently learn sequences of actions: temporal difference (td) learning accomplishes this! Goal: for each month, predict the cumulative expected reward you will get in all subsequent months of the year (ending in dec) These predictions are called states values , abbreviated v . The true state values for this version of the problem are shown above. Your goal is to use experience to learn the true state values (or get reasonably close to them) Initialize all state values to 0: do the following each time you transition from state s to state s", calculate the prediction error: [r(s") + v(s")] - v(s, update the value of v(s) (get from video) Review temporal difference learning: components of reinforcement learning problem: Q values are predictions of the future cumulative reward you will get if you start in state s and take action a and behave optimally thereafter.