COGSCI 200 Lecture Notes - Lecture 15: Temporal Difference Learning, Reinforcement Learning, B. F. Skinner

56 views2 pages
Reinforcement Learning
Reinforcement
I. BF Skinner: instrumental conditioning→ All of psychology is based on rewards for actions
A. But, should a rat be rewarded for accidentally finding the cheese?
II. We need a way to “intelligently” learn sequences of actions
A. Temporal difference (TD) learning accomplishes this!
The Calendar Problem
Goal: For each month, predict the cumulative expected reward you will get in all subsequent months of the year
(ending in Dec)
These predictions are called “states values”, abbreviated V
The “true” state values for this version of the problem are shown above.
Your goal is to use experience to learn the true state values (or get reasonably close to them)
Temporal Difference Learning
1. Initialize all state values to 0
2. Do the following each time you transition from state s to state s’
a. Calculate the prediction error: [R(s’) + V(s’)] - V(s)
b. Update the value of V(s) (get from video)
*If you follow this rule .. ?
REVIEW
Review Temporal Difference Learning
I. Components of Reinforcement Learning Problem:
States, rewards, actions Q values
- Q values are predictions of the future cumulative reward you will get if you start in state
s and take action a and behave “optimally” thereafter
II. Calculate Prediction Error: prediction error = [R(s’) + Q(s’, a’)] - Q(s,a)
Update the values of Q(s,a): Q(s,a) ← Q(s,a) + (a * prediction error)
III. Directions:
find more resources at oneclass.com
find more resources at oneclass.com
Unlock document

This preview shows half of the first page of the document.
Unlock all 2 pages and 3 million more documents.

Already have an account? Log in

Document Summary

We need a way to intelligently learn sequences of actions: temporal difference (td) learning accomplishes this! Goal: for each month, predict the cumulative expected reward you will get in all subsequent months of the year (ending in dec) These predictions are called states values , abbreviated v . The true state values for this version of the problem are shown above. Your goal is to use experience to learn the true state values (or get reasonably close to them) Initialize all state values to 0: do the following each time you transition from state s to state s", calculate the prediction error: [r(s") + v(s")] - v(s, update the value of v(s) (get from video) Review temporal difference learning: components of reinforcement learning problem: Q values are predictions of the future cumulative reward you will get if you start in state s and take action a and behave optimally thereafter.

Get access

Grade+
$40 USD/m
Billed monthly
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
10 Verified Answers
Class+
$30 USD/m
Billed monthly
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
7 Verified Answers

Related Documents