Reinforcement Learning, Part 5: Temporal-Difference Learning | by Vyacheslav Efimov | Jul, 2024

Intelligently synergizing dynamic programming and Monte Carlo algorithms

Reinforcement learning is a domain in machine learning that introduces the concept of an agent learning optimal strategies in complex environments. The agent learns from its actions, which result in rewards, based on the environment’s state. Reinforcement learning is a challenging topic and differs significantly from other areas of machine learning.

What is remarkable about reinforcement learning is that the same algorithms can be used to enable the agent adapt to completely different, unknown, and complex conditions.

Note. To fully understand the concepts included in this article, it is highly recommended to be familiar with dynamic programming and Monte Carlo methods discussed in previous articles.

In part 2, we explored the dynamic programming (DP) approach, where the agent iteratively updates V- / Q-functions and its policy based on previous calculations, replacing them with new estimates.
In parts 3 and 4, we introduced Monte Carlo (MC) methods, where the agent learns from experience acquired by sampling episodes.

Temporal-difference (TD) learning algorithms, on which we will focus in this article, combine principles from both of these apporaches:

Similar to DP, TD algorithms update estimates based on the information of previous estimates. As seen in part 2, state updates can be performed without updated values of other states, a technique known as bootstrapping, which is a key feature of DP.
Similar to MC, TD algorithms do not require knowledge of the environment’s dynamics because they learn from experience as well.

Temporal difference algorithms combine advantages of dynamic programming and Monte Carlo methods.

This article is based on Chapter 6 of the book “Reinforcement Learning” written by Richard S. Sutton and Andrew G. Barto. I highly appreciate the efforts of the authors who contributed to the publication of this book.

As we already know, Monte Carlo algorithms learn from experience by generating an episode and observing rewards for every visited state. State updates are performed only after the episode ends.

Temporal-difference algorithms operate similarly, with the only key difference being that they do not wait until the end of episodes to update states. Instead, the updates of every state are performed after n time steps the state was visited (n is the algorithm’s parameter). During these observed n time steps, the algorithm calculates the received reward and uses that information to update the previously visited state.

Temporal-difference algorithm performing state updates after n time steps is denoted as TD(n).

The simplest version of TD performs updates in the next time step (n = 1), known as one-step TD.

At the end of the previous part, we introduced the constant-α MC algorithm. It turns out that the pseudocode for one-step TD is almost identical, except for the state update, as shown below:

One-step TD learning pseudocode. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

Since TD methods do not wait until the end of the episode and make updates using current estimates, they are said to use bootstrapping, like DP algorithms.

The expression in the brackets in the update formula is called TD error:

TD error. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

In this equation, γ is the discount factor which takes values between 0 and 1 and defines the importance weight of the current reward compared to future rewards.

TD error plays an important role. As we will see later, TD algorithms can be adapted based on the form of TD error.

At first sight, it might seem unclear how using information only from the current transition reward and the state values of the current and next states can be indeed beneficial for optimal strategy search. It will be easier to understand if we take a look at an example.

Let us imagine a simplified version of the famous “Copa America” soccer tournament, which regularly takes place in South America. In our version, in every Copa America tournament, our team faces 6 opponents in the same order. Through the system is not real, we will omit complex details to better understand the example.

We would like to create an algorithm that will predict our team’s total goal difference after a sequence of matches. The table below shows the team’s results obtained in a recent edition of the Copa America.

Match results of our team in the Copa America tournament. The last column is the cumulative goal difference calculated after every match.

To better dive into the data, let us visualize the results. The initial algorithm estimates are shown by the yellow line in the diagram below. The obtained cumulative goal difference (last table column) is depicted in black.

Initial algorithm estimates (yellow line) and cumulative goal difference (black line) based on the obtained results

Roughly speaking, our objective is to update the yellow line in a way that will better adapt changes, based on the recent match results. For that, we will compare how constant-α Monte Carlo and one-step TD algorithms cope with this task.

Constant-α Monte Carlo

The Monte Carlo method calculates the cumulative reward G of the episode, which is in our case is the total goal difference after all matches (+3). Then, every state is updated proportionally to the difference between the total episode’s reward and the current state’s value.

For instance, let us take the state after the third match against Peru (we will use the learning rate α = 0.5)

The initial state’s value is v = 1.2 (yellow point corresponding to Chile).
The cumulative reward is G = 3 (black dashed line).
The difference between the two values G–v = 1.8 is then multiplied by α = 0.5 which gives the update step equal to Δ = 0.9 (red arrow corresponding to Chile).
The new value’s state becomes equal to v = v + Δ = 1.2 + 0.9 = 2.1 (red point corresponding to Chile).

Constant-α Monte Carlo updates. Red transparent arrows show the direction of the update. Red opaque arrows show changes made by the algorithm (α = 0.5).

One-step TD

For the example demonstration, we will take the total goal difference after the fourth match against Chile.

The initial state’s value is v

Related

Tags: Efimov Jul Learning Part reinforcement TemporalDifference Vyacheslav

Technical Terrence Team

Next Post

Is time up for Watches of Switzerland shares?

Leave a Reply Cancel reply
Your email address will not be published. Required fields are marked *
Comment *
Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.

Δ

Reinforcement Learning, Part 5: Temporal-Difference Learning | by Vyacheslav Efimov | Jul, 2024

Related

Recommended.

New AI research from John Hopkins explains how AI can perform better on theory of mind tests than real humans

If you had invested £1k in Direct Line shares 2 years ago, this is what you would have now

Chevron says fault at Australia LNG plant shuts one-fourth of production (NYSE:CVX)

No, There Is No The Batman Spin-Off Video Game In The Works, Says James Gunn

Fed Should Move Slowly to Cut Rates, Says Ken Griffin

Categories

Important Links

Reinforcement Learning, Part 5: Temporal-Difference Learning | by Vyacheslav Efimov | Jul, 2024

Intelligently synergizing dynamic programming and Monte Carlo algorithms

Constant-α Monte Carlo

One-step TD

Related

Technical Terrence Team

Is time up for Watches of Switzerland shares?

Leave a Reply Cancel reply

Recommended.

New AI research from John Hopkins explains how AI can perform better on theory of mind tests than real humans

If you had invested £1k in Direct Line shares 2 years ago, this is what you would have now

Chevron says fault at Australia LNG plant shuts one-fourth of production (NYSE:CVX)

No, There Is No The Batman Spin-Off Video Game In The Works, Says James Gunn

Fed Should Move Slowly to Cut Rates, Says Ken Griffin

Categories

Important Links

Get daily news updates to your inbox!