Reinforcement learning is a domain in machine learning that introduces the concept of an agent learning optimal strategies in complex environments. The agent learns from its actions, which result in rewards, based on the environment’s state. Reinforcement learning is a challenging topic and differs significantly from other areas of machine learning.
What is remarkable about reinforcement learning is that the same algorithms can be used to enable the agent adapt to completely different, unknown, and complex conditions.
Note. To fully understand the concepts included in this article, it is highly recommended to be familiar with dynamic programming and Monte Carlo methods discussed in previous articles.
- In part 2, we explored the dynamic programming (DP) approach, where the agent iteratively updates V- / Q-functions and its policy based on previous calculations, replacing them with new estimates.
- In parts 3 and 4, we introduced Monte Carlo (MC) methods, where the agent learns from experience acquired by sampling episodes.
Temporal-difference (TD) learning algorithms, on which we will focus in this article, combine principles from both of these apporaches:
- Similar to DP, TD algorithms update estimates based on the information of previous estimates. As seen in part 2, state updates can be performed without updated values of other states, a technique known as bootstrapping, which is a key feature of DP.
- Similar to MC, TD algorithms do not require knowledge of the environment’s dynamics because they learn from experience as well.
This article is based on Chapter 6 of the book “Reinforcement Learning” written by Richard S. Sutton and Andrew G. Barto. I highly appreciate the efforts of the authors who contributed to the publication of this book.
As we already know, Monte Carlo algorithms learn from experience by generating an episode and observing rewards for every visited state. State updates are performed only after the episode ends.
Temporal-difference algorithms operate similarly, with the only key difference being that they do not wait until the end of episodes to update states. Instead, the updates of every state are performed after n time steps the state was visited (n is the algorithm’s parameter). During these observed n time steps, the algorithm calculates the received reward and uses that information to update the previously visited state.
Temporal-difference algorithm performing state updates after n time steps is denoted as TD(n).
The simplest version of TD performs updates in the next time step (n = 1), known as one-step TD.
At the end of the previous part, we introduced the constant-α MC algorithm. It turns out that the pseudocode for one-step TD is almost identical, except for the state update, as shown below:
Since TD methods do not wait until the end of the episode and make updates using current estimates, they are said to use bootstrapping, like DP algorithms.
The expression in the brackets in the update formula is called TD error:
In this equation, γ is the discount factor which takes values between 0 and 1 and defines the importance weight of the current reward compared to future rewards.
TD error plays an important role. As we will see later, TD algorithms can be adapted based on the form of TD error.
At first sight, it might seem unclear how using information only from the current transition reward and the state values of the current and next states can be indeed beneficial for optimal strategy search. It will be easier to understand if we take a look at an example.
Let us imagine a simplified version of the famous “Copa America” soccer tournament, which regularly takes place in South America. In our version, in every Copa America tournament, our team faces 6 opponents in the same order. Through the system is not real, we will omit complex details to better understand the example.
We would like to create an algorithm that will predict our team’s total goal difference after a sequence of matches. The table below shows the team’s results obtained in a recent edition of the Copa America.
To better dive into the data, let us visualize the results. The initial algorithm estimates are shown by the yellow line in the diagram below. The obtained cumulative goal difference (last table column) is depicted in black.
Roughly speaking, our objective is to update the yellow line in a way that will better adapt changes, based on the recent match results. For that, we will compare how constant-α Monte Carlo and one-step TD algorithms cope with this task.
Constant-α Monte Carlo
The Monte Carlo method calculates the cumulative reward G of the episode, which is in our case is the total goal difference after all matches (+3). Then, every state is updated proportionally to the difference between the total episode’s reward and the current state’s value.
For instance, let us take the state after the third match against Peru (we will use the learning rate α = 0.5)
- The initial state’s value is v = 1.2 (yellow point corresponding to Chile).
- The cumulative reward is G = 3 (black dashed line).
- The difference between the two values G–v = 1.8 is then multiplied by α = 0.5 which gives the update step equal to Δ = 0.9 (red arrow corresponding to Chile).
- The new value’s state becomes equal to v = v + Δ = 1.2 + 0.9 = 2.1 (red point corresponding to Chile).
One-step TD
For the example demonstration, we will take the total goal difference after the fourth match against Chile.
- The initial state’s value is v