Dissecting Richard S. Sutton's “Reinforcement Learning” with Custom Python Implementations, Episode V
In our previous post, we concluded the introductory series on fundamental reinforcement learning (RL) techniques by exploring temporal difference learning (TD). TD methods combine the strengths of Dynamic Programming (DP) and Monte Carlo (MC) methods, taking advantage of their best features to form some of the most important RL algorithms, such as Q-learning.
Building on that foundation, this post delves into TD learning in n stepsa versatile approach presented in chapter 7 of Sutton's book (1). This method bridges the gap between classical TD and MC techniques. Like TD, n-step methods use bootstrapping (leveraging prior estimates), but also incorporate the following n
rewards, offering a unique combination of short and long-term learning. In a future post, we will generalize this concept even further with traces of eligibility.
We will follow a structured approach, starting with the prediction problem before moving on control. Along the way, we will:
- Introduce n-step sauce,
- extend it to learning outside of policies,
- Explore the n-step tree backup algorithmand
- Present a unifying perspective with n-step Q(σ).
As always, you can find all the code attached at GitHub. Let's dive in!