Gaussian processes (GPs) are an amazing class of models. There are very few machine learning algorithms that give you an accurate measure of uncertainty for free and at the same time are super flexible. The problem is that conceptually GPs are very difficult to understand. Most explanations use some complex algebra and probability, which is often not helpful in intuiting how these models work.
There are also many excellent guides that skip the math and give you an intuition for how these models work, but when it comes to using GPs yourself, in the right context, my personal belief is that superficial knowledge is not enough. That's why I wanted to explain a basic implementation, from scratch, so you have a clearer idea of what's going on under the hood of all the libraries that implement these models for you.
I also link my GitHub repository, where you will find the implementation of GPs using only NumPy. I've tried to abstract myself from the math as much as possible, but obviously there is still some that is necessary…
The first step is always to take a look at the data. We will use monthly atmospheric CO2 concentration over time, measured at the Mauna Loa observatory, a common data set for GPs (1). This is intentionally the same data set that sklearn uses in its GP tutorialwhich teaches how to use their API and not what happens under the hood of the model.
This is a very simple data set, which will make it easier to explain the following mathematics. The notable characteristics are the linear upward trend as well as the seasonal trend, with a period of one year.
What we will do is separate the seasonal component and linear components of the data. To do this, we fit a linear model to the data.