Cross validation is an important part of training and evaluating an ML model. It allows you to get an estimate of how a model trained on new data will perform.
Most people who learn how to perform cross validation first learn about the K-fold approach. I know I did. In K-fold cross validation, the data set is randomly divided into n times (usually 5). Over the course of 5 iterations, the model is trained on 4 of the 5 folds, while the remaining 1 acts as a test set to evaluate performance. This is repeated until all 5 folds have been used as a test set at a given time. In the end, you will have 5 error scores that, averaged together, will give you your cross-validation score.
However, here's the problem: this method really only works for data that is neither time series nor sequential. If the order of the data matters in any way, or if any data point depends on previous values, you cannot use K-fold cross validation.
The reason is quite simple. If you split the data into 4 training folds and 1 test fold using KFold, it will randomize the order of the data. Therefore, data points that once preceded other data points may end up in the test set, so when it comes down to it, will use future data to predict the past.
This is a big no-no.
The way you test your model in development should mimic the way it will run in the production environment.
If you are going to use past data to predict future data when the model goes into production (as you would with time series), you should test your model in development in the same way.
This is where TimeSeriesSplit comes into play. TimeSeriesSplit, a science fiction learning class, describes itself as a “variation of KFold.”
In the kth split, it returns the first k folds as the train set and the (k+1)th fold as the test set.
The main differences between TimeSeriesSplit and KFold are:
- In TimeSeriesSplit, the training data set gradually increases in size, while in…