How to do it: Cross-validation with time series data | by Haden Pelletier | December 2023

When dealing with time series data, cross-validation needs to be done differently

Standard k-fold cross-validation. Author's image

Cross validation is an important part of training and evaluating an ML model. It allows you to get an estimate of how a model trained on new data will perform.

Most people who learn how to perform cross validation first learn about the K-fold approach. I know I did. In K-fold cross validation, the data set is randomly divided into n times (usually 5). Over the course of 5 iterations, the model is trained on 4 of the 5 folds, while the remaining 1 acts as a test set to evaluate performance. This is repeated until all 5 folds have been used as a test set at a given time. In the end, you will have 5 error scores that, averaged together, will give you your cross-validation score.

However, here's the problem: this method really only works for data that is neither time series nor sequential. If the order of the data matters in any way, or if any data point depends on previous values, you cannot use K-fold cross validation.

The reason is quite simple. If you split the data into 4 training folds and 1 test fold using KFold, it will randomize the order of the data. Therefore, data points that once preceded other data points may end up in the test set, so when it comes down to it, will use future data to predict the past.

This is a big no-no.

The way you test your model in development should mimic the way it will run in the production environment.

If you are going to use past data to predict future data when the model goes into production (as you would with time series), you should test your model in development in the same way.

This is where TimeSeriesSplit comes into play. TimeSeriesSplit, a science fiction learning class, describes itself as a “variation of KFold.”

In the kth split, it returns the first k folds as the train set and the (k+1)th fold as the test set.

The main differences between TimeSeriesSplit and KFold are:

In TimeSeriesSplit, the training data set gradually increases in size, while in…

How to do it: Cross-validation with time series data | by Haden Pelletier | December 2023

Technical Terrence Team

I would buy these FTSE 250 shares for the next bull market

Leave a Reply Cancel reply

Recommended.

University of Dubai CUD to Embrace Crypto Payments

CleanSpark Acquires 20,000 Bitcoin Miners for New Facility

MIT Researchers Introduce MechGPT: A Language-Based Pioneer Bridging Scales, Disciplines, and Modalities in Mechanics and Materials Modeling

Bitcoin price remains strong and foresees a new increase above $44,000

What does selling to platform engineering teams mean for developer relations? • TechCrunch

Categories

Important Links

How to do it: Cross-validation with time series data | by Haden Pelletier | December 2023

When dealing with time series data, cross-validation needs to be done differently

The way you test your model in development should mimic the way it will run in the production environment.

Related

Technical Terrence Team

I would buy these FTSE 250 shares for the next bull market

Leave a Reply Cancel reply

Recommended.

University of Dubai CUD to Embrace Crypto Payments

CleanSpark Acquires 20,000 Bitcoin Miners for New Facility

MIT Researchers Introduce MechGPT: A Language-Based Pioneer Bridging Scales, Disciplines, and Modalities in Mechanics and Materials Modeling

Bitcoin price remains strong and foresees a new increase above $44,000

What does selling to platform engineering teams mean for developer relations? • TechCrunch

Categories

Important Links

Get daily news updates to your inbox!