Modeling time series data and forecasting it are complex topics. There are many techniques that could be used to improve model performance for a forecasting job. Here we will discuss a technique that can improve the way forecasting models absorb and use weather features and generalize from them. The main focus will be creating seasonal features that feed the time series forecasting model into training; Easy gains can be made here by including the Fourier transform in the feature creation process.
This article assumes you are familiar with the basics of time series forecasting; We will not discuss this topic in general, but only a refinement of one aspect of it. For an introduction to time series forecasting, see the Kaggle course On this topic: The technique discussed here is based on your lesson on seasonality.
Consider The Australian Portland Cement Production Quarterly Datasetshowing total quarterly Portland cement production in Australia (in million tonnes) from 1956:Q1 to 2014:Q1.
df = pd.read_csv('Quarterly_Australian_Portland_Cement_production_776_10.csv', usecols=('time', 'value'))
# convert time from year float to a proper datetime format
df('time') = df('time').apply(lambda x: str(int(x)) + '-' + str(int(1 + 12 * (x % 1))).rjust(2, '0'))
df('time') = pd.to_datetime(df('time'))
df = df.set_index('time').to_period()
df.rename(columns={'value': 'production'}, inplace=True)
production
time
1956Q1 0.465
1956Q2 0.532
1956Q3 0.561
1956Q4 0.570
1957Q1 0.529
... ...
2013Q1 2.049
2013Q2 2.528
2013Q3 2.637
2013Q4 2.565
2014Q1 2.229(233 rows x 1 columns)
This is time series data with a trend, seasonal components, and other attributes. Observations are made quarterly and span several decades. Let’s first take a look at the seasonal components, using the periodogram graph (all the code is included in the companion notebook, linked at the end).
The periodogram shows the power density of the spectral components (seasonal components). The strongest component is the one that has a period equal to 4 quarters or 1 year. This confirms the visual impression that the strongest variations up and down the chart occur approximately 10 times per decade. There is also a component with a period of 2 quarters; It is the same seasonal phenomenon, and it simply means that the 4-quarter periodicity is not a simple sine wave, but has a more complex shape. We will ignore components with periods greater than 4, for simplicity.
If you are trying to fit a model to this data, perhaps to forecast cement production for times after the end of the data set, it would be a good idea to capture this annual seasonality in a feature or set of features, and provide those to the model in training. . Let’s look at an example.
Seasonal components can be modeled in different ways. They are often represented as time dummies or as sine-cosine pairs. These are synthetic features with a period equal to the seasonality you are trying to model, or a submultiple of it.
Annual time dummies:
seasonal_year = DeterministicProcess(index=df.index, constant=False, seasonal=True).in_sample()
print(seasonal_year)
s(1,4) s(2,4) s(3,4) s(4,4)
time
1956Q1 1.0 0.0 0.0 0.0
1956Q2 0.0 1.0 0.0 0.0
1956Q3 0.0 0.0 1.0 0.0
1956Q4 0.0 0.0 0.0 1.0
1957Q1 1.0 0.0 0.0 0.0
... ... ... ... ...
2013Q1 1.0 0.0 0.0 0.0
2013Q2 0.0 1.0 0.0 0.0
2013Q3 0.0 0.0 1.0 0.0
2013Q4 0.0 0.0 0.0 1.0
2014Q1 1.0 0.0 0.0 0.0(233 rows x 4 columns)
Annual sine-cosine pairs:
cfr = CalendarFourier(freq='Y', order=2)
seasonal_year_trig = DeterministicProcess(index=df.index, seasonal=False, additional_terms=(cfr)).in_sample()
with pd.option_context('display.max_columns', None, 'display.expand_frame_repr', False):
print(seasonal_year_trig)
sin(1,freq=A-DEC) cos(1,freq=A-DEC) sin(2,freq=A-DEC) cos(2,freq=A-DEC)
time
1956Q1 0.000000 1.000000 0.000000 1.000000
1956Q2 0.999963 0.008583 0.017166 -0.999853
1956Q3 0.017166 -0.999853 -0.034328 0.999411
1956Q4 -0.999963 -0.008583 0.017166 -0.999853
1957Q1 0.000000 1.000000 0.000000 1.000000
... ... ... ... ...
2013Q1 0.000000 1.000000 0.000000 1.000000
2013Q2 0.999769 0.021516 0.043022 -0.999074
2013Q3 0.025818 -0.999667 -0.051620 0.998667
2013Q4 -0.999917 -0.012910 0.025818 -0.999667
2014Q1 0.000000 1.000000 0.000000 1.000000(233 rows x 4 columns)
Weather dummies can capture very complex waveforms of the seasonal phenomenon. On the other hand, if the period is N, then you need N time dummies; Therefore, if the period is very long, you will need many dummy columns, which may not be desirable. For non-penalized models, only N-1 dummies are often used; one is removed to avoid collinearity problems (we will ignore it here).
Sine/cosine pairs can model arbitrarily long periods of time. Each pair will model a pure sine waveform with some initial phase. As the seasonal characteristic waveform becomes more complex, you will need to add more pairs (increase the order of the CalendarFourier output).
(Side note: we use a sine/cosine pair for each period we want to model. What we really want is just a column of A*sin(2*pi*t/T + phi)
where A
is the weight assigned by the model to the column, t
is the time index of the series, and T
is the period of the component. But the model cannot adjust the initial phase. phi
While fitting the data, the sine values are precomputed. Fortunately, the above formula is equivalent to: A1*sin(2*pi*t/T) + A2*cos(2*pi*t/T)
and the model only needs to find the weights A1 and A2.)
TLDR: What these two techniques have in common is that they represent seasonality through a set of repeating features, with values that alternate up to once per the time period being modeled (time dummy and the sine/cosine base pair ), or several times per period (higher order sine/cosine). And each of these features has values that vary within a fixed interval: from 0 to 1, or from -1 to 1. These are all the features we have to model seasonality here.
Let’s see what happens when we fit a linear model to such seasonal characteristics. But first, we need to add trend features to the collection of features used to train the model.
Let’s create trend features and then join them with the seasonal_year time dummies generated earlier:
trend_order = 2
trend_year = DeterministicProcess(index=df.index, constant=True, order=trend_order).in_sample()
X = trend_year.copy()
X = X.join(seasonal_year)
const trend trend_squared s(1,4) s(2,4) s(3,4) s(4,4)
time
1956Q1 1.0 1.0 1.0 1.0 0.0 0.0 0.0
1956Q2 1.0 2.0 4.0 0.0 1.0 0.0 0.0
1956Q3 1.0 3.0 9.0 0.0 0.0 1.0 0.0
1956Q4 1.0 4.0 16.0 0.0 0.0 0.0 1.0
1957Q1 1.0 5.0 25.0 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ...
2013Q1 1.0 229.0 52441.0 1.0 0.0 0.0 0.0
2013Q2 1.0 230.0 52900.0 0.0 1.0 0.0 0.0
2013Q3 1.0 231.0 53361.0 0.0 0.0 1.0 0.0
2013Q4 1.0 232.0 53824.0 0.0 0.0 0.0 1.0
2014Q1 1.0 233.0 54289.0 1.0 0.0 0.0 0.0(233 rows x 7 columns)
This is the X data frame (features) that will be used to train/validate the model. We are modeling the data with quadratic trend characteristics, in addition to the four dummy variables needed for annual seasonality. The data frame and (target) will only be the cement production figures.
Let’s make a validation set from the data, containing observations from the year 2010. We will fit a linear model to the X characteristics shown above and the target y represented by cement production (the test part), and then We will evaluate the performance of the model in validation. We will also do all of the above with a trend-only model that will only fit the characteristics of the trend but ignore seasonality.
def do_forecast(X, index_train, index_test, trend_order):
X_train = X.loc(index_train)
X_test = X.loc(index_test)y_train = df('production').loc(index_train)
y_test = df('production').loc(index_test)
model = LinearRegression(fit_intercept=False)
_ = model.fit(X_train, y_train)
y_fore = pd.Series(model.predict(X_test), index=index_test)
y_past = pd.Series(model.predict(X_train), index=index_train)
trend_columns = X_train.columns.to_list()(0 : trend_order + 1)
model_trend = LinearRegression(fit_intercept=False)
_ = model_trend.fit(X_train(trend_columns), y_train)
y_trend_fore = pd.Series(model_trend.predict(X_test(trend_columns)), index=index_test)
y_trend_past = pd.Series(model_trend.predict(X_train(trend_columns)), index=index_train)
RMSLE = mean_squared_log_error(y_test, y_fore, squared=False)
print(f'RMSLE: {RMSLE}')
ax = df.plot(**plot_params, title='AUS Cement Production - Forecast')
ax = y_past.plot(color='C0', label='Backcast')
ax = y_fore.plot(color='C3', label='Forecast')
ax = y_trend_past.plot(ax=ax, color='C0', linewidth=3, alpha=0.333, label='Trend Past')
ax = y_trend_fore.plot(ax=ax, color='C3', linewidth=3, alpha=0.333, label='Trend Future')
_ = ax.legend()
do_forecast(X, index_train, index_test, trend_order)
RMSLE: 0.03846449744356434
Blue is train, red is validation. The complete model is the fine and sharp line. The trend model is the wide and diffuse line.
This is not bad, but there is an obvious problem: the model has learned annual seasonality of constant amplitude. Although the annual variation actually increases over time, the model can only adhere to fixed amplitude variations. Obviously, this is because we gave the model only fixed-amplitude seasonal features, and there is nothing else in the features (target delays, etc.) to help it overcome this problem.
Let’s dig deeper into the signal (the data) to see if there is anything that can help with the fixed amplitude issue.
The periodogram will highlight all the spectral components of the signal (all the seasonal components of the data) and provide an overview of its overall intensity, but it is an aggregate, a sum of the entire time interval. It says nothing about how the amplitude of each seasonal component may vary in time across the entire data set.
To capture this variation, the Fourier spectrogram must be used. It is like the periodogram, but it is performed repeatedly over many time windows across the entire data set. Both periodogram and spectrogram are available as methods in the scipy library.
Let’s plot the spectrogram of the seasonal components with periods of 2 and 4 quarters, mentioned above. As always, the full code is in the companion notebook linked at the end.
spectrum = compute_spectrum(df('production'), 4, 0.1)
plot_spectrogram(spectrum, figsize_x=10)
What this diagram shows is the amplitude, the “strength” of the 2-quarter and 4-quarter components over time. They are initially quite weak, but become very strong around 2010, which matches the variations that can be seen in the data set graph at the beginning of this article.
What if, instead of feeding the model with fixed-amplitude seasonal features, we adjust the amplitude of these features over time, matching what the spectrogram tells us? Would the final model perform better in validation?
Let’s choose the seasonal component of the fourth quarter. We could fit a linear model (called an envelope model) on the order=2 trend of this component, in the train data (model.fit()) and extend that trend to validation/testing (model.predict()):
envelope_features = DeterministicProcess(index=X.index, constant=True, order=2).in_sample()spec4_train = compute_spectrum(df('production').loc(index_train), max_period=4)
spec4_train
spec4_model = LinearRegression()
spec4_model.fit(envelope_features.loc(spec4_train.index), spec4_train('4.0'))
spec4_regress = pd.Series(spec4_model.predict(envelope_features), index=X.index)
ax = spec4_train('4.0').plot(label='component envelope', color='gray')
spec4_regress.loc(spec4_train.index).plot(ax=ax, color='C0', label='envelope regression: past')
spec4_regress.loc(index_test).plot(ax=ax, color='C3', label='envelope regression: future')
_ = ax.legend()
The blue line is the strength of variation of the 4-quarter component in the train data, fitted as a quadratic trend (order=2). The red line is the same thing, extended (predicted) over the validation data.
We have modeled the variation over time of the seasonal component of 4 quarters. We can take the result of the envelopment model and multiply it by the dummy time variables corresponding to the 4-quarter seasonal component:
spec4_regress = spec4_regress / spec4_regress.mean()season_columns = ('s(1,4)', 's(2,4)', 's(3,4)', 's(4,4)')
for c in season_columns:
X(c) = X(c) * spec4_regress
print(X)
const trend trend_squared s(1,4) s(2,4) s(3,4) s(4,4)
time
1956Q1 1.0 1.0 1.0 0.179989 0.000000 0.000000 0.000000
1956Q2 1.0 2.0 4.0 0.000000 0.181109 0.000000 0.000000
1956Q3 1.0 3.0 9.0 0.000000 0.000000 0.182306 0.000000
1956Q4 1.0 4.0 16.0 0.000000 0.000000 0.000000 0.183581
1957Q1 1.0 5.0 25.0 0.184932 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ...
2013Q1 1.0 229.0 52441.0 2.434701 0.000000 0.000000 0.000000
2013Q2 1.0 230.0 52900.0 0.000000 2.453436 0.000000 0.000000
2013Q3 1.0 231.0 53361.0 0.000000 0.000000 2.472249 0.000000
2013Q4 1.0 232.0 53824.0 0.000000 0.000000 0.000000 2.491139
2014Q1 1.0 233.0 54289.0 2.510106 0.000000 0.000000 0.000000(233 rows x 7 columns)
The 4 time dummies are no longer 0 or 1. They have been multiplied by the component envelope and now their amplitude varies in time, as does the envelope.
Let’s train the main model again, now using the modified time dummies.
do_forecast(X, index_train, index_test, trend_order)
RMSLE: 0.02546321729737165
We’re not looking for a perfect fit here, as this is just a toy model, but it’s obvious that the model performs better, more closely following the 4-quarter variations on the target. The performance metric has also improved, by 51%, which is not bad at all.
Improving model performance is a very broad topic. The behavior of any model does not depend on a single characteristic, nor on a single technique. However, if you’re looking to extract everything you can from a given model, you should probably provide it with meaningful features. Time dummies, or sine/cosine Fourier pairs, are most meaningful when they reflect the time variation of the seasonality they are modeling.
Fitting the envelope of the seasonal component using the Fourier transform is particularly effective for linear models. Boosted trees don’t benefit as much, but you can still see improvements almost no matter what model you use. If you are using a hybrid model, where a linear model handles deterministic features (calendar, etc.), while a boosted tree model handles more complex features, it is important that the linear model “gets it right”, so it leaves less work to do. Do it for the other model.
It is also important that the envelope models you use to adjust seasonal characteristics are only trained on the train data and do not see any test data during training, like any other model. Once the envelope is adjusted, the time replicas contain target information; They are no longer purely deterministic characteristics, which can be calculated in advance at any arbitrary forecast horizon. So at this point information could leak from the validation/test to the training data if you are not careful.
The dataset used in this article is available here under the public domain license (CC0):
The code used in this article can be found here:
A notebook submitted to the Store Sales: Time Series Forecasting contest on Kaggle, using the ideas described in this article:
A GitHub repository containing the development version of the notebook submitted to Kaggle is here:
All images and code used in this article are created by the author.