Regression with ARIMA errors combines two powerful statistical models namely, Linear Regression, and ARIMA (or Seasonal ARIMA), into a single super-powerful regression model for forecasting time series data.
The following schematic illustrates how Linear Regression, ARIMA and Seasonal ARIMA models are combined to produce the Regression with ARIMA errors model:
In this section, we’ll look at the most general of these models, called as Regression with Seasonal ARIMA errors or SARIMAX for short.
SARIMAX — the concept
Suppose your time series data set consists of a response variable and some regression variables. Suppose also that the regression variables are contained in a matrix X, and the response variable a.k.a. dependent variable is contained in a vector y. At each time step i, y takes some value y_i and there is a corresponding row vector x_i in X containing the values of all the regression variables at time step i. This situation is illustrated by the following figure:
A naive approach to model this data using a linear regression model as follows:
In the above model specification, β(cap) is an (m x 1) size vector storing the fitted model’s regression coefficients.
ε, the residual errors of regression is the difference between the actual y and the value y(cap) predicted by the model. So at each time step i:
ε_i = y_i — y(cap)_i.
ε is a vector of size (n x 1), assuming a data set spanning n time steps.
But alas, our elegant Linear Regression model will not work for time series data for a very simple reason: time series data are auto i.e. self correlated. A given value of y_i in y is influenced by previous values of y i.e. y_(i-1), y_(i-2) etc. Linear regression models are unable to ‘explain’ such auto-correlations.
Therefore, if you fit a straight-up linear regression model to the (y, X) data set, these auto-correlations will leak into the the residual errors of regression (ε), making the ε auto-correlated!
We have seen in the section on the Assumptions of Linear Regression that Linear Regression models assume that the residual errors of regression are independent random variables with identical normal distributions. But if the residual errors are auto-correlated, they cannot be independent, causing many problems. A major problem caused by auto-correlated residual errors is that one cannot use statistical tests of significance such as the F-test or the Student’s t-test to determine whether the regression coefficients are significant. Neither can one rely on the standard errors of the regression coefficients. This in turn makes the confidence intervals of the regression coefficients, and those of the model’s forecasts, unreliable. Just this one problem of auto-correlation in the residual errors, causes a cascading litany of problems with the Linear Regression model, rendering it practically useless for modeling time series data.
But instead of throwing away the powerful Linear Regression model, one can fix the problem of auto-correlated residuals by bringing in another powerful model, namely ARIMA (or Seasonal ARIMA). (S)ARIMA models are perfectly suited for dealing with auto-correlated data. We harness this ability of SARIMA, by modeling the residual errors of linear regression using the SARIMA model.
The SARIMA model
A SARIMA model consists of the following 7 components:
AR: The Auto-Regressive (AR) component is a linear combination of past values of the time series up to some number of lags p. i.e. y_i is a linear combination of y_(i-1), y_(i-2),…y_(i-p) as follows:
y_i is the actual value observed at the ith time step. The phi(cap)_i are the fitted model’s regression coefficients. The ε_i is the residual error of regression for the ith time step. The order ‘p’ of the AR(p) model is determined using a combination of well-known rules and modeler’s judgement.
MA: The Moving Average (MA) component of SARIMA is a linear combination of the model’s past errors up to some number of lags q. The model’s past errors are calculated by subtracting the past predictions from past actual values. The MA(q) model is expressed as follows:
y_i is the actual value observed at the ith time step. The theta(cap)_i are the fitted model’s regression coefficients. The ε_i is the residual error of regression for the ith time step. The negative signs are as per the convention for specifying MA models. The order ‘q’ of the MA(q) model is determined using a combination of well-known rules and the modeler’s judgement.
The combined ARMA (p,q) model is simply the combination of the AR(p) and MA(q) models:
Order of differencing (d): The ARMA model cannot be used if the time series has a trend. Common examples of trend are linear trend, quadratic trend and exponential or logarithmic trends. If the time series demonstrates a trend, one applies one or more orders of differencing to the time series so as to remove the trend. A first order difference will remove a linear trend of the type y = m*x + x. Second and higher order differences will remove polynomial trends of the kind y = m*x² + c, y = x*x³ + c etc.
The difference operation in ARIMA models is denoted by the I letter. In ARIMA, I stands for Integrated. Differencing is applied by ARIMA models before the AR and the MA terms are brought into play. The order of differencing is denoted by the d parameter in the ARIMA(p,d,q) model specification.
SAR, SMA, D and m: The Seasonal ARIMA or SARIMA model simply extends the above concepts concepts of AR, MA and differencing to the seasonal realm by introducing a Seasonal AR (SAR) term of order P, Seasonal MA (SMA) term of order Q, and a Seasonal Difference of order D. The final parameter in SARIMA models is ‘m’ which is the seasonal period. For e.g. m=12 months for a time series that exhibits yearly seasonality. Just as with p,d and q, there are well established rules for estimating the values of P, D, Q and m.
The complete SARIMA model specification is ARIMA(p,d,q)(P,D,Q)m.
Now let’s return to our little conundrum about auto-correlated residuals.
As mentioned earlier, (S)ARIMA models are perfectly suited for forecasting time series data, and particularly for dealing with auto-correlated data. We apply this property of SARIMA models to model the auto-correlated residuals of the Linear Regression model after it is fitted to the time series data set.
The resulting model is known as Regression with Seasonal ARIMA Errors or SARIMAX for short. It can be loosely specified as follows:
If (p,d,q), (P,D,Q) and m are chosen correctly, one would expect the residual errors of the SARIMAX model to be not auto-correlated. In fact, they would be expected to be independent, identically distributed random variables with zero mean and some constant variance.
In this section, I will not go into the details of the rules for estimating p,d,q,P,D,Q and m. Instead, we’ll focus on building a SARIMAX model for a real world data set using Python and Statsmodels. As we go along, we’ll apply various rules for fixing SARIMAX’s parameters.
To get a good understanding of what the full set of rules is, please refer to the following excellent link on the topic:
Be sure to also review the rest of Prof. Nau’s lecture notes on ARIMA for an in-depth understanding of ARIMA and SARIMA models:
Using Python and statsmodels to build a Regression model with Seasonal ARIMA errors
We’ll put to use what we’ve learned so far. We’ll design a SARIMAX model for a real world data set. The data set we’ll use contains hourly readings of various air pollutants measured at a busy intersection in an Italian city from 2004 to 2005.
Here’s the link to the original data set.
I have adapted this data set for our SARIMAX experiment. The modified data set can be downloaded from here.
Let’s begin by importing all the Python packages we will be using:
import pandas as pd from statsmodels.regression import linear_model from patsy import dmatrices import statsmodels.graphics.tsaplots as tsa from matplotlib import pyplot as plt from statsmodels.tsa.seasonal import seasonal_decompose from statsmodels.tsa.arima.model import ARIMA as ARIMA import numpy as np
We’ll use pandas to load the data into a DataFrame:
df = pd.read_csv('air_quality_uci.csv', header=0)
Let’s print out the first 10 rows:
We’ll build a regression model to predict the hourly value of the PT08_S4_NO2 variable.
Our model’s variables will be as follows:
Dependent variable y is PT08_S4_NO2
The matrix of regression variables X will contain two variables:
Absolute Humidity AH
We will use the Regression with (Seasonal) ARIMA errors i.e. the (S)ARIMAX model to predict hourly values of PT08_S4_NO2 using hourly measurements of Temperature and Absolute Humidity.
Since we are using (S)ARIMAX, we may also implicitly use past values of the dependent variable PT08_S4_NO2 and past errors of the model as additional regression variables.
I have put (Seasonal) or (S) in brackets for now, since we don’t yet know whether there is any seasonality present in the residual errors of the regression. We’ll find that out soon enough.
We’ll use the following step-by-step procedure to build the (S)ARIMAX model:
STEP 1: Prepare the data
Convert the dateTime column in the data frame into a pandas DateTime column and set it as the index of the DataFrame.
df['DateTimeIndex']= pd.to_datetime(df['DateTime']) df = df.set_index(keys=['DateTimeIndex'])
Set the frequency attribute of the index to Hourly. This will create several empty rows corresponding to the missing hourly measurements in the original data set. Fill up all the empty data cells with the mean of the corresponding column.
df = df.asfreq('H') df = df.fillna(df.mean())
Verify that there are no empty cells in any column. The output should be all zeroes.
df.isin([np.nan, np.inf, -np.inf]).sum()
Create the training and the test data sets. Set the test data set length to be 10% of the training data set. We won’t actually need it to be that big, but let’s go with 10% for now.
dataset_len = len(df) split_index = round(dataset_len*0.9) train_set_end_date = df.index[split_index] df_train = df.loc[df.index <= train_set_end_date].copy() df_test = df.loc[df.index > train_set_end_date].copy()
STEP 2: Create a Linear Regression model
We’ll now fit an Ordinary Least Squares Linear Regression model on the training data set and fetch it’s vector of residual errors.
Let’s create the model expression in Patsy syntax. In the expression below, we are saying that PT08_S4_NO2 is the dependent variable and T and AH are the regression variables. An intercept term is assumed by default.
expr = 'PT08_S4_NO2 ~ T + AH'
Let’s carve out the y and X matrices. Patsy makes this really simple:
y_train, X_train = dmatrices(expr, df_train, return_type='dataframe') y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')
Fit an OLSR model on the training dataset:
olsr_results = linear_model.OLS(y_train, X_train).fit()
Print out the OLSR model’s training results:
We see the following output. I have highlighted a couple of significant areas in the output:
We see that the regression coefficients of both regression variables T and AH are significant at a 99.99% confidence level as indicated by their P values (P > |t| column) which are essentially 0.
The second thing to note in these results is the output of the Durbin-Watson test which measures the degree of LAG-1 auto-correlation in the residual errors of regression. A value of 2 implies no LAG-1 auto-correlation. A value closer to 0 implies strong positive auto-correlation while a value close to 4 implies a strong negative auto-correlation at LAG-1 among the residuals errors ε.
In the above output, we see that the DW test statistic is 0.348 indicating a strong positive auto-correlation among the residual errors of regression at LAG-1. This was completely expected since the underlying data is a time series and the linear regression model has failed to explain the auto-correlation in the dependent variable. The DW test statistic just confirms it.
STEP 3: Estimate (S)ARIMA parameters (p,d,q), (P,D,Q) and m
We now begin the process of estimating the parameters of the SARIMA model on the OLSR model’s residual errors of regression ε. The regression errors are stored in the variable
Let’s start with plotting the Auto-correlation plot of the residual errors:
tsa.plot_acf(olsr_results.resid, alpha=0.05) plt.show()
We get the following plot:
The ACF tells us three things:
- There are strong auto-correlations extending out to multiple lags indicating that the residual errors time series has a trend. We’ll need to de-trend this time series by using one or possibly 2 orders of differencing. Thus, the parameter d is likely to be 1, or possibly 2.
- The wavelike pattern in the ACF evidences a seasonal variation in the data.
- The peak at LAG = 24 indicates that the seasonal period is likely to be 24 hours. i.e. m is likely to be 24. This seems reasonable for data containing vehicular pollution measurements. We’ll soon verify this guess using the time series decomposition plot.
Quick note: the LAG-0 autocorrelation will always be a perfect 1.0 and can be ignored as a value is perfectly correlated with itself.
Before we estimate the rest of the (S)ARIMA parameters, let’s difference the time series once i.e. d=1:
olsr_resid_diff_1 = olsr_results.resid.diff() olsr_resid_diff_1 = olsr_resid_diff_1.dropna()
Let’s replot the ACF of the differenced time series of residual errors:
tsa.plot_acf(olsr_resid_diff_1, alpha=0.05) plt.show()
We now see a very different picture in the ACF. The auto-correlations are significantly reduced at all lags. The wavelike pattern still exists but that’s because we did nothing to remove the possible seasonal variation. The LAG-24 auto-correlation is once again especially prominent.
We see that there is still a significant auto-correlation at LAG-1 in the differenced time series. We could try extinguishing it by taking one more difference, i.e. d=2 and plotting the resulting time series’ ACF:
olsr_resid_diff_2 = olsr_resid_diff_1.diff() olsr_resid_diff_2 = olsr_resid_diff_2.dropna() tsa.plot_acf(olsr_resid_diff_2, alpha=0.05) plt.show()
We get the following plot:
Unfortunately, diff-ing the time series a second time has produced a heavy negative auto-correlation at LAG-1. This is bad sign. We seem to have over-done the differencing. We should stick with d=1.
Let’s revisit the ACF of the DIFF(1) time series:
The single positive auto-correlation at LAG-1 indicates that we may want to fix the AR order p to 1. i.e. an AR(1) model.
Since we have fixed p to 1, for now, we’ll leave out the MA portion of the model. i.e. we fix q to 0. i.e. our SARIMA model will not have an MA component.
So far we have the following: p=1, d=1, q=0
Let’s verify that the seasonal period m is 24 hours. To do that, we’ll decompose the residual errors of regression into trend, seasonality and noise by using the
seasonal_decompose() function provided by statsmodels:
components = seasonal_decompose(olsr_results.resid) components.plot()
We get the following plot:
Let’s zoom into the seasonal component:
The seasonal component confirms that m=24.
The strong seasonal component warrants a single seasonal order of differencing, i.e. we set D=1.
Let’s apply a single seasonal difference to our already differenced time series of residual errors:
olsr_resid_diff_1_24 = olsr_resid_diff_1.diff(periods=24)
The strong negative correlation at LAG-24 indicates a Seasonal MA (SMA) signature with order 1. i.e. Q=1. Moreover, an absence of positive correlation at LAG-1, indicates an absence of a Seasonal AR component. i.e. P=0.
We have fixed P=0, D=1 and Q=1, and m=24 hours
We have managed to estimate all 7 params of the SARIMA model as follows:
p=1, d=1, q=0, P=0, D=1, Q=1 and m=24 i.e. SARIMAX(1,1,0)(0,1,1)24
STEP 4: Build and fit the Regression Model with Seasonal ARIMA errors
Let’s fit the SARIMAX model on the training data set (y_train, X_train) using the above parameters. Before we do that, we need to remove the intercept that Patsy had auto-added to X_train and reset the time series index frequency to Hourly on both X_train and y_train.
X_train_minus_intercept = X_train.drop('Intercept', axis=1) X_train_minus_intercept = X_train_minus_intercept.asfreq('H') y_train = y_train.asfreq('H')
Let’s build and test the SARIMAX model:
sarimax_model = ARIMA(endog=y_train, exog=X_train_minus_intercept,order=(1,1,0), seasonal_order=(0,1,1,24)) sarimax_results = sarimax_model.fit() sarimax_results.summary()
Here’s the training summary:
The number to look at first in the training summary is the Ljung-Box test’s statistic and its p-value. The Ljung-Box helps us determine if the residual errors of regression are auto-correlated in a statistically significant way. In this case, the p value is 0.4 which is significantly higher than 0.05 (95% confidence threshold). So we accept the Null hypothesis of the Ljung-Box test that the residual errors are not auto-correlated.
Two other things to note in the result: The Jarque-Bera test of normality has yielded a vanishingly small p-value implying a rejection of the Null hypothesis at a > 99.99% confidence level. The Null hypothesis being that the regression errors are normally distributed. This is probably because the errors are highly kurtotic (note that the Kurtosis=7.49 as against the 3.0 that it should have been for a normal distribution). Note also that the errors are not at all skewed (skewness=0.04 as against 0.0 for normally distributed errors).
The regression errors are also heteroskedastic i.e. they have non-constant variance. This is evidenced by the vanishingly small p-value of the H-test, but there are some things you can do to fix that.
We could also experiment with a different combination of parameters for p,d,q,P,D,Q and m. It turns out, this sort of an iteration with different SARIMAX models is not unusual in the SARIMAX world.
What we’ll do is to simplify our model by setting Q to 0. i.e. we’ll try a SARIMAX(1,1,0)(0,1,0)24 model:
sarimax_model = ARIMA(endog=y_train, exog=X_train_minus_intercept,order=(1,1,0), seasonal_order=(0,1,0,24)) sarimax_results = sarimax_model.fit() sarimax_results.summary()
Here’s the training summary:
In this revised model, we see that the p-value of Ljung-Box test statistic has reduced from the earlier value of 0.4 to 0.1 implying that the regression errors of this model are much less likely to be non-correlated than in the earlier model. The new model does not seem to have made the errors any less heteroskedastic or more normally distributed than the previous model.
We’ll accept earlier version of the model, namely SARIMAX(1,1,0)(0,1,1)24 model.
(Thanks to Thiago Ribas for identifying the errors in an earlier version of this section)
STEP 5: Prediction
The final step in our SARIMAX saga is using the chosen model to generate some predictions. We’ll ask the model to generate 24 out-of-sample predictions i.e. predict the value of the y (pollutant PT08_S4_NO2) for the next 24 hours beyond the end of the training data set. We’ll use the testing data set (y_test, X_test) which we had carved out in STEP-1. Recollect that the model has never seen the test data set during training.
Let’s prepare the test data set for prediction:
X_test_minus_intercept = X_test.drop('Intercept', axis=1) X_test_minus_intercept = X_test_minus_intercept.asfreq('H') y_test = y_test.asfreq('H')
get_forecast method to get the out of sample forecasts:
predictions = sarimax_results.get_forecast(steps=24, exog=X_test_minus_intercept[:24]) predictions.summary_frame()
We get the following output:
Let’s plot the actual value y_test from the test data set alongside the forecast value mentioned in the ‘mean’ column of the summary _frame. We’ll plot the lower and upper confidence intervals for each forecast value:
predicted, = plt.plot(X_test_minus_intercept[:24].index, predictions.summary_frame()['mean'], 'go-', label='Predicted') actual, = plt.plot(X_test_minus_intercept[:24].index, y_test[:24], 'ro-', label='Actual') lower, = plt.plot(X_test_minus_intercept[:24].index, predictions.summary_frame()['mean_ci_lower'], color='#990099', marker='.', linestyle=':', label='Lower 95%') upper, = plt.plot(X_test_minus_intercept[:24].index, predictions.summary_frame()['mean_ci_upper'], color='#0000cc', marker='.', linestyle=':', label='Upper 95%') plt.fill_between(X_test_minus_intercept[:24].index, predictions.summary_frame()['mean_ci_lower'], predictions.summary_frame()['mean_ci_upper'], color = 'b', alpha = 0.2) plt.legend(handles=[predicted, actual, lower, upper]) plt.show()
We get the following plot:
- Regression with (Seasonal) ARIMA errors (SARIMAX) is a time series regression model that brings together two powerful regression models namely, Linear Regression, and ARIMA (or Seasonal ARIMA).
- The Python Statsmodels library provides powerful support for building (S)ARIMAX models via the
statsmodels.tsa.arima.model.ARIMAclass in v0.12.0 of statsmodels, or via
- While configuring the (S)ARIMA portion of the (S)ARIMAX model, it helps to use a set of well-known rules (combined with personal judgement) for fixing the values of the p,d,q,P,D,Q and m parameters of the (S)ARIMAX model.
- A well designed (S)ARIMAX model’s residual errors of regression will have very little auto-correlation. This is indicated by the p value of the Ljung-Box test.
- Additionally, you would want the residual errors to be homoscedastic, and (preferably) normally distributed. So you may have to experiment with different combinations of p,d,q,P,D,Q until you get a model with the best goodness-of-fit characteristics.
Citations and Copyrights
Data set citation
S. De Vito, E. Massera, M. Piga, L. Martinotto, G. Di Francia, On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario, Sensors and Actuators B: Chemical, Volume 129, Issue 2, 22 February 2008, Pages 750–757, ISSN 0925–4005, [Web Link]. ([Web Link])
Makridakis, S., Wheelwright, S. C., Hyndman, R. J. Forecasting Methods and Applications. Third Ed. John Wiley & Sons.
PREVIOUS: Holt-Winters Exponential Smoothing