Building Robust Linear Models For Nonlinear, Heteroscedastic Data

A step-by-step tutorial in Python

in this section, we’ll cover the following topics:

A brief overview of assumptions of Linear Regression models which include among other things, linearity of relationships, and homoscedastic (i.e. constant variance) residual errors.
A step-by-step guide to fitting regression linear models to real-world data which is often nonlinear and not homoscedastic.
Introduction to the Duan’s smearing estimator: a necessary tool for reducing prediction errors.
A Python based tutorial: All along, we’ll work with a real world data set of the Export Price Index of Gold from the US FRED, and we’ll illustrate each step in detail using Python, Pandas, Numpy, Patsy and Statsmodels.

This section has a hands-on ‘learn-as-you-code’ style.

I’ll start you off with a little bit of theory on point #1, then we’ll dive straight into the hands-on part in points 2 thru 4.

Assumptions of Linear Regression Models

Linear regression models such the Ordinary Least Squares Regression (OLSR) model are incredibly powerful for modeling linear relationships. Their operating characteristics are well-understood and they are backed by decades of research, leading to results that are explainable, defensible and highly usable.

Linear regression models come with several assumptions, namely:

Linearity: The relationship between the dependent and explanatory variables is assumed to be linear i.e. it can be represented using the following equation:
y = β*X + ϵ
Where y is the dependent variable vector, X is the matrix of explanatory variables which includes the intercept, β is the vector of (constant-value) regression coefficients and ϵ is the vector of error terms i.e. the part of y that X is unable to explain.
i.i.d. residuals: The residual errors ϵ of the regression model are assumed to be independent of each other (i.e. no correlation among the residuals) and identically distributed.
Normally distributed residuals: We prefer (but do not strictly require) the residuals to be normally distributed.
Homoscedastic residuals: The variance in the residuals is assumed to be constant, in particular the variance should not be a function of the dependent variable y (or the explanatory variables X), or a function of time (in case of time series data).

In real world data sets, data is often nonlinear and heteroscedastic (i.e. non-homoscedastic). The model’s residual errors may also not be perfectly i.i.d. or normally distributed.

In this section, we’ll see how to fit linear models to data, in spite of these practical hurdles.

Let’s dive in.

A real world data set

We’ll use the following Gold Price Index data set (available here):

Export Price Index (End Use): Non-monetary gold (Source: US FRED)

STEP 1: LOAD THE DATA SET

Let’s load the data into a pandas DataFrame:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv('monthly_gold_price_index_fred.csv', header=0, infer_datetime_format=True, parse_dates=[0], index_col=[0])
print(df.head(10))

Here are the first 10 rows:

We’ll add a new column called Time_Period which takes values from 1 to 132.

df['Time_Period'] = range(1, len(df)+1)

Print the first 10 rows:

print(df.head(10))

STEP 2: INSPECT THE DATA

Plot the dependent variable Export_Price_Index_of_Gold against the Time_Period:

#Create a new pyplot figure to plot into
fig = plt.figure()
#Set the title of the plot
fig.suptitle('Export Price Index of Gold')
#Set the X and Y axis labels
plt.xlabel('Time Period')
plt.ylabel('Price Index')
#plot the time series and store the plot in the actual variable. We'll need that later for the legend.
actual, = plt.plot(df['Time_Period'], df['Export_Price_Index_of_Gold'], 'bo-', label='Gold Price Index')
#Set up the legend. There is only one time series in the legend.
plt.legend(handles=[actual])
#Show everything
plt.show()

Here is the plot:

Export price index of gold plotted for 132 consecutive monthly time periods from Jan 2001 to Dec 2011 (*Source:* *US FRED*) (Image by Author)

The price data displays heteroscedastic variance i.e non-constant variance, and nonlinear growth.

To confirm the heteroscedasticity, let’s plot the first difference of Export_Price_Index_of_Gold against Time_Period:

#create a time lagged column
df['LAGGED_Export_Price_Index_of_Gold'] = df['Export_Price_Index_of_Gold'].shift(1)
#Do a diff between the Export_Price_Index_of_Gold column and the time lagged version
df['DIFF_Export_Price_Index_of_Gold'] = df['Export_Price_Index_of_Gold']-df['LAGGED_Export_Price_Index_of_Gold']
#Plot the diff column using Series.plot()
df['DIFF_Export_Price_Index_of_Gold'].plot()
#Display the plot
plt.show()

Here’s the first-difference plot showing the variance increasing with time:

First difference of the Export Price Index showing the heteroscedastic variance (Image by Author)

STEP 3: REMOVE THE NONLINEARLITY

We’ll tackle the nonlinearity in the data first.

Taking the logarithm or the square root of the dependent variable has the effect of making the data linear and at the same time dampening down the heteroscedasticity in it.

We’ll use the natural log of the Export Price Index as the log function grows more slowly than the square root function and so its transformation effect is stronger than the square root.

While using these two transforms — the log and the square root — we should also keep in mind their shortcomings.

Drawbacks of the log and square-root transforms:

Negative values: Both transforms yield undefined values for negative y values. The negative portion of the data-set needs to be dealt with differently such as by means of a two-part model.
Zero values: The log transform is undefined for y=0. This can be dealt with by adding a tiny positive value to each y value at the expense of introducing a bias. A better approach is to use a two-part model in which a Logistic Regression model learns to differentiate between zero and non-zero data, and an OLSR model is fitted to a log-transformed y for the positive valued data. Another approach is to use a Generalized Linear Model (GLM) which can tolerate zero values.

The log and square root transforms work best for positive-valued data.

There are also other transforms to choose from such as the log-log transform for double-exponential data.

So which is the right transform (a.k.a. scale) to choose for your data?

The underlying assumption of the log-transform is that the raw data exhibits an exponential trend. With the square-root transform, you are assuming that the raw data exhibits a power trend.

In their book, “Generalized Linear Models”, Messrs. McCullagh and Nelder offer a succinct piece of advice on choosing the correct transform, which I will paraphrase as follows:

A good transform function for your data should be one which achieves the following 3 goals:

It makes the variance more or less constant i.e. it dampens down the heteroscedasticity in the data.

It makes the residual errors of the model almost normally distributed.

It ensures a linear, additive relationship between the explanatory variables and the dependent variable of the linear model.

Let’s continue with our transformation exercise with the log-transform.

Let’s add a new column to our Data Frame called LOG_Export_Price_Index_of_Gold containing. We’ll use numpy.log() to do this.

df['LOG_Export_Price_Index_of_Gold'] = np.log(df['Export_Price_Index_of_Gold'])

A side-by-side comparison of the raw and the logged variables reveals that the log-transform has linearized the time series:

#Create a (2 x 1) grid of subplots
ax = plt.subplot(1, 2, 1)
#Set the title of the first sub-plot
ax.set_title('Export_Price_Index_of_Gold versus Time_Period', fontdict={'fontsize': 12})
#Plot the RAW scale plot
plt.scatter(x=df['Time_Period'].values, y=df['Export_Price_Index_of_Gold'].values, color='r', marker='.')
#Setup the second subplot
ax = plt.subplot(1, 2, 2)
ax.set_title('LOG(Export_Price_Index_of_Gold) versus Time_Period', fontdict={'fontsize': 12})
#Plot the LOG scale plot
plt.scatter(x=df['Time_Period'].values, y=df['LOG_Export_Price_Index_of_Gold'].values, color='b', marker='.')
#Display both subplots
plt.show()

Comparison between the raw scale and log scale plots of gold price index (Image by Author)

STEP 4: FIT A LINEAR REGRESSION MODEL TO THE LOG-TRANSFORMED DATASET

Since log(y) is linear with respect to Time_Period, we’ll fit an OLS regression model to log(y) using following linear regression equation:

log(Export_Price_Index_of_Gold) = β_0 + β_1*Time_Period

Where, β_0 & β_1 are respectively the regression intercept and the regression coefficient.

Import the regression packages:

import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices

Form the model expression in patsy syntax. We are telling Patsy that LOG_Export_Price_Index_of_Gold depends on Time_Period. Patsy will auto-include the intercept β_0:

expr = 'LOG_Export_Price_Index_of_Gold ~ Time_Period'

Create the training and test sets:

#We'll train the model on the first 120 time periods i.e. 120 months of data and we'll test its predictions on the last 12 time periods i.e. 12 months
split_index = 119
#Get the indexed date at the split position
split_date = df.index[split_index]
#time periods 0 to 119 is the training set
df_train = df.loc[df.index <= split_date].copy()
#time periods 120 to 131 is the testing set
df_test = df.loc[df.index > split_date].copy()
print('Model will train on the first ' + str(len(df_train)) + ' months and make predictions for the final ' + str(len(df_test)) + ' months.')

I have purposely chosen the last 12 time periods as the holdout (test) set. If you look closely at the Gold Price Index plot, you’ll see that the data becomes slightly more nonlinear in the last 10–20 time periods making this zone especially challenging for a linear model from a forecasting perspective.

Let’s build and train an Ordinary Least Squares Regression (OLSR) model on the training set:

olsr_results = smf.ols(expr, df_train).fit()

Print the training summary:

print(olsr_results.summary())

Let’s inspect the model’s performance:

R-squared: The model has been able to explain 97.7% of the variance in log(y) which is a remarkably good fit.

F-statistic: The p-value of 6.42e-99 is incredibly tiny — much smaller than a critical value of even 0.1% (0.001). So we reject the null hypothesis of the F-test that model is no better than an intercept only model, and we accept the alternate hypothesis of the F-test that the model’s coefficients are jointly significant.

Significance of model coefficients: The p-values of the model’s coefficients suggest that they are individually statistically significant.

Overall, the model seems to have achieved a high goodness-of-fit on the training data set.

Let’s fetch the model’s predictions on the training and test sets.

Note that we are predicting log(y), not the raw y.

#Predict log(y) on the training data set
olsr_predictions_training_set = olsr_results.predict(df_train['Time_Period'])
#Predict log(y) on the testing data set
olsr_predictions_testing_set = olsr_results.predict(df_test['Time_Period'])

Plot the predicted and the actual values on the log-scale:

#Create the pyplot figure for plotting
fig = plt.figure()
fig.suptitle('Predicted versus actual values of LOG(price index)')
#Plot the log-scale PREDICTIONS for the training data set
predicted_training_set, = plt.plot(df_train.index, olsr_predictions_training_set, 'go-', label='Predicted (Training data set)')
#Plot the log-scale ACTUALS fpr the training data set
actual_training_set, = plt.plot(df_train.index, df_train['LOG_Export_Price_Index_of_Gold'], 'ro-', label='Actuals (Training data set)')
#Plot the log-scale PREDICTIONS for the testing data set
predicted_testing_set, = plt.plot(df_test.index, olsr_predictions_testing_set, 'bo-', label='Predicted (Testing data set)')
#Plot the log-scale ACTUALS for the testing data set
actual_testing_set, = plt.plot(df_test.index, df_test['LOG_Export_Price_Index_of_Gold'], 'mo-', label='Actuals (Testing data set)')
#Set up the legends
plt.legend(handles=[predicted_training_set, actual_training_set, predicted_testing_set, actual_testing_set])
#Display everything
plt.show()

Predicted versus actual values of log(y) on training and testing data sets — Predicted versus actual values of ***log(y)*** on training and testing data sets (Image by Author)

STEP 5: TRANSFORM THE MODEL’S PREDICTIONS FROM LOG SCALE BACK TO RAW SCALE

This is where one needs to be careful. Our instinct might be to simply exponentiate the log-scale predictions back to raw-scale y.

But our instinct would be wrong. Let’s see why that is.

If you like, you can skip the little bit of math that follows and scroll down to the section called Duan’s smearing estimator.

The conditional expectation of log(y) on X is related to the fitted model’s predictions via the following linear equation:

E(log(y)|X) = β_fitted*X + ϵ … (1)

Where:

E(log(y)|X) is the conditional expectation of log(y) i.e. the value of y predicted by the model for a given x.
β_fitted is the vector of the trained model’s coefficients, including the placeholder for the intercept.
X is the matrix of regression variables + one column for the regression intercept.
ϵ is the vector of residual errors of the fitted model. ϵ=Predictions minus Actuals

To transform the prediction back to the raw scale, we exponentiate both sides of equation (1):

exp(E(log(y)|X) = exp(β_fitted*X + ϵ) … (2)

What we want is not exp(E(log(y)|X). Instead we want E(exp(log(y)|X)). Notice the subtle but important difference between the two.

So we apply the E() operator to both sides of eq (2):

E(exp(E(log(y)|X))) = E(exp(β_fitted*X + ϵ))

Which, after some simplification becomes:

E(y|X) = E(exp(β_fitted*X)) * E(exp(ϵ))

Which finally simplifies to:

E(y)|X) = exp(β_fitted*X) * E(exp(ϵ))

Where:

E(X)|y) is the conditional expectation in the raw scale. Exactly what we want.
exp(β_fitted*X) are the model’s log-scale predictions, after exponentiation so as to convert them to the raw scale.
E(exp(ϵ)) is the expected value of the log-scale model residual errors of the model, after exponentiation so as to convert them to the raw scale.

Duan’s smearing estimator

In summary, in order to correctly re-transform the model’s predictions from the log-scale back to the raw scale, we need to do the following:

Exponentiate the regression model’s predictions, and,
Multiply the exponentiated predictions by the expected value of the exponentiated log-scale errors i.e. E(exp(ϵ)).

The expectation of the exponentiated residual errors — E(exp(ϵ)) — is called the Duan’s smearing estimator a.k.a. the smearing factor.

The value of E(exp(ϵ)) depends on the distribution of the log-scale residual errors ϵ, as follows:

CASE 1: The log-scale residual errors ϵ are normally distributed:
If ϵ ~ N(0, σ²) i.e. ϵ is normally distributed with mean µ =0 and variance σ², then the value of the smearing factor E(exp(ϵ)) is the expected value of the exponentiated random variable N(0, σ²). This is simply the first moment of N(0, σ²) viz. exp(0 + 0.5σ²) = exp(0.5σ²)

CASE 2: The log-scale residual errors ϵ are i.i.d. but not normally distributed:
If the log-scale residuals are independent, identically distributed random variables but they are not N(µ, σ²), one needs to find out their distribution and the corresponding Moment Generating Function (MGF). Then we take the first moment of that distribution to get the value of the smearing factor E(exp(ϵ)).

CASE 3: Distribution of log-scale residual error is unknown:
In this case:

where ϵ_i = error in the ith sample, and N=number of samples.

From the three cases described above, we can see that:

The smearing factor is large when the errors have a large spread, i.e. a large variance σ², or in case of case #3, they are large valued.

Let’s calculate the smearing factor for our example.

STEP 5A: CALCULATE THE DUAN’S SMEARING FACTOR

Let’s see if the residuals are normally distributed (CASE 1 above).

The residuals of our log-scale model are stored in olsr_results.resid. Here are the top few rows:

Residual errors of the OLSR (Image by Author)

To know if the errors are normally distributed, we’ll focus attention on 4 pieces of evidence in the bottom part of the OLSR model’s output:

Properties of the OLSR model’s residual errors (Image by Author)

Skew: The skewness of the residuals (-0.025) is almost zero. In comparison, a normal distribution has zero skewness.

Kurtosis: The kurtosis of the residual errors (2.984) is almost equal to 3.0. A normal distribution’s kurtosis is 3.0.

The Omnibus K² test of normality: The Omnibus test’s extermely high p-value (0.96) leads us to accept the test’s null hypothesis, namely that the residual errors are normally distributed.

The Jarque-Bera test of normality: The JB test’s extremely high p-value (0.993) again validates the JB test’s null hypothesis that the residuals are normally distributed.

All evidence says the residual errors are normally distributed.

This conclusion also bears out visually:

plt.hist(olsr_results.resid, bins=20)
plt.show()

Histogram plot of residual errors (Image by Author)

Let’s print out the mean µ and variance σ² of the residual errors using mean() and var() functions of the pandas.core.series.Series object:

print('Mean of residual errors='+str(olsr_results.resid.mean()))
print('Variance of residual errors='+str(olsr_results.resid.var()))

Mean of residual errors=-3.841371665203042e-15
Variance of residual errors=0.005319971501418866

We can see that that the mean is for all practical purposes, zero. This is entirely to be expected from an OLSR model. The variance is 0.00532.

Since we have proven that residual errors are N(0, σ²) distributed, as per CASE #1, the Duan’s smearing factor is:

E(exp(ϵ)) = exp(0.5σ²) = exp(0.5*0.00532) = 1.00266

Our smearing factor is in place. Are we ready to use it? Not yet. There is one last thing we need to check:

Are the residual errors homoscedastic?

STEP 5B: CHECK IF RESIDUAL ERRORS ARE HOMOSCEDASTIC

The Duan’s smearing factor yields biased results if the residual errors ϵ are not homoscedastic, i.e. they are heteroscedastic.

When ϵ is heteroscedastic, Variance(ϵ) is not constant. In fact Variance(ϵ) can be a function of the model’s explanatory variables X. In our example, X=Time_Period

For example, if ϵ is N( 0, σ²(X) ) distributed, i.e., the residuals are normally distributed with zero mean and a variance function: σ²(X), then as per CASE #1 mentioned earlier:

Smearing factor = E( exp(ϵ) ) = exp( 0.5*σ²(X) )

When the residual errors of the regression model are heteroscedastic on the log-scale, in order to calculate the smearing factor, we need to know how to calculate the variance function σ²(X).

In practice though, calculating the variance function σ²(X) may not be an easy task.

When the residuals are heteroscedastic on the log-scale, it’s often better to do the following:

Approaches for dealing with heteroscedastic residual errors

Verify that the effect of inflation and seasonality have been neutralized via inflation adjustment and seasonality adjustment. This is especially relevant for monetary data.
Check if any important explanatory variables are missing from the model and add them in.
Instead of using the raw residual errors ϵ, use the heteroscedasticity adjusted residual errors (a.k.a. the ‘whitened’ residuals) for computing the Duan’s smearing estimator. Statsmodels makes the whitened residuals available to you in the regression model’s training output via the variable RegressionResults.wresid.
Switch to a Generalized Linear Model (GLM). A GLM assumes that the variance is a function of the mean, with mean itself being a function of the explanatory variables X.
Switch to a Weighted Least Squares (WSS) or a Generalized Least Squares (GLS) model which does not assume that the variance is homoscedastic.
Finally, if the amount of heteroscedasticity in the residuals is small, and your OLSR model is otherwise performing well, just accept your OLSR model as is!

Let’s test if the residual errors of our model are homoscedastic. If they are, we are in the clear, otherwise we should consider one of the above 4 remedies.

Statsmodels contains an implementation of the White’s test of heteroscedasticity which can be easily applied to our residual errors as follows:

#Using patsy, pull out the X matrix containing the Time_Period and the intercept columns from the pandas Data Frame.
expr = 'LOG_Export_Price_Index_of_Gold ~ Time_Period'
y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')

Import the test package and run the White’s test:

from statsmodels.stats.diagnostic import het_white
from statsmodels.compat import lzip
keys = ['Lagrange Multiplier statistic:', 'LM test\'s p-value:', 'F-statistic:', 'F-test\'s p-value:']
test = het_white(olsr_results.resid, X_train)
lzip(keys, test)

This prints out the following output:

[('Lagrange Multiplier statistic:', 3.2148975951052883), 
 ("LM test's p-value:", 0.20039821889655918), 
 ('F-statistic:', 1.610406682366166), 
 ("F-test's p-value:", 0.2042032693339592)]

LM test: The LM test’s statistic follows the Chi-squared distribution with degrees of freedom = the DF of model-1=(3–1)=2. The p-value of 0.2 leads us to accept the null hypothesis of the test that there is no heteroscedastisticity in the residual errors.

F-test: The F-test’s statistic follows the F-distribution. Again, the high p-value of 0.204 confirms the null hypothesis of the test that there is no heteroscedastisticity in the residual errors.

Overall, we conclude that the residual errors are homoscedastic.

Let’s take the final step in our regresion jounery, which is to convert the predictions back to the raw scale and correct them using the Duan’s smearing factor.

STEP 5C: CONVERT PREDICTIONS FROM LOG-SCALE TO RAW-SCALE

Recollect that the log-scale predictions and the log-scale actuals on the training and the test sets are stored in the following variables:

olsr_predictions_training_set and df_train['LOG_Export_Price_Index_of_Gold']
olsr_predictions_testing_set and df_test['LOG_Export_Price_Index_of_Gold'].

Let’s exponentiate all four variables to convert them back to the raw scale. We’ll use numpy.exp() for this:

olsr_predictions_raw_scale_training_set = np.exp(olsr_predictions_training_set)
actuals_raw_scale_training_set = np.exp(df_train['LOG_Export_Price_Index_of_Gold'])
olsr_predictions_raw_scale_testing_set = np.exp(olsr_predictions_testing_set)
actuals_raw_scale_testing_set = np.exp(df_test['LOG_Export_Price_Index_of_Gold'])

Multiply the raw scale predictions with the Duan’s smearing factor of 1.00266 that we had calculated earlier.

adjusted_olsr_predictions_raw_scale_training_set = olsr_predictions_raw_scale_training_set*1.00266
adjusted_olsr_predictions_raw_scale_testing_set = olsr_predictions_raw_scale_testing_set*1.00266

Finally, plot all four (adjusted) raw scale values:

#Create the pyplot figure for plotting
fig = plt.figure()
fig.suptitle('Predicted versus actual values of Gold Price Index')
#Plot the raw scale predictions made on the training data set
predicted_training_set, = plt.plot(df_train.index, adjusted_olsr_predictions_raw_scale_training_set, 'go-', label='Predicted (Training data set)')
#Plot the raw scale predictions made on the testing data set
predicted_testing_set, = plt.plot(df_test.index, adjusted_olsr_predictions_raw_scale_testing_set, 'bo-', label='Predicted (Testing data set)')
#Plot the raw scale actual values in the testing data set
actual_testing_set, = plt.plot(df_test.index, df_test['Export_Price_Index_of_Gold'], 'mo-', label='Actuals (Testing data set)')
#Set up the legends
plt.legend(handles=[predicted_training_set, actual_training_set, predicted_testing_set, actual_testing_set])
#Display everything
plt.show()

We get the following plot:

Predicted versus actual values of raw scale y on training and testing data sets — Predicted versus actual values of ***raw scale y*** on training and testing data sets (Image by Author)

One should not feel too disappointed with the somewhat large deviation between the actual (magenta) and the predicted (blue) values on the test data set, for two reasons:

Notice that the raw data becomes appreciably nonlinear during the holdout time period and so our holdout set has been especially challenging for the linear model, and,
We are asking our model to predict the price index for a full 12 months into the future! In reality, we would do rolling n-step ahead forecasts where n could be 1 to 6 months at the most.

Summary

We covered several topics in this section. Let’s summarize them:

Overview of assumptions of linear regression models.
Techniques for linearizing nonlinear data, drawbacks of those techniques and how to deal with the drawbacks.
How to fit a linear model to the linearized data and how to evaluate its performance.
Overview of the Duan’s smearing factor and how to use it to improve the accuracy of the forecasts of the fitted model.
How to detect heteroscedasticity and your options for dealing with it in the model’s residual errors.

Citations and Copyrights

Data set

U.S. Bureau of Labor Statistics, Export Price Index (End Use): Non-monetary Gold [IQ12260], retrieved from FRED, Federal Reserve Bank of St. Louis;, June 19, 2021. Curated version for download

Book

McCullagh P., Nelder J. A., Generalized Linear Models”, Chapman and Hall/CRC; 2nd edition (August 1, 1989), ISBN-13 : 978-0412317606

Papers

Duan, Naihua. “Smearing Estimate: A Nonparametric Retransformation Method.” Journal of the American Statistical Association, vol. 78, no. 383, 1983, pp. 605–610. JSTOR, www.jstor.org/stable/2288126. Accessed 6 June 2021.

Willard G Manning, John Mullahy, Estimating log models: to transform or not to transform?, Journal of Health Economics, Volume 20, Issue 4, 2001, Pages 461-494, ISSN 0167-6296, https://doi.org/10.1016/S0167-6296(01)00086-8. (https://www.sciencedirect.com/science/article/pii/S0167629601000868)

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

PREVIOUS: Introduction to Heteroscedasticity

NEXT: What Are Dummy Variables And How To Use Them In A Regression Model

UP: Table of Contents