Introduction to Heteroscedasticity - Statistical Modeling and Forecasting

Causes, effects, tests, and solutions using Python

Heteroscedasticity in the context of regression modeling, is what you have in your data when the conditional variance in your data is not constant. Conditional variance is the variability that you see in the dependent variable y for each value of the explanatory variables X, or each value of time period t (in case of time series data).

When your data is heteroscedastic:

Variance(y|X) = f(X)

Where f is some function of X. The opposite of heteroscedatiscity is homoscedasticity where the variance is constant i.e.:

Variance(y|X) = σ²…a constant value.

The following figure illustrates a heteroscedastic data set:

Heteroscedastic variance (Image by Author)

The following figure illustrates a homoscedastic data set:

Homoscedastic variance (Image by Author)

A heteroscedastic time series

The following gold price index plot illustrates a heteroscedastic time series. Notice how the variation is larger in later time periods when the index’s value is higher.

Export Price Index (End Use): Non-monetary gold (Source: US FRED) (Image by Author)

The first difference of Gold Price Index brings out the heteroscedastic variance:

Differenced time series (Image by Author)

Causes and forms of Heteroscedasticity

A common form of heteroscedasticity is when the amount of fluctuation is a fraction of the value. This happens commonly with monetary data such as:

Prices (stock prices, prices of goods, medical procedure costs),
Expenditures (household expenditure, employee wages, rents),
Price indexes (the Gold Price Index example illustrated above).

Heteroscedasticity introduced by the measurement process

It’s possible to inadvertently introduce heteroscedasticity while creating a data set. Let’s illustrate this with an example:

Say you are measuring bacterial growth in an antibiotic infused sample. We’ll assume that the bacteria in the sample are growing linearly, like so:

Actual_Bacterial_Count = 100 + 5*Time_Period

Suppose your cell counter has a counting error of ≤ 10%. One can model its operating characteristics as follows:

Observed_Bacterial_Count = 100 + 5*Time_Period + 
ROUND( (-1.0+2*RAND(0,1))*0.1*True_Bacterial_Count,0)

The error introduced by the counter (in MS Excel syntax) is:


ROUND( (-1.0+2*RAND(0,1))*0.1*True_Bacterial_Count,0)

The term ( -1.0+2*RAND(0,1) ) represents a random variable having a Uniform(-1, 1) distribution, which is multiplied by 10% of the true count.

Running the counter for 150 time periods produces the following plot showing the heteroscedasticity introduced by the counter’s error characteristic:

Heteroscedasticity introduced by measurement error (Image by Author)

How to fix the problem:

Log-transform the y variable to ‘dampen down’ some of the heteroscedasticity, then build an OLSR model for log(y).
Use a Generalized Linear Model (GLM) such as the Negative Binomial regression model which does not assume that the data set is homoscedastic. An NB regression model can work especially well if your data is discrete and non-negative.
Use a Weighted Least Squares (WLS) or a Generalized Least Squares (GLS) model — two models that do not assume a homoscedastic variance. The Python statsmodels package supports both models in the statsmodels.api package.

Heteroscedasticity introduced seasonal outliers and by inflation

The following time series plot of liquor sales illustrates a slowly growing variance introduced by outliers in December and January, and also by the effect of monthly price inflation:

Retail sales: Beer, wine and liquor stores. (Source: US FRED) (Image by Author)

How to fix the problem:

The seasonal effect can be neutralized by doing a seasonal adjustment on the data set.
The effect of inflation can be extinguished by performing inflation adjustment a.k.a. deflation of the time series.

Related: How To Adjust For Inflation In Monetary Data Sets

One should always consider these two data transformation steps while dealing with monetary data such as sales volume, costs, price indexes, etc.

Heteroscedasticity introduced by a miss-specified model

Even after performing seasonal adjustment, deflation and log-transformation, if your model is unable to adequately explain the variance in the transformed data set, the unexplained variance will leak into the residual errors of the model, potentially making them heteroscedastic.

How to fix the problem:

Check if important explanatory variables are missing in your model and add them in.
Switch to a GLM, WSS or GLS model
Accept your current model as is. A small amount of heteroscedasticity in the model’s residuals can be tolerated if your model is otherwise performing well.

Practical consequences of heteroscedasticity

If the residual errors of a linear regression model such as the Ordinary Least Square Regression model are heteroscedastic, the OLSR model is no longer efficient, i.e. it is not guaranteed to be the best unbiased linear estimator for your data. It may be possible to construct a different estimator with a better goodness-of-fit.

If your data contains heteroscedastic variance, an OLSR model is likely to either underestimate or overestimate the variance in the population depending on what sort of variance it has seen in the training sample.

This leads to a cascading series of problems: The standard errors of the model’s parameters become incorrect, causing their p-values to go wrong and the confidence intervals to be too narrow or too wide. This could mislead you into believing that certain parameter values are significant while they are actually not significant, and vice versa. The entire model becomes unreliable.

This problem is not restricted to OLSR models. Any model which assumes homoscedastic data or homoscedastic residual errors is vulnerable to these issues.

How to detect heteroscedasticity

Start by plotting the dependent variable against the independent variable(s) or against time and look for signs that the variability in the dependent variable follows some sort of a pattern.

Another approach is to train a suitable model on the data and plot its residuals against the dependent variable, again looking for patterns in the variability.

A third, much better approach is to use one of the following statistical tests for heteroscedasticity:

We’ll soon see how to run the the White test for heteroscedasticity in Python on the gold prices data set.

Meanwhile, let’s look at how these tests work. Most of them use the following standard recipe for detecting heteroscedasticity:

Train a suitable primary regression model on the data.
Next, fit an auxiliary regression model on the square of the residuals of the primary model, with the explanatory variables being those of the primary model, or some combination thereof of these variables. The tests mentioned above use one or more of the following regression expressions for the auxiliary model:
ϵ² = β_0 + β_1*X + γ
ϵ² = β_0 + β_1*X + β_2*X² + γ
ϵ² = β_0 + β_1*√X + γ
ϵ² = β_0 + β_1 / X + γ
ϵ² is the vector of square of the residuals of the primary model,
β_0 is the vector of regression intercepts,
β_1 is the vector of regression coefficients,
X is primary model’s explanatory variables matrix,
γ is the vector of error terms.
For the fitted auxiliary model, calculate a suitable goodness-of-fit statistic such as R² or the F-statistic for regression analysis to see how well the residuals fit the explanatory variables of the primary model.
If the test statistic does not reveal a significant goodness-of-fit, accept the null hypothesis that the residuals are homoscedastic. Otherwise, accept the alternate hypothesis that the residuals are heteroscedastic, which in turn means that either 1) the conditional variance of y of the primary model is heteroscedastic, or 2) our primary model was miss-specified, or 3) both (1) and (2) hold true.

Testing for heteroscedasticity using Python and statsmodels

Let’s run the White test for heteroscedasticity using Python on the gold price index data set (found over here).

Import all the required packages.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

Load the data set and plot the dependent variable

Load the data set into a pandas Data Frame and print the first 10 rows:

df = pd.read_csv('monthly_gold_price_index_fred.csv', header=0, infer_datetime_format=True, parse_dates=[0], index_col=[0])
print(df.head(10))

We’ll add a new column to the data frame called Time_Period containing integers from 1 to 132.

df['Time_Period'] = range(1, len(df)+1)
print(df.head(10))

Plot the data:

#Create a new mpyplot figure to plot into
fig = plt.figure()
#Set the title of the plot
fig.suptitle('Export Price Index of Gold')
#Set the X and Y axis labels
plt.xlabel('Time Period')
plt.ylabel('Price Index')
#plot the time series and store the plot in the actual variable. We'll need that later for the legend.
actual, = plt.plot(df['Time_Period'], df['Export_Price_Index_of_Gold'], 'bo-', label='Gold Price Index')
#Set up the legend. There is only one time series in the legend.
plt.legend(handles=[actual])
#Show everything
plt.show()

Here is the plot we get:

Export price index of gold plotted for 132 consecutive monthly time periods from Jan 2001 to Dec 2011 (Image by Author)

The price data appears to be both heteroscedastic and nonlinear.

Take the log-transform of the dependent variable

Taking the log transform of the dependent variable is one of the most commonly used techniques for not only linearizing the dependent variable y but for also dampening down the heteroscedastic variance (if it exists) in y.

Let’s add a new column to the Data Frame called LOG_Export_Price_Index_of_Gold containing. We’ll use numpy.log() to do this.

df['LOG_Export_Price_Index_of_Gold'] = np.log(df['Export_Price_Index_of_Gold'])

A side-by-side comparison of the raw data and the log-transformed data reveals that the log transform has reduced the nonlinearity in the time series:

Comparison of raw and log transformed price index data (Image by Author)

Fit an OLS linear regression model to the log-transformed data set

If you recollect the testing recipe we had outlined earlier, we had called this model our primary model.

The regression expression for our primary model is:

LOG_Export_Price_Index_of_Gold = β_0 + β_1*Time_Period + ϵ

i.e. we are seeking to predict log(Export_Price_Index_of_Gold) using Time_Period.

Import the regression packages:

import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices

Form the model expression in patsy syntax. We are telling Patsy that LOG_Export_Price_Index_of_Gold depends on Time_Period. Patsy will auto-include the intercept β_0:

expr = 'LOG_Export_Price_Index_of_Gold ~ Time_Period'

Build and train the OLSR model:

olsr_results = smf.ols(expr, df).fit()

Plot the residual errors (stored in the olsr_results.resid field) against the Time_Period:

#Create a new pyplot figure to plot into
fig = plt.figure()
#Set the title of the plot
fig.suptitle('Residual errors against Time_Period')
#Set the X and Y axis labels
plt.xlabel('Time_Period')
#plot the time series and store the plot in the actual variable.
actual, = plt.plot(df['Time_Period'], olsr_results.resid, 'go-', label='Residual Errors')
plt.ylabel('Residual Errors')
#Set up the legend. There is only one time series in the legend.
plt.legend(handles=[actual])
#Show the plot
plt.show()

Here’s the plot:

Run the White test of heteroscedasticity on the residual errors

The White test uses an auxiliary OLSR model in which the dependent variable is the square of the residuals from the primary model and the explanatory variables are the primary model’s explanatory variables, their squares and their cross products.

In our example, we have only one explanatory variable: Time_Period.

Let’s add two columns to the pandas Data Frame, one for the square of the residuals from the primary model, and the second one for the square of Time_Period. The numpy.power() method is a quick way of doing this.

df['SQ_RESID'] = np.power(olsr_results.resid, 2.0)
df['SQ_Time_Period'] = np.power(df['Time_Period'], 2.0)

Construct the model expression (in patsy syntax) for our auxiliary model:

df['SQ_Time_Period'] = np.power(df['Time_Period'], 2.0)

Build the X and y matrices. Pandas makes this very easy:

y, X = dmatrices(aux_expr, df, return_type='dataframe')

Add a column to X for saving the regression intercept:

X = sm.add_constant(X)

Build and train an OLSR model on the (y, X) data set:

aux_olsr_results = sm.OLS(y, X).fit()

Print the results:

print(aux_olsr_results.summary())

This prints the following output:

Output of the auxiliary OLSR model (Image by Author)

Analyzing the results of the auxiliary model

R-squared: The model has been able to explain only 0.8% of the variance in the squared residuals, indicating a rather poor fit.

F-statistic: The very high p-value of 0.593 makes us accept the null hypothesis of the F-test that the model’s parameter values are not jointly significant. This model is no better than a mean model.

Significance of regression coefficients: The p-value of Time_Period (0.812) and SQ_Time_Period (0.634) are quite high, causing us to accept the null hypothesis of the t-test that both coefficients are not significant i.e. they are basically zero.

All available evidence indicates that the residuals are homoscedastic.

What has happened is that the log-transform has also dampened down the heteroscedasticity in the raw gold price index to a level that is undetectable by the White test.

Running the White test using statsmodels

The Python statsmodels library contains an implementation of the White’s test. Let’s see how it works:

STEP 1: Import the test package.

from statsmodels.stats.diagnostic import het_white
from statsmodels.compat import lzip

The het_white(resid, exog) test in statsmodels takes two parameters:

resid: An array of residuals from your primary regression model. In our case, resid is olsr_results.resid

exog: A matrix (e.g. a numpy array) of explanatory variables X from the primary model. In our case exog is Time_Period + Intercept

STEP 2: Build and train the primary regression model on the data. Recollect that we have already done that and residual errors are available in olsr_results.resid.

STEP 3: Using patsy, pull out the X matrix containing the Time_Period and the intercept columns from the pandas Data Frame:

expr = 'LOG_Export_Price_Index_of_Gold ~ Time_Period'
y, X = dmatrices(expr, df, return_type='dataframe')

STEP 4: Execute the White test:

keys = ['Lagrange Multiplier statistic:', 'LM test\'s p-value:', 'F-statistic:', 'F-test\'s p-value:']
results = het_white(olsr_results.resid, X)
lzip(keys, results)

Here’s the output:

[
('Lagrange Multiplier statistic:', 1.0637754647238826),
 ("LM test's p-value:", 0.5874948891258364),
 ('F-statistic:', 0.5240224217417021),
 ("F-test's p-value:", 0.5933889438779911)
]

LM test: The LM test’s statistic follows the Chi-squared distribution with degrees of freedom = DF of model minus one = (3–1) = 2. It’s p-value (0.587) is very high. So we accept the null hypothesis of the test that there is no heteroscedastisticity in the residual errors.

F-test: The F-test’s statistic follows the F-distribution. Again, the high p-value of 0.593 confirms the null hypothesis of the test that there is no heteroscedastisticity in the residual errors.

Overall, we conclude that the residual errors are homoscedastic.

This agrees with our earlier analysis that the residuals are homoscedastic.

The F-test for Regression Analysis

How to Adjust for Inflation in Monetary Data Sets

Building Robust Linear Models For Nonlinear, Heteroscedastic Data