The Assumptions Of Linear Regression, And How To Test Them - Statistical Modeling and Forecasting

Linear Regression is the bicycle of regression models. It’s simple yet incredibly useful. It can be used in a variety of domains. It has a nice closed formed solution, which makes model training a super-fast non-iterative process.

A Linear Regression model’s performance characteristics are well understood and backed by decades of rigorous research. The model’s predictions are easy to understand, easy to explain and easy to defend.

If there is only one regression model that you have time to learn inside-out, it should be the Linear Regression model.

If your data satisfies the assumptions that the Linear Regression model, specifically the Ordinary Least Squares Regression (OLSR) model makes, in most cases you need look no further.

Which brings us to the following four assumptions that the OLSR model makes:

Linear functional form: The response variable y should be a linearly related to the explanatory variables X.
Residual errors should be i.i.d.: After fitting the model on the training data set, the residual errors of the model should be independent and identically distributed random variables.
Residual errors should be normally distributed: The residual errors should be normally distributed.
Residual errors should be homoscedastic: The residual errors should have constant variance.

Let’s look at the four assumptions in detail and how to test them.

Assumption 1: Linear functional form

Linearity requires little explanation. After all, if you have chosen to do Linear Regression, you are assuming that the underlying data exhibits linear relationships, specifically the following linear relationship:

y = β*X + ϵ

Where y is the dependent variable vector, X is the matrix of explanatory variables which includes the intercept, β is the vector of regression coefficients and ϵ is the vector of error terms i.e. the portion of y that X is unable to explain.

How to test the linearity assumption using Python

This can be done in two ways:

An easy way is to plot y against each explanatory variable x_j and visually inspect the scatter plot for signs of non-linearity.
One could also use the DataFrame.corr() method in Pandas to get the Pearson’s correlation coefficient ‘r’ between the response variable y and each explanatory variable x_j to get a quantitative feel for the degree of linear correlation.

Note that Pearson’s ‘r’ should be used only when the the relation between y and X is known to be linear.

Let’s test the linearity assumption on the following data set of 9568 observations of 4 operating parameters of a combined cycle power plant taken over 6 years:

Performance data from a combined cycle power plant. Data source: UCI Machine Learning Repository

The explanatory variables x_j are as the following 4 power plant parameters:

Ambient_Temp in Celsius
Exhaust_Volume in column height of Mercury in centimeters
Ambient_Pressure in millibars of Mercury
Relative_Humidity expressed as a percentage

The response variable y is Power_Output of the power plant in MW.

Let’s load the data set into a Pandas DataFrame.

import pandas as pd
from patsy import dmatrices
from matplotlib import pyplot as plt
import numpy as np
df = pd.read_csv('power_plant_output.csv', header=0)

Plot the scatter plots of each explanatory variable against the response variable Power_Output.

df.plot.scatter(x='Ambient_Temp', y='Power_Output')
plt.xlabel('Ambient_Temp', fontsize=18)
plt.ylabel('Power_Output', fontsize=18)
plt.show()
df.plot.scatter(x='Exhaust_Volume', y='Power_Output')
plt.xlabel('Exhaust_Volume', fontsize=18)
plt.ylabel('Power_Output', fontsize=18)
plt.show()
df.plot.scatter(x='Ambient_Pressure', y='Power_Output')
plt.xlabel('Ambient_Pressure', fontsize=18)
plt.ylabel('Power_Output', fontsize=18)
plt.show()
df.plot.scatter(x='Relative_Humidity', y='Power_Output')
plt.xlabel('Relative_Humidity', fontsize=18)
plt.ylabel('Power_Output', fontsize=18)
plt.show()

Here is a collage of the four plots:

Scatter plots of Power_Output against each explanatory variable (Image by Author)

You can see that Ambient_Temp and Exhaust_Volume seem to be most linearly related to the power plant’s Power_Output, followed by Ambient_Pressure and Relative_Humidity in that order.

Let’s also print out the Pearson’s ‘r’:

df.corr()['Power_Output']

We get the following output, which backs up our visual intuition:

Ambient_Temp        -0.948128
Exhaust_Volume      -0.869780
Ambient_Pressure     0.518429
Relative_Humidity    0.389794
Power_Output         1.000000
Name: Power_Output, dtype: float64

Related read: The Intuition Behind Correlation, for an in-depth explanation of the Pearson’s correlation coefficient.

Assumption 2: i.i.d. residual errors

The second assumption that one makes while fitting OLSR models is that the residual errors left over from fitting the model to the data are independent, identically distributed random variables.

We break this assumption into three parts:

The residual errors are random variables,
They are independent random variables, and
Their probability distributions are identical.

Why are residual errors random variables?

After we train a Linear Regression model on a data set, if we run the training data through the same model, the model will generate predictions. Let’s call them y_pred. For each predicted value y_pred in the vector y_pred, there is a corresponding actual value y from the response variable vector y. The difference (y — y_pred) is the residual error ‘ε’. There are as many of these ε as the number of rows in the training set and together they form the residual errors vector ε.

Each residual error ε is a random variable. To understand why, recollect that our training set (y_train, X_train) is just a sample of n values drawn from some very large population of values.

If we had drawn a different sample (y_train’, X_train’) from the same population, the model would have fitted somewhat differently on this second sample, thereby producing a different set of predictions y_pred’, and therefore a different set of residual errors ε = (y’ — y_pred’).

A third training sample drawn from the population would have, after training the model on it, generated a third set of residual errors ε = (y’’ — y_pred’’), and so on.

One can now see how each residual error in the vector ε can take a random value from as many set of values as the number of sample training data sets one is willing to train the model on, thereby making each residual error ε a random variable.

Why do residual errors need to be independent?

Two random variables are independent if the probability of one of them taking up some value doesn’t depend on what value the other variable has taken. When you roll a die twice, the probability of its coming up as one, two,…,six in the second throw does not depend on the value it came up on the first throw. So the two throws are independent random variables that can each take a value of 1 thru 6 independent of the other throw.

In the context of regression, we have seen why the residual errors of the regression model are random variables. If the residual errors are not independent, they will likely demonstrate some sort of a pattern (which is not always obvious to the naked eye). There is information in this pattern that the regression model wasn’t able to capture during its training on the training set, thereby making the model sub-optimal.

If the residual errors aren’t independent, it may mean a number of things:

One or more important explanatory variables are missing from your model. The effect of the missing variables is showing through as a pattern in the residual errors.
The linear model you have built is just the wrong kind of model for the data set. For e.g. if the data set shows obvious non-linearity and you try to fit a linear regression model on such a data set, the nonlinear relationships between y and X will show through in the residual errors of regression in the form of a distinct pattern.
A third interesting cause of non-independence of residual errors is what’s known as multicolinearity which means that the explanatory variables are themselves linearly related to each other. Multicolinearity causes the model’s coefficients to become unstable, i.e. they will swing wildly from one training run to next when trained on different training sets. This can make the model’s overall goodness-of-fit statistics questionable. Another serious effect of multicoliearity, especially extreme multicolinearity, is that the model’s least squares solver may throw up infinities during the model fitting process thereby making it impossible to fit the model on the training data.

How to test for independence of residual errors?

It’s not easy to verify independence. But sometimes one can detect patterns in the plot of residual errors versus the predicted values or the plot of residual errors versus actual values.

Another common technique is to use the Dubin-Watson test which measures the degree of correlation of each residual error with the ‘previous’ residual error. This is known as lag-1 auto-correlation and it is a useful technique to find out if residual errors of a time series regression model are independent.

Let’s fit a linear regression model to the Power Plant data and inspect the residual errors of regression.

We’ll start by creating the model expression using the Patsy library as follows:

model_expr = 'Power_Output ~ Ambient_Temp + Exhaust_Volume + Ambient_Pressure + Relative_Humidity'

In the above model expression, we are telling Patsy that Power_Output is the response variable while Ambient_Temp, Exhaust_Volume, Ambient_Pressure and Relative_Humidity are the explanatory variables. Patsy will add the regression intercept by default.

We’ll use patsy to carve out the y and X matrices as follows:

y, X = dmatrices(model_expr, df, return_type='dataframe')

Let’s also carve out the train and test data sets. The training data set will be 80% of the size of the overall (y, X) and the rest will be the testing data set:

mask = np.random.rand(len(X)) < 0.8
X_train = X[mask]
y_train = y[mask]
X_test = X[~mask]
y_test = y[~mask]

Finally, build and train an Ordinary Least Squares Regression Model on the training data and print the model summary:

olsr_results = linear_model.OLS(y_train, X_train).fit()
print('Training completed')
print(olsr_results.summary())

We get the following output:

OLS Regression output printed out by statsmodels (Image by Author)

Next, let’s get the predictions of the model on test data set and get its predictions:

olsr_predictions = olsr_results.get_prediction(X_test)

olsr_predictions is of type statsmodels.regression._prediction.PredictionResult and the predictions can obtained from the PredictionResult.summary_frame() method:

prediction_summary_frame = olsr_predictions.summary_frame()
print(prediction_summary_frame)

The model’s predictions and the 95% lower and upper confidence levels (Image by Author)

Let’s calculate the residual errors of regression ε = (y_test — y_pred):

resid = y_test['Power_Output'] - prediction_summary_frame['mean']

Finally, let’s plot resid against the predicted value y_pred=prediction_summary_frame[‘mean’]:

plt.xlabel('Predicted Power Output', fontsize=18)
plt.ylabel('Residual Error of Regression', fontsize=18)
plt.scatter(y_test['Power_Output'], resid)
plt.show()

We get the following plot:

Residual errors plotted against the model’s predictions (Image by Author)

One can see that the residuals are more or less pattern-less for smaller values of Power Output, but they seem to be showing a linear pattern at the higher end of the Power Output scale. It indicates that the model’s predictions at the higher end of the power output scale are less reliable than at the lower end of the scale.

Why should residual errors have Identical distributions?

What identically distributed means is that residual error ε_i corresponding to the prediction for each data row, has the same probability distribution. If the distribution of errors is not identical, one cannot reliably use tests of significance such as the F-test for regression analysis or perform confidence interval testing on the predictions. Many of these tests depend on the residual errors being identically, and normally distributed. This brings us to the next assumption.

Assumption 3: Residual errors should be normally distributed

In the previous section, we saw how and why the residual errors of the regression are assumed to be independent, identically distributed (i.i.d.) random variables. Assumption 3 imposes an additional constraint. The errors should all have a normal distribution with a mean of zero. In statistical language:

∀ i ∈ n, ε_i ~ N(0, σ²)

This notation is read as follows:

For all i in the data set of length n rows, the ith residual error of regression is a random variable that is normally distributed (that’s why the N() notation). This distribution has a mean of zero and a variance of σ². Furthermore, all ε_i have the same variance σ², i.e. they are identically distributed.

It is a common misconception that linear regression models require the explanatory variables and the response variable to be normally distributed.

More often than not, x_j and y will not even be identically distributed, leave alone normally distributed.

In Linear Regression, Normality is required only from the residual errors of the regression.

In fact, normality of residual errors is not even strictly required. Nothing will go horribly wrong with your regression model if the residual errors ate not normally distributed. Normality is only a desirable property.

What’s normally is telling you is that most of the prediction errors from your model are zero or close to zero and large errors are much less frequent than the small errors.

What happens if the residual errors are not N(0, σ²) distributed?

If the residual errors of regression are not N(0, σ²), then statistical tests of significance that depend on the errors having an N(0, σ²) distribution, simply stop working.

For example,

The F-statistic used by the F-test for regression analysis has the required Chi-squared distribution only if the regression errors are N(0, σ²) distributed. If regression errors are not normally distributed, the F-test cannot be used to determine if the model’s regression coefficients are jointly significant. You will then have to use some other test to figure out if your regression model did a better job than a straight line through the data set mean.
Similarly, the computation of t-values and confidence intervals assumes that regression errors are N(0, σ²) distributed. If the regression errors are not normally distributed, t-values for the model’s coefficients and the model’s predictions become inaccurate and you should not put too much faith into the confidence intervals for the coefficients or the predictions.

A special case of non-normality: bi-modally distributed residual errors

Sometimes, one finds that the model’s residual errors have a bimodal distribution i.e. they have two peaks. This may point to a badly specified model or a crucial explanatory variable that is missing from the model.

For example, consider the following situation:

Your dependent variable is a binary variable such as Won (encoded as 1.0) or Lost (encoded as 0.0). But your linear regression model is going to generate predictions on the continuous real number scale. If the model generates most of its predictions along a narrow range of this scale around 0.5, for e.g. 0.55, 0.58, 0.6, 0.61, etc. the regression errors will peak either on one side of zero (when the true value is 0), or on the other side of zero (when the true value is 1). This is a sign that your model is not able to decide whether the output should be 1 or 0, so it’s predicting a value that is around the average of 1 and 0.

This can happen if you are missing a key binary variable, known as an indicator variable, which influences the output value in the following way:

When the variable’s value is 0, the output ranges within a certain range, say close to 0.

When the variable’s value is 1, the output takes on a whole new range of values that are not there in the earlier range, say around 1.0.

If this variable is missing in your model, the predicted value will average out between the two ranges, leading to two peaks in the regression errors. Once this variable is added, the model is well specified, and it will correctly differentiate between the two possible ranges of the explanatory variable.

Related read: Dealing with Multi-modality of Residual Errors

How to test for normality of residual errors?

There are number of tests of normality available. The easiest way to check for normality is to measure the Skewness and the Kurtosis of the distribution of residual errors.

The Skewness of a perfectly normal distribution is 0 and its kurtosis is 3.0.

Any departures, positive or negative from these values indicates a departure from normality. It is of course impossible to get a perfectly normal distribution. Some departure from normality is expected. But how much is a ‘little’ departure? How to judge if the departure is significant?

Whether the departure is significant is answered by statistical tests of normality such as the Jarque Bera Test and the Omnibus Test. A p-value of ≤ 0.05 on these tests indicates that the distribution is normal at a confidence level of ≥ 95%.

Let’s run the Jarque-Bera normality test on the linear regression model that we have trained on the Power Plant data set. Recollect that the residual errors were stored in the variable resid and they were obtained by running the model on the test data and by subtracting the predicted value y_pred from the observed value y_test.

from statsmodels.compat import lzip
import statsmodels.stats.api as sms
name = ['Jarque-Bera test', 'Chi-squared(2) p-value', 'Skewness', 'Kurtosis']
#run the Jarque-Bera test for Normality on the residuals vector
test = sms.jarque_bera(resid)
#print out the test results. This will also print the Skewness and Kurtosis of the resid vector
lzip(name, test)

This prints out the following:

[('Jarque-Bera test', 1863.1641805048084), ('Chi-squared(2) p-value', 0.0), ('Skewness', -0.22883430693578996), ('Kurtosis', 5.37590904238288)]

The skewness of the residual errors is -0.23 and their Kurtosis is 5.38. The Jarque-Bera test has yielded a p-value that is < 0.01 and thus it has judged them to be respectively different than 0.0 and 3.0 at a greater than 99% confidence level thereby implying that the residuals of the linear regression model are for all practical purposes not normally distributed.

Let’s plot the frequency distribution of the residual errors:

resid.hist(bins=50)
plt.show()

We get the following histogram showing us that the residual errors do seem to be normally distributed (but the JB has shown that they are in fact not so):

Frequency distribution of residual errors (Image by Author)

Related read: Testing For Normality of Residual Errors Using Skewness And Kurtosis Measures, for an in-depth explanation of Normality and statistical tests of normality.

Related read: Dealing with Multi-modality of Residual Errors

Assumption 4: Residual errors should be homoscedastic

In the previous section we saw why the residual errors should be N(0, σ²) distributed, i.e. normally distributed with mean zero and variance σ². In this section we impose an additional constraint on them: the variance σ² should be constant. Particularly, σ² should not be a function of the response variable y, and thereby indirectly the explanatory variables X.

The property of a data set to have constant variance is called homoscedasticity. And it’s opposite, where the variance is a function of explanatory variables X is called heteroscedasticity.

Here is an illustration of a data set showing homoscedastic variance:

Example of homoscedastic variance (Image by Author)

And here’s one that displays a heteroscedastic variance:

Example of heteroscedastic variance (Image by Author)

While talking about homoscedastistic or heteroscedastic variances, we always consider the conditional variance: Var(y|X=x_i), or Var(ε|X=x_i). This is read as variance of y or variance of residual errors ε for a certain value of X=x_i.

Related read: Conditional Probability, Conditional Expectation and Conditional Variance: Conditional expectation, conditional probability & conditional variance: practical insights for regression modelers

Why do we want the residual errors to be homoscedastic?

The immediate consequence of residual errors having a variance that is a function of y (and so X) is that the residual errors are no longer identically distributed. The variance of ε for each X=x_i will be different, thereby leading to non-identical probability distributions for each ε_i in ε.

We have seen that if the residual errors are not identically distributed, we cannot use tests of significance such as the F-test for regression analysis or perform confidence interval checking on the regression model’s coefficients or the model’s predictions. Many of these tests depend on the residual errors being independent, identically distributed random variables.

What can cause residual errors to be heteroscedastic?

Heteroscedastic errors frequently occur when a linear model is fitted to data in which the fluctuation in the response variable y is some function of the current value y, for e.g. it is a percentage of the current value of y. Such data sets commonly occur in the monetary domain. An example is where the absolute amount of variation in a company’s stock price is proportional to the current stock price. Another example is of seasonal variations in the sales of some product being proportional to the sales level.

Heteroscedasticity can also be introduced by errors in the data gathering process. For example, if the measuring instrument introduces a noise in the measured value that is proportional to the measured value, the measurements will contain heteroscedastic variance.

Another reason heteroscedasticity is introduced in the model’s errors is by simply using the wrong kind of model for the data set or by leaving out important explanatory variables.

How to fix heteroscedasticity in the model’s residual errors?

There are three main approaches to dealing with heteroscedastic errors:

Transform the dependent variable so as to linearize it and dampen down the heteroscedastic variance. Commonly used transforms are log(y) and square-root(y).
Identify important variables that may be missing from the model, and which are causing the variance in the errors to develop a pattern, and add those variables into the model. Alternately, stop using the linear model and switch to a completely different model such as a Generalized Linear Model, or a neural net model.
Simply accept the heteroscedasticity present in the residual errors.

How to detect heteroscedasticity in the residual errors?

There are several tests of homoscedasticity available. Here are a few:

Testing for heteroscedastic variance using Python

Let’s test the model’s residual errors for heteroscedastic variance by using the White test. We’ll use the errors from the linear model we built earlier for predicting the power plant’s output.

The White test for heteroscedasticity uses the following line of reasoning to detect heteroscedatsicity:

If the residual errors ε are heteroscedastic, their variance can be ‘explained’ by y (and therefore by a combination of the model’s explanatory variables X and their squares (X²) and cross-products (X X X).
Therefore, when an auxiliary linear model is fitted on the errors ε and (X, X², X x X), it is expected that the aux linear model will be able to explain at least some of the relationship that is assumed to be present between errors ε and X.
If we run the F-test for regression on the aux-model, and the F-test returns a p-value that is ≤ 0.05, it will lead us to accept the F-test’s alternate hypothesis that the auxiliary model’s coefficients are jointly significant. Hence the fitted aux model is indeed able to capture a meaningful relationship between the residual errors ε of the primary model and the model’s explanatory variables X. This leads us to conclude that the residual errors of the primary model ε are heteroscedastic.
On the other hand, if the F-test returns a p-value that is ≥ 0.05, then we accept the F-test’s null hypothesis that there is no meaningful relationship between the residual errors ε of the primary model and the model’s explanatory variables X. Thus, the residual errors of the primary model ε are homoscedastic.

Let’s run the White test on the residual errors that we got earlier from running the fitted Power Plant Output model on the test data set. These residual errors are stored in the variable resid.

from statsmodels.stats.diagnostic import het_white
keys = ['Lagrange Multiplier statistic:', 'LM test\'s p-value:', 'F-statistic:', 'F-test\'s p-value:']
#run the White test
results = het_white(resid, X_test)
#print the results. We will get to see the values of two test-statistics and the corresponding p-values
lzip(keys, results)

We see the following out:

[('Lagrange Multiplier statistic:', 33.898672268600926), ("LM test's p-value:", 2.4941917488321856e-06), ('F-statistic:', 6.879489454587562), ("F-test's p-value:", 2.2534296887344e-06)]

You can see that the F-test for regression has returned a p-value of 2.25e-06 which is much smaller than even 0.01.

So with 99% confidence, we can say that the auxiliary model used by the White test was able to explain a meaningful relationship between the residual errors residof the primary model and the primary model’s explanatory variables (in this case X_test).

So we reject the null hypothesis of the F-test that the residuals errors of the Power Plant Output model are homoscedastic and accept the alternate hypothesis that the residual errors of the model are heteroscedastic.

Recollect that we had seen the following linear pattern of sorts in the plot of residual errors versus the predicted value y_pred:

Residual errors versus predicted values (Image by Author)

From this plot, we should have expected the residual errors of our linear model to be heteroscedastic. The White test just confirmed this expectation!

Related Read:Introduction to Heteroscedasticity

Further reading:Robust Linear Regression Models for Nonlinear, Heteroscedastic Data: A step-by-step tutorial in Python

Summary

The Ordinary Least Squares regression model (a.k.a. the linear regression model) is a simple and powerful model that can be used on many real world data sets.

The OLSR model is based on strong theoretical foundations. It’s predictions are explainable and defensible.

To get the most out of an OLSR model, we need to make and verify the following four assumptions:

The response variable y should be linearly related to the explanatory variables X.
The residual errors of regression should be independent, identically distributed random variables.
The residual errors should be normally distributed.
The residual errors should have constant variance, i.e. they should be homoscedastic.

The Intuition behind Correlation

The F-Test for Regres sion Analysis

Dealing with Multi-modality of Residual Errors

Introduction to Heteroscedasticity

Building Robust Linear Models For Nonlinear, Heteroscedastic Data

Citations and Copyrights

Data set

Combined Cycle Power Plant Data Set: downloaded from UCI Machine Learning Repository is used under the following citation requests:

Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126–140, ISSN 0142–0615, [Web Link],
([Web Link])
Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13–18 (Mar. 2012, Dubai

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

PREVIOUS: A Guide To Exogenous And Endogenous Variables

NEXT: An Overview Of The Variance-Covariance Matrices Used In Linear Regression

UP: Table of Contents