Testing For Normality of Residual Errors Using Skewness And Kurtosis Measures

And a guide to using the Omnibus K-squared and Jarque–Bera normality tests

We’ll cover the following four topics in this section:

What is normality and why should you care if the residual errors from your trained regression model are normally distributed?
What are Skewness and Kurtosis measures and how to use them for testing for normality of residual errors?
How to use two very commonly used tests of normality, namely the Omnibus K-squared and Jarque–Bera tests that are based on Skewness and Kurtosis.
How to apply these tests to a real-world data set to decide if Ordinary Least Squares regression is the appropriate model for this data set.

What is normality?

Normality means that your data follows the normal distribution. Specifically, each value y_i in Y is a ‘realization’ of some normally distributed random variable N(µ_i, σ_i) as follows:

A normally distributed response variable Y — *A normally distributed response variable* Y (Image by *Author*)

Normality in the context of linear regression

While building a linear regression model, one assumes that Y depends on a matrix of regression variables X. This makes Y conditionally normal on X. If X =[x_1, x_2, …, x_n] are jointly normal, then µ = f(X) is a normally distributed vector, and so is Y, as follows:

In linear regression, Y is conditionally normally distributed on the rmatrix of regressors X (Image by Author)

Why test for normality?

Several statistical techniques and models assume that the underlying data is normally distributed.

I’ll give below three such situations where normality rears its head:

As seen above, in Ordinary Least Squares (OLS) regression, Y is conditionally normal on the regression variables X in the following manner: Y is normal, if X =[x_1, x_2, …, x_n] are jointly normal. But nothing bad happens to your OLS model, even if Y isn’t normally distributed.
A non-strict requirement of classical linear regression models is that the residual errors of regression ‘ϵ=(y_obs-y_predicted)’ should be normally distributed with an expected value of zero i.e. E(ϵ) = 0. If the residual errors ϵ are not normally distributed, one cannot reliably calculate confidence intervals for the model’s forecasts using the t-distribution, especially for small sample sizes (n ≤ 20).

Bear in mind that even if the errors are not normally distributed, the OLS estimator is still the BLUE i.e. Best Linear Unbiased Estimator for the model as long as E(ϵ)=0, and all other requirements of OLSR are satisfied.

Finally, certain goodness-of-fit techniques such as the F-test for regression analysis assume that the residual errors of the competing regression models are all normally distributed. If the residual errors are not normally distributed, the F-test cannot be reliably used to compare the goodness-of-fit of two competing regression models.

How can I tell if my data is (not) normally distributed?

Several statistical tests are available to test the degree to which your data deviates from normality, and if the deviation is statistically significant.

We’ll look at moment based measures, namely Skewness and Kurtosis, and the statistical tests of significance, namely Omnibus K² and Jarque — Bera, that are based on these measures.

What is ‘Skewness’ and how to use it?

Skewness lets you test by how much the overall shape of a distribution deviates from the shape of the normal distribution.

The following figures illustrate skewed distributions.

Positive and negative skewness (Source: Wikimedia Commons under CC BY-SA 3.0)

The moment based definition of Skewness is as follows:

Skewness is defined as the third standardized central moment, of the random variable of the probability distribution.

The formula for skewness of the population is show below:

Formula for population skewness (Image by Author)

Skewness has the following properties:

Skewness is a moment based measure (specifically, it’s the third moment), since it uses the expected value of the third power of a random variable.
Skewness is a central moment, because the random variable’s value is centralized by subtracting it from the mean.
Skewness is a standardized moment, as its value is standardized by dividing it by (a power of) the standard deviation.
Because it is the third moment, a probability distribution that is perfectly symmetric around the mean will have zero skewness. This is because, for each y_i that is greater than the mean µ, there will be a corresponding y_i smaller than mean µ by the same amount. Since the distribution is symmetric around the mean, both y_i values will have the same probability. So pairs of (y_i- µ) will cancel out, yielding a total skewness of zero.
Skewness of the normal distribution is zero.
While a symmetric distribution will have a zero skewness, a distribution having zero skewness is not necessarily symmetric.
Certain ratio based distributions — most famously the Cauchy distribution — have an undefined skewness as they have an undefined mean µ.

In practice, we can estimate the skewness in the population by calculating skewness for a sample. For the sample, we cheat a little by assuming that the random variable is uniformly distributed, so the probability of each y_i in the sample is 1/n and the third, central, sample moment becomes 1/n times a simple summation over all (y_i —y_bar)³.

Formula for sample skewness (Image by Author)

Skewness is very sensitive to the parameters of the probability distribution.

The following figure illustrates the skewness of the Poisson distribution’s Probability Mass Function for various values of the event rate parameter λ:

Skewness of the Poisson(λ) distribution for various event rates (λ) (Image by Author)

Why does skewness of Poisson’s PMF reduce for large event rates? For large values of λ, the Poisson distribution’s PMF approaches the Normal distribution’s PMF with mean and variance = λ. That is, Poisson(λ) → N(λ, λ), as λ → ∞. Therefore, it’s no coincidence what are seeing in the above figure.

As λ → ∞, skewness of the Poisson distribution tends to the skewness of the normal distribution, namely 0.

There are other measures of Skewness also, for example:

Skewness of mode
Skewness of median
Skewness calculated in terms of the Quartile values
…and a few others.

What is ‘Kurtosis’ and how to use it?

Kurtosis is a measure of how differently shaped are the tails of a distribution as compared to the tails of the normal distribution. While skewness focuses on the overall shape, Kurtosis focuses on the tail shape.

Kurtosis is defined as follows:

Kurtosis is the fourth standardized central moment, of the random variable of the probability distribution.

The formula for Kurtosis is as follows:

Formula for population Kurtosis (Image by Author)

Kurtosis has the following properties:

Just like Skewness, Kurtosis is a moment based measure and, it is a central, standardized moment.
Because it is the fourth moment, Kurtosis is always positive.
Kurtosis is sensitive to departures from normality on the tails. Because of the 4th power, smaller values of centralized values (y_i-µ) in the above equation are greatly de-emphasized. In other words, values in Y that lie near the center of the distribution are de-emphasized. Conversely, larger values of (y_i-µ), i.e. the ones lying on the two tails of the distribution are greatly emphasized by the 4th power. This property makes Kurtosis largely ignorant about the values lying toward the center of the distribution, and it makes Kurtosis sensitive toward values lying on the distribution’s tails.
Kurtosis of the normal distribution is 3.0. While measuring the departure from normality, Kurtosis is sometimes expressed as excess Kurtosis which is the balance amount of Kurtosis after subtracting 3.0.

For a sample, excess Kurtosis is estimated by dividing the fourth central sample moment by the fourth power of the sample standard deviation, and subtracting 3.0, as follows:

Formula for sample excess Kurtosis (Image by Author)

Here is an excellent image from Wikipedia Commons that shows the Excess Kurtosis of various distributions. I have super-imposed a magnified version of the tails in the top left side of the image:

Excess Kurtosis of various distributions (Source: Wikimedia Commons under CC0)

Normality tests based on Skewness and Kurtosis

While Skewness and Kurtosis quantify the amount of departure from normality, one would want to know if the departure is statistically significant. The following two tests let us do just that:

The Omnibus K-squared test
The Jarque–Bera test

In both tests, we start with the following hypotheses:

Null hypothesis (H_0): The data is normally distributed.
Alternate hypothesis (H_1): The data is not normally distributed, in other words, the departure from normality, as measured by the test statistic, is statistically significant.

Omnibus K-squared normality test

The Omnibus test combines the random variables for Skewness and Kurtosis into a single test statistic as follows:

Formula for the Omnibus K-squared test statistic (Image by Author)

Probability distribution of the test statistic:
In the above formula, the functions Z1() and Z2() are meant to make the random variables g1 and g2 approximately normally distributed. Which in turn makes their sum of squares approximately Chi-squared(2) distributed, thereby making the statistic of the Omnibus K-squared approximately Chi-squared(2) distributed under the assumption that null hypothesis is true, i.e. the data is normally distributed.

Jarque–Bera normality test

The test statistic for this test is as follows:

Formula for the Jarque-Bera test statistic (Image by Author)

Probability distribution of the test statistic:
The test statistic is the scaled sum of squares of random variables g1 and g2 that are each approximately normally distributed, thereby making the JB test statistic approximately Chi-squared(2) distributed, under the assumption that the null hypothesis is true.

Example

We’ll use the following data set from the U.S. Bureau of Labor Statistics, to illustrate the application of normality tests:

Source: Wages and salaries (series id: CXU900000LB1203M). U.S. Bureau of Labor Statistics — Source: Wages and salaries by Occupation: Total wage and salary earners (series id: CXU900000LB1203M). U.S. Bureau of Labor Statistics under US BLS Copyright Terms (Image by Author)

Here are the first few rows of the data set:

You can download the data from this link.

Let’s fit the following OLS regression model to this data set:

Wages = β_0 + β_1*Year+ ϵ

Where:

Wages is the response a.k.a. dependent variable,
Year is the regression a.k.a. explanatory variable,
β_0 is the intercept of regression,
β_1 is the coefficient of regression, and
ϵ is the unexplained regression error

We’ll use Python libraries pandas and statsmodels to read the data, and to build and train our OLS model for this data.

Let’s start with importing the required packages:

import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf

Read the data into the pandas data frame:

df = pd.read_csv('wages_and_salaries_1984_2019_us_bls_CXU900000LB1203M.csv', header=0)

Plot Wages against Year:

fig = plt.figure()
plt.xlabel('Year')
plt.ylabel('Wages and Salaries (USD)')
fig.suptitle('Wages and Salaries before taxes. All US workers')
wages, = plt.plot(df['Year'], df['Wages'], 'go-', label='Wages and Salaries')
plt.legend(handles=[wages])
plt.show()

Create the regression expression in Patsy syntax. In the following expression, we are telling statsmodels that Wages is the response variable and Year is the regression variable. statsmodels will automatically add an intercept to the regression equation.

expr = 'Wages ~ Year'

Configure the OLS regression model by passing the model expression, and train the model on the data set, all in one step:

olsr_results = smf.ols(expr, df).fit()

Print the model summary:

print(olsr_results.summary())

In the following output, I have called out the areas that bode well and bode badly for our OLS model’s suitability for the data:

Summary of OLS regression model (Image by Author)

Interpreting the results

Following are a few things to note from the results:

The residual errors are positively skewed with a skewness of 0.268 and they also have an excess positive Kurtosis of 2.312 i.e. thicker tails.
The Omnibus test and the JB test have both produced test-statistics (1.219 and 1.109 respectively), which lie within the H_0 acceptance zone of the Chi-squared(2) PDF (see figure below). Thus we will accept the hypothesis H_0, i.e. the residuals are normally distributed.

Acceptance and rejection zones for the Null hypothesis in the Chi-squared(2) PDF for two-tailed α=0.05 (Image by Author)

You can also get the values of Skewness, excess Kurtosis, and the test statistics for Omnibus and JB tests as follows:

name = ['Omnibus K-squared test', 'Chi-squared(2) p-value']
#Pass the residual errors of the regression into the test
test = sms.omni_normtest(olsr_results.resid)
lzip(name, test)

This prints out the following:

> [('Omnibus K-squared test', 1.2194658631806088), ('Chi-squared(2) p-value', 0.5434960003061313)]

name = ['Jarque-Bera test', 'Chi-squared(2) p-value', 'Skewness', 'Kurtosis']
test = sms.jarque_bera(olsr_results.resid)
lzip(name, test)

This prints out the following:

[('Jarque-Bera test', 1.109353094606092), ('Chi-squared(2) p-value', 0.5742579764509973), ('Skewness', 0.26780140709870015), ('Kurtosis', 2.3116476989966737)]

Since the residuals seem to be normally distributed, we can also trust the 95% confidence levels reported by the model for the two model params.
We can also trust the p-value of the F-test. It’s exceedingly tiny, indicating that the both model params are also jointly significant.
Finally, the R-squared reported by the model is quite high indicating that the model has fitted the data well.

Now for the bad part: Both the Durbin-Watson test and the Condition number of the residuals indicates auto-correlation in the residuals, particularly at lag 1.

We can easily confirm this via the ACF plot of the residuals:

plot_acf(olsr_results.resid, title='ACF of residual errors')
plt.show()

ACF plot of residual errors (Image by Author)

This presents a problem for us: One of the fundamental requirements of Classical Linear Regression Models is that the residual errors should not be auto-correlated. In this case they most certainly are so. Which means that the OLS estimator may have under-estimated the variance in the training data, which in turn means that it’s predictions will be off by a large amount.

Simply put, the OLS estimator is no longer BLUE (Best Linear Unbaised Estimator) for the model. Bummer!

The auto-correlation of residual errors points to a possibility that our model was incorrectly chosen, or incorrectly configured. Particularly,

We may have left out some key explanatory variables which is causing some signal to leak into the residuals in the form of auto-correlations, OR,
The choice of the OLS model itself may be entirely wrong for this data set. We may need to look at alternate models such as the Regression with ARIMA Errors model which we had covered in an earlier section.

In such cases, your choice is between accepting the sub optimal-ness of the chosen model, and addressing the above two reasons for sub optimality.

Summary

Several statistical procedures assume that the underlying data follows the normal distribution.
Skewness and Kurtosis are two moment based measures that will help you to quickly calculate the degree of departure from normality.
In addition to using Skewness and Kurtosis, you should use the Omnibus K-squared and Jarque-Bera tests to determine whether the amount of departure from normality is statistically significant.
In some cases, if the data (or the residuals) are not normally distributed, your model will be sub-optimal.

References, Citations and Copyrights

Data links

Wages and salaries by Occupation: Total wage and salary earners (series id: CXU900000LB1203M). U.S. Bureau of Labor Statistics under US BLS Copyright Terms. Curated data set link for download

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

UP: Table of Contents