When the frequency distribution of residual errors has multiple peaks, especially two peaks, what does it mean, and how to deal with that situation.
A raw residual is the difference between the actual value and the value predicted by a trained regression model.
The frequency distribution plot of residuals can provide a good feel for whether the model is correctly specified, i.e. whether it is the right kind of model for the data set, and whether all the important regression variables have been considered, and whether the model has fitted the data in an unbiased manner.
What you usually want to see is a normally distributed plot of residuals that is centered around zero like the one shown below.
A normally distributed frequency plot of residuals is one sign of a well-chosen, well-fitted model.
But residual plots are often skewed, or they have fat tails or thin tails, and sometimes they are not centered at zero. There are ways to address these problems. And sometimes one has to simply accept some degree of non-normality.
We’ll wee what to do when your model’s residuals turn out to be bimodal, i.e. they have two peaks instead of one, like so:
The data set
We’ll use a data set of daily usage of rental bicycles spanning two years. Here are the first 10 rows of the dataset:
The variables in the data set are as follows:
- Instant: The row index
- dteday: The day on which the measurement was taken in dd-MM-yy format
- season: the prevailing weather season
- yr: the prevailing year: 0=2011, 1=2012
- mnth: the prevailing month: 1 thru 12
- holiday: Whether the measurement was taken on a holiday (yes=1, no=0)
- weekday: day of the week (0 thru 6)
- workingday: Whether the measurement was taken on a working day (yes=1, no=0)
- weathersit: The weather situation on the day: 1=Clear, Few clouds, Partly cloudy, Partly cloudy. 2=Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist. 3=Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds. 4=Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog.
- temp: Temperature, normalized to 39C
- atemp: Real feel, normalized to 50C
- hum: Humidity, normalized to 100
- windspeed: Wind speed, normalized to 67
- casual_user_count: count of casual bicycle renters
- registered_user_count: count of registered (member) bicycle renters
- total_user_count: count of total bicycle renters
You can download the data set from here.
The regression model
We’ll build a regression model in which the dependent variable is registered_user_count, and explanatory variables or the covariates as they are called, are the following:
season, mnth, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed.
Since we are modeling counts, we will use the Poisson regression model from the Python statsmodels library.
Let’s begin by importing all the required packages:
import pandas as pd from patsy import dmatrices import numpy as np import statsmodels.api as sm import statsmodels.stats.stattools as st import matplotlib.pyplot as plt
Load the dataset into a Pandas data frame:
df = pd.read_csv('bike_sharing_dataset_daywise.csv', header=0, parse_dates=['dteday'], infer_datetime_format=True)
Create the training and test data sets:
mask = np.random.rand(len(df)) < 0.8 df_train = df[mask] df_test = df[~mask] print('Training data set length='+str(len(df_train))) print('Testing data set length='+str(len(df_test)))
Create the regression expression in Patsy syntax. We are saying that registered_user_count is the dependent variable and it depends on all the variables mentioned on the right side of the tile (~) sign:
expr = 'registered_user_count ~ season + mnth + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed'
Set up the X, y matrices:
y_train, X_train = dmatrices(expr, df_train, return_type='dataframe') y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')
Build and train the Poisson regression model:
poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
poisson_training_results is an object of type GLMResults. Print the model summary:
This prints out the following:
It’s good to see that all model coefficients are statistically significant at a p-value of < 0.001 i.e. at 99.999% confidence level.
As against linear regression models, models in which the dependent variable is a count, rarely produce normally distributed residual error distributions.
So we have to normalize the raw-residuals using other means. Three popular transformations are:
- The Pearson residual
- The Anscombe residual
- The Deviance residual
Statsmodels makes all three kinds of residual errors available to us via GLMResults.resid_pearson, GLMResults.resid_anscombe, and GLMResults.resid_deviance variables
The raw residuals are available in GLMResults.resid_response.
We’ll print out the Skewness and the Kurtosis of all 4 kinds of residuals to see which kind is the most normally distributed. A perfectly normal distribution has a Skewness of zero, and a Kurtosis of 3.0.
#Skewness of four different types of residual errors raw_residual_skewness = st.robust_skewness(poisson_training_results.resid_response) pearson_residual_skewness = st.robust_skewness(poisson_training_results.resid_pearson) anscobe_residual_skewness = st.robust_skewness(poisson_training_results.resid_anscombe) deviance_residual_skewness = st.robust_skewness(poisson_training_results.resid_deviance) #Kurtosisof four different types of residual errors raw_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_response) pearson_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_pearson) anscobe_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_anscombe) deviance_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_deviance) #build the stats data table residual_stats = [ ['Raw residual', raw_residual_skewness, raw_residual_kurtosis], ['Pearson\'s residual', pearson_residual_skewness, pearson_residual_kurtosis], ['Anscombe residual', anscobe_residual_skewness, anscobe_residual_kurtosis], ['Deviance residual', deviance_residual_skewness, deviance_residual_kurtosis] ] #Organize it into a Pandas Dataframe residual_stats_df = pd.DataFrame(residual_stats, columns=['Residual', 'Skewness', 'Kurtosis']) #Print the Dataframe print(residual_stats_df)
We get the following output:
We are looking for the residual type with Skewness that is closest to zero, and Kurtosis that is closest to 3.0. Given how close all the values are in this table, it’s hard to make a good choice of residual type from this table. We’ll choose the Pearson’s residual as it’s Skewness is closest to zero.
Let’s plot the frequency distribution of the Pearson’s residuals:
We see the following bimodal distribution!
What could be going on here that caused the regression errors to be bimodal?
Factors leading to bi-modality of Residual Errors
When regression errors are bimodal, there can be a couple of things going on:
The dependent variable is a binary variable such as Won/Lost, Dead/Alive, Up/Down etc. But your regression model may be generating as predictions, a continuously varying real valued values. So if you have encoded (Won=1, Lost=0), or (Dead=0, Alive=1) etc. and your regression model generates predicted values in a narrow range around 0.5, for e.g. 0.55, 0.58, 0.6, 0.61, etc, then the regression errors will peak either on one side of zero (when the true value is 0), or on the other side of zero (when the true value is 1).
This in turn can happen if you have not chosen the right kind of regression model for the data set, and/or you are missing key explanatory variables without which most of the predictions are hovering around 0.5.
In our example, the dependent variable is a count, and we have used a model that is appropriate for counts, namely the Poisson model, so the above situation can be ruled out.
Another reason for bimodal residuals is that one may have left out a binary regression variable which influences the output value in the following way:
When the variable’s value is 0, the output ranges within a certain range.
When the variable’s value is 1, the output takes on a whole new range of values that are not there in the earlier range.
It turns out, that is indeed the case in our example! We have left out the binary variable yr (year of observation) which takes on two values: 0=year 2011, and 1=year 2012.
Let’s see how yr influences the dependent variable registered_user_count. We’ll plot registered_user_count against yr:
df.plot.scatter('yr', 'registered_user_count') plt.show()
All the values of registered_user_count lying in the red circle occur only when year=1 i.e. 2012. Since we left out yr in our model, our model couldn’t explain the presence of higher valued counts. It made a systematic error each time the actual value was in the 5000–7000 range, leading to the second peak developing in the residuals plot.
If this theory is correct, adding yr into the model should resolve the bimodality of residuals. So let’s add yr into the regression expression and build and train the model again:
#Set up the regression expression. This time, include yr. expr = 'registered_user_count ~ yr + season + mnth + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed' #Set up the X and y matrices for the training and testing data sets y_train, X_train = dmatrices(expr, df_train, return_type='dataframe') y_test, X_test = dmatrices(expr, df_test, return_type='dataframe') #Using the statsmodels GLM class, train the Poisson regression model on the training data set poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit() #print out the training summary print(poisson_training_results.summary())
This yields the following output:
This time, notice that yr is one of the regression variables and its coefficient is statistically significant at 99.999% confidence level.
Also, the Log-likelihood of this model is -52683, which is considerably greater than the Log-likelihood (-114530) of the previous model without yr. Log-Likelihood is a measure of goodness-of-fit. In this case, it indicates that adding yr has considerably improved the model’s goodness of fit. This is consistent with our theory that the absence of yr was preventing our model from explaining all those high-value counts.
So far so good.
Like before, we’ll once again calculate the Skewness and Kurtosis of the four different kinds of residuals from this revised Poisson model, and print the statistics out in a Pandas Dataframe:
Once again, the Pearson’s residual shines among the lot with a Skewness that is closest to zero. Although, the Kurtosis of the raw residual is closest to 3.0.
Let’s plot the frequency distribution of the Pearson’s residual of the revised model:
df.plot.scatter('yr', 'registered_user_count') plt.show()
This time, the bi-modality has been extinguished to a large extent with the addition of the yr variable.
Let’s look at the two plots side by side, without yr on the left, and with yr on the right:
We have successfully fixed most of the bi-modality in the residuals by including an important binary regression variable. Adding the missing variable has also increased the model’s goodness-of-fit score by a considerable amount.
Let’s fetch the model’s predictions on the test data set and we’ll also plot the predicted versus actual counts:
#fetch the predictions poisson_predictions = poisson_training_results.get_prediction(X_test) #.summary_frame() returns a pandas DataFrame predictions_summary_frame = poisson_predictions.summary_frame() #print the predictions print(predictions_summary_frame) #The mean column contains the predicted count predicted_counts=predictions_summary_frame['mean'] #get the actual count from y_test actual_counts = y_test['registered_user_count'] #Plot the predicted counts versus the actual counts for the test data. fig = plt.figure() fig.suptitle('Predicted versus actual user counts') predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts') actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts') plt.legend(handles=[predicted, actual]) plt.show()
We get the following plot:
Here is the complete source code:
Citations and Copyrights
The Bike sharing data set has been downloaded from UCI Machine Learning Repository. Curated data set download link.
Data set cited in paper: Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1–15, Springer Berlin Heidelberg, doi:10.1007/s13748–013–0040–3.
All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.
PREVIOUS: Testing for Normality Of Residuals Using Skewness and Kurtosis
NEXT: R-squared, Adjusted-R-Squared And Pseudo R-Squared