# A Tutorial on White’s Heteroskedasticity Consistent Estimator Using Python and Statsmodels

How to use the White’s heteroskedasticity consistent estimator using Python and statsmodels

In this chapter, we shall learn how to employ the HC estimator to perform statistical inference that is robust to heteroskedasticity.

This chapter is PART 2 of the following two part series:

PART 1: Introducing White’s Heteroskedasticity Consistent Estimator
PART 2: A tutorial on White’s Heteroskedasticity Consistent Estimator using Python and Statsmodels

In PART 1, we drilled into the theory of the White’s heteroskedasticity consistent estimator. Let’s quickly recall what we learned in PART 1.

Consider the following linear model:

y is the response variable. For a data set of size n, it’s a column vector of size [n x 1]. Suppose the model has k regression variables including the intercept. β is a column vector of regression coefficients [β_1, β_2,…β_k] where β_1 is the intercept. X is the matrix of regression variables including the placeholder for the intercept in column 1 of the matrix. X is of size [n x k].

Here’s the matrix form of Eq (1):

A least squares estimation of Eq (1) yields the following estimator for β:

Since the estimated coefficients β_cap are random variables, they have a variance-covariance structure which is given by the following covariance matrix: The covariance matrix of the fitted regression coefficients (Image by Author)

The diagonal elements of this matrix are the standard errors of the coefficient estimates and these standard errors are used to derive the p values and confidence intervals of the coefficients, in other words, to test whether the coefficients are statistically significant and to calculate their precision.

The covariance matrix of β_cap is estimated using the following matrix equation (an explanation and the derivation of which is covered in PART 1): Formula for the covariance matrix of estimated coefficients (Image by Author)

In the above equation, X’ is the transpose of X and (-1) operator inverts the matrix in the bracket. At the heart of this equation is σ²Ω which is the following [n x n] matrix of the model’s errors: The covariance matrix of the regression model’s errors when the model’s errors are heteroskedastic and non-auto-correlated (Image by Author)

σ² is just a common scaling factor such that ω_i=σ²_i/σ². When model’s errors are homoskedastic (constant variance) and non-auto-correlated (quite common in cross-sectional data sets), σ²_i=σ² for all i and ω_i=1 for all i and Ω=I, the identity matrix. In this case, it can be shown that formula for the covariance matrix of coefficient estimates reduces to the following simple formula: The formula for the covariance matrix of the fitted regression coefficients when the model’s errors are homoskedastic and non-auto-correlated (Image by Author)

Since σ², is not directly observable, we use the following formula to estimate the covariance matrix. Here s² is the variance of the fitted model’s residual errors: The formula for the estimated covariance matrix of the fitted regression coefficients when the model’s errors are homoskedastic and non-auto-correlated (Image by Author)

Unfortunately, when the model’s errors are homoskedastic, we cannot use Eq (6) or (6a) and must revert to working with Eq (4). In this equation, σ²Ω, the covariance matrix of the model’s errors is inherently unobservable, so it must be estimated.

To do so, we use the following estimator proposed by Halbert White (also covered in detail in PART 1), or one of its improvements (also proposed subsequently by White):

In Eq (7), e_i is the residual error of the fitted model corresponding to the ith predicted value: e_i=(y_i — x_iβ_cap), where x_i is the ith row of X.

In the rest of the chapter, we’ll put these results into practice. We’ll see how to use White’s HC estimator to perform statistical inference about a model’s coefficients in a situation where the model’s errors are heteroskedastic.

The following model attempts to explain poverty in a US county in terms of several socioeconomic factors, specifically, the median age of the county’s inhabitants, the vacancy rate of houses for sale in the county and the percentage of the county’s population with at least one college degree.

To train this model, we’ll use use data about several socioeconomic indicators collected by the US Census Bureau and aggregated at a county level. The data is a subset of the 2015–2019 American Community Survey (ACS) 5-Year Estimates conducted by the US Census Bureau. The following table contains the data that we will use (tap or click on the image to zoom it):

A subset of the American Community Survey data set pulled from the US Census Bureau under Terms Of Use

The data set used in this example can be downloaded from here. The complete ACS data set can be pulled from the US Census Bureau’s website using publicly available APIs, or directly from the Census Bureau’s Community Explorer website.

We’ll use Python and Pandas to load the ACS data file into memory, and we’ll use the Python based statsmodels package to build and fit the linear model.

Let’s start by importing the required packages and loading the data file into a Pandas `DataFrame`:

```import pandas as pd
import statsmodels.formula.api as smf
from patsy import dmatrices
from matplotlib import pyplot as plt

#Load the US Census Bureau data into a Dataframe
```

Construct the model’s equation in Patsy syntax. Statsmodels will automatically add the intercept of regression to the model, so we don’t have to explicitly specify it in the model’s equation:

```reg_expr = 'Percent_Households_Below_Poverty_Level ~ Median_Age + Homeowner_Vacancy_Rate + Percent_Pop_25_And_Over_With_College_Or_Higher_Educ'
```

Build and train the model and print the training summary:

```olsr_model = smf.ols(formula=reg_expr, data=df)

olsr_model_results = olsr_model.fit()

print(olsr_model_results.summary())
```

We see the following summary:

Training summary of the Linear Model (Image by Author)

We will ignore the value of R-squared (or adjusted R-squared of 0.191) reported by statsmodels as our interest lies in estimating the main effects (namely the coefficient estimates), their standard errors and 95% confidence intervals, and whether the coefficients are statistically significant. Let’s zoom into the relevant portion of the output: Coefficient estimates, their standard errors, p values and 95% Confidence Intervals (Image by Author)

We see that all coefficient estimates are significant at a p of < .001. By default, Statsmodels has assumed homoskedastic errors and accordingly used Eq (6a) reproduced below to calculate the covariance matrix of coefficient estimates which is further used to calculate standard errors, p values and confidence intervals. The formula for the estimated covariance matrix of the fitted regression coefficients when the model’s errors are homoskedastic and non-auto-correlated (Image by Author)

We have seen that if the errors are not homoskedastic, then the above formula will yield misleading results. Thus, we need to know if the model’s errors are homoskedastic or heteroskedastic.

A simple visual test of heteroskedasticity consists of plotting the fitted model’s residuals (which are unbiased estimates of the model’s errors) against the fitted values of the response variable, as follows:

```#Carve out the y and X matrices
y_train, X_train = dmatrices(reg_expr, df, return_type='dataframe')

#Get the predicted values of y from the fitted model
y_cap = olsr_model_results.predict(X_train)

#Plot the model's residuals against the predicted values of y
plt.xlabel('Predicted value of Percent_Households_Below_Poverty_Level')

plt.ylabel('Residual error')

plt.scatter(y_cap, olsr_model_results.resid)

plt.show()
```

We get the following scatter plot (the red lines are drawn just to illustrate the heteroskedastic variance): Plot of the fitted model’s residual errors against the estimated value of the response variable (Image by Author)

The model’s errors are clearly heteroskedastic which should make us suspect the standard errors, p values and confidence intervals reported by statsmodels.

Fortunately, statsmodels lets you use one of several kinds of Heteroskedasticity Consistent estimators while fitting the model. These are denoted as HC0, HC1, HC2, HC3 etc.

The first one, HC0 is White’s original heteroskedasticity consistent estimator which we saw earlier (reproduced below):

HC1, HC2, HC3 improve upon HC0 to make it less biased for small sized data sets. These later three estimators are described by MacKinnon and White in their 1985 paper (see link at the bottom of this chapter).

We’ll try all four of them sequentially. In each case, we’ll fit the model and print out the training summary as follows:

```#Ask statsmodels to use the HC estimators
olsr_model_results = olsr_model.fit(cov_type='HC0')
print(olsr_model_results.summary())

olsr_model_results = olsr_model.fit(cov_type='HC1')
print(olsr_model_results.summary())

olsr_model_results = olsr_model.fit(cov_type='HC2')
print(olsr_model_results.summary())

olsr_model_results = olsr_model.fit(cov_type='HC3')
print(olsr_model_results.summary())
```

Let’s stack the coefficient estimates from all 4 trainings and compare them with the original set estimates: Comparison of coefficient estimates, standard errors and C.I.s after using different heteroskedasticity consistent estimators (Image by Author)

The first set of four rows contain the standard errors and confidence intervals assuming homoskedastic errors, i.e. equation (6a) is used. The next set of rows contain the estimates from using different HC estimators: HC0 thru HC3. We see that even after using HC0 thru HC3, all coefficients stay statistically significant at a p < .001. This is good news. However, as expected, the standard errors of the coefficient estimates become larger and the corresponding C.I.s become wider after accounting for heteroskedasticity.

Let’s zoom into a portion of the above figure to see what we mean:

Let’s look at the data for the coefficient estimate of Homeowner_Vacancy_Rate. The standard error of the coefficient estimate without reckoning for heteroskedastic errors is 0.088, but when we account for heteroskedasticity, the standard error of this coefficient almost doubles in size to 0.158.

Without accounting for heteroskedasticity, the 95% C.I.s for the coefficient estimate of Homeowner_Vacancy_Rate are [0.748, 1.091]. After accounting for heteroskedasticity, they become wider: [0.609, 1.230].

We also see that the use of HC1 through HC3 has caused the standard error estimated by White’s HC estimator to nudge upward albeit by a very small amount. For instance, HC3 has estimated the standard error for Homeowner_Vacancy_Rate’s coefficient as 0.163, which is only a 3% “improvement” on the estimate of 0.158 reported by HC0.

At any rate, the main takeaway is as follows:

The coefficient estimates of the regression model lose precision in the face of heteroskedasticity.

Hence, if the model’s errors are found to be heteroskedastic, which is a frequently occurring situation in cross-sectional data sets, one should always use a heteroskedasticity consistent estimator for performing sound statistical inference about coefficient estimates.

Here’s the complete source code used in the chapter: