An introduction to the HC estimator, and its importance in building effective regression models in the face of heteroskedasticity
In this chapter, we’ll bring together two fundamental topics in statistical modeling, namely the covariance matrix and heteroskedasticity.
Covariance matrices are the work horses of statistical inference. They are used for determining if regression coefficients are statistically significant (i.e. different from zero), and for constructing confidence intervals for each coefficient. To do this work, they make a few crucial assumptions. Chief among these assumptions is that the model’s errors are homoskedastic i.e. they have constant variance, and the errors are not auto-correlated.
In practice, it is possible to assume non-auto-correlatedness of errors especially in cross-sectional data (but not in time-series settings).
Unfortunately, the assumption of homoskedasticity of errors is, more often than not, not met. Fortunately, there are ways to build a covariance matrix of regression coefficients (and thereby perform sound statistical inference) in the face of non-constant variance i.e. heteroskedastic regression errors.
In this chapter, we’ll study one such technique known as the White’s heteroskedasticity consistent estimator (named after its creator Halbert White) in which we will build a covariance matrix of regression coefficients that is robust to heteroskedastic regression errors.
This chapter is part 1 of the following two part series:
PART 1: Introducing White’s Heteroskedasticity Consistent Estimator
PART 2: A tutorial on White’s Heteroskedasticity Consistent Estimator using Python and Statsmodels
In PART 1, we will get into the theory of the HC estimator while in PART 2, we walk through a Python based tutorial on how to use it for doing statistical inference that is robust to heteroskedasticity.
Consider the following general form of the regression model:
In this model, we express the response variable y as some unspecified function of the regression variables matrix X, the coefficients β and the regression errors ϵ. If this model has k regression coefficients including the intercept, then for a data set of size n, y is a vector of size [n x 1], X is a matrix of size [n x k], β is a vector of size [k x 1], and ϵ is a vector of size [n x 1].
A linear form of this model that consists of two regression variables and an intercept would be expressed as follows:
Here, 1, x_2, and x_3 are column vectors of size [n x 1] with 1 being just a column vector of 1s. For any given row i in the data set, the above model can be expressed as follows:
In practice, it is useful to express Eq (1) or (1a) compactly as follows:
Or in its full matrix glory as follows (in our example, k=3):
β_1, β_2,…β_k represent the true, population level values of the coefficients corresponding to the situation when the sample is the entire population.
But for all practical situations the sample data set is some random subset of size n from the population. When a linear model is trained a.k.a. “fitted” on this sample data set, we have the following equation of the fitted model:
Here, y and X have the same meaning as before. But β_cap is a vector of the fitted values of the coefficients β_1_cap, β_2_cap,…β_k_cap, and e is a vector of the residual errors of the fitted model. Note the subtle but crucial difference between the error term ϵ=(y — Xβ) of the regression model, and the residual error term e=(y—Xβ_cap) of the same model when its fitted on the data set.
Since fitting the model on different samples, each of size n, will yield a different set of coefficient values each time, the fitted coefficients β_1_cap, β_2_cap,…β_k_cap can be considered as random variables. The fitted values β_1_cap, β_2_cap,…β_k_cap each have a mean value which can be shown to be the corresponding true population value β_1, β_2,…β_k, and they have a variance around that mean.
The covariance matrix of the regression model’s fitted coefficients contains the variances of the fitted coefficients, and the covariances of fitted coefficients with each other. Here’s how this matrix looks like for a model containing k regression variables (including the intercept):
The covariance matrix is a square matrix of size [k x k]. An element at position (i,j) in this matrix contains the covariance of the ith and the jth fitted coefficient. The values along the main diagonal are the variances of the k fitted coefficients while the off-diagonal elements contain the covariances between fitted coefficients. The square-root of the main diagonal elements are the standard errors of fitted coefficients. The matrix is symmetrical around the main diagonal.
The covariance matrix of regression coefficients is used to determine whether the model’s coefficients are statistically significant, and to calculate the confidence interval for each coefficient.
For linear models of the form y = Xβ + ϵ (which form the backbone of statistical modeling), the formula for the covariance matrix contains within it another covariance matrix, namely the covariance matrix of the model’s error term ϵ. The covariance matrix of errors is also a square matrix. For a data set of size n, the size of this matrix is [n x n] and its structure is as follows:
Incidentally, note that both matrices contain conditional variances and covariances, conditioned as they are upon the regression matrix X.
Just as with the fitted coefficients, each error ϵ_i is a random variable with a mean and a variance. If the model’s errors are homoskedastic (constant variance), then for each row i in the data set, Var(ϵ_i) is some constant σ². Additionally, if the errors are not auto-correlated, then for each pair of errors ϵ_i and ϵ_j where i != j, Cov(ϵ_i, ϵ_j)=0. Hence, if the model’s errors are homoskedastic and non-correlated, the above matrix reduces to a rather simple form as follows:
Where I is an identity matrix of size [n x n]. An identity matrix is the matrix equivalent of the scalar number 1. I’s main-diagonal elements are all ones and the rest of the elements are zero. And just as with the number 1, (I)^n = I and the inverse of I is also I.
In a linear model, when the model’s errors are homoskedastic and non-auto-correlated, the covariance matrix of fitted regression coefficients has the following nice and simple form, the derivation of which is covered in the chapter on covariance matrices:
In the above equation, X’ is the transpose of X. The transpose is like turning X on the side by interchanging the rows with the columns. Since X has a dimension [n x k], X’ has a dimension of [k x n]. The product of X’X is the product of two matrices of size [k x n] and [n x k] respectively, and hence it’s a square matrix of size [k x k]. It’s inverse denoted by -1 in the superscript is also of size [k x k]. Finally, we scale the inverse with the factor σ² which is the constant variance of all error terms.
Eq (6) suggests that to calculate the covariance matrix of fitted coefficients, one must have access to X and σ². While X is fully accessible to the experimenter, the error term ϵ of the regression model is an intrinsically unobservable variable, and so is its variance σ². The best that the experimenter can do is to calculate the residual errors e of the fitted model and use them as a proxy for ϵ. Fortunately, it can be proven that the variance s² of residual errors forms an unbiased estimate of the variance σ² of the model’s errors, and with the residual errors known, s² is easily computable. Hence, in practice, we use the following formula to estimate the variance of the fitted coefficients:
Unfortunately, reality is seldom simple. While we can often assume non-auto-correlatedness of error terms, especially in cross-sectional data sets, heteroskedasticity of errors is quite commonly encountered in both cross-sectional and time series data sets.
When the error term is heteroskedastic, the covariance matrix of errors looks like this:
In this matrix, σ² is just a common scaling factor which we have extracted out of the matrix so that each of ω_i=σ²_i/σ². It is easy to see that when the errors are homoskedastic, σ²_i=σ² for all i and ω_i=1 for all i and Ω=I, the identity matrix.
Incidentally, some statistical texts use Ω to represent a covariance matrix of errors that are both heteroskedastic and auto-correlated.
Either way, when the matrix of covariance errors is not σ²I, and instead is σ²Ω, the covariance matrix of fitted regression coefficients loses the simple form given by Eq (6) or (6a), and instead takes on the ominous form shown below:
If you are curious about how Eq (8) was derived, I have mentioned its derivation at the end of the chapter.
For now, we need to see how best to estimate the variance of the fitted coefficients in the face of heteroskedastic errors using equation (8).
Equation (8) consists of three segments (colored in green, yellow, and green respectively). Since the X matrix is completely accessible to the experimenter, the terms X’ and (X’X)^(-1) are easily computable, and thus the green pieces can be computed. The problem lies with Ω and the variance factor σ², both of which are unobservable since they relate to the error term ϵ of the regression model. Thus, the yellow piece at the heart of the equation cannot be computed.
Recollect that when the errors were homoskedastic and non-autocorrelated, we could estimate σ²I using the variance s²I of the residual errors. Similarly, in this case we need a way to estimate (σ²Ω) which will make the yellow bit computable.
And this is where the estimator proposed by Halbert White in 1980 comes into play. White proposed a way to estimate the yellow term (X’σ²ΩX) at the heart of equation (8). Specifically, he proved the following:
Equation (9) deserves some explanation. The plim operator on the L.H.S of Eq (9) stands for probability limit. It is a short-hand way to say that the random variable computed by the summation on the L.H.S. of Eq (9) converges in probability to the random variable on the R.H.S. as the size of the data set n tends to infinity. A simple way to understand “convergence in probability” is by imagining two random variables A and B. If A converges in probability to B, it implies that the probability distribution of A becomes more and more like the probability distribution B as n (the size of the data sample) increases and it becomes (almost) identical to the probability distribution of B as n tends to infinity. Thus, the properties of A become indistinguishable from the properties of B as n becomes arbitrarily large.
The fact that the summation on the L.H.S. of Eq (9) and the term on the R.H.S are random variables can be deduced as follows:
Let’s look at the R.H.S. of Eq (9). Each time one randomly selects a sample of size n, X, the regression matrix, is likely to take on a different set of values. That makes the X matrix a random variable with some (unknown) probability distribution. The term on the R.H.S. of Eq (9) namely, (X’σ²ΩX)/n, is a function of X which makes it a random variable having some probability distribution.
Now let’s look at the L.H.S. On the L.H.S., we have the terms x’_i and x_i which are the transpose of the ith row of X and ith row respectively. Both are random variables for the same reasons that X and X’ are random variables. Hence the entire scaled summation on the L.H.S. is also a random variable with some probability distribution.
Let’s inspect the summation on the L.H.S.:
x_i is the ith row of X, so its a row vector of size [1 x k], and x’_i, its transpose is of size [k x 1]. Hence their matrix product is a [k x k] square matrix and it contains the covariance of the ith row of X. We scale this matrix by the square of the residual error e for the ith row. The summation sums up n such [k x k] matrices, each scaled by the corresponding squared residual error from the fitted model. The outcome of the summation is a square matrix of size [k x k]:
Now let’s look at the R.H.S. of Eq (9):
X is the regression matrix of size [n x k] which makes its transpose X’ of size [k x n]. σ² is a scalar. Ω is of size [n x n]. Working left to right, X’Ω is of size [k x n]. And (X’Ω)X is of size [k x k]. Thus the R.H.S. is also a squared matrix of size [k x k] scaled with the scalar σ²/n.
With this estimate in hand, let’s revisit the formula for the covariance of the fitted coefficients in the face of heteroskedastic errors:
Thanks to Dr. White, we now have a way to estimate the yellow term at the center of Eq (8) as follows:
Equation (10) is known as White’s Heteroskedasticity Consistent (HC) Estimator. It gives the regression modeler a way to estimate the asymptotic covariance matrix of the fitted regression coefficients in the face of heteroskedastic errors. The word ‘asymptotic’ implies that the estimator is valid, strictly speaking, only for infinitely large data sets. More on this fact below.
While using it, we should keep in mind its following two limitations:
Recall that this estimator is based on an identity that is valid only as the data set becomes arbitrarily large in size, i.e. technically as n → ∞. In practical settings, this limitation makes this estimator especially valid only for very large data sets. For smaller data sets (known as small samples), say when n is less than or equal to a couple of hundred data points, White’s HC estimator tends to underestimate the variance in the fitted coefficients making them look more relevant than they might actually be. This underestimation in small samples can be usually \corrected by dividing Eq (10) by (n — k) as shown by MacKinnon and White in their 1985 paper (see paper link at the end of the chapter).
The second potential issue with White’s HC estimator is that it assumes that there is little to no auto-correlation in the errors of the regression model. This assumption makes it suitable only for cross-sectional and panel data sets and makes it especially unsuitable for time series data sets which typically contain auto-correlations extending to several periods into the past.
All things said, White’s heteroskedasticity consistent estimators provides a powerful means to estimate the covariance matrix of fitted coefficients and thereby perform consistent statistical inference in the face of heteroskedasticity.
Derivation of the covariance matrix of coefficient estimates
We start with the linear model:
A least-squares estimation of β yields the following estimator:
Substituting y in (b) with Xβ + ϵ and rearranging the terms yields the following derivation:
The variance of a random variable is the expected value of the square of its mean-subtracted value: Var(X)=E[(X — X_mean)²]. The mean of the coefficient estimates β_cap is simply the population value β. Thus we continue with the derivation as follows:
The blue colored term at the center of Eq ( e ) is the covariance of the regression model’s errors conditioned upon X. We know that the covariance matrix of errors can be represented as σ²Ω. Thus, substituting the blue colored expectation in Eq ( c ) with σ²Ω yields the formula for the covariance matrix of estimated coefficients when the model’s errors are heteroskedastic:
In PART 2, we’ll walk through a tutorial on how to use the White’s Heteroskedasticity Consistent Estimator using Python and Statsmodels.
Citations and Copyrights
White, Halbert. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica, vol. 48, no. 4, 1980, pp. 817–38. JSTOR, https://doi.org/10.2307/1912934. Accessed 25 Sep. 2022. PDF download link
James G MacKinnon, Halbert White, Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties, Journal of Econometrics, Volume 29, Issue 3, 1985, Pages 305–325, ISSN 0304–4076, https://doi.org/10.1016/0304-4076(85)90158-7. (https://www.sciencedirect.com/science/article/pii/0304407685901587) PDF download link
All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.
PREVIOUS: Introduction to Heteroscedasticity
NEXT: A Tutorial on White’s Heteroskedasticity Consistent Estimator Using Python and Statsmodels