A tutorial on how the variance-covariance matrices are constructed and when to be skeptical about the data they report

I am basing this chapter on a topic that I hope will be a joy to read for those of you who are mathematically inclined, and also for those of us who have wondered what standard errors are or where they come from.

As a quick refresher of concepts: the **variance **is a measure of a random variable’s “spread” or variation around its mean (a.k.a. its expected value), while the **co-variance** measures how correlated are the variations of two random variables with each other.

The **Standard error** is simply the standard deviation of the error, where “error” has different meanings depending on whether it is used in the context of a regression variable’s coefficient or in the context of the prediction of the model as whole. We’ll get into this soon.

The **variance-covariance matrix** is a square matrix i.e. it has the same number of rows and columns. The elements of the matrix that lie along its main diagonal i.e. the one that goes from top-left to bottom-right contain the variances while all other elements contain the co-variances. Thus, the **variance-covariance matrix of the fitted coefficients** of a regression model contains the variances of the fitted model’s coefficient estimates along its main diagonal, and it contains the pair-wise co-variances between coefficient estimates in the non-diagonal elements.

Similarly, the **variance-covariance matrix of the error terms** of the regression model contain the variance of each error term along its main diagonal and the covariances between all pairs of error terms in the other places.

In the previous chapter, we got an intuitive feel for why the fitted model’s coefficients or the error terms exhibit variances, and what role these matrices play in regression modeling.

In this article, we’ll use the **C**lassical **L**inear **R**egression **M**odel (CLSR) as the vehicle for showing how the formula for the variance-covariance matrix of coefficient estimates, and the variance-covariance matrix of errors, emerges seemingly naturally from the fundamental assumptions of the CLSR model and its solution. Even though the material I have presented is for the linear model, the overall conceptual approach carries through to many other popularly used regression models.

This article is PART 2 of the following two part series on variance-covariance matrices:

PART 1: An Overview Of The Variance-Covariance Matrices Used in Linear Regression**PART 2: A Deep Dive Into The Variance-Covariance Matrices Used In Linear Regression**

There is some amount of very simple linear algebra (read: matrices and operations on matrices) that we will encounter in this article.

To begin with, we’ll quickly (re)acquaint ourselves with a few basic linear algebra concepts.

## A short primer on Linear Algebra

**Matrix multiplication: **Two matrices ** A **and

**can be multiplied if number of columns of**

*B***are equal to the number of rows of**

*A***. The resulting matrix’s dimensions are the number of rows of**

*B***times number of columns of**

*A***. The following figure illustrates this product operation. For simplicity, we have dropped the column subscript of the**

*B***matrix**

*β**.*

**The transpose of a matrix: **The transpose of a matrix results from interchanging the rows and columns of the matrix. It’s as if the matrix is turned on its side. Here is the regression matrix ** X **and it’s transpose

**(pronounced**

*X’*

*X**prime*):

**Multiplication of X’ and X: **The multiplication of

**of size**

*X’**(k x n)*with

**of size**

*X**(n x k)*yields a matrix of size

*(k x k)*. As we shall soon see, the

**matrix appears everywhere in regression analysis.**

*X’X*Let’s revisit the automobiles data set that we saw in my previous article on variance-covariance matrices:

We’ll carve out a regression matrix ** X** containing the following 3 regression variables:

City_MPG

Engine_Size

Curb_Weight

Here’s how *X**’* looks like:

And here is the ** X’X** matrix for the sample of size 205 from our automobile data set. As you can see,

*X**’*

**is a**

*X**(3 x 3)*matrix:

Here are a couple of useful identities associated with the transpose operation:

- The
**transpose of the product of two matrices**is the product of the transpose of individual matrices in*reverse order*:

(**AB**)’ =*B**’**A**’* - The
**transpose of a transposed matrix**gives you back the original matrix: (’)’ =*X**X*

**Identity Matrix: **The identity matrix contains 1s along the main diagonal (the one going from top-left to bottom-right), and zeros in all other elements. It is usually denoted as *I**_n* (** I** subscript

*n*) or simply

*I**and is a square matrix of dimension*

*(n x n)*. The Identity matrix is conceptually a multi-dimensional version of the number

*1*.

*I**_n*has the following properties:

*I**_n*raised to any power equals*I**_n. For e.g.**I**³=**I*- Any matrix
of size*Z**(m x n)*multiplied by*I**_n*yields backi.e.*Z.**ZI**=**Z* - Inverse of
*I**_n*is*I**_n*. This is conceptually akin to knowing that*1*divided by*1*is, well*1.* - Any invertible square matrix
of size*Z**(n x n)*times its inverse yields*I**_n*. This last property brings us to the next concept — the inverse of a matrix.

**Inverse of a matrix: **The inverse of a matrix is conceptually the multi-dimensional equivalent of the inverse of a scalar number *N. *The inverse of a matrix is calculated using a somewhat complex formula and we’ll skip the formula for now. In regression analysis, the things to remember about a matrix’s inverse are the following:

- The inverse of
is denoted as*X**X**^(-1)*i.e.with a superscript -1.*X* - Only ‘invertible’ matrices have an inverse. In the scalar world, zero does not have a meaningful inverse. 1/0 = ??. It’s the same concept with matrices.
- The inverse of a square matrix
*X**^(-1)*timesequates to*X**I**_n*. i.e.*X**^(-1)**X**=*Just as in the scalar world,*I.**1/n*times*n*is*1*.

Now let’s apply these concepts to the topic at hand which is the derivation of covariance matrix for the linear regression model.

## Formulation of the variance-covariance matrix of a regression model

We’ll start with the equation of the linear regression model as follows:

In the above equation, ** y** is the dependent variable,

**is the matrix of regression variables,**

*X***is the vector of regression coefficients**

*β**β_1, β_2, β_3, …, β_k*containing the

*population level*values of each coefficient (including the intercept

*β_1*

**)**, and

**is the vector of error terms.**

*ϵ***is the difference between the observed value of**

*ϵ***and the modeled value of**

*y***. The error terms reflect the effect of all that which is not or cannot be modeled.**

*y*We assume that the first column of ** X **is a column of 1s holding the place for the intercept of regression

*β_1*.

Here’s the matrix form:

Solving (a.k.a. “fitting”) this model on a sample of size *n*, yields estimated values of ** β** which we will denote as

**Thus, the fitted model looks like this:**

*β_cap.*In the above equation, ** e **is the vector of

**residual errors**(a.k.a.

**residuals**). The residual

**is the difference between the observed value of**

*e***and the value**

*y***that is predicted by the fitted model.**

*y_cap*If the model is fitted using the least squares minimization technique, known as **O**rdinary **L**east **S**quares (OLS), we are seeking to minimize the following sum of squares of residual errors, also called **R**esidual **S**um of **S**quares (RSS):

In the above equation, *x**_i* is the *ith *row of the regression matrix ** X**.

We’ll now derive a matrix based formulation for the **R**esidual **S**um of **S**quares.

Let’s reexamine the vector of error terms ** ϵ **in equation (1). For a sample of size

*n*,

*ϵ**is a column vector of size*

*(n x 1)*:

Each *ϵ_i* is a random variable that has a certain mean (a.k.a. **expectation**) that is conditional upon ** X**, and a certain variance around that mean that is also conditional upon

**.**

*X*The column vector of conditional expectations of ** ϵ **is represented as

*E(*

*ϵ**|*

*X**)*:

Subtracting *E(**ϵ**|**X**)* from ** ϵ **yields the following:

Transposing this mean-centered column vector yields the following row vector:

Next, let’s multiply (*ϵ** — E(**ϵ**|**X**)*)’ with (*ϵ** — E(**ϵ**|**X**)*). Notice that (*ϵ** — E(**ϵ**|**X**)*)’ is a row vector of size *(1 x n)* and (*ϵ** — E(**ϵ**|**X**)*) is a column vector of size *(n x 1)*. Hence the product (*ϵ** — E(**ϵ**|**X**)*)’ (*ϵ** — E(**ϵ**|**X**)*) is *a (1 x 1) *matrix, i.e. for our purposes, a single number. The following diagram illustrates the multiplication of (*ϵ** — E(**ϵ**|**X**)*)’ with (*ϵ** — E(**ϵ**|**X**)*):

The third step in the above procedure is where we take the product of the two matrices shown in the second step.

Here, we get an important insight: the summation in the final step is (*n-k) *times the the conditional variance of the random variable *ϵ**.* Thus, we have reached the following important result:

Now, one of the chief assumptions in linear regression is the following:

The error terms are not correlated with the regression variables.

This assumption is known as the **mean independence*** *property of the regression model. It implies that conditioned upon ** X**, the mean of

**is zero which means that in the above formula, the conditional mean of errors**

*ϵ**E(*

*ϵ|X**) =*

**where**

*0***is a column matrix of size**

*0**[n x 1]*containing only zeroes

*.*Hence, the above formula for error variance of

**reduces to the following:**

*ϵ*Let’s revisit the derivation shown in equation (4). In the first step, we will apply our assumption of mean independence by substituting *E(**ϵ|X**) *with ** 0**. Similarly, in the last step, we will substitute

*E(ϵ_i*

*|X**)*with

*0*

**for all**

*i*in the summation. Doing these two substitutions gets us another important

**formula for the sum of squares of error terms of a regression model**:

Notice that nowhere in these derivations have we assumed that the model is linear. We have only made one assumption so far, namely that of **mean independence**.

We will now make a second assumption:

The residual errors

eof the fitted model are unbiased estimators of the error termsϵ.

This assumption allows us to **estimate the sum of squares of errors** using the **sum of squares of residuals**. In other words:

Let’s once again revisit the linear regression model. In order to fit the model on a sample of size *n* using the **O**rdinary **L**east **S**quares (OLS) estimation technique, we need to minimize the residual sum of squares given by equation (3). We have also shown that the summation on the L.H.S. of equation (3) can be estimated using the matrix product of transpose of residual errors times itself, i.e. *e’e.*

We also know that the fitted model’s equation can be written as: *y** = **Xβ_cap **+*** e**, or, in terms of

*e, e=**(*

*y**—*

*Xβ_cap**)*

**Thus, we need to minimize the following so as to fit the linear model using the least squares technique:**

*.*To do so, we will take the partial derivative of the above equation, set it to the zero matrix ** 0**, and solve the system of

*k*equations so obtained corresponding to the

*k*regression variables so as to get

**=**

*β_cap**[β_1_cap, β_2_cap … β_k_cap]*.

We will not actually solve (8) but instead we will directly state the end result, which is the value of ** β_cap** that will minimize the residual sum of squares:

The above elegant, closed-form equation gives us in single line, the least squares estimator for the linear regression model that is fitted using the OLS technique.

Let’s recollect that our goal is to arrive at the formula for the variance-covariance matrix for the regression coefficients *β**. *To that end, we’ll follow the approach suggested in Greene (2018) and substitute ** y** with

*Xβ**+*

**in equation (9) as follows:**

*ϵ*Let’s distribute out the terms inside the blue colored bracket on the R.H.S.:

Let’s recolor the two terms as follows:

The gray bit in the first term of the above equation is the multiplication of a matrix with itself and it equates to the identity matrix ** I**. And we know that

*Iβ**=*

**. We’ll replace the magenta bit in the second term with the placeholder matrix**

*β***.**

*A*Let’s remember this result.

Let’s recollect that the variance of the least squares estimate of ** β**, namely

**is what we are after.**

*β_cap*In the earlier part of this article, we have seen that ** β_cap **is a random variable. The variance of any random variable

**can be expressed as the expectation of the square of its mean-subtracted a.k.a. “de-meaned” value.**

*Z**Thus, Var(**Z**) = E[(**Z** — **Z_bar**)²]*.

We know that the mean of ** β_cap** is the population level value of coefficients

**Since**

*β.***and**

*β***are**

*β_cap**(k x 1)*column vectors, we can concisely express the formula for variance by taking the expectation of the multiplication of

**and its transpose**

*(β — β_cap)***.**

*(β and β_cap)’*The matrix ** (β — β_cap)** is of size

*(k x 1)*and its transpose

**is of size**

*(β — β_cap)’**(1 x k)*and therefore their product is a square matrix of size

*(k x k)*which is exactly what we want in a covariance matrix.

In the above equation for variance, let’s replace ** β_cap** with

*(*

*β**+*

*Aϵ**)*and simplify the resulting expression as follows:

Let’s inspect the last line. The center term *E(**ϵϵ’**)**|X**]* deserves some explanation.

*E(**ϵϵ’**)**|X**] *is the conditional expectation of ** ϵϵ’** given

**. We know that**

*X***is a column vector of size**

*ϵ**(n x1)*and so its transpose is a row vector of size

*(1 x n)*. Therefore, their product

**is a square matrix of size**

*ϵϵ’**(n x n)*as follows:

Since the mean of error terms is assumed to be zero, the diagonal elements containing *E(ϵ_i*ϵ_i)**|X**] *are actually the **conditional variance of the error terms** *ϵ_i* and* *the non-diagonal elements* E(ϵ_i*ϵ_j)**|X**]* are the **conditional covariance of error terms** *ϵ_i and ϵ_j*.

We will now make two important assumptions:

Homoskedastic errorsThe error terms of the linear regression model are homoskedastic, i.e. they have constant conditional variance. Thus, E(ϵ_i*ϵ_j)

|X] is some constantσ²when i = j.This assumption causes all the expectation terms along the main diagonal of the variance-covariance matrix of errors to have some constant value

σ².

Non-correlated errorsThe error terms of the linear regression model are not correlated with each other. Thus, E(ϵ_i*ϵ_j)

|X] = 0 when i not equal to j.This assumption causes all the expectation terms that are not along the main diagonal of the variance-covariance matrix of errors to have a value of zero.

With these two assumptions, the variance-covariance matrix of errors becomes the following:

We will now substitute *E(**ϵϵ’**)**|X**] *with *σ²*** I** in equation (10) to get the following equation for the variance of the least squares estimator:

To simplify the above equation, we’ll simplify the gray bit on the R.H.S. using the identity *(**AB**)’ = **B**’**A**’*. We’ll use the identities *A**^(-1)**A** =*** I**, and

*IA**=*

*AI**=*

*A**,*and (

**’)’ =**

*A***while simplifying as follows:**

*A*Equation (11) is one of the fundamental results in the field of statistical modeling.

As we can see, the variance-covariance matrix of the fitted regression coefficients is a function of the values of all the regression variables in the data sample and the variance of the error terms of the regression which is assumed to be constant.

In equation (11), while ** X** is known to us, the population level variance

*σ²*is not a known quantity.

*However, the sample variance*

*s²*which is the

**variance of the residual errors of the fitted model**forms an unbiased estimate of

*σ².*And from our earlier work on computing the variance of residual errors, particularly equations

*(5)*and

*(7)*, we have:

Where, as usual, ** e’e** is the product of the transpose of the vector of residual errors of the fitted model with itself,

*n*is the sample size and

*k*is the number of regression variables including the intercept. You may want to verify the equation on the R.H.S. yields a scalar quantity. The residual errors vector

**is a column vector of size**

*e**(n x 1)*and

*e**’*is of size

*(1 x n)*, hence,

*e**’*

**is of size**

*e**(1 x 1)*and can be considered a scalar for the purposes of computing s².

Thus, we have arrived at the following unbiased *estimate* of the variance-covariance matrix of the coefficients of the fitted OLS regression model:

If the model has *k* regression variables including the intercept, the variance-covariance matrix is a square matrix of size *(k x k)*. This can be confirmed by noting that ** X** is a matrix of size

*(n x k)*, and so

**’ is of size**

*X**(k x n)*and therefore the product

**’**

*X***is a square matrix of size**

*X**(k x k)*and so is its inverse.

*s²*acts as a scaling factor for this matrix.

As with any variance-covariance matrix, the elements lying on the main diagonal of this matrix contain the variance of the *k* regression variables in ** β_cap** around their respective population mean in

**, while the non-diagonal elements contain the co-variances of**

*β*

*β_cap**_i*with

*β_cap**_j*.

### Confidence intervals of the regression coefficient’s estimates

The square root of each diagonal element of the variance-covariance matrix is the standard deviation a.k.a. the **standard error** of the respective regression coefficient’s estimate in ** β_cap**.

These standard errors can be directly plugged into the formula for the (1 — α)100% confidence interval of the regression coefficient’s estimate around it’s estimated value as follows:

In the above formula, *β_cap_i *is the estimated value (a.k.a. the mean) of the *ith *regression coefficient of the fitted model. The *t *value inside the square bracket is the critical value returned from the 2-sided *Student’s t*-distribution with *(n-k)* degrees of freedom where *n *is the sample size and *k *is the number of regression coefficients including the intercept. *se_i_i* is square root of the *ith *diagonal element in the variance-covariance matrix of ** β_cap **from equation (12).

### The assumptions underlying the formulation of the variance-covariance matrix of regression coefficients

The formulation for the variance-covariance matrix presented in equation (12) has emerged from the following set of 5 assumptions that we progressively made during the derivation process:

**Mean-independence of errors**: Error terms of the model are mean independent:*E(**ϵ**|**X**)=0***Linear form**: The regression model has a linear form: y =*Xβ**+**ϵ*- Residual errors
of the fitted model are unbiased estimates of the error terms*e*.*ϵ* **Homoskedastic errors**: The model’s errors are homoskedastic:*E(ϵ_i*ϵ_j|**X**) = σ²*(a constant) when*i = j,*and,**Uncorrelated errors**: The model’s errors are not correlated with each other:*E(ϵ_i*ϵ_j|**X**) =0*when*i != j*

#### When to be skeptical about the variance-covariance matrix

If the above set of assumptions are violated, then the preceding set of derivations no longer hold, and equation (12) cannot be used to compute the variance-covariance matrix of the model. If we choose to still compute the variance-covariance matrix using equation (12), the variance and covariance values in the matrix will be inaccurate. Consequently, the** standard errors of the coefficients computed from the matrix will be incorrect**, and **we will end up reporting wrong confidence intervals for the regression coefficients**.

To be sure, for a given data set, the first three assumptions — mean independence, linear form and residuals as unbiased estimates of error terms —determine the viability of the linear regression model. Assumptions (4) and (5) only describe the behavior of the variances of the model’s errors. Although, assumptions (1) through (5) together lead up to the formulation of the linear OLS regression model’s variance-covariance matrix, if (1) through (3) are satisfied but (4) and /or (5) are not, the linear regression model is still a viable model for the data set at hand. But in this situation, one should not report standard errors and confidence intervals of the coefficient estimates using equation (12).

When you build and fit a linear OLS model using a statistical modeling package such as statsmodels, the software will usually report standard errors using the formula given in equation (12) in the model training summary. It will therefore also report the (1 — α)100% confidence interval of the regression coefficients’ estimates using these standard errors. If the model is fitted using the OLS technique, assumptions (2) and (3) — linear form and residuals as unbiased estimates of error terms — are baked into the OLS technique. It is important to verify that assumptions (1), (4) and (5) — mean independence, homoskedasticity, and uncorrelatedness of error terms are satisfied by the fitted model. But if they are not satisfied, then one should not rely on the standard errors and confidence intervals reported by the statistical package.

As an illustration of equation (12), let’s use it to compute the variance-covariance matrix for a linear OLS model fitted on the motor vehicles data set.

First, we will fit the model on a randomly selected sample of size 50 from the overall data set. We’ll use the **statsmodels** package to build and fit the model as follows:

```
import pandas as pd
from patsy import dmatrices
import numpy as np
import scipy.stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
#Read the automobiles dataset into a Pandas DataFrame
df = pd.read_csv('automobile_uciml_3vars.csv', header=0)
SAMPLE_SIZE = 50
#Select a random sample of size SAMPLE_SIZE
df_sample = df.sample(n=SAMPLE_SIZE)
#Setup the model expression in Patsy syntax
model_expr = 'City_MPG ~ Curb_Weight + Engine_Size'
#Carve out the X and y matrices using Patsy
y_train, X_train = dmatrices(model_expr, df_sample, return_type='dataframe')
#Build an OLS regression model using Statsmodels
olsr_model = sm.OLS(endog=y_train, exog=X_train)
#Fit the model on (y, X)
olsr_results = olsr_model.fit()
#Print the training summary of the fitted model
print(olsr_results.summary())
```

We see the following training output:

One can see that the standard errors of coefficients reported by statsmodels are the following:

And their (1–0.05)100% CIs are the following:

Let’s manually calculate the variance-covariance matrix for the fitted coefficients using equation (12).

To use equation (12), we’ll need to calculate *s²*, which is the variance of the residuals. It is calculated using the formula *(**e**’**e**)/(n-k)* as follows:

```
e=np.array(olsr_results.resid)
#reshape it to (50 x 1)
e=e.reshape(e.shape[0],1)
e_prime = np.transpose(e)
resid_variance = np.matmul(e_prime,e)/(50-3)[0]
print('Variance of residuals='+str(resid_variance[0][0]))
```

We get the following output:

Variance of residuals=25.074639949635987

The variance-covariance matrix is calculated using equation (12) as follows:

```
X=np.array(X_train)
X_prime = np.array(np.transpose(X))
coeff_covs_matrix = np.linalg.inv(np.matmul(X_prime, X))*resid_variance
print('Variance-Covariance Matrix of coefficient estimates=\n'+str(coeff_covs_matrix))
```

We get the following output:

Variance-Covariance Matrix of coefficient estimates=

[[ 1.39215336e+01 -7.68038461e-03 5.11395679e-02]

[-7.68038461e-03 5.77116102e-06 -5.62371161e-05]

[ 5.11395679e-02 -5.62371161e-05 7.23625275e-04]]

The square-root of the main diagonal elements is the standard errors of the fitted coefficients:

```
coeff_std_errors = np.sqrt(coeff_covs_matrix)
print('Standard errors of coefficient estimates=\n'+str(coeff_std_errors))
```

We see the following output:

Standard errors of coefficient estimates=

[[3.73115714e+00nan 2.26140593e-01]

[ nan2.40232409e-03nan]

[2.26140593e-01 nan2.69002839e-02]]

The main diagonal elements are:

[**3.731**15714, **0.002**40232409, **0.0269**002839]

These values match the standard errors reported by Statsmodels’ in the training summary:

## References, Citations and Copyrights

### Data set

**The Automobile Data Set**** citation: **Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. **Download link**

### Paper and Book Links

William H. Greene, *Econometric Analysis*, 8th Edition*, *2018, *Pearson*

**PREVIOUS:** An Overview of the Variance-Covariance Matrices Used in Linear Regression

**NEXT: **Introduction to Heteroskedasticity