An Overview Of Variance-Covariance Matrices Used In Linear Regression

An illustration of how an artifact that is fundamental to regression modeling is constructed, using a real-world data set.

The variance-covariance matrix forms the keystone artifact of regression models. The variance-covariance matrix of the regression model’s errors is used to determine whether the model’s error terms are homoskedastic (constant variance) and uncorrelated. The variance-covariance matrix of the fitted regression model’s coefficients is used to derive the standard errors and confidence intervals of the fitted model’s coefficient estimates. Both matrices are used in forming the prediction intervals of the model’s forecasts.

The variance-covariance matrix is a square matrix i.e. it has the same number of rows and columns. The elements of the matrix that lie along its main diagonal i.e. the one that goes from top-left to bottom-right contain the variances while all other elements contain the co-variances. Thus, the variance-covariance matrix of the fitted coefficients of a regression model contains the variances of the fitted model’s coefficient estimates and the pair-wise covariances between coefficient estimates.

Similarly, the variance-covariance matrix of the error terms of the regression model contain the variance of each error term along its main diagonal and the covariances between all pairs of error terms.

Having said that, why the fitted model’s coefficients or the error terms have variances in the first place, and what role these matrices play in regression modeling are topics that we will delve into in this chapter.

We will use the Classical Linear Regression model as our exemplar model. The concepts we will learn are equally applicable to a large variety of commonly used regression models.

This chapter is Part 1 of the following two-part series:

PART 1: An Overview Of Variance-Covariance Matrices Used In Linear Regression
PART 2: A Deep Dive Into The Variance-Covariance Matrices Used In Linear Regression

The automobile data set as our sample data set

The following data contains specifications of 205 automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle.

The automobiles data set (Source: UC Irvine)

We’ll consider a small subset of this data set consisting of only three variables:
City_MPG
Engine_Size
Curb_Weight

The 3-variables version is available for download from here.

Here are the first few rows of the three variable data set:

A subset of the Automobiles data set (Source: UC Irvine)

Regression goal

Our regression goal is to regress City_MPG on Engine_Size and Curb_Weight using a linear regression model. The model equation is:

City_MPG = β_1 + β_2*Engine_Size + β_3*Curb_Weight + ϵ

Where ϵ is the error term of the model. The error term ϵ of the regression model represents the effects of all the factors that the modeler has not or cannot measure.

The matrix version of the above equation is written as follows:

y = Xβ + ϵ

Where,

y is an [n x 1] size column vector containing the observed values of City_MPG. We assume
β is a [3 x 1] size column vector of regression model coefficients β_1, β_2, β_3 corresponding to the intercept, the Engine_Size and Curb_Weight.
X is a [n x 3] size matrix containing the values of the regression variables. The first column of this matrix is a column of 1s and it acts as the placeholder for the intercept β_1.
ϵ is an [n x 1] size column vector of the model’s regression errors.

When the model is fitted on a sample of size n, the fitted model’s equation can be written as:

y = Xβ_cap + e

Where,

y and X have the same meaning as before.
β_cap is a [3 x 1] size column vector contained the estimates of regression model’s coefficients β_1, β_2, β_3. β_cap are sometimes called the fitted coefficients.
e is an [n x 1] size column vector of the fitted model’s residual errors. The residual error is the difference between the observed value and the predicted value of y. e= y — Xβ_cap

Regression strategy

Normally, we would train this model on 75–85% of the data set and test it on the remaining 15–25% of data set. Thus, the given data set would be the sample that has been presumably drawn from a much larger (theoretically infinite sized) population.

To illustrate how the variance-covariance matrices are constructed, we will follow a somewhat different strategy, called bootstrapping. We will assume that the data set of 205 vehicles is the population, and we will follow the following regression strategy:

We will draw 100 random samples of size 50 vehicles each, with replacement, from this population. “With replacement” means that after using the sample, we will put it back into the population so that those data are available while making the next draw. The with-replacement strategy may not be realistic in many real-world settings, but it makes the math a whole lot simpler, and it does not lead to any practical difficulties in case of the autos data set.
We will train (a.k.a. fit) a linear regression model on each one of these 50 vehicle samples using the OLS technique. Such a model is called as an Ordinary Least Squares Regression (OLSR) model.
After training on each sample, we’ll note down the values of the fitted model’s coefficients β_0_cap, β_1_cap, and β_2_cap. The ‘cap’ indicates that these are the estimates of the population level value of the corresponding coefficients.
We will also note down for each sample, the residual errors of regression ‘e’. The residual error of regression is the difference between the observed value of the dependent variable (City_MPG) and the value predicted by the fitted regression model.

Here is the Python code for implementing the above bootstrapping algorithm:

import pandas as pd
from patsy import dmatrices
import numpy as np
import scipy.stats
import statsmodels.api as sm
import matplotlib.pyplot as plt


#Read the automobiles dataset into a Pandas DataFrame
df = pd.read_csv('automobile_uciml_3vars.csv', header=0)

#Set the sample size and number of random draws (with replacement)
SAMPLE_SIZE = 50
NUM_DRAWS = 100

#Set up the Dataframe to store the coefficient estimates from each training run
param_col_names = ['Intercept']
param_col_names = param_col_names + list(df.columns[1:-1])
df_sample_beta = pd.DataFrame(columns=param_col_names)

#Setup the model expression in Patsy syntax
model_expr = 'City_MPG ~ Curb_Weight + Engine_Size'

#Run the loop NUM_DRAWS times
for i in range(NUM_DRAWS):
    #Select a random sample of size SAMPLE_SIZE
    df_sample = df.sample(n=SAMPLE_SIZE)
    #carve out the X and y matrices using Patsy
    y_train, X_train = dmatrices(model_expr, df_sample, return_type='dataframe')
    #Build an OLS regression model using Statsmodels
    olsr_model = sm.OLS(endog=y_train, exog=X_train)
    #Fit the model on (y, X)
    olsr_results = olsr_model.fit()
    #Push the row of fitted coefficients into the Dataframe
    df_sample_beta.loc[len(df_sample_beta.index)] = list(olsr_results.params)

Let’s view the results of our bootstrapping experiment.

Training results

Let’s print out the the table of regression coefficients (including the Intercept) from running the OLSR model on the 100 data samples using the procedure outlined above.

print(df_sample_beta)

Table of the fitted OLSR model’s regression coefficients from fitted the model 100 times, each time on a randomly selected sample of 50 vehicles (Image by Author)

Each row in this table corresponds to the fitted values of the regression coefficients obtained from fitting the OLSR model on a randomly selected sample of 50 vehicles.

It can be seen from this table that the fitted model’s regression coefficients behave like random variables. Indeed, they are random variables. Each estimated coefficient follows some probability distribution and it has a mean and variance.

Using the data from the above table, let us plot the Probability Density Functions of the three regression coefficients. Here’s the Python code for it:

def draw_pdf(data, min_X, max_X, var_name):
    hist = np.histogram(data)
    hist_dist = scipy.stats.rv_histogram(hist)
    X = np.linspace(min_X, max_X, 20)
    fig = plt.figure()
    fig.suptitle('PDF of ' + var_name)
    plt.plot(X, hist_dist.pdf(X), label='PDF')
    plt.xlabel(var_name)
    plt.ylabel('Density')
    plt.show()

#Plot the PDFs of the three coefficient estimates
draw_pdf(df_sample_beta['Intercept'], 0, 100, 'estimated Intercept ')

data=df_sample_beta['Curb_Weight']
draw_pdf(data, min(data)*0.9, max(data)*1.1, 'estimated coefficient for Curb_Weight')

data=df_sample_beta['Engine_Size']
draw_pdf(data, min(data)*0.9, max(data)*1.1, 'estimated coefficient for Engine_Size')

We get the following three plots:

PDFs of estimates of regression coefficients (Image by Author)

Constructing the variance-covariance matrix of regression coefficients

We can use the table of regression coefficient values to calculate the variance of each coefficient as well as the pair-wise covariance of the three coefficients.

Let’s recollect the formulas for variance and covariance.

Given a random variable x that is realized over a sample of size n, the following sample variance of x forms an unbiased estimate of the population variance of a x:

An unbiased estimator of the population variance of a random variable x (Image by Author)

Here, x_bar is the mean of x, and the -1 in the denominator represents the single degree of freedom lost due to the inclusion of the mean.

A similar unbiased estimator of the population level covariance between two random variables x and z is as follows:

An unbiased estimator of the population covariance of random variables x and z — An unbiased estimator of the population **covariance** of random variables x and z (Image by Author)

Using these formulae, we will calculate the variances and pair-wise covariances for the regression coefficient values.

The coefficient means are as follows:

The mean estimates of regression model coefficients (Image by Author)

And the variances and covariances are as follows:

Variances and covariances of regression coefficients (Image by Author)

The following Python code can be used to compute the means of the coefficient estimates and the variance-covariance matrix of regression coefficients:

#Calculate the mean estimate for each coefficient
coeff_means = df_sample_beta.mean()

#Calculate the variance-covariance matrix for each coefficient
coeff_covs = df_sample_beta.cov()

We’ll print them out:

print(coeff_means)

The mean estimate of coefficients (Image by Author)

print(coeff_covs)

The variance-covariance matrix of the regression coefficients (Image by Author)

In the above matrix, the elements along the main diagonal indicated by the red boxes contain the variances of the respective coefficient estimates while the non-diagonal elements contain the pair-wise covariances.

The negative coefficient between Engine_Size and Curb_Weight may seem counter-intuitive but the value is so small that one should ignore the sign. Indeed, given the nature of the sampling technique, each time we conduct the 1000 sample experiment, we will get slightly different values for the variances and covariances.

The above process of deriving the covariance matrix for the model’s coefficients is called the bootstrap technique. It is important to remember that the covariances of regression coefficients are finite sample estimates of the respective true, population level covariances (which are usually assumed to be unknown).

While using the bootstrap technique, when the number of draws is small, one needs to adjust up the estimated covariances using the factor (D+1)/D, where D is the number of draws. In our experiment, D=100 and this adjustment factor is only 1.01 and can be ignored.

Standard errors of the model’s coefficients

The standard error of a coefficient’s estimate is simply the standard deviation of the random variable that represents the coefficient’s estimate.

Notation wise,

SE(β_cap|X) = STDEV(β_cap|X) = SQRT(Var(β_cap|X))

Recollect that the diagonal elements of the variance-covariance matrix contain the variances of coefficients. Hence, the standard error for each coefficient can be calculated by taking the square root of the respective diagonal element of the covariance matrix.

Let’s calculate and print out the standard deviations:

coeff_std_errors = np.sqrt(coeff_covs)
print(coeff_std_errors)

Here are the standard errors of three coefficients shown in red boxes:

Standard errors of coefficient estimates calculated using the bootstrapping technique (Image by Author)

The above technique of calculating standard errors of coefficients is called bootstrapping of standard errors, and the standard errors so obtained are called bootstrapped standard errors of coefficient estimates.

Confidence intervals of the model’s coefficients

The (1-α)*100% confidence interval for each coefficient is calculated using the following formula:

Formula for confidence interval of regression coefficients (Image by Author)

In the above formula:

β_cap_i is the fitted value of the ith coefficient reported by the model after it is fitted on the data sample.
The t value inside the square bracket is the critical value returned from the 2-sided t-distribution with (n-k) degrees of freedom where n is the sample size and k is the number of regression coefficients including the intercept.
se_i_i is square root of the ith diagonal element in the variance-covariance matrix of β_cap.

To calculate the confidence intervals, we’ll need to fit the OLSR model on a randomly drawn sample of size 50:

# Select a random sample of size SAMPLE_SIZE
df_sample = df.sample(n=SAMPLE_SIZE)

# carve out the X and y matrices using Patsy
y_train, X_train = dmatrices(model_expr, df_sample, return_type='dataframe')

# Build an OLS regression model using Statsmodels
olsr_model = sm.OLS(endog=y_train, exog=X_train)

# Fit the model on (y, X)
olsr_results = olsr_model.fit()

#Print the training summary of the fitted model
print(olsr_results.summary())

We get the following training summary output:

Training summary of the OLSR model (Image by Author)

The fitted coefficients β_cap are as follows:

print(olsr_results.params)

Intercept      49.987734
Curb_Weight    -0.009330
Engine_Size    -0.001189

The sample size n is 50 and there are 3 regression variables including the intercept. So, (n-k)= 50–3=47. The 2-sided t-value at α=0.05 is 2.012.

The 95% confidence interval for the estimate for Curb_Weight’s coefficient would be calculated as follows:

95% CI for Curb_Weight
= (-0.009330) +/- (2.012 * 0.002586)
= (-0.009330) +/- (0.005203032)
= [-0.014533032, -0.004126968]

This closely matches the 95% CI for Curb_Weight reported by statsmodels in the training summary:

95% CI reported by Statsmodels (Image by Author)

Let’s also compare the standard errors reported by Statsmodels and the corresponding ones we have calculated using the Bootstrapping technique:

Here are the ones reported by Statsmodels:

Here are the ones we have bootstrapped:

We can see that while the standard errors for Curb_Weight and Engine_Size closely match those reported by Statsmodels, the one for Intercept is off by a small margin (3.731 versus 3.915). So should we use the bootstrapped standard errors or should we use the ones reported by the statistical package? After all, if the variance-covariance matrix is miss-specified, the standard errors of the coefficient estimates will be incorrect, and so will be the confidence intervals.

We’ll address this important question in the next chapter: A Deep Dive Into The Variance-Covariance Matrices Used In Linear Regression.

References, Citations and Copyrights

Data set

The Automobile Data Set citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Download link

PREVIOUS: The Assumptions Of Linear Regression

NEXT: A Deep Dive Into The Variance-Covariance Matrices Used In Linear Regression

UP: Table of Contents