An illustration of how an artifact that is fundamental to regression modeling is constructed, using a real-world data set.
The variance-covariance matrix forms the keystone artifact of regression models. The variance-covariance matrix of the regression model’s errors is used to determine whether the model’s error terms are homoskedastic (constant variance) and uncorrelated. The variance-covariance matrix of the fitted regression model’s coefficients is used to derive the standard errors and confidence intervals of the fitted model’s coefficient estimates. Both matrices are used in forming the prediction intervals of the model’s forecasts.
The variance-covariance matrix is a square matrix i.e. it has the same number of rows and columns. The elements of the matrix that lie along its main diagonal i.e. the one that goes from top-left to bottom-right contain the variances while all other elements contain the co-variances. Thus, the variance-covariance matrix of the fitted coefficients of a regression model contains the variances of the fitted model’s coefficient estimates and the pair-wise covariances between coefficient estimates.
Similarly, the variance-covariance matrix of the error terms of the regression model contain the variance of each error term along its main diagonal and the covariances between all pairs of error terms.
Having said that, why the fitted model’s coefficients or the error terms have variances in the first place, and what role these matrices play in regression modeling are topics that we will delve into in this chapter.
We will use the Classical Linear Regression model as our exemplar model. The concepts we will learn are equally applicable to a large variety of commonly used regression models.
This chapter is Part 1 of the following two-part series:
PART 1: An Overview Of Variance-Covariance Matrices Used In Linear Regression
PART 2: A Deep Dive Into The Variance-Covariance Matrices Used In Linear Regression
The automobile data set as our sample data set
The following data contains specifications of 205 automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle.
We’ll consider a small subset of this data set consisting of only three variables:
The 3-variables version is available for download from here.
Here are the first few rows of the three variable data set:
Our regression goal is to regress City_MPG on Engine_Size and Curb_Weight using a linear regression model. The model equation is:
City_MPG = β_1 + β_2*Engine_Size + β_3*Curb_Weight + ϵ
Where ϵ is the error term of the model. The error term ϵ of the regression model represents the effects of all the factors that the modeler has not or cannot measure.
The matrix version of the above equation is written as follows:
y = Xβ + ϵ
- y is an [n x 1] size column vector containing the observed values of City_MPG. We assume
- β is a [3 x 1] size column vector of regression model coefficients β_1, β_2, β_3 corresponding to the intercept, the Engine_Size and Curb_Weight.
- X is a [n x 3] size matrix containing the values of the regression variables. The first column of this matrix is a column of 1s and it acts as the placeholder for the intercept β_1.
- ϵ is an [n x 1] size column vector of the model’s regression errors.
When the model is fitted on a sample of size n, the fitted model’s equation can be written as:
y = Xβ_cap + e
- y and X have the same meaning as before.
- β_cap is a [3 x 1] size column vector contained the estimates of regression model’s coefficients β_1, β_2, β_3. β_cap are sometimes called the fitted coefficients.
- e is an [n x 1] size column vector of the fitted model’s residual errors. The residual error is the difference between the observed value and the predicted value of y. e= y — Xβ_cap
Normally, we would train this model on 75–85% of the data set and test it on the remaining 15–25% of data set. Thus, the given data set would be the sample that has been presumably drawn from a much larger (theoretically infinite sized) population.
To illustrate how the variance-covariance matrices are constructed, we will follow a somewhat different strategy, called bootstrapping. We will assume that the data set of 205 vehicles is the population, and we will follow the following regression strategy:
- We will draw 100 random samples of size 50 vehicles each, with replacement, from this population. “With replacement” means that after using the sample, we will put it back into the population so that those data are available while making the next draw. The with-replacement strategy may not be realistic in many real-world settings, but it makes the math a whole lot simpler, and it does not lead to any practical difficulties in case of the autos data set.
- We will train (a.k.a. fit) a linear regression model on each one of these 50 vehicle samples using the OLS technique. Such a model is called as an Ordinary Least Squares Regression (OLSR) model.
- After training on each sample, we’ll note down the values of the fitted model’s coefficients β_0_cap, β_1_cap, and β_2_cap. The ‘cap’ indicates that these are the estimates of the population level value of the corresponding coefficients.
- We will also note down for each sample, the residual errors of regression ‘e’. The residual error of regression is the difference between the observed value of the dependent variable (City_MPG) and the value predicted by the fitted regression model.
Here is the Python code for implementing the above bootstrapping algorithm:
Let’s view the results of our bootstrapping experiment.
Let’s print out the the table of regression coefficients (including the Intercept) from running the OLSR model on the 100 data samples using the procedure outlined above.
Each row in this table corresponds to the fitted values of the regression coefficients obtained from fitting the OLSR model on a randomly selected sample of 50 vehicles.
It can be seen from this table that the fitted model’s regression coefficients behave like random variables. Indeed, they are random variables. Each estimated coefficient follows some probability distribution and it has a mean and variance.
Using the data from the above table, let us plot the Probability Density Functions of the three regression coefficients. Here’s the Python code for it:
def draw_pdf(data, min_X, max_X, var_name): hist = np.histogram(data) hist_dist = scipy.stats.rv_histogram(hist) X = np.linspace(min_X, max_X, 20) fig = plt.figure() fig.suptitle('PDF of ' + var_name) plt.plot(X, hist_dist.pdf(X), label='PDF') plt.xlabel(var_name) plt.ylabel('Density') plt.show() #Plot the PDFs of the three coefficient estimates draw_pdf(df_sample_beta['Intercept'], 0, 100, 'estimated Intercept ') data=df_sample_beta['Curb_Weight'] draw_pdf(data, min(data)*0.9, max(data)*1.1, 'estimated coefficient for Curb_Weight') data=df_sample_beta['Engine_Size'] draw_pdf(data, min(data)*0.9, max(data)*1.1, 'estimated coefficient for Engine_Size')
We get the following three plots:
Constructing the variance-covariance matrix of regression coefficients
We can use the table of regression coefficient values to calculate the variance of each coefficient as well as the pair-wise covariance of the three coefficients.
Let’s recollect the formulas for variance and covariance.
Given a random variable x that is realized over a sample of size n, the following sample variance of x forms an unbiased estimate of the population variance of a x:
Here, x_bar is the mean of x, and the -1 in the denominator represents the single degree of freedom lost due to the inclusion of the mean.
A similar unbiased estimator of the population level covariance between two random variables x and z is as follows:
Using these formulae, we will calculate the variances and pair-wise covariances for the regression coefficient values.
The coefficient means are as follows:
And the variances and covariances are as follows:
The following Python code can be used to compute the means of the coefficient estimates and the variance-covariance matrix of regression coefficients:
#Calculate the mean estimate for each coefficient coeff_means = df_sample_beta.mean() #Calculate the variance-covariance matrix for each coefficient coeff_covs = df_sample_beta.cov()
We’ll print them out:
In the above matrix, the elements along the main diagonal indicated by the red boxes contain the variances of the respective coefficient estimates while the non-diagonal elements contain the pair-wise covariances.
The negative coefficient between Engine_Size and Curb_Weight may seem counter-intuitive but the value is so small that one should ignore the sign. Indeed, given the nature of the sampling technique, each time we conduct the 1000 sample experiment, we will get slightly different values for the variances and covariances.
The above process of deriving the covariance matrix for the model’s coefficients is called the bootstrap technique. It is important to remember that the covariances of regression coefficients are finite sample estimates of the respective true, population level covariances (which are usually assumed to be unknown).
While using the bootstrap technique, when the number of draws is small, one needs to adjust up the estimated covariances using the factor (D+1)/D, where D is the number of draws. In our experiment, D=100 and this adjustment factor is only 1.01 and can be ignored.
Standard errors of the model’s coefficients
The standard error of a coefficient’s estimate is simply the standard deviation of the random variable that represents the coefficient’s estimate.
SE(β_cap|X) = STDEV(β_cap|X) = SQRT(Var(β_cap|X))
Recollect that the diagonal elements of the variance-covariance matrix contain the variances of coefficients. Hence, the standard error for each coefficient can be calculated by taking the square root of the respective diagonal element of the covariance matrix.
Let’s calculate and print out the standard deviations:
coeff_std_errors = np.sqrt(coeff_covs) print(coeff_std_errors)
Here are the standard errors of three coefficients shown in red boxes:
The above technique of calculating standard errors of coefficients is called bootstrapping of standard errors, and the standard errors so obtained are called bootstrapped standard errors of coefficient estimates.
Confidence intervals of the model’s coefficients
The (1-α)*100% confidence interval for each coefficient is calculated using the following formula:
In the above formula:
- β_cap_i is the fitted value of the ith coefficient reported by the model after it is fitted on the data sample.
- The t value inside the square bracket is the critical value returned from the 2-sided t-distribution with (n-k) degrees of freedom where n is the sample size and k is the number of regression coefficients including the intercept.
- se_i_i is square root of the ith diagonal element in the variance-covariance matrix of β_cap.
To calculate the confidence intervals, we’ll need to fit the OLSR model on a randomly drawn sample of size 50:
# Select a random sample of size SAMPLE_SIZE df_sample = df.sample(n=SAMPLE_SIZE) # carve out the X and y matrices using Patsy y_train, X_train = dmatrices(model_expr, df_sample, return_type='dataframe') # Build an OLS regression model using Statsmodels olsr_model = sm.OLS(endog=y_train, exog=X_train) # Fit the model on (y, X) olsr_results = olsr_model.fit() #Print the training summary of the fitted model print(olsr_results.summary())
We get the following training summary output:
The fitted coefficients β_cap are as follows:
The sample size n is 50 and there are 3 regression variables including the intercept. So, (n-k)= 50–3=47. The 2-sided t-value at α=0.05 is 2.012.
The 95% confidence interval for the estimate for Curb_Weight’s coefficient would be calculated as follows:
95% CI for Curb_Weight
= (-0.009330) +/- (2.012 * 0.002586)
= (-0.009330) +/- (0.005203032)
= [-0.014533032, -0.004126968]
This closely matches the 95% CI for Curb_Weight reported by statsmodels in the training summary:
Let’s also compare the standard errors reported by Statsmodels and the corresponding ones we have calculated using the Bootstrapping technique:
Here are the ones reported by Statsmodels:
Here are the ones we have bootstrapped:
We can see that while the standard errors for Curb_Weight and Engine_Size closely match those reported by Statsmodels, the one for Intercept is off by a small margin (3.731 versus 3.915). So should we use the bootstrapped standard errors or should we use the ones reported by the statistical package? After all, if the variance-covariance matrix is miss-specified, the standard errors of the coefficient estimates will be incorrect, and so will be the confidence intervals.
We’ll address this important question in the next chapter: A Deep Dive Into The Variance-Covariance Matrices Used In Linear Regression
References, Citations and Copyrights
The Automobile Data Set citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Download link
PREVIOUS: The Assumptions Of Linear Regression