We’ll understand how conditional variance and covariance matrices are calculated and how they are used in regression modelling
Conditional Variance and Conditional Covariance are concepts that are central to statistical modeling. In this chapter, we’ll learn what they are, and we’ll illustrate how to calculate them using a real-world data set.
First, a quick refresher on what is variance and covariance.
Variance of a random variable measures its variation around its mean. The covariance between two random variables is a measure of how correlated are their variations around their respective means.
The conditional variance of a random variable X is a measure of how much variation is left behind after some of it is ‘explained away’ via X’s association with other random variables Y, X, W…etc.
It is expressed in notation form as Var(X|Y,X,W) and read off as the Variance of X conditioned upon Y, Z and W.
First, let’s state the formula for the unconditional (total) variance:
In the above formula, E(X) is the “unconditional” expectation (mean) of X.
The formula for conditional variance is obtained by simply replacing the unconditional expectation with the conditional expectation as follows (Note that in equation (2) below, we are now calculating the variance of Y):
E(Y|X) is the value of Y that is predicted by a regression model that is fitted on a data set in which the dependent variable is Y and the explanatory variable is X. The index i is implicit in the conditional expectation, i.e. for each row i in the data set, we use E(Y=y_i|X=x_i).
Here, our choice of regression model is important. A correct choice of model will result in a substantial amount of variance in Y to be explained by the fitted model and therefore the conditional variance of Y on X will be correspondingly small. On the other hand, an incorrect choice of model will result in a large conditional variance since the model is unable to explain most of the variance in Y.
The above formula for conditional variance can be extended to more than one variable on which the variance is conditioned by using a regression model in which X matrix contains more than one regression variable.
Let’s illustrate the procedure for calculating conditional variance using some real world data. The following data set contains specifications of 205 automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle.
We’ll consider a small subset of this data set consisting of the following six variables:
This 6-variable data set can be downloaded from here.
Let’s plot Engine_Size versus Num_Cylinders. We’ll use Python and the Pandas and Matplotib packages to load the data into a DataFrame and display the plot:
Let’s import all the required packages, including ones that we will use later.
import pandas as pd from patsy import dmatrices import numpy as np import scipy.stats import statsmodels.api as sm import matplotlib.pyplot as plt
Now let’s load the data file into a Pandas DataFrame and plot Engine_Size versus Num_Cylinders.
#Read the automobiles dataset into a Pandas DataFrame df = pd.read_csv('automobile_uciml_6vars.csv', header=0) #Drop all empty rows df = df.dropna() #Plot Engine_Size versus Num_Cylinders fig = plt.figure() fig.suptitle('Engine_Size versus Num_Cylinders') plt.xlabel('Num_Cylinders') plt.ylabel('Engine_Size') plt.scatter(df['Num_Cylinders'], df['Engine_Size']) #Plot a horizontal mean line plt.plot([0, df['Num_Cylinders'].max()], [df['Engine_Size'].mean(), df['Engine_Size'].mean()], [df['Engine_Size'].mean()], color='red', linestyle='dashed') #Group the DataFrame by Num_Cylinders and calculate the mean for each group df_grouped_means = df.groupby(['Num_Cylinders']).mean() #Print out all the grouped means df_grouped_means = df.groupby(['Num_Cylinders']).mean() #Plot the group-specific means of Engine_Size for i in df_grouped_means.index: mean = df_grouped_means['Engine_Size'].loc[i] plt.plot(i, mean, color='red', marker='o') plt.show()
Here is the table of grouped means i.e. the means conditioned upon various values of Num_Cylinders.
And we also see the following plot showing the variation in Engine_Size across different values of Num_Cylinders:
The red horizontal line indicates the unconditional mean value of 126.91. The red dots indicate the mean Engine_Size for different values of Num_Cylinders. These are the conditional means a.k.a. conditional expectations of Engine_Size for different values of Num_Cylinders and they are denoted as E(Engine_Size|Num_Cylinders=x).
Unconditional (Total) variance in Engine_Size
Let’s revisit the formula for the total variance of X:
In the above formula, if X=Engine_Size, the mean, denoted by E(X) is 126.88. Using this formula, we calculate the sample variance of Engine_Size as 1726.14. This is a measure of the variation of Engine_Size around the unconditional expectation of 126.91.
In Pandas, we can get the value of the total variance as follows:
unconditional_variance_engine_size = df['Engine_Size'].var() print('(Unconditional) sample variance in Engine_Size='+str(unconditional_variance_engine_size))
We see the following output:
Unconditional variance in Engine_Size=1726.1394527363163
Conditional variance in Engine_Size
The variance of Engine_Size conditioned upon Num_Cylinders is the variance left over in Engine_Size after some of it has been ‘explained’ by the regression of Engine_Size on Num_Cylinders. We can use Equation (2) to calculate it as follows:
We can extend this technique to multiple explanatory variables.
Suppose we wish to calculate the variance of Engine_Size conditioned upon Curb_Weight, Vehicle_Volume and Num_Cylinders.
To do so, we will use the following procedure:
- Construct a regression model in which the response variable is Engine_Size and the regression variables are Curb_Weight, Vehicle_Volume, Num_Cylinders and an intercept.
- Train the model on a data set.
- Run the trained model on the data set to get the predicted (expected) values of Engine_Size for each combination of Curb_Weight, Vehicle_Volume, Num_Cylinders. These are the set of conditional expectations:
E(Engine_Size|Curb_Weight, Vehicle_Volume, Num_Cylinders) corresponding to the observed values of Engine_Size.
- Plugin the observed values of Engine_Size and the predicted values calculated in step 2 into equation (2) to get the conditional variance.
Let’s calculate it!
#Construct the regression expression. A regression intercept is included by default olsr_expr = 'Engine_Size ~ Curb_Weight + Vehicle_Volume + Num_Cylinders' #Carve out the y and X matrices based on the regression expression y, X = dmatrices(olsr_expr, df, return_type='dataframe') #Build the OLS linear regression model olsr_model = sm.OLS(endog=y, exog=X) #Train the model olsr_model_results = olsr_model.fit() #Make the predictions on the training data set. These are the conditional expectations of y y_pred=olsr_model_results.predict(X) y_pred=np.array(y_pred) #Convert y from a Pandas DataFrame into an array y=np.array(y['Engine_Size']) #Calculate the conditional variance in Engine_Size using equation (2) conditional_variance_engine_size = np.sum(np.square(y-y_pred))/(len(y)-1) print('Conditional variance in Engine_Size='+str(conditional_variance_engine_size))
We get the following output:
Conditional variance in Engine_Size=167.42578329039935
As expected, this variance of 167.43 is considerably less than the total variance in Engine_Size (1726.13).
Relationship of conditional variance to R-squared
R-squared for a linear regression model is the fraction of the total variance in the dependent variable that the explanatory variables are able to ‘explain’.
We now know that the variance in y that X was not able to explain is the conditional variance Var(y|X). And the total variance in y is simply the unconditional variance Var(y). Hence R-squared can be expressed in terms of conditional and unconditional variance as follows:
Let’s calculate R-squared for the linear regression model that we had constructed earlier. Recollect that the dependent variable y was Engine_Size while the explanatory variables X were Curb_Weight, Vehicle_Volume and Num_Cylinders.
The total variance in y was found to be 1726.1394527363163.
The conditional variance in y, i.e. variance in y conditioned upon Curb_Weight, Vehicle_Volume and Num_Cylinders was found to be 167.42578329039935.
Using equation (4), R-squared of this linear model is:
R-squared = 1–167.43/1726.14 = 0.903
This value matches perfectly with the value reported by statsmodels:
Recollect that covariance between two random variables X and Z is a measure of how correlated the variations in X and Z are with each other. Its formula is as follows:
In this formula, E(X) and E(Z) are the unconditional means (a.k.a. unconditional expectations) of X and Z.
The covariance of X and Z, conditional upon some random variable(s) W is a measure of how correlated are the variations in X and Z around the conditional expectations of X on W, and Z on W respectively.
E(X|W) and E(Z|W) are the conditional expectations of X and Z on W. Hence (x_i — E(X|W)) is the variation in X after some of it has been explained by W. Ditto for (z_i — E(Z|W)). The index i is implicit in the two conditional expectations, i.e. for each row i in the data set, we use E(X=x_i|W=w_i) and E(Z=z_i|W=w_i).
Thus, the conditional covariance is a measure of how correlated are the variations in X and Z after some of the respective variances have been explained by the presence of W.
As with the procedure for calculating conditional variance, we can estimate the conditional expectations E(X|W) and E(Z|W) by regressing X on W, and Z on W. The respective regression model’s predictions on the training data set are the corresponding conditional expectations E(X|W) and E(Z|W) that we are seeking.
We’ll calculate the covariance between Engine_Size and Curb_Weight, conditional upon Vehicle_Volume.
First, we’ll baseline the variance by calculating the unconditional (total) covariance between Engine_Size and Curb_Weight. This can be easily done using equation (5) as follows:
Using Pandas, we can calculate this covariance as follows:
covariance = df['Curb_Weight'].cov(df['Engine_Size'])
We see the following output:
Covariance between Curb_Weight and Engine_Size=18248.28333333333
Let’s also view the scatter plot of mean-centered Engine_Size and mean-centered Curb_Weight to get a visual feel for this covariance:
#Plot mean-centered Curb_Weight versus Engine_Size fig = plt.figure() fig.suptitle('Mean centered Curb_Weight versus Engine_Size') plt.xlabel('Mean centered Engine_Size') plt.ylabel('Mean centered Curb_Weight') plt.scatter(df['Engine_Size']-df['Engine_Size'].mean(), df['Curb_Weight']-df['Curb_Weight'].mean()) plt.show()
We see the following plot:
One thing we immediately notice in this plot is that there appears to be a wide variation in curb weights for vehicles with similar engine size:
There are other factors involved that could explain some of this variance in Curb Weight within a particular Engine Size.
Let’s look at Vehicle Volume as one such factor. Specifically, let’s calculate the covariance between Curb_Weight and Engine_Size conditional upon Vehicle Volume, i.e. after netting out the effect of Vehicle Volume.
In the above formula, the two conditional expectations marked in green can be obtained by regressing Engine_Size on Vehicle_Volume and Curb_Weight on Vehicle_Volume. As before, the index i is implicit in the two expectations.
Using Pandas and statsmodels, let’s calculate this conditional covariance as follows. In the below piece of code, X=Engine_Size, Z=Curb_Weight and W=Vehicle_Volume.
#Carve out the X and W matrices. An intercept is automatically added to W. X, W = dmatrices('Engine_Size ~ Vehicle_Volume', df, return_type='dataframe') #Regress X on W olsr_model_XW = sm.OLS(endog=X, exog=W) olsr_model_XW_results = olsr_model_XW.fit() #Get the conditional expectations E(X|W) X_pred=olsr_model_XW_results.predict(W) X_pred=np.array(X_pred) X=np.array(df['Engine_Size']) #Carve out the Z and W matrices Z, W = dmatrices('Curb_Weight ~ Vehicle_Volume', df, return_type='dataframe') #Regress Z on W olsr_model_ZW = sm.OLS(endog=Z, exog=W) olsr_model_ZW_results = olsr_model_ZW.fit() #Get the conditional expectations E(Z|W) Z_pred=olsr_model_ZW_results.predict(W) Z_pred=np.array(Z_pred) Z=np.array(df['Curb_Weight']) #Construct the delta matrices Z_delta=Z-Z_pred X_delta=X-X_pred #Calculate the conditional covariance conditional_variance = np.sum(Z_delta*X_delta)/(len(Z)-1) print('Conditional Covariance between Curb_Weight and Engine_Size='+str(conditional_variance))
We see the following output:
Conditional Covariance between Curb_Weight and Engine_Size=7789.498082862661
If we compare this value of 7789.5 with the total covariance of 18248.28 calculated earlier, we see that the covariance between Engine_Size and Curb_Weight net of the effect of Vehicle_Volume is indeed much smaller than without the effect of Vehicle_Volume.
Here is the complete source code used in the chapter:
References, Citations and Copyrights
The Automobile Data Set citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Download link