###### We’ll understand how conditional variance and covariance matrices are calculated and how they are used in regression modelling

Conditional Variance and Conditional Covariance are concepts that are central to statistical modeling. In this chapter, we’ll learn what they are, and we’ll illustrate how to calculate them using a real-world data set.

First, a quick refresher on what is variance and covariance.

**Variance** of a random variable measures its variation around its mean. The **covariance** between two random variables is a measure of how correlated are their variations around their respective means.

## Conditional variance

The **conditional variance** of a random variable ** X** is a measure of how much variation is left behind after some of it is ‘explained away’ via

**’s association with other random variables**

*X***,**

*Y***,**

*X***…etc.**

*W*It is expressed in notation form as *Var(**X**|**Y**,**X**,**W**)* and read off as the Variance of ** X **conditioned upon

**,**

*Y***and**

*Z***.**

*W*First, let’s state the formula for the unconditional (total) variance:

In the above formula, *E(**X**)* is the “unconditional” expectation (mean) of ** X**.

The formula for conditional variance is obtained by simply replacing the unconditional expectation with the conditional expectation as follows (Note that in equation (2) below, we are now calculating the variance of ** Y**):

*E(**Y**|**X**)* is the value of ** Y** that is predicted by a regression model that is fitted on a data set in which the dependent variable is

**and the explanatory variable is**

*Y***. The index**

*X**i*is implicit in the conditional expectation, i.e. for each row

*i*in the data set, we use

*E(*

*Y**=y_i|*

*X**=x_i)*.

Here, our choice of regression model is important. A correct choice of model will result in a substantial amount of variance in ** Y** to be explained by the fitted model and therefore the conditional variance of

**on**

*Y***will be correspondingly small. On the other hand, an incorrect choice of model will result in a large conditional variance since the model is unable to explain most of the variance in**

*X***.**

*Y*The above formula for conditional variance can be extended to more than one variable on which the variance is conditioned by using a regression model in which ** X** matrix contains more than one regression variable.

### Illustration

Let’s illustrate the procedure for calculating conditional variance using some real world data. The following data set contains specifications of 205 automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle.

We’ll consider a small subset of this data set consisting of the following six variables:

City_MPG

Curb_Weight

Vehicle_Volume

Num_Cylinders

Vehicle_Price

Engine_Size

This 6-variable data set can be **downloaded from here**.

Let’s plot Engine_Size versus Num_Cylinders. We’ll use Python and the Pandas and Matplotib packages to load the data into a DataFrame and display the plot:

Let’s import all the required packages, including ones that we will use later.

```
import pandas as pd
from patsy import dmatrices
import numpy as np
import scipy.stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
```

Now let’s load the data file into a Pandas DataFrame and plot Engine_Size versus Num_Cylinders.

```
#Read the automobiles dataset into a Pandas DataFrame
df = pd.read_csv('automobile_uciml_6vars.csv', header=0)
#Drop all empty rows
df = df.dropna()
#Plot Engine_Size versus Num_Cylinders
fig = plt.figure()
fig.suptitle('Engine_Size versus Num_Cylinders')
plt.xlabel('Num_Cylinders')
plt.ylabel('Engine_Size')
plt.scatter(df['Num_Cylinders'], df['Engine_Size'])
#Plot a horizontal mean line
plt.plot([0, df['Num_Cylinders'].max()], [df['Engine_Size'].mean(), df['Engine_Size'].mean()],
[df['Engine_Size'].mean()], color='red', linestyle='dashed')
#Group the DataFrame by Num_Cylinders and calculate the mean for each group
df_grouped_means = df.groupby(['Num_Cylinders']).mean()
#Print out all the grouped means
df_grouped_means = df.groupby(['Num_Cylinders']).mean()
#Plot the group-specific means of Engine_Size
for i in df_grouped_means.index:
mean = df_grouped_means['Engine_Size'].loc[i]
plt.plot(i, mean, color='red', marker='o')
plt.show()
```

Here is the table of grouped means i.e. the means conditioned upon various values of Num_Cylinders.

And we also see the following plot showing the variation in Engine_Size across different values of Num_Cylinders:

The red horizontal line indicates the **unconditional mean** value of 126.91. The red dots indicate the mean Engine_Size for different values of Num_Cylinders. These are the **conditional means a.k.a. conditional expectations** of Engine_Size for different values of Num_Cylinders and they are denoted as *E(**Engine_Size**|**Num_Cylinders=**x)*.

### Unconditional (Total) variance in Engine_Size

Let’s revisit the formula for the total variance of ** X**:

In the above formula, if ** X**=Engine_Size, the mean, denoted by

*E(*

*X**)*is

*126.88. Using this formula, we calculate the sample variance of Engine_Size as*

**1726.14**. This is a measure of the variation of Engine_Size around the unconditional expectation of

*126.91*.

In Pandas, we can get the value of the total variance as follows:

```
unconditional_variance_engine_size = df['Engine_Size'].var()
print('(Unconditional) sample variance in Engine_Size='+str(unconditional_variance_engine_size))
```

We see the following output:

Unconditional variance in Engine_Size=1726.1394527363163

### Conditional variance in Engine_Size

The variance of Engine_Size conditioned upon Num_Cylinders is the variance left over in Engine_Size after some of it has been ‘explained’ by the regression of Engine_Size on Num_Cylinders. We can use Equation (2) to calculate it as follows:

We can extend this technique to multiple explanatory variables.

Suppose we wish to **calculate the variance of Engine_Size conditioned upon Curb_Weight, Vehicle_Volume and Num_Cylinders**.

To do so, we will use the following procedure:

- Construct a regression model in which the response variable is Engine_Size and the regression variables are Curb_Weight, Vehicle_Volume, Num_Cylinders and an intercept.
- Train the model on a data set.
- Run the trained model on the data set to get the predicted (expected) values of Engine_Size for each combination of Curb_Weight, Vehicle_Volume, Num_Cylinders. These are the set of conditional expectations:
*E(Engine_Size|Curb_Weight, Vehicle_Volume, Num_Cylinders)*corresponding to the observed values of Engine_Size. - Plugin the observed values of Engine_Size and the predicted values calculated in step 2 into equation (2) to get the conditional variance.

Let’s calculate it!

```
#Construct the regression expression. A regression intercept is included by default
olsr_expr = 'Engine_Size ~ Curb_Weight + Vehicle_Volume + Num_Cylinders'
#Carve out the y and X matrices based on the regression expression
y, X = dmatrices(olsr_expr, df, return_type='dataframe')
#Build the OLS linear regression model
olsr_model = sm.OLS(endog=y, exog=X)
#Train the model
olsr_model_results = olsr_model.fit()
#Make the predictions on the training data set. These are the conditional expectations of y
y_pred=olsr_model_results.predict(X)
y_pred=np.array(y_pred)
#Convert y from a Pandas DataFrame into an array
y=np.array(y['Engine_Size'])
#Calculate the conditional variance in Engine_Size using equation (2)
conditional_variance_engine_size = np.sum(np.square(y-y_pred))/(len(y)-1)
print('Conditional variance in Engine_Size='+str(conditional_variance_engine_size))
```

We get the following output:

Conditional variance in Engine_Size=167.42578329039935

As expected, this variance of **167.43 **is considerably less than the total variance in Engine_Size (**1726.13**).

### Relationship of conditional variance to R-squared

R-squared for a **linear** regression model is the fraction of the total variance in the dependent variable that the explanatory variables are able to ‘explain’.

We now know that the variance in ** y** that

**was**

*X**not*able to explain is the conditional variance

*Var(*

*y**|*

*X**)*. And the total variance in

**is simply the unconditional variance**

*y**Var(*

*y**)*. Hence R-squared can be expressed in terms of conditional and unconditional variance as follows:

Let’s calculate R-squared for the linear regression model that we had constructed earlier. Recollect that the dependent variable ** y** was Engine_Size while the explanatory variables

**were Curb_Weight, Vehicle_Volume and Num_Cylinders.**

*X*The total variance in ** y** was found to be

**1726.1394527363163**.

The conditional variance in ** y**, i.e. variance in

**conditioned upon Curb_Weight, Vehicle_Volume and Num_Cylinders was found to be**

*y***167.42578329039935**.

Using equation (4), R-squared of this linear model is:

*R-squared = 1–167.43/1726.14 = 0.903*

This value matches perfectly with the value reported by statsmodels:

## Conditional covariance

Recollect that covariance between two random variables ** X** and

**is a measure of how correlated the**

*Z**variations*in

**and**

*X***are with each other. Its formula is as follows:**

*Z*In this formula, *E(**X**)* and *E(**Z**)* are the unconditional means (a.k.a. unconditional expectations) of ** X** and

**.**

*Z*The **covariance** of ** X** and

**,**

*Z***conditional upon**some random variable(s)

**is a measure of how correlated are the variations in**

*W***and**

*X***around the conditional expectations of**

*Z***on**

*X***, and**

*W***on**

*Z***respectively.**

*W**E(**X**|**W**)* and *E(**Z|W**)* are the **conditional expectations** of ** X** and

**on**

*Z***. Hence**

*W**(x_i — E(*

*X**|*

*W**))*is the variation in

**after some of it has been explained by**

*X***. Ditto for**

*W**(z_i — E(*

*Z**|*

*W**))*. The index

*i*is implicit in the two conditional expectations, i.e. for each row

*i*in the data set, we use

*E(*

*X**=x_i|*

*W**=w_i)*and

*E(*

*Z**=z_i|*

*W**=w_i)*.

Thus, the conditional covariance is a measure of how correlated are the variations in ** X** and

*Z**after some of the respective variances have been explained by the*presence of

**.**

*W*As with the procedure for calculating conditional variance, we can estimate the conditional expectations *E(**X**|**W**) *and *E(**Z|W**)* by regressing ** X** on

**, and**

*W***on**

*Z***. The respective regression model’s predictions on the training data set are the corresponding conditional expectations**

*W**E(*

*X**|*

*W**)*and

*E(*

*Z|W**)*that we are seeking.

### Illustration

We’ll calculate the covariance between Engine_Size and Curb_Weight, conditional upon Vehicle_Volume.

First, we’ll baseline the variance by calculating the unconditional (total) covariance between Engine_Size and Curb_Weight. This can be easily done using equation (5) as follows:

Using Pandas, we can calculate this covariance as follows:

```
covariance = df['Curb_Weight'].cov(df['Engine_Size'])
```

We see the following output:

Covariance between Curb_Weight and Engine_Size=18248.28333333333

Let’s also view the scatter plot of mean-centered Engine_Size and mean-centered Curb_Weight to get a visual feel for this covariance:

```
#Plot mean-centered Curb_Weight versus Engine_Size
fig = plt.figure()
fig.suptitle('Mean centered Curb_Weight versus Engine_Size')
plt.xlabel('Mean centered Engine_Size')
plt.ylabel('Mean centered Curb_Weight')
plt.scatter(df['Engine_Size']-df['Engine_Size'].mean(), df['Curb_Weight']-df['Curb_Weight'].mean())
plt.show()
```

We see the following plot:

One thing we immediately notice in this plot is that there appears to be a wide variation in curb weights for vehicles with similar engine size:

There are other factors involved that could explain some of this variance in Curb Weight* within* a particular Engine Size.

Let’s look at Vehicle Volume as one such factor. Specifically, let’s **calculate the covariance between Curb_Weight and Engine_Size conditional upon Vehicle Volume**, i.e. after netting out the effect of Vehicle Volume.

In the above formula, the two conditional expectations marked in green can be obtained by regressing Engine_Size on Vehicle_Volume and Curb_Weight on Vehicle_Volume. As before, the index *i* is implicit in the two expectations.

Using Pandas and statsmodels, let’s calculate this conditional covariance as follows. In the below piece of code, ** X**=Engine_Size,

**=Curb_Weight and**

*Z***=Vehicle_Volume.**

*W*```
#Carve out the X and W matrices. An intercept is automatically added to W.
X, W = dmatrices('Engine_Size ~ Vehicle_Volume', df, return_type='dataframe')
#Regress X on W
olsr_model_XW = sm.OLS(endog=X, exog=W)
olsr_model_XW_results = olsr_model_XW.fit()
#Get the conditional expectations E(X|W)
X_pred=olsr_model_XW_results.predict(W)
X_pred=np.array(X_pred)
X=np.array(df['Engine_Size'])
#Carve out the Z and W matrices
Z, W = dmatrices('Curb_Weight ~ Vehicle_Volume', df, return_type='dataframe')
#Regress Z on W
olsr_model_ZW = sm.OLS(endog=Z, exog=W)
olsr_model_ZW_results = olsr_model_ZW.fit()
#Get the conditional expectations E(Z|W)
Z_pred=olsr_model_ZW_results.predict(W)
Z_pred=np.array(Z_pred)
Z=np.array(df['Curb_Weight'])
#Construct the delta matrices
Z_delta=Z-Z_pred
X_delta=X-X_pred
#Calculate the conditional covariance
conditional_variance = np.sum(Z_delta*X_delta)/(len(Z)-1)
print('Conditional Covariance between Curb_Weight and Engine_Size='+str(conditional_variance))
```

We see the following output:

Conditional Covariance between Curb_Weight and Engine_Size=7789.498082862661

If we compare this value of **7789.5 **with the total covariance of **18248.28 **calculated earlier, we see that the covariance between Engine_Size and Curb_Weight net of the effect of Vehicle_Volume is indeed much smaller than without the effect of Vehicle_Volume.

Here is the complete source code used in the chapter:

## References, Citations and Copyrights

### Data set

**The Automobile Data Set**** citation: **Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. **Download link**

**PREVIOUS: **The Three Conditionals: Conditional Probability, Conditional Expectation And Conditional Variances

**NEXT: **Getting to Know The Poisson Process And The Poisson Probability Distribution