We’ll understand what is Omitted Variable Bias and we’ll illustrate its calculation using a real-world data set

We‘ll study the consequences of failing to include important variables in a linear regression model. For illustration, we’ll base our discussion on a real world data set of automobile characteristics. Our goal will be to formulate a well-known result in statistical modeling called **Omitted Variable Bias **and to illustrate the calculation using the sample data set.

## The automobiles data set

The following data contains specifications of 205 automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle.

We’ll consider a subset of this data consisting of the following variables:

City_MPG

Car_Volume

Curb_Weight

Engine_Size

The Car_Volume variable is not present in the original data set. It is a new variable we have added as follows: Car_Volume = Length*Width*Height.

The above 4-variables version of the data set is available for download **from here**.

### Regression goal

Our regression goal is to regress **City_MPG** on **Engine_Size** and **Curb_Weight** using a **linear regression model**. The model equation is:

*City_MPG = β_1 **+ **β_2*Car_Volume+ β_3*Curb_Weight + β_4*Engine_Size + ϵ*

The error term *ϵ* of the regression model represents the effects of all the factors that the modeler has been unable to measure.

The matrix version of the above equation is written as follows:

*Where,*

is an*y**[n x 1]*size column vector containing the observed values of*City_MPG. n*being the number of data points.is a*β**[4 x 1]*size column vector of regression model coefficients*β_1, β_2, β_3*,*β_4*corresponding to the*intercept, Car_Volume, Curb_Weight*and*Engine_Size.*is a*X**[n x 4]*size matrix containing the values of the regression variables. The first column of this matrix is a column of 1s and it acts as the placeholder for the intercept*β_1.*is an*ϵ**[n x 1]*size column vector of the model’s regression errors.

Let’s illustrate how the regression model’s equation looks like using matrices:

The first column represented by the column vector *x_1=**[x_11,…x_n1]’* in the ** X** matrix is a column of 1s. Assuming a sample size of

*n*, the above matrix representation is equivalent to writing out the following system of

*n*regression equations:

Now, suppose we partition this system of equations into two parts as follows:

Here is the matrix representation of the above partitioning:

In general, we can express the above partition as follows:

We have substituted the partitioned out regression variable *x**_4* with the variable ** z** which is an

*[n x 1]*column vector.

*γ**(gamma) is a*

*[1 x 1]*“matrix” that takes the place of regression coefficient

*β_4.*

When one trains (a.k.a. ‘fits’) the above mentioned linear model on a data set of *n *samples, the fitted model can be expressed as follows:

Notice the cap or hat “^” symbol over *β** *and ** γ **indicating that they are the fitted values i.e. the estimates of the corresponding population level values of

*β**and*

**. Also in equation (2), the column vector of residual errors**

*γ***takes the place of the column vector of regression errors**

*e***The**

*ϵ.**ith*residual error

*e_i*is the difference between the

*ith*observation

*y_i*and the corresponding

*ith*predicted value from the fitted model.

We have now prepared the ground for addressing the problem of what happens when you omit regression variables.

## The effect of omitting a regression variable

Let’s revisit the regression model for the automobiles data set:

*City_MPG = β_1 **+ **β_2*Car_Volume+ β_3*Curb_Weight + β_4*Engine_Size + ϵ*

Here’s the equation for the fitted model:

*City_MPG = β_1_cap **+ **β_2_cap*Car_Volume+ β_3_cap*Curb_Weight + β_4_cap*Engine_Size + e*

Suppose the we fail to include the variable *Engine_Size* while building the model. This is akin to leaving out the term ** zγ** from equation (1) or

**term from equation (2).**

*zγ_cap*If we solve the rest of equation (2), namely *y **= **Xβ_cap** + *** e** by minimizing the sum of squares of residual errors

**, it has a beautiful closed form solution that can be expressed in matrix notation as follows:**

*e*In the above equation:

is a column vector of fitted regression coefficients of size*β_cap**(k x 1)*assuming there are*k*regression variables in the model including the intercept but excluding the variable that we have omitted.is a matrix of regression variables of size*X**(n x k)*.is the transpose of*X’*, i.e.*X*with its rows and columns interchanged. It’s as if*X*has been turned on its side. Hence*X*is of size*X’**(k x n)*. And therefore,is of size*X’X**(k x k)*. Recollect that the product of two matrices of size*(k x n)*and*(n x k)*is a matrix of size*(k x k)*.is a column vector of observed values of size*y**(n x 1)*.

*(**X’X**)*, which is of size *(k x k)*, when multiplied with ** X’** of size

*(k x n)*, yields a matrix of size

*(k x n)*which when multiplied with

**of size**

*y**(n x1)*yields a matrix of size

*(k x 1)*, which is exactly the dimensions of

**.**

*β_cap*In the above equation, we will substitute ** y** with

*Xβ**+*

*zγ**+*

**from equation (1) as follows:**

*ϵ*Next, we distribute out the terms in the blue colored bracket as follows:

The first term on the R.H.S. of equation (3) can be simplified to simply ** β** as follows:

In the above simplification, ** I **is an Identity matrix of size

*(k x k)*.

**is the matrix equivalent of the number**

*I**1*. The multiplication of a matrix

**with the inverse of**

*A***equates to**

*A***, in the same way that**

*I**(n)*(1/n)=1*.

Let’s substitute the first term on the R.H.S. of equation (3) with ** β **and restate the simplified equation (3) as follows:

The above equation gives us our first hint that the omission of variable *z** *may cause the fitted value of the coefficients vector *β_cap** *to be biased around their true population values *β**, *by an amount that is equal to the value of the terms in the red colored box.

Let’s recollect that the coefficient estimates in a fitted regression model are random variables that have a mean (a.k.a. expectation) and a variance around the mean.

Thus, it is not the point estimate of ** β_cap **that we should be interested in. Instead, we ought to be calculating the following

**conditional estimate**a.k.a. conditional mean of

**:**

*β_cap**E(**β_cap**|**X**)*

Accordingly, let’s take conditional expectations of both sides of the above equation as follows:

The blue colored expression on the R.H.S. of the above equation can be split out using the identity *E(**A** + **B** + **C**) = E(**A**) +E(**B**) + E(**C**) *as follows:

The first term on the right side *E(**β|X**)* is simply ** β**, the true population value of the coefficients which are constant.

Before we inspect the second term on the right, let’s simplify the third term using the identity *E(**ABC**)=E(**A**)E(**B**)E(**C**)* assuming that random variables ** A**,

**and**

*B***are independent of each other:**

*C*Now we come to an important observation.

One of the primary assumptions of the linear regression model is that the errors ** ϵ**, conditioned upon the regression variables

**, have a zero mean.**

*X*This property of the **errors being exogenous** implies that *the gray colored expectation E(**ϵ**|**X**)=*** 0**, where

**is a column vector of size**

*0**(n x 1)*containing only zeroes. The green colored expectation is simply

**’ (which is the transpose of**

*X***) and it is of size**

*X**(k x n)*. Thus, the green and gray bits multiplied together is the column vector

**of size**

*0**(k x 1)*. Finally, the yellow colored bit is the inverse of product of

**of size**

*X**(n x k)*and its transpose of size

*(k x n)*. So the yellow colored bit equates to a matrix of size

*(k x k)*. The product of this matrix with the

*(k x 1)*column vector of zeroes is simply a column vector of zeroes of size

*(k x 1)*.

Thus, the third term in equation (4) is effectively extinguished into a column vector of zeroes of size *(k x 1)*.

So far, we have shown that in equation (4), the first term on the right is the** **column vector

**and the third term is the column vector**

*β***, both of size**

*0**(k x 1)*

*.*Now let’s look at the second term of equation (4). To simply it, we’ll use the identity *E(**AB**)=E(**A**)E(**B**)* :

The gray colored term is simply ** γ** since it’s the population level value of the coefficient of

**, and therefore its expectation (mean) is the same as itself.**

*z*It would be instructive to compare the green bit inside the expectation on the R.H.S. with the closed form solution of the least squares regression of ** y** on

**(reproduced below):**

*X*It’s easy to see that the green bit is actually *the closed form solution of the least square regression of the omitted variable **z** on **X**!*

And therefore, we can express the second term of equation (4) as follows:

In the R.H.S. of the above equation:

*γ*is the regression coefficient of the variablewhen it is included in the*z*matrix while performing a regression of*X*on*y*.*X**γ*is a scalar and hence not**bolded**.is the vector of fitted regression coefficients from regressing*β_cap_zX*on the rest of the variables in*z*.*X*

We are now in a position to bring together all the pieces and state the formula for the expected value of the fitted coefficients ** β_cap** of the regression of

**on**

*y*

*X**when we omit a variable*

**from the regression:**

*z*In equation (5), ** β_cap_zX** is a column vector of size

*(k x 1)*where

*k*is the number of regression coefficients in the model (not including

**), and**

*z**γ*is a scalar. Thus, when we omit a variable such as

**from the model, the fitted coefficients of the resulting model are**

*z**off*from their true population values by an amount that is proportional to the covariance of

**with the rest of the variables in**

*z***, as represented by the**

*X**E(*

*β_cap_zX**|*

*X**)*term.

This analysis suggests the following two scenarios:

### The omitted variable z is correlated with the rest of the regression variables in X.

In this case, *E(**β_cap_zX**|**X**)* is a non-zero vector. Thus, we reach the following important result:

### The omitted variable z is uncorrelated with the rest of the regression variables X

In this situation, the column vector *E(**β_cap_zX**|**X**)* contains all zeroes. Consequently, the second term on the R.H.S. of Eq. (5) vanishes and the expected value of the fitted coefficients of the remaining model is equal to the population values *β**.*

Even if the omitted variable is uncorrelated with the rest of the regression variables, there is a price to be paid for omitting it.

If the variance in the omitted variable ** z** would have “explained ” some of the variance in the response variable

**, then leaving out**

*y***causes this unexplained variance to leak into the error term**

*z***of the model, causing the variance of the errors to be larger and**

*ϵ***R-squared**to be smaller.

There is an intuitive aspect to this result. If we go on removing relevant variables from the model, we will be eventually left with only the intercept of regression and that leads us to the **mean model**, namely,* y_i* = *β_1 *+ *ϵ_i *in which *β_1 *is the mean of ** y**. All the variance in

**that cannot be explained by the mean of**

*y***will spill over into the variance of the error term**

*y***.**

*ϵ*Let’s return to our data set of automobiles, and the regression model for the same:

*City_MPG = β_1 **+ **β_2*Car_Volume+ β_3*Curb_Weight + β_4*Engine_Size + ϵ*

Let’s examine the effect of omitting *Engine_Size*. As per equation (5), we would need to regress *Engine_Size *on *Car_Volume* and *Curb_Weight* (plus the intercept).

We’ll use the Python library Pandas to load the data set into memory:

```
import pandas as pd
from patsy import dmatrices
import numpy as np
import scipy.stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
#Read the automobiles dataset into a Pandas DataFrame
df = pd.read_csv('automobile_uciml_4vars.csv', header=0)
```

Let’s print out the first few rows:

```
#Print the first few rows of the data set
print(df.head())
```

To judge the impact of the omitting *Engine_Size*, let’s regress *Engine_Size* on *Car_Volume* and *Curb_Weight*.

```
# Here's the model expression in Patsy syntax. The intercept's presence is implied.
model_expr = 'Engine_Size ~ Car_Volume + Curb_Weight'
# carve out the X and y matrices using Patsy
y_train, X_train = dmatrices(model_expr,df, return_type='dataframe')
# Build an OLS regression model using Statsmodels
olsr_model = sm.OLS(endog=y_train, exog=X_train)
# Fit the model on (y, X)
olsr_results = olsr_model.fit()
#Print the training summary of the fitted model
print(olsr_results.summary())
```

Here’s the training summary:

The Adjusted R-squared value of 0.753 and a significant F-statistic of 312.7 lead us to believe that *Engine_Size *is strongly correlated with *Car_Volume* and *Curb_Weight.*

Hence equation (5) suggests that if we omit the variable *Engine_Size *from the following regression model:

*City_MPG = β_1 **+ **β_2*Car_Volume+ β_3*Curb_Weight + β_4*Engine_Size + ϵ*

then the least squares linear regression *of City_MPG *on *Car_Volume* and *Curb_Weight* will yield fitted coefficients *β_cap** =[β_1_cap, β_2_cap, β_3_cap)* that *would be significantly biased from their true population values **β**=[β_1, β_2, β_3].*

Let us estimate this bias using equation (5) using a two-step procedure as follows:

**STEP 1: **We will first regress *City_MPG *on *Car_Volume, Curb_Weight *and *Engine_Size* (plus the *Intercept*)*:*

```
model_expr = 'City_MPG ~ Car_Volume + Curb_Weight + Engine_Size'
y_train, X_train = dmatrices(model_expr, df, return_type='dataframe')
olsr_model = sm.OLS(endog=y_train, exog=X_train)
olsr_results = olsr_model.fit()
print(olsr_results.params)
```

We see the following output:

Intercept 44.218699

Car_Volume 0.000019

Curb_Weight -0.012464Engine_Size 0.008221

dtype: float64

In the above output, the estimated coefficient of *Engine_Size* (highlighted) is **0.008221**. This value takes the place of *γ* in equation (5). Note that in equation (5), *γ* is the true population value of this coefficient while in practice, we are using its estimated value **0.008221.**

**STEP 2: **We will now regress *Engine_Size* on *Car_Volume* and *Curb_Weight *(plus the *Intercept*):

```
model_expr = 'Engine_Size ~ Car_Volume + Curb_Weight'
y_train, X_train = dmatrices(model_expr, df, return_type='dataframe')
olsr_model = sm.OLS(endog=y_train, exog=X_train)
olsr_results = olsr_model.fit()
print(olsr_results.params)
```

We see the following output:

Intercept 2.256588

Car_Volume -0.000165

Curb_Weight 0.088617

dtype: float64

This is the column vector *E(**β_cap_zX**|**X**):*

As per equation (5), if we scale this vector by *γ* (from step 1), we will get the estimate of the bias introduced in the regression model’s coefficient estimates if we omit *Engine_Size* from the model:

It is tempting to solve the problem of bias by not omitting the variable in question. But that can lead to another problem. If the omitted variable is correlated with other variables in the model (like *Engine_Size* is), then adding it back causes **multicollinearity**, a situation that makes the coefficients less precise. That is a topic for another chapter!

Stay tuned and happy modeling!

## References, Citations and Copyrights

### Data set

**The Automobile Data Set**** citation: **Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. **Download link**

**PREVIOUS: **A Deep Dive Into The Variance-Covariance Matrices Used In Linear Regression

**NEXT: **Introduction to Heteroscedasticity