The Consequences of Omitting Important Variables From A Linear Regression Model

We’ll understand what is Omitted Variable Bias and we’ll illustrate its calculation using a real-world data set


We‘ll study the consequences of failing to include important variables in a linear regression model. For illustration, we’ll base our discussion on a real world data set of automobile characteristics. Our goal will be to formulate a well-known result in statistical modeling called Omitted Variable Bias and to illustrate the calculation using the sample data set.

The automobiles data set

The following data contains specifications of 205 automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle.

The automobiles data set
The automobiles data set (Source: UC Irvine)

We’ll consider a subset of this data consisting of the following variables:
City_MPG
Car_Volume
Curb_Weight
Engine_Size

The Car_Volume variable is not present in the original data set. It is a new variable we have added as follows: Car_Volume = Length*Width*Height.

A subset of the Automobiles data set (Source: UC Irvine)

The above 4-variables version of the data set is available for download from here.

Regression goal

Our regression goal is to regress City_MPG on Engine_Size and Curb_Weight using a linear regression model. The model equation is:

City_MPG = β_1 + β_2*Car_Volume+ β_3*Curb_Weight + β_4*Engine_Size + ϵ

The error term ϵ of the regression model represents the effects of all the factors that the modeler has been unable to measure.

The matrix version of the above equation is written as follows:

Equation of a linear regression model
Equation of a linear regression model (Image by Author)

Where,

  • y is an [n x 1] size column vector containing the observed values of City_MPG. n being the number of data points.
  • β is a [4 x 1] size column vector of regression model coefficients β_1, β_2, β_3, β_4 corresponding to the intercept, Car_Volume, Curb_Weight and Engine_Size.
  • X is a [n x 4] size matrix containing the values of the regression variables. The first column of this matrix is a column of 1s and it acts as the placeholder for the intercept β_1.
  • ϵ is an [n x 1] size column vector of the model’s regression errors.

Let’s illustrate how the regression model’s equation looks like using matrices:

A linear regression model containing three variables and an intercept
A linear regression model containing three variables and an intercept (Image by Author)

The first column represented by the column vector x_1=[x_11,…x_n1]’ in the X matrix is a column of 1s. Assuming a sample size of n, the above matrix representation is equivalent to writing out the following system of n regression equations:

System of n regression equations in four variables plus and the error term
System of n regression equations in four variables plus and the error term (Image by Author)

Now, suppose we partition this system of equations into two parts as follows:

A system of n equations partitioned into two parts
A system of n equations partitioned into two parts (Image by Author)

Here is the matrix representation of the above partitioning:

A system of n equations partitioned into two parts (Image by Author)

In general, we can express the above partition as follows:

A partitioned linear regression model (Image by Author)

We have substituted the partitioned out regression variable x_4 with the variable z which is an [n x 1] column vector. γ (gamma) is a [1 x 1] “matrix” that takes the place of regression coefficient β_4.

When one trains (a.k.a. ‘fits’) the above mentioned linear model on a data set of n samples, the fitted model can be expressed as follows:

The fitted partitioned linear regression model
The fitted partitioned linear regression model (Image by Author)

Notice the cap or hat “^” symbol over β and γ indicating that they are the fitted values i.e. the estimates of the corresponding population level values of β and γ. Also in equation (2), the column vector of residual errors e takes the place of the column vector of regression errors ϵ. The ith residual error e_i is the difference between the ith observation y_i and the corresponding ith predicted value from the fitted model.

We have now prepared the ground for addressing the problem of what happens when you omit regression variables.

The effect of omitting a regression variable

Let’s revisit the regression model for the automobiles data set:

City_MPG = β_1 + β_2*Car_Volume+ β_3*Curb_Weight + β_4*Engine_Size + ϵ

Here’s the equation for the fitted model:

City_MPG = β_1_cap + β_2_cap*Car_Volume+ β_3_cap*Curb_Weight + β_4_cap*Engine_Size + e

Suppose the we fail to include the variable Engine_Size while building the model. This is akin to leaving out the term from equation (1) or zγ_cap term from equation (2).

If we solve the rest of equation (2), namely y = Xβ_cap + e by minimizing the sum of squares of residual errors e, it has a beautiful closed form solution that can be expressed in matrix notation as follows:

The closed form solution of y = Xβ_cap + e
The closed form solution of y = Xβ_cap + e (Image by Author)

In the above equation:

  • β_cap is a column vector of fitted regression coefficients of size (k x 1) assuming there are k regression variables in the model including the intercept but excluding the variable that we have omitted.
  • X is a matrix of regression variables of size (n x k).
  • X’ is the transpose of X, i.e. X with its rows and columns interchanged. It’s as if X has been turned on its side. Hence X’ is of size (k x n). And therefore, X’X is of size (k x k). Recollect that the product of two matrices of size (k x n) and (n x k) is a matrix of size (k x k).
  • y is a column vector of observed values of size (n x 1).

(X’X), which is of size (k x k), when multiplied with X’ of size (k x n), yields a matrix of size (k x n) which when multiplied with y of size (n x1) yields a matrix of size (k x 1), which is exactly the dimensions of β_cap.

In the above equation, we will substitute y with + + ϵ from equation (1) as follows:

Substituting for y with Xβ + zγ + ϵ
Substituting for y with + + ϵ (Image by Author)

Next, we distribute out the terms in the blue colored bracket as follows:

After distributing out the terms in the bracket
After distributing out the terms in the bracket (Image by Author)

The first term on the R.H.S. of equation (3) can be simplified to simply β as follows:

Simplification of the first term
Simplification of the first term (Image by Author)

In the above simplification, I is an Identity matrix of size (k x k). I is the matrix equivalent of the number 1. The multiplication of a matrix A with the inverse of A equates to I, in the same way that (n)*(1/n)=1.

Let’s substitute the first term on the R.H.S. of equation (3) with β and restate the simplified equation (3) as follows:

Simplified version of equation (3)
Simplified version of equation (3) (Image by Author)

The above equation gives us our first hint that the omission of variable z may cause the fitted value of the coefficients vector β_cap to be biased around their true population values β, by an amount that is equal to the value of the terms in the red colored box.

Let’s recollect that the coefficient estimates in a fitted regression model are random variables that have a mean (a.k.a. expectation) and a variance around the mean.

Thus, it is not the point estimate of β_cap that we should be interested in. Instead, we ought to be calculating the following conditional estimate a.k.a. conditional mean of β_cap:

E(β_cap|X)

Accordingly, let’s take conditional expectations of both sides of the above equation as follows:

After taking conditional expectations of both sides
After taking conditional expectations of both sides (Image by Author)

The blue colored expression on the R.H.S. of the above equation can be split out using the identity E(A + B + C) = E(A) +E(B) + E(C) as follows:

After splitting out the conditional expectation operator
After splitting out the conditional expectation operator (Image by Author)

The first term on the right side E(β|X) is simply β, the true population value of the coefficients which are constant.

Before we inspect the second term on the right, let’s simplify the third term using the identity E(ABC)=E(A)E(B)E(C) assuming that random variables A, B and C are independent of each other:

The second term of equation (4) simplified using the product of expectations rule
The third term of equation (4) simplified using the product of expectations rule (Image by Author)

Now we come to an important observation.

One of the primary assumptions of the linear regression model is that the errors ϵ, conditioned upon the regression variables X, have a zero mean.

This property of the errors being exogenous implies that the gray colored expectation E(ϵ|X)=0, where 0 is a column vector of size (n x 1) containing only zeroes. The green colored expectation is simply X’ (which is the transpose of X) and it is of size (k x n). Thus, the green and gray bits multiplied together is the column vector 0 of size (k x 1). Finally, the yellow colored bit is the inverse of product of X of size (n x k) and its transpose of size (k x n). So the yellow colored bit equates to a matrix of size (k x k). The product of this matrix with the (k x 1) column vector of zeroes is simply a column vector of zeroes of size (k x 1).

Thus, the third term in equation (4) is effectively extinguished into a column vector of zeroes of size (k x 1).

So far, we have shown that in equation (4), the first term on the right is the column vector β and the third term is the column vector 0, both of size (k x 1).

Now let’s look at the second term of equation (4). To simply it, we’ll use the identity E(AB)=E(A)E(B) :

The second term of equation (4) simplified using the product of expectations rule
The second term of equation (4) simplified using the product of expectations rule (Image by Author)

The gray colored term is simply γ since it’s the population level value of the coefficient of z, and therefore its expectation (mean) is the same as itself.

It would be instructive to compare the green bit inside the expectation on the R.H.S. with the closed form solution of the least squares regression of y on X (reproduced below):

The closed form solution of y = Xβ_cap + e
The closed form solution of y = Xβ_cap + e (Image by Author)

It’s easy to see that the green bit is actually the closed form solution of the least square regression of the omitted variable z on X!

And therefore, we can express the second term of equation (4) as follows:

The expected value of the fitted coefficients β_zX of the regression of z on X
The expected value of the fitted coefficients β_zX of the regression of z on X (Image by Author)

In the R.H.S. of the above equation:

  • γ is the regression coefficient of the variable z when it is included in the X matrix while performing a regression of y on X. γ is a scalar and hence not bolded.
  • β_cap_zX is the vector of fitted regression coefficients from regressing z on the rest of the variables in X.

We are now in a position to bring together all the pieces and state the formula for the expected value of the fitted coefficients β_cap of the regression of y on X when we omit a variable z from the regression:

The expected value of the fitted coefficients when a variable is omitted from the regression (Image by Author)

In equation (5), β_cap_zX is a column vector of size (k x 1) where k is the number of regression coefficients in the model (not including z), and γ is a scalar. Thus, when we omit a variable such as z from the model, the fitted coefficients of the resulting model are off from their true population values by an amount that is proportional to the covariance of z with the rest of the variables in X, as represented by the E(β_cap_zX|X) term.

This analysis suggests the following two scenarios:

The omitted variable z is correlated with the rest of the regression variables in X.

In this case, E(β_cap_zX|X) is a non-zero vector. Thus, we reach the following important result:

When the omitted variable is correlated with the rest of the variables in the regression model, the least squares estimator for the remaining regression model is no longer unbiased. And thus, it is no longer BLUE (Best Linear Unbiased Estimator).

The omitted variable z is uncorrelated with the rest of the regression variables X

In this situation, the column vector E(β_cap_zX|X) contains all zeroes. Consequently, the second term on the R.H.S. of Eq. (5) vanishes and the expected value of the fitted coefficients of the remaining model is equal to the population values β.

When the omitted variable is uncorrelated with the rest of the variables in the regression model, the least squares estimator for the remaining regression model continues to be unbiased, and thus, it stays BLUE.

Even if the omitted variable is uncorrelated with the rest of the regression variables, there is a price to be paid for omitting it.

If the variance in the omitted variable z would have “explained ” some of the variance in the response variable y, then leaving out z causes this unexplained variance to leak into the error term ϵ of the model, causing the variance of the errors to be larger and R-squared to be smaller.

There is an intuitive aspect to this result. If we go on removing relevant variables from the model, we will be eventually left with only the intercept of regression and that leads us to the mean model, namely, y_i = β_1 + ϵ_i in which β_1 is the mean of y. All the variance in y that cannot be explained by the mean of y will spill over into the variance of the error term ϵ.


Let’s return to our data set of automobiles, and the regression model for the same:

City_MPG = β_1 + β_2*Car_Volume+ β_3*Curb_Weight + β_4*Engine_Size + ϵ

Let’s examine the effect of omitting Engine_Size. As per equation (5), we would need to regress Engine_Size on Car_Volume and Curb_Weight (plus the intercept).

We’ll use the Python library Pandas to load the data set into memory:

import pandas as pd
from patsy import dmatrices
import numpy as np
import scipy.stats
import statsmodels.api as sm
import matplotlib.pyplot as plt


#Read the automobiles dataset into a Pandas DataFrame
df = pd.read_csv('automobile_uciml_4vars.csv', header=0)

Let’s print out the first few rows:

#Print the first few rows of the data set
print(df.head())
The first few rows of the autos data set
The first few rows of the autos data set (Image by Author)

To judge the impact of the omitting Engine_Size, let’s regress Engine_Size on Car_Volume and Curb_Weight.

# Here's the model expression in Patsy syntax. The intercept's presence is implied.
model_expr = 'Engine_Size ~ Car_Volume + Curb_Weight'

# carve out the X and y matrices using Patsy
y_train, X_train = dmatrices(model_expr,df, return_type='dataframe')

# Build an OLS regression model using Statsmodels
olsr_model = sm.OLS(endog=y_train, exog=X_train)

# Fit the model on (y, X)
olsr_results = olsr_model.fit()

#Print the training summary of the fitted model
print(olsr_results.summary())

Here’s the training summary:

Training summary from regressing Engine_Size on Car_Volume and Curb_Weight
Training summary from regressing Engine_Size on Car_Volume and Curb_Weight (Image by Author)

The Adjusted R-squared value of 0.753 and a significant F-statistic of 312.7 lead us to believe that Engine_Size is strongly correlated with Car_Volume and Curb_Weight.

Hence equation (5) suggests that if we omit the variable Engine_Size from the following regression model:

City_MPG = β_1 + β_2*Car_Volume+ β_3*Curb_Weight + β_4*Engine_Size + ϵ

then the least squares linear regression of City_MPG on Car_Volume and Curb_Weight will yield fitted coefficients β_cap =[β_1_cap, β_2_cap, β_3_cap) that would be significantly biased from their true population values β=[β_1, β_2, β_3].

Let us estimate this bias using equation (5) using a two-step procedure as follows:

STEP 1: We will first regress City_MPG on Car_Volume, Curb_Weight and Engine_Size (plus the Intercept):

model_expr = 'City_MPG ~ Car_Volume + Curb_Weight + Engine_Size'

y_train, X_train = dmatrices(model_expr, df, return_type='dataframe')

olsr_model = sm.OLS(endog=y_train, exog=X_train)

olsr_results = olsr_model.fit()

print(olsr_results.params)

We see the following output:

Intercept      44.218699
Car_Volume 0.000019
Curb_Weight -0.012464
Engine_Size 0.008221
dtype: float64

In the above output, the estimated coefficient of Engine_Size (highlighted) is 0.008221. This value takes the place of γ in equation (5). Note that in equation (5), γ is the true population value of this coefficient while in practice, we are using its estimated value 0.008221.

STEP 2: We will now regress Engine_Size on Car_Volume and Curb_Weight (plus the Intercept):

model_expr = 'Engine_Size ~ Car_Volume + Curb_Weight'

y_train, X_train = dmatrices(model_expr, df, return_type='dataframe')

olsr_model = sm.OLS(endog=y_train, exog=X_train)

olsr_results = olsr_model.fit()

print(olsr_results.params)

We see the following output:

Intercept      2.256588
Car_Volume -0.000165
Curb_Weight 0.088617
dtype: float64

This is the column vector E(β_cap_zX|X):

The expected values of the fitted coefficients of Intercept, Car_Volume and Curb_Weight (in that order), from regressing Engine_Size on Car_Volume and Curb_Weight
The expected values of the fitted coefficients of Intercept, Car_Volume and Curb_Weight (in that order), from regressing Engine_Size on Car_Volume and Curb_Weight (plus the Intercept) (Image by Author)

As per equation (5), if we scale this vector by γ (from step 1), we will get the estimate of the bias introduced in the regression model’s coefficient estimates if we omit Engine_Size from the model:

The estimated bias introduced in the coefficient estimates of the Intercept, Car_Volume and Curb_Weight (in that order) from omitting Engine_Size from the regression model
The estimated bias introduced in the coefficient estimates of the Intercept, Car_Volume and Curb_Weight (in that order) from omitting Engine_Size from the regression model (Image by Author)

It is tempting to solve the problem of bias by not omitting the variable in question. But that can lead to another problem. If the omitted variable is correlated with other variables in the model (like Engine_Size is), then adding it back causes multicollinearity, a situation that makes the coefficients less precise. That is a topic for another chapter!

Stay tuned and happy modeling!


References, Citations and Copyrights

Data set

The Automobile Data Set citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Download link


PREVIOUS: A Deep Dive Into The Variance-Covariance Matrices Used In Linear Regression

NEXT: The Consequences of Including Irrelevant Variables In A Linear Regression Model


UP: Table of Contents