Introduction To 2-Stage Least Squares (2SLS) Estimation - Statistical Modeling and Forecasting

We’ll learn how to use the 2SLS technique to estimate linear models containing Instrumental Variables

In this chapter, we’ll learn about two different ways to estimate a linear model using the Instrumental Variables technique.

In the previous chapter, we learnt about Instrumental Variables, what they are, and when and how to use them. Let’s recap what we learnt:

Consider the following linear model:

A linear model of two variables x_2 and x_3 (Image by Author)

In the above equation, y, 1, x_2, x_3, and ϵ are column vectors of size [n x 1]. From subsequent equations, we’ll drop the 1 (which is a vector of 1s) for brevity.

If one or more regression variables, say x_3, is endogenous, i.e., it is correlated with the error term ϵ, the Ordinary Least Squares (OLS) estimator is not consistent. The coefficient estimates it generates are biased away from the true values, putting into question the usefulness of the experiment.

One way to rescue the situation is to devise a way to effectively “break” x_3 into two parts:

A chunk that is uncorrelated with ϵ which we will add back into the model in place of x_3. This is the part of x_3 that is in fact exogenous.
A second chunk that is correlated with ϵ which we will cut out of the model. This is the part that is endogenous.

And one way to accomplish this goal is to identify a variable z_3, “an instrument for x_3”, with the following properties:

It is correlated with x_3. That (to some extent) satisfies the first of the above two requirements, and
It is uncorrelated with the error term which takes care of the second requirement.

Replacing x_3 with z_3 yields the following model:

The linear model with the endogenous variable x_3 replaced with the variable z_3 (Image by Author)

All variables on the R.H.S of Eq (1a) are exogenous. This model can be consistently estimated using least-squares.

The above estimation technique can be easily extended to multiple endogenous variables and their corresponding instruments as long as each endogenous variable is paired one-on-one with a single unique instrumental variable.

The above example suggests a general framework for IV estimation which we present below.

A linear regression of y on X takes the following matrix form:

A linear model in which y is regressed on X (Image by Author)

Assuming a data set of size n, in Eq (2):

y is a vector of size [n x 1].
X is the matrix of regression variables of size [n x (k+1)], i.e. it has n rows and (k+1) columns of which the first column is a column of 1s and it acts as the placeholder for the intercept.
β is a column vector of regression coefficients of size [(k+1) x 1] where the first element β_1 is the intercept of regression.
ϵ is a column vector of regression errors of size [n x1]. ϵ effectively holds the balance amount of variance in y that the model Xβ wasn’t able to explain.

Here’s how the above equation would look in matrix format:

The matrix version of the linear regression model (Image by Author)

Without loss of generality, and not counting the intercept, let’s assume that the first p regression variables in X are exogenous and the next q variables are endogenous such that 1 + p + q = k:

The X matrix composed of p exogenous variables and q endogenous variables (Image by Author)

Suppose we are able to identify q instrumental variables which would be the instruments for the corresponding q regression variables in X namely x_(p+1) thru x_k that are suspected to be endogenous.

Let’s construct a matrix Z as follows:

The first column of Z will be a column of 1s.
The next p columns of Z namely z_2 thru z_p will be identical to the p exogenous variables x_2 thru x_p in X.
The final set of q columns in Z namely z_(p+1) thru z_k will hold the data for the q variables that would be the instruments for the corresponding q endogenous variables in X namely x_(p+1) thru x_k.

Thus, the size of Z is also [n x (k+1)] i.e. the same as that of X.

Next, we’ll take the transpose Z which interchanges the rows and columns. The transpose operation essentially turns Z on its side. The transpose of Z denoted as Z’ is of size [(k+1) x n].

Now, let’s pre-multiply Eq (2) by Z’:

Pre-multiplying eq (2) with Z’ (Image by Author)

Eq (3) is dimensionally correct. On the L.H.S., Z’ is of size [(k+1) x n] and y is of size [n x 1]. Hence Z’y is of size [(k+1) x 1].

On the R.H.S., X is of size [n x (k+1)] and β is of size [(k+1) x 1]. Working left to right, Z’X is a square matrix of size [(k+1) x (k+1)] and (Z’X)β is of size [(k+1) x 1].

Similarly, ϵ is of size [n x 1]. So Z’ϵ is also of size [(k+1) x 1].

Now, let’s apply the expectation operator E(.) on both sides of Eq. (3):

Applying the expectation operator to both sides of Eq (3) (Image by Author)

E(Z’y) and E(Z’Xβ) resolve respectively to Z’y and Z’Xβ.

Recollect that Z contains only exogenous variables. Therefore, Z and ϵ are not correlated and hence the mean value of (Z’ϵ) is a column vector of zeros, and Eq (3a) resolves to the following:

Next, we’ll isolate the coefficients vector β on the R.H.S. of (4) by multiplying both sides of Eq (4) with the inverse of the square matrix (Z’X).

The inverse of a matrix is conceptually the multi-dimensional equivalent of the inverse of a scalar number N (assuming N is non-zero). The inverse of a matrix is calculated using a complex formula which we’ll skip getting into.

It is possible to show that (Z’X) is invertible (again something we won’t get into here). Pre-multiplying both sides of Eq. (4) by the inverse of (Z’X) namely (Z’X)^-1, gets us the following:

The yellow and green bits on the R.H.S. cancel each other out and yield an identity matrix in the same way as N*(1/N) equals 1, leaving us with the following equation for estimating the coefficients vector β of the instrumented model:

The formula for calculating the regression coefficients of the regression model containing instrumental variables (Image by Author)

Notice that Z, X and y are all observable quantities and so all regression coefficients can be estimated in one shot using Eq (6) provided there is a one-to-one correspondence between the endogenous variables in X and the chosen instruments in Z.

There is one final point that must be mentioned about Eq (6). Eq (6) is strictly speaking estimable only asymptotically, i.e. when the number of data samples n → ∞. But in practice, and for a set of mathematical reasons that probably deserve their own chapter, we can use it to calculate the coefficient estimates of a model estimated via IV on finite sized samples, in other words, on a real world data set.

Thus, the finite sample IV estimator β_cap_IV of β can be stated as follows:

The finite sample estimator of regression coefficients for the model containing IVs (Image by Author)

Now, let’s look at the case where there is more than one Instrumental Variable defined for an endogenous variable.

Consider the following regression model of wages:

Log of wage regressed on a variety of variables (Image by Author)

In the above model, we regress the natural log of wage instead of the raw wage as wage data is often right-skewed and logging it can reduce the skew. Education is measured in terms of years of schooling. College and city are boolean variables indicating whether the person went to college and whether they live in a city. Unemp contains the percentage unemployment rate in the county of residence.

Our X matrix is [1, age, experience, college, city, unemp, education], where the each variable is a column vector of size [n x 1] and the size of X is [n x 7].

We’ll argue that education is endogenous. As such, years of schooling captures only what is taught in school or college. And it also leaves out aspects such as how well the person has grasped the material, their knowledge of topics outside of the curriculum and so on, all of which are left unobserved and therefore captured in the error term ϵ.

We’ll propose two variables, mother’s number of years of schooling (meducation) and father’s number of years of schooling (feducation) as the IVs for the person’s education.

The relevance and exogeneity conditions

Our chosen IVs need to pass the relevance condition. If a regression of education on the rest of the variables in X plus meducation and feducation reveals (via an F-test) that meducation and feducation are jointly significant, the two IVs pass the relevance condition.

The error term ϵ is inherently unobservable. So the exogeneity condition for the IVs cannot be directly tested. Instead, we take it upon faith that parents’ number of years of schooling is unlikely to be correlated with factors such as the child’s grasp of material, i.e. the factors that are hiding in the error term and which are making education be endogenous. But we could be wrong about this. We’ll soon find out.

The regression model containing IVs

Our regression model with IVs is as follows:

Log of wage regressed on a variety of variables including two IVs: meducation and feducation (Image by Author)

Our Z matrix is [1, age, experience, college, city, unemp, meducation, feducation], where the each variable is a column vector of size [n x 1] and the size of Z is [n x 8]. Notice how we have replaced education with its two IVs.

And the coefficient vector to be estimated is:

β_cap_IV=[β*_1_cap, β*_2_cap, β*_3_cap, β*_4_cap, β*_5_cap, β*_6_cap, β*_7_cap, β*_8_cap]

Where the caps indicated estimated values.

With X and Z defined, can we use Eq (6a) to perform a single-shot calculation of β_cap_IV?

Unfortunately, the answer is , no.

Recollect that the size of Z is [n x 8]. So, the size of Z’ is [8 x n]. The size of X is [n x 7]. Hence Z’X has size [8 x 7] which is not a square matrix and therefore not invertible. Thus, Eq. (6a) cannot be used when multiple instrumental variables such as meducation and feducation are used to represent a single endogenous variable such as education.

This difficulty suggests that we explore a different approach for estimating β_cap_IV. This different approach is a two-stage OLS estimator.

The 2-stage OLS estimator

We begin by developing the first stage of this estimator.

The First Stage

In this stage, we’ll regress education on age, experience, college, city, unemp, meducation, and feducation.

Let’s suppose that we have determined via the F-test that education is indeed correlated with the IVs meducation and feducation.

We will now regress education not only on meducation and feducation but also the other variables which allows us to account for the effect of possible correlations between the non-IV variables and the IV variables. See my earlier chapter on Instrumental Variables for a detailed explanation of this effect.

Education regressed on the Z matrix (Image by Author)

ν is the error term. The above model can be consistently estimated using OLS as all regression variables are exogenous. The estimated model has the following form:

Estimated value of education after fitting the model that regresses education on the Z matrix (Image by Author)

In the above equation, education_cap is the estimated (a.k.a. predicted) value of education. The caps on the coefficients similarly indicate estimated values.

The above OLS based regression represents the first stage of a two-stage OLS (2SLS) estimation that we are about to do.

The second stage

The key insight to be had about the first stage is that education_cap contains only the portion of variance of education that is exogenous, i.e. not correlated with the error term.

Therefore, we can replace education in the original model of ln(wage) with education_cap to form a model that contains only exogenous regression variables, as follows:

log of wage regressed on X with education replaced with its exogenous portion — log of **wage** regressed on X with **education** replaced with its exogenous portion (Image by Author)

Since the above model contains only exogenous regression variables, it can be consistently estimated using OLS. This estimation forms the second stage of the 2-stage OLS estimator.

General Framework of 2SLS

For those of you with a flair for linear algebra, the general framework of 2-stage least squares is as follows (if you like, you may skip this section to go straight to the Python tutorial on 2SLS):

Let’s work on the first stage.

Stage 1

In stage 1, we estimate the following model. To keep things general, X contains not just the endogenous education but also the rest of the variables, γ is the vector of regression coefficients, and ν is the error term:

A linear regression of X on Z (Image by Author)

The least-squares estimator of γ can be shown to be calculated as follows using the standard formula for the least-squares based estimator:

Using γ_cap, the estimated value of X is given by:

Estimation of X using the estimated coefficients γ_cap — Estimation of X using the estimated *coefficients* γ*_cap (Image by Author)*

This completes the first stage of 2-SLS.

Now, let’s work on the second stage.

Stage 2

Let’s recollect Eq 6(a) which is the IV estimator we had constructed for the case where there is a one-to-one correspondence between the endogenous variables in X and the instruments in Z:

We’ll plug in X_cap from Eq (6c) in place of Z in Eq (6a) to get β_cap_2SLS as follows:

The coefficients of the instrumented model, estimated using 2-stage Least Squares (Image by Author)

This completes the formulation of the 2-SLS estimator. All matrices on the R.H.S. of Eq (6b) are entirely observable to the experimenter. The estimation of coefficients can be carried out by simply applying equations (6bb), (6c) and (6d) in that sequence:

The 2-stage Least Squares (2SLS) estimator (Image by Author))

A tutorial on estimating a linear model using 2SLS using Python and statsmodels

We’ll use the following cross-sectional data from a 1976 Panel Study of Income Dynamics of married women based on data for the previous year, 1975.

PSID76 data set (Source: R datasets under GPL v3)

Each row contains hourly wage data and other variables about a married female participant. The data set contains several variables. The ones of interest to us are as follows:

wage: Average hourly wage in 1975 dollars
education: years of schooling of participant
meducation: years of schooling of mother of participant
feducation: years of schooling of father of participant
participation: Did the individual participate in the labor force in 1975? (1/0). We consider only those individuals who participated in 1975.

Our goal is to estimate the effect of education as approximated by number of years of schooling on the hourly wage, specifically log of hourly wage, of married female respondents in 1975.

A model to estimate the effect of education on log of hourly wage (Image by Author)

As we saw earlier, education is endogenous, hence a straight-up estimation using OLS will yield biased estimates of all coefficients. Specifically, an OLS estimation of β_1 and β_2 will likely overestimate their values i.e. it will overestimate the effect of education on hourly wages.

We’ll try to remediate this situation by using meducation and feducation as instruments for education.

We’ll use Python, Pandas and Statsmodels to load the data set and build and train the model. Let’s start by importing the required packages:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from statsmodels.api import add_constant
from statsmodels.sandbox.regression.gmm import IV2SLS

Let’s load the data set into a Pandas Dataframe:

df = pd.read_csv('PSID1976.csv', header=0)

Next, we’ll use a subset of the data set where participation=yes.

df_1975 = df.query('participation == \'yes\'')

We’ll need to verify that the instruments meducation and feducation satisfy the relevance condition. For that, we’ll regress education on meducation and feducation, and verify using the F-test that the coefficients of meducation and feducation in this regression are jointly significant.

reg_expr = 'education ~ meducation + feducation'

olsr_model = smf.ols(formula=reg_expr, data=df_1975)

olsr_model_results = olsr_model.fit()

print(olsr_model_results.summary())

We see the following output:

Training summary of the linear model that regresses education on meducation, feducation and a constant (Image by Author)

The coefficients of meducation and feducation are individually significant at a p of < 0.001 as indicated by their p-values which are basically zero. The coeffcients are also jointly significant at a p of 2.96e-22 i.e. < .001. meducation and feducation clearly meet the relevance condition for IVs of education.

We’ll now build a linear model for the wage equation and using statsmodels, we’ll train the model using the 2SLS estimator.

We’ll start by building the design matrices. The dependent variable is ln(wage):

ln_wage = np.log(df_1975['wage'])

Statsmodel’s IV2SLS estimator is defined as follows:

statsmodels.sandbox.regression.gmm.IV2SLS(endog, exog, instrument=None)

Statsmodels needs the endog, exog and instrument matrices to be constructed in a specific way as follows:

endog is an [n x 1] matrix containing the dependent variable. In our example, it is the ln_wage variable.

exog is an [n x (k+1)] size matrix that must contain all the endogenous and exogenous variables, plus the constant. In our example, apart from the constant, we do not have any exogenous variables defined in our wage equation. So it will look like this:

instrument is a matrix that contains the instrumental variables. Additionally, the Statsmodels’ IV2SLS estimator requires instrument to also contain all variables from the exog matrix that are not being instrumented. In our example, the instrumental variables are meducation and feducation. The variables in exog that are not being instrumented is just the placeholder column for the intercept. Hence, our instrument matrix will look like this:

Let’s build out the three matrices:

df_1975['ln_wage'] = np.log(df_1975['wage'])

exog = df_1975[['education']]
exog = add_constant(exog)

instruments = df_1975[['meducation', 'feducation']]
instruments = add_constant(instruments)

Now let’s build and train the IV2SLS model:

iv2sls_model = IV2SLS(endog=df_1975['ln_wage'], exog=exog, instrument=instruments)

iv2sls_model_results = iv2sls_model.fit()

And let’s print the training summary:

print(iv2sls_model_results.summary())

Training summary of the 2SLS model (Image by Author)

Interpretation of results of the 2SLS model

Since our primary interest is in estimating the effect of education on hourly wages, we’ll focus our attention on the coefficient estimate of the education variable.

We see that the 2SLS model has estimated the coefficient of education as 0.0505 with a standard error of 0.032 and a 95% confidence interval of -0.013 to 0.114. The p value of 0.117 suggests a significance at (1–0.117)100%=88.3%. Overall, and as expected for a 2-SLS model, the model lacks precision.

Note that dependent variable is log(wage). To calculate the rate of change of hourly wages for each unit change (i.e. one year) of education, we must exponentiate the coefficient of education.

e^(0.0505)=1.05179 implying that a unit increase in number of years of education is estimated to yield an increase of $1.05179 in hourly wages, and vice-versa.

Comparison of the IV estimator with an OLS estimator

Let’s compare the performance of the 2SLS model with a straight-up OLS model that regresses log(wage) on education.

reg_expr = 'ln_wage ~ education'

olsr_model = smf.ols(formula=reg_expr, data=df_1975)

olsr_model_results = olsr_model.fit()

print(olsr_model_results.summary())

Training summary of the OLS model (Image by Author)

We’ll focus our attention on the estimated value of the coefficient of education. At 0.1086, it is double the estimate reported by the 2SLS model.

e^(0.1086)=1.11472, implying a unit increase (decrease) in the number of years of education is estimated to translate into a $1.11472 increase (decrease) in hourly wages.

The higher estimate from OLS is expected due to the suspected endogeniety of education. In practice, depending on the situation we are modeling, we may want to accept the more conservative estimate of 0.0505 reported by the 2SLS model. However, (and against the 2SLS model), the coefficient estimate from the OLS model is highly significant with a p-value that is essentially zero. Recollect that the estimate from the 2SLS model was significant at only a 88% confidence level.

Also, (and again as expected from the OLS model), the coefficient estimate of education reported by the OLS model has a much smaller standard error (0.014) as compared to that from the 2SLS model (0.032). And therefore, the corresponding 95% CIs from the OLS model are much tighter than those estimated by the 2SLS model.

For comparison, here are the coefficient estimates of education and corresponding 95% CIs from the two models:

Comparison of coefficient estimates for education reported by the 2SLS and the OLS models (Image by Author)

With the IV estimator, one trades precision of estimates for the removal of endogeneity and the consequent bias in the estimates.

And here’s a comparison of the main effect of education estimated by the two models on hourly wages:

A comparison of the main effect of education on hourly wages estimated by the 2SLS and OLS models (Image by Author)

Data set and tutorial code

The wages data set used in the chapter can be accessed from this link. The associated documentation can be found here.

Here is the complete source code shown in this chapter:

	import pandas as pd
	import numpy as np
	import statsmodels.formula.api as smf
	from statsmodels.api import add_constant
	from statsmodels.sandbox.regression.gmm import IV2SLS


	#Load the Panel Study of Income Dynamics (PSID) into a Dataframe
	df = pd.read_csv('PSID1976.csv', header=0)

	#Use a subset of the dataset where participating=yes
	df_1975 = df.query('participation == \'yes\'')
	print(df_1975)

	#Let's confirm that meducation and feducation satisfy the relevance condition for education
	reg_expr = 'education ~ meducation + feducation'

	#Build an train an OLS model that regresses education on meducation and feducation and verify
	# using the F-test that coefficients of meducation and feducation are jointly significant
	# significance
	olsr_model = smf.ols(formula=reg_expr, data=df_1975)
	olsr_model_results = olsr_model.fit()
	print(olsr_model_results.summary())

	#Build the dependent variable column
	df_1975['ln_wage'] = np.log(df_1975['wage'])

	#Build out the exog matrix. Statsmodels requires this matrix to contain all the endogenous and
	# exogenous variables, plus the constant.
	exog = df_1975[['education']]
	exog = add_constant(exog)

	#Build out the instruments matrix. Statsmodels requires this matrix to contain not only all the
	# instruments but also the variables in exog that will NOT be instrumented
	instruments = df_1975[['meducation', 'feducation']]
	instruments = add_constant(instruments)

	#Build and train the IV2SLS model
	iv2sls_model = IV2SLS(endog=df_1975['ln_wage'], exog=exog, instrument=instruments)
	iv2sls_model_results = iv2sls_model.fit()

	#Print the training summary
	print(iv2sls_model_results.summary())

	#Compare the performance of 2SLS with OLS of ln(wage) on education
	reg_expr = 'ln_wage ~ education'
	olsr_model = smf.ols(formula=reg_expr, data=df_1975)
	olsr_model_results = olsr_model.fit()
	print(olsr_model_results.summary())

view raw instrumental_variables_regression.py hosted with ❤ by GitHub

Citations and Copyrights

Data set

The Labor Force Participation data set is available as part of R data sets. It is made available by Vincent Arel-Bundock as part of the Rdatasets package under the GPL-3 license.

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

PREVIOUS: Introduction To Instrumental Variables

NEXT: An Introduction to Systems Of Regression Equations

UP: Table of Contents