# Introduction To 2-Stage Least Squares (2SLS) Estimation

We’ll learn how to use the 2SLS technique to estimate linear models containing Instrumental Variables

In this chapter, we’ll learn about two different ways to estimate a linear model using the Instrumental Variables technique.

In the previous chapter, we learnt about Instrumental Variables, what they are, and when and how to use them. Let’s recap what we learnt:

Consider the following linear model: A linear model of two variables x_2 and x_3 (Image by Author)

In the above equation, y, 1, x_2, x_3, and ϵ are column vectors of size [n x 1]. From subsequent equations, we’ll drop the 1 (which is a vector of 1s) for brevity.

If one or more regression variables, say x_3, is endogenous, i.e., it is correlated with the error term ϵ, the Ordinary Least Squares (OLS) estimator is not consistent. The coefficient estimates it generates are biased away from the true values, putting into question the usefulness of the experiment.

One way to rescue the situation is to devise a way to effectively “break” x_3 into two parts:

1. A chunk that is uncorrelated with ϵ which we will add back into the model in place of x_3. This is the part of x_3 that is in fact exogenous.
2. A second chunk that is correlated with ϵ which we will cut out of the model. This is the part that is endogenous.

And one way to accomplish this goal is to identify a variable z_3, “an instrument for x_3”, with the following properties:

1. It is correlated with x_3. That (to some extent) satisfies the first of the above two requirements, and
2. It is uncorrelated with the error term which takes care of the second requirement.

Replacing x_3 with z_3 yields the following model: The linear model with the endogenous variable x_3 replaced with the variable z_3 (Image by Author)

All variables on the R.H.S of Eq (1a) are exogenous. This model can be consistently estimated using least-squares.

The above estimation technique can be easily extended to multiple endogenous variables and their corresponding instruments as long as each endogenous variable is paired one-on-one with a single unique instrumental variable.

The above example suggests a general framework for IV estimation which we present below.

A linear regression of y on X takes the following matrix form: A linear model in which y is regressed on X (Image by Author)

Assuming a data set of size n, in Eq (2):

• y is a vector of size [n x 1]
• X is the matrix of regression variables of size [n x (k+1)], i.e. it has n rows and (k+1) columns of which the first column is a column of 1s and it acts as the placeholder for the intercept.
• β is a column vector of regression coefficients of size [(k+1) x 1] where the first element β_1 is the intercept of regression.
• ϵ is a column vector of regression errors of size [n x1]. ϵ effectively holds the balance amount of variance in y that the model wasn’t able to explain.

Here’s how the above equation would look in matrix format: The matrix version of the linear regression model (Image by Author)

Without loss of generality, and not counting the intercept, let’s assume that the first p regression variables in X are exogenous and the next q variables are endogenous such that 1 + p + q = k: The X matrix composed of p exogenous variables and q endogenous variables (Image by Author)

Suppose we are able to identify q instrumental variables which would be the instruments for the corresponding q regression variables in X namely x_(p+1) thru x_k that are suspected to be endogenous.

Let’s construct a matrix Z as follows:

• The first column of Z will be a column of 1s.
• The next p columns of Z namely z_2 thru z_p will be identical to the p exogenous variables x_2 thru x_p in X.
• The final set of q columns in Z namely z_(p+1) thru z_k will hold the data for the q variables that would be the instruments for the corresponding q endogenous variables in X namely x_(p+1) thru x_k.

Thus, the size of Z is also [n x (k+1)] i.e. the same as that of X.

Next, we’ll take the transpose Z which interchanges the rows and columns. The transpose operation essentially turns Z on its side. The transpose of Z denoted as Z’ is of size [(k+1) x n].

Now, let’s pre-multiply Eq (2) by Z’:

Eq (3) is dimensionally correct. On the L.H.S., Z’ is of size [(k+1) x n] and y is of size [n x 1]. Hence Z’y is of size [(k+1) x 1]

On the R.H.S., X is of size [n x (k+1)] and β is of size [(k+1) x 1]. Working left to right, Z’X is a square matrix of size [(k+1) x (k+1)] and (Z’X)β is of size [(k+1) x 1].

Similarly, ϵ is of size [n x 1]. So Z’ϵ is also of size [(k+1) x 1].

Now, let’s apply the expectation operator E(.) on both sides of Eq. (3): Applying the expectation operator to both sides of Eq (3) (Image by Author)

E(Z’y) and E(Z’Xβ) resolve respectively to Z’y and Z’Xβ

Recollect that Z contains only exogenous variables. Therefore, Z and ϵ are not correlated and hence the mean value of (Z’ϵ) is a column vector of zeros, and Eq (3a) resolves to the following:

Next, we’ll isolate the coefficients vector β on the R.H.S. of (4) by multiplying both sides of Eq (4) with the inverse of the square matrix (Z’X).

The inverse of a matrix is conceptually the multi-dimensional equivalent of the inverse of a scalar number N (assuming N is non-zero). The inverse of a matrix is calculated using a complex formula which we’ll skip getting into.

It is possible to show that (Z’X) is invertible (again something we won’t get into here). Pre-multiplying both sides of Eq. (4) by the inverse of (Z’X) namely (Z’X)^-1, gets us the following:

The yellow and green bits on the R.H.S. cancel each other out and yield an identity matrix in the same way as N*(1/N) equals 1, leaving us with the following equation for estimating the coefficients vector β of the instrumented model: The formula for calculating the regression coefficients of the regression model containing instrumental variables (Image by Author)

Notice that Z, X and y are all observable quantities and so all regression coefficients can be estimated in one shot using Eq (6) provided there is a one-to-one correspondence between the endogenous variables in X and the chosen instruments in Z.

There is one final point that must be mentioned about Eq (6). Eq (6) is strictly speaking estimable only asymptotically, i.e. when the number of data samples n → ∞. But in practice, and for a set of mathematical reasons that probably deserve their own chapter, we can use it to calculate the coefficient estimates of a model estimated via IV on finite sized samples, in other words, on a real world data set.

Thus, the finite sample IV estimator β_cap_IV of β can be stated as follows: The finite sample estimator of regression coefficients for the model containing IVs (Image by Author)

Now, let’s look at the case where there is more than one Instrumental Variable defined for an endogenous variable.

Consider the following regression model of wages:

In the above model, we regress the natural log of wage instead of the raw wage as wage data is often right-skewed and logging it can reduce the skew. Education is measured in terms of years of schooling. College and city are boolean variables indicating whether the person went to college and whether they live in a city. Unemp contains the percentage unemployment rate in the county of residence.

Our X matrix is [1, age, experience, college, city, unemp, education], where the each variable is a column vector of size [n x 1] and the size of X is [n x 7].

We’ll argue that education is endogenous. As such, years of schooling captures only what is taught in school or college. And it also leaves out aspects such as how well the person has grasped the material, their knowledge of topics outside of the curriculum and so on, all of which are left unobserved and therefore captured in the error term ϵ.

We’ll propose two variables, mother’s number of years of schooling (meducation) and father’s number of years of schooling (feducation) as the IVs for the person’s education

## The relevance and exogeneity conditions

Our chosen IVs need to pass the relevance condition. If a regression of education on the rest of the variables in X plus meducation and feducation reveals (via an F-test) that meducation and feducation are jointly significant, the two IVs pass the relevance condition.

The error term ϵ is inherently unobservable. So the exogeneity condition for the IVs cannot be directly tested. Instead, we take it upon faith that parents’ number of years of schooling is unlikely to be correlated with factors such as the child’s grasp of material, i.e. the factors that are hiding in the error term and which are making education be endogenous. But we could be wrong about this. We’ll soon find out.

## The regression model containing IVs

Our regression model with IVs is as follows: Log of wage regressed on a variety of variables including two IVs: meducation and feducation (Image by Author)

Our Z matrix is [1, age, experience, college, city, unemp, meducation, feducation], where the each variable is a column vector of size [n x 1] and the size of Z is [n x 8]. Notice how we have replaced education with its two IVs.

And the coefficient vector to be estimated is:

β_cap_IV=[β*_1_cap, β*_2_cap, β*_3_cap, β*_4_cap, β*_5_cap, β*_6_cap, β*_7_cap, β*_8_cap]

Where the caps indicated estimated values.

With X and Z defined, can we use Eq (6a) to perform a single-shot calculation of β_cap_IV? The finite sample estimator of regression coefficients for the model containing IVs (Image by Author)

Unfortunately, the answer is , no.

Recollect that the size of Z is [n x 8]. So, the size of Z’ is [8 x n]. The size of X is [n x 7]. Hence Z’X has size [8 x 7] which is not a square matrix and therefore not invertible. Thus, Eq. (6a) cannot be used when multiple instrumental variables such as meducation and feducation are used to represent a single endogenous variable such as education.

This difficulty suggests that we explore a different approach for estimating β_cap_IV. This different approach is a two-stage OLS estimator.

## The 2-stage OLS estimator

We begin by developing the first stage of this estimator.

### The First Stage

In this stage, we’ll regress education on age, experience, college, city, unemp, meducation, and feducation

Let’s suppose that we have determined via the F-test that education is indeed correlated with the IVs meducation and feducation

We will now regress education not only on meducation and feducation but also the other variables which allows us to account for the effect of possible correlations between the non-IV variables and the IV variables. See my earlier chapter on Instrumental Variables for a detailed explanation of this effect.

ν is the error term. The above model can be consistently estimated using OLS as all regression variables are exogenous. The estimated model has the following form: Estimated value of education after fitting the model that regresses education on the Z matrix (Image by Author)

In the above equation, education_cap is the estimated (a.k.a. predicted) value of education. The caps on the coefficients similarly indicate estimated values.

The above OLS based regression represents the first stage of a two-stage OLS (2SLS) estimation that we are about to do.

### The second stage

The key insight to be had about the first stage is that education_cap contains only the portion of variance of education that is exogenous, i.e. not correlated with the error term.

Therefore, we can replace education in the original model of ln(wage) with education_cap to form a model that contains only exogenous regression variables, as follows: log of wage regressed on X with education replaced with its exogenous portion (Image by Author)

Since the above model contains only exogenous regression variables, it can be consistently estimated using OLS. This estimation forms the second stage of the 2-stage OLS estimator.

## General Framework of 2SLS

For those of you with a flair for linear algebra, the general framework of 2-stage least squares is as follows (if you like, you may skip this section to go straight to the Python tutorial on 2SLS):

Let’s work on the first stage.

### Stage 1

In stage 1, we estimate the following model. To keep things general, X contains not just the endogenous education but also the rest of the variables, γ is the vector of regression coefficients, and ν is the error term:

The least-squares estimator of γ can be shown to be calculated as follows using the standard formula for the least-squares based estimator:

Using γ_cap, the estimated value of X is given by: Estimation of X using the estimated coefficients γ_cap (Image by Author)

This completes the first stage of 2-SLS.

Now, let’s work on the second stage.

### Stage 2

Let’s recollect Eq 6(a) which is the IV estimator we had constructed for the case where there is a one-to-one correspondence between the endogenous variables in X and the instruments in Z: The finite sample estimator of regression coefficients for the model containing IVs (Image by Author)

We’ll plug in X_cap from Eq (6c) in place of Z in Eq (6a) to get β_cap_2SLS as follows: The coefficients of the instrumented model, estimated using 2-stage Least Squares (Image by Author)

This completes the formulation of the 2-SLS estimator. All matrices on the R.H.S. of Eq (6b) are entirely observable to the experimenter. The estimation of coefficients can be carried out by simply applying equations (6bb), (6c) and (6d) in that sequence:

## A tutorial on estimating a linear model using 2SLS using Python and statsmodels

We’ll use the following cross-sectional data from a 1976 Panel Study of Income Dynamics of married women based on data for the previous year, 1975.

Each row contains hourly wage data and other variables about a married female participant. The data set contains several variables. The ones of interest to us are as follows:

wage: Average hourly wage in 1975 dollars
education: years of schooling of participant
meducation: years of schooling of mother of participant
feducation: years of schooling of father of participant
participation: Did the individual participate in the labor force in 1975? (1/0). We consider only those individuals who participated in 1975.

Our goal is to estimate the effect of education as approximated by number of years of schooling on the hourly wage, specifically log of hourly wage, of married female respondents in 1975. A model to estimate the effect of education on log of hourly wage (Image by Author)

As we saw earlier, education is endogenous, hence a straight-up estimation using OLS will yield biased estimates of all coefficients. Specifically, an OLS estimation of β_1 and β_2 will likely overestimate their values i.e. it will overestimate the effect of education on hourly wages.

We’ll try to remediate this situation by using meducation and feducation as instruments for education.

We’ll use Python, Pandas and Statsmodels to load the data set and build and train the model. Let’s start by importing the required packages:

```import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.gmm import IV2SLS
```

Let’s load the data set into a Pandas `Dataframe`:

```df = pd.read_csv('PSID1976.csv', header=0)
```

Next, we’ll use a subset of the data set where participation=yes.

```df_1975 = df.query('participation == \'yes\'')
```

We’ll need to verify that the instruments meducation and feducation satisfy the relevance condition. For that, we’ll regress education on meducation and feducation, and verify using the F-test that the coefficients of meducation and feducation in this regression are jointly significant.

```reg_expr = 'education ~ meducation + feducation'

olsr_model = smf.ols(formula=reg_expr, data=df_1975)

olsr_model_results = olsr_model.fit()

print(olsr_model_results.summary())
```

We see the following output: Training summary of the linear model that regresses education on meducation, feducation and a constant (Image by Author)

The coefficients of meducation and feducation are individually significant at a p of < 0.001 as indicated by their p-values which are basically zero. The coeffcients are also jointly significant at a p of 2.96e-22 i.e. < .001. meducation and feducation clearly meet the relevance condition for IVs of education.

We’ll now build a linear model for the wage equation and using statsmodels, we’ll train the model using the 2SLS estimator.

We’ll start by building the design matrices. The dependent variable is ln(wage):

```ln_wage = np.log(df_1975['wage'])
```

Statsmodel’s IV2SLS estimator is defined as follows:

```statsmodels.sandbox.regression.gmm.IV2SLS(endog, exog, instrument=None)
```

Statsmodels needs the `endog`, `exog` and `instrument` matrices to be constructed in a specific way as follows:

`endog` is an [n x 1] matrix containing the dependent variable. In our example, it is the ln_wage variable.

`exog` is an [n x (k+1)] size matrix that must contain all the endogenous and exogenous variables, plus the constant. In our example, apart from the constant, we do not have any exogenous variables defined in our wage equation. So it will look like this:

`instrument` is a matrix that contains the instrumental variables. Additionally, the Statsmodels’ `IV2SLS` estimator requires `instrument` to also contain all variables from the `exog` matrix that are not being instrumented. In our example, the instrumental variables are meducation and feducation. The variables in `exog` that are not being instrumented is just the placeholder column for the intercept. Hence, our instrument matrix will look like this:

Let’s build out the three matrices:

```df_1975['ln_wage'] = np.log(df_1975['wage'])

exog = df_1975[['education']]

instruments = df_1975[['meducation', 'feducation']]
```

Now let’s build and train the `IV2SLS` model:

```iv2sls_model = IV2SLS(endog=df_1975['ln_wage'], exog=exog, instrument=instruments)

iv2sls_model_results = iv2sls_model.fit()
```

And let’s print the training summary:

```print(iv2sls_model_results.summary())
```

## Interpretation of results of the 2SLS model

Since our primary interest is in estimating the effect of education on hourly wages, we’ll focus our attention on the coefficient estimate of the education variable.

We see that the 2SLS model has estimated the coefficient of education as 0.0505 with a standard error of 0.032 and a 95% confidence interval of -0.013 to 0.114. The p value of 0.117 suggests a significance at (1–0.117)100%=88.3%. Overall, and as expected for a 2-SLS model, the model lacks precision.

Note that dependent variable is log(wage). To calculate the rate of change of hourly wages for each unit change (i.e. one year) of education, we must exponentiate the coefficient of education.

e^(0.0505)=1.05179 implying that a unit increase in number of years of education is estimated to yield an increase of \$1.05179 in hourly wages, and vice-versa.

## Comparison of the IV estimator with an OLS estimator

Let’s compare the performance of the 2SLS model with a straight-up OLS model that regresses log(wage) on education.

```reg_expr = 'ln_wage ~ education'

olsr_model = smf.ols(formula=reg_expr, data=df_1975)

olsr_model_results = olsr_model.fit()

print(olsr_model_results.summary())
```

We’ll focus our attention on the estimated value of the coefficient of education. At 0.1086, it is double the estimate reported by the 2SLS model.

e^(0.1086)=1.11472, implying a unit increase (decrease) in the number of years of education is estimated to translate into a \$1.11472 increase (decrease) in hourly wages.

The higher estimate from OLS is expected due to the suspected endogeniety of education. In practice, depending on the situation we are modeling, we may want to accept the more conservative estimate of 0.0505 reported by the 2SLS model. However, (and against the 2SLS model), the coefficient estimate from the OLS model is highly significant with a p-value that is essentially zero. Recollect that the estimate from the 2SLS model was significant at only a 88% confidence level.

Also, (and again as expected from the OLS model), the coefficient estimate of education reported by the OLS model has a much smaller standard error (0.014) as compared to that from the 2SLS model (0.032). And therefore, the corresponding 95% CIs from the OLS model are much tighter than those estimated by the 2SLS model.

For comparison, here are the coefficient estimates of education and corresponding 95% CIs from the two models: Comparison of coefficient estimates for education reported by the 2SLS and the OLS models (Image by Author)

With the IV estimator, one trades precision of estimates for the removal of endogeneity and the consequent bias in the estimates.

And here’s a comparison of the main effect of education estimated by the two models on hourly wages: A comparison of the main effect of education on hourly wages estimated by the 2SLS and OLS models (Image by Author)

## Data set and tutorial code

The wages data set used in the chapter can be accessed from this link. The associated documentation can be found here.

Here is the complete source code shown in this chapter: