We’ll learn how to use the 2SLS technique to estimate linear models containing Instrumental Variables

In this chapter, we’ll learn about two different ways to estimate a linear model using the **Instrumental Variables** technique.

In the previous chapter, we learnt about Instrumental Variables, what they are, and when and how to use them. Let’s recap what we learnt:

Consider the following linear model:

In the above equation, *y**, **1, x**_2, **x**_3*, and ** ϵ **are column vectors of size

*[n x 1]*. From subsequent equations, we’ll drop the

**(which is a vector of 1s) for brevity.**

*1*If one or more regression variables, say *x**_3*, is **endogenous**, i.e., it is correlated with the error term ** ϵ**, the Ordinary Least Squares (OLS) estimator is

**not**

**consistent**. The coefficient estimates it generates are biased away from the true values, putting into question the usefulness of the experiment.

One way to rescue the situation is to devise a way to effectively “break” *x**_3* into two parts:

- A chunk that is uncorrelated with
which we will add back into the model in place of*ϵ**x**_3*. This is the part of*x**_3*that is in fact exogenous. - A second chunk that is correlated with
which we will cut out of the model. This is the part that is endogenous.*ϵ*

And one way to accomplish this goal is to identify a variable *z**_3*, “an instrument for *x**_3*”, with the following properties:

- It is correlated with
*x**_3.*That (to some extent) satisfies the first of the above two requirements, and - It is uncorrelated with the error term which takes care of the second requirement.

Replacing *x**_3* with *z**_3* yields the following model:

All variables on the R.H.S of Eq (1a) are exogenous. This model can be consistently estimated using least-squares.

The above estimation technique can be easily extended to multiple endogenous variables and their corresponding instruments *as long as each endogenous variable is paired one-on-one with a single unique instrumental variable*.

The above example suggests a general framework for IV estimation which we present below.

A linear regression of ** y** on

**takes the following matrix form:**

*X*Assuming a data set of size *n*, in Eq (2):

is a vector of size*y**[n x 1]*.is the matrix of regression variables of size*X**[n x (k+1)]*, i.e. it has*n*rows and*(k+1)*columns of which the first column is a column of 1s and it acts as the placeholder for the intercept.is a column vector of regression coefficients of size*β**[(k+1) x 1]*where the first element*β_1*is the intercept of regression.is a column vector of regression errors of size*ϵ**[n x1]*.effectively holds the balance amount of variance in*ϵ*that the model*y*wasn’t able to explain.*Xβ*

Here’s how the above equation would look in matrix format:

Without loss of generality, and not counting the intercept, let’s assume that the first *p* regression variables in ** X** are exogenous and the next

*q*variables are endogenous such that

*1 + p + q = k*:

Suppose we are able to identify *q* instrumental variables which would be the instruments for the corresponding *q* regression variables in ** X **namely

*x**_(p+1)*thru

*x**_k*that are suspected to be endogenous.

Let’s construct a matrix ** Z** as follows:

- The first column of
will be a column of 1s.*Z* - The next
*p*columns of**Z**namely*z**_*2 thru*z**_p*will be identical to the*p*exogenous variablesthru**x**_2*x**_p*in*X.* - The final set of
*q*columns innamely*Z**z**_(p+1)*thru*z**_k*will hold the data for the*q*variables that would be the instruments for the corresponding*q*endogenous variables innamely*X**x**_(p+1)*thru*x**_k*.

Thus, the size of ** Z** is also

*[n x (k+1)]*i.e. the same as that of

**.**

*X*Next, we’ll take the transpose ** Z** which interchanges the rows and columns. The transpose operation essentially turns

**on its side. The transpose of**

*Z***denoted as**

*Z***is of size**

*Z’**[(k+1) x n].*

Now, let’s pre-multiply Eq (2) by ** Z’**:

Eq (3) is dimensionally correct. On the L.H.S., ** Z’** is of size

*[(k+1) x n]*and

**is of size**

*y**[n x 1]*. Hence

**is of size**

*Z’y**[(k+1) x 1]*.

On the R.H.S., ** X** is of size

*[n x (k+1)]*and

**is of size**

*β**[(k+1) x 1]*. Working left to right,

**is a square matrix of size**

*Z’X**[(k+1) x (k+1)]*and

*(*

*Z’X**)*

**is of size**

*β**[(k+1) x 1].*

Similarly, ** ϵ** is of size

*[n x 1]*. So

**is also of size**

*Z’ϵ**[(k+1) x 1].*

Now, let’s apply the expectation operator *E(.)* on both sides of Eq. (3):

*E(**Z’y**)* and *E(**Z’Xβ**)* resolve respectively to ** Z’y** and

**.**

*Z’Xβ*Recollect that ** Z** contains only exogenous variables. Therefore,

**and**

*Z***are not correlated and hence the mean value of**

*ϵ**(*

*Z’ϵ**)*is a column vector of zeros, and Eq (3a) resolves to the following:

Next, we’ll isolate the coefficients vector ** β** on the R.H.S. of (4) by multiplying both sides of Eq (4) with the inverse of the square matrix (

*Z’X**).*

The inverse of a matrix is conceptually the multi-dimensional equivalent of the inverse of a scalar number *N* (assuming *N* is non-zero)*. *The inverse of a matrix is calculated using a complex formula which we’ll skip getting into.

It is possible to show that (** Z’X**) is invertible (again something we won’t get into here). Pre-multiplying both sides of Eq. (4) by the inverse of

*(*

*Z’X**)*namely

*(*

*Z’X**)^-1*, gets us the following:

The yellow and green bits on the R.H.S. cancel each other out and yield an identity matrix in the same way as N*(1/N) equals 1, leaving us with the following equation for estimating the coefficients vector ** β **of the instrumented model:

Notice that ** Z**,

**and**

*X***are all observable quantities and so all regression coefficients can be estimated in one shot using Eq (6) provided there is a one-to-one correspondence between the endogenous variables in**

*y***and the chosen instruments in**

*X***.**

*Z*There is one final point that must be mentioned about Eq (6). Eq (6) is strictly speaking estimable only asymptotically, i.e. when the number of data samples *n → ∞*. But in practice, and for a set of mathematical reasons that probably deserve their own chapter, we can use it to calculate the coefficient estimates of a model estimated via IV on finite sized samples, in other words, on a real world data set.

Thus, the finite sample IV estimator ** β_cap_IV** of

**can be stated as follows:**

*β*Now, let’s look at the case where there is more than one Instrumental Variable defined for an endogenous variable.

Consider the following regression model of wages:

In the above model, we regress the natural log of ** wage** instead of the raw wage as wage data is often right-skewed and logging it can reduce the skew.

**is measured in terms of years of schooling.**

*Education***and**

*College***are boolean variables indicating whether the person went to college and whether they live in a city.**

*city***contains the percentage unemployment rate in the county of residence.**

*Unemp*Our ** X** matrix is

*[*

*1**,*

*age**,*

*experience**,*

*college**,*

*city**,*

*unemp**,*

*education**]*, where the each variable is a column vector of size

*[n x 1]*and the size of

**is**

*X**[n x 7].*

We’ll argue that ** education** is endogenous. As such, years of schooling captures only what is taught in school or college. And it also leaves out aspects such as how well the person has grasped the material, their knowledge of topics outside of the curriculum and so on, all of which are left unobserved and therefore captured in the error term

**.**

*ϵ*We’ll propose two variables, mother’s number of years of schooling (** meducation**) and father’s number of years of schooling (

**) as the IVs for the person’s**

*feducation***.**

*education*## The relevance and exogeneity conditions

Our chosen IVs need to pass the **relevance condition**. If a regression of ** education** on the rest of the variables in

**plus**

*X***and**

*meducation***reveals (via an**

*feducation***F-test**) that

**and**

*meducation***are**

*feducation**jointly*significant, the two IVs pass the relevance condition.

The error term ** ϵ **is inherently unobservable. So the

**exogeneity condition**for the IVs cannot be directly tested. Instead, we take it upon faith that parents’ number of years of schooling is unlikely to be correlated with factors such as the child’s grasp of material, i.e. the factors that are hiding in the error term and which are making education be endogenous. But we could be wrong about this. We’ll soon find out.

## The regression model containing IVs

Our regression model with IVs is as follows:

Our ** Z** matrix is

*[*

*1**,*

*age**,*

*experience**,*

*college**,*

*city**,*

*unemp**,*

*meducation**,*

*feducation**]*, where the each variable is a column vector of size

*[n x 1]*and the size of

**is**

*Z**[n x 8].*Notice how we have replaced education with its two IVs.

And the coefficient vector to be estimated is:

*β_cap_IV=[**β*_1_cap, β*_2_cap, β*_3_cap, β*_4_cap, β*_5_cap, β*_6_cap, β*_7_cap, β*_8_cap]*

Where the caps indicated estimated values.

With ** X** and

**defined, can we use Eq (6a) to perform a single-shot calculation of**

*Z*

*β_cap_IV**?*

Unfortunately, the answer is , no.

Recollect that the size ** of Z** is

*[n x 8].*So, the size of

**is**

*Z’**[8 x n]*. The size of

**is**

*X**[n x 7]*. Hence

**has size**

*Z’X**[8 x 7]*which is not a square matrix and therefore

**. Thus, Eq. (6a) cannot be used when multiple instrumental variables such as**

*not invertible***and**

*meducation***are used to represent a single endogenous variable such as**

*feducation***.**

*education*This difficulty suggests that we explore a different approach for estimating ** β_cap_IV.** This different approach is a two-stage OLS estimator.

## The 2-stage OLS estimator

We begin by developing the first stage of this estimator.

### The First Stage

In this stage, we’ll regress ** education** on

*age**,*

*experience**,*

*college**,*

*city**,*

*unemp**,*

*meducation**,*and

**.**

*feducation*Let’s suppose that we have determined via the F-test that ** education** is indeed correlated with the IVs

**and**

*meducation***.**

*feducation*We will now regress ** education** not only on

**and**

*meducation***but also the other variables which allows us to account for the effect of possible correlations between the non-IV variables and the IV variables. See my earlier chapter on Instrumental Variables for a detailed explanation of this effect.**

*feducation*** ν** is the error term. The above model can be consistently estimated using OLS as all regression variables are exogenous. The estimated model has the following form:

In the above equation, ** education_cap** is the estimated (a.k.a. predicted) value of

**. The caps on the coefficients similarly indicate estimated values.**

*education*The above OLS based regression represents the **first stage** of a two-stage OLS (2SLS) estimation that we are about to do.

### The second stage

*The key insight to be had about the first stage is that **education_cap** contains only the portion of variance of **education** that is exogenous, i.e. not correlated with the error term. *

Therefore, we can replace education in the original model of *ln(**wage**)* with ** education_cap** to form a model that contains only exogenous regression variables, as follows:

Since the above model contains only exogenous regression variables, it can be consistently estimated using OLS. This estimation forms the **second stage** of the 2-stage OLS estimator.

## General Framework of 2SLS

For those of you with a flair for linear algebra, the general framework of 2-stage least squares is as follows (if you like, you may skip this section to go straight to the Python tutorial on 2SLS):

Let’s work on the first stage.

### Stage 1

In stage 1, we estimate the following model. To keep things general, ** X** contains not just the endogenous education but also the rest of the variables,

**is the vector of regression coefficients, and**

*γ***is the error term:**

*ν*The least-squares estimator of ** γ** can be shown to be calculated as follows using the standard formula for the least-squares based estimator:

Using ** γ**_cap, the estimated value of

**is given by:**

*X*This completes the first stage of 2-SLS.

Now, let’s work on the second stage.

### Stage 2

Let’s recollect Eq 6(a) which is the IV estimator we had constructed for the case where there is a one-to-one correspondence between the endogenous variables in ** X** and the instruments in

**:**

*Z*We’ll plug in *X**_cap* from Eq (6c) in place of ** Z** in Eq (6a) to get

*β**_cap_2SLS*as follows:

This completes the formulation of the 2-SLS estimator. All matrices on the R.H.S. of Eq (6b) are entirely observable to the experimenter. The estimation of coefficients can be carried out by simply applying equations (6bb), (6c) and (6d) in that sequence:

## A tutorial on estimating a linear model using 2SLS using Python and statsmodels

We’ll use the following cross-sectional data from a 1976 Panel Study of Income Dynamics of married women based on data for the previous year, 1975.

Each row contains hourly wage data and other variables about a married female participant. The data set contains several variables. The ones of interest to us are as follows:

*wage*: Average hourly wage in 1975 dollars*education*: years of schooling of participant*meducation*: years of schooling of mother of participant*feducation*: years of schooling of father of participant*participation*: Did the individual participate in the labor force in 1975? (1/0). We consider only those individuals who participated in 1975.

Our goal is to **estimate the effect of education as approximated by number of years of schooling on the hourly wage, specifically log of hourly wage**, of married female respondents in 1975.

As we saw earlier, education is endogenous, hence a straight-up estimation using OLS will yield biased estimates of all coefficients. Specifically, an OLS estimation of *β_1* and *β_2* will likely overestimate their values i.e. it will overestimate the effect of education on hourly wages.

We’ll try to remediate this situation by using *meducation* and *feducation* as **instruments** for education.

We’ll use Python, Pandas and Statsmodels to load the data set and build and train the model. Let’s start by importing the required packages:

```
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from statsmodels.api import add_constant
from statsmodels.sandbox.regression.gmm import IV2SLS
```

Let’s load the data set into a Pandas `Dataframe`

:

```
df = pd.read_csv('PSID1976.csv', header=0)
```

Next, we’ll use a subset of the data set where participation=yes.

```
df_1975 = df.query('participation == \'yes\'')
```

We’ll need to verify that the instruments *meducation *and *feducation* satisfy the **relevance condition**. For that, we’ll regress *education* on *meducation *and *feducation*, and verify using the F-test that the coefficients of *meducation* and *feducation* in this regression are *jointly* significant.

```
reg_expr = 'education ~ meducation + feducation'
olsr_model = smf.ols(formula=reg_expr, data=df_1975)
olsr_model_results = olsr_model.fit()
print(olsr_model_results.summary())
```

We see the following output:

The coefficients of *meducation* and *feducation* are individually significant at a p of < 0.001 as indicated by their p-values which are basically zero. The coeffcients are also jointly significant at a p of 2.96e-22 i.e. < .001. *meducation* and *feducation* clearly meet the **relevance condition **for IVs of *education*.

We’ll now build a linear model for the wage equation and using statsmodels, we’ll train the model using the 2SLS estimator.

We’ll start by building the design matrices. The dependent variable is *ln(wage)*:

```
ln_wage = np.log(df_1975['wage'])
```

Statsmodel’s IV2SLS estimator is defined as follows:

```
statsmodels.sandbox.regression.gmm.IV2SLS(endog, exog, instrument=None)
```

Statsmodels needs the

, *endog*

and *exog*

matrices to be constructed in a specific way as follows:*instrument*

is an *endog**[n x 1]* matrix containing the dependent variable. In our example, it is the *ln_wage *variable.

is an *exog**[n x (k+1)]* size matrix that must contain all the endogenous and exogenous variables, plus the constant. In our example, apart from the constant, we do not have any exogenous variables defined in our wage equation. So it will look like this:

is a matrix that contains the instrumental variables. Additionally, the Statsmodels’ *instrument*`IV2SLS`

estimator requires `instrument`

to also contain all variables from the `exog`

matrix that are *not* being instrumented. In our example, the instrumental variables are *meducation* and *feducation*. The variables in `exog`

that are *not* being instrumented is just the placeholder column for the intercept. Hence, our instrument matrix will look like this:

Let’s build out the three matrices:

```
df_1975['ln_wage'] = np.log(df_1975['wage'])
exog = df_1975[['education']]
exog = add_constant(exog)
instruments = df_1975[['meducation', 'feducation']]
instruments = add_constant(instruments)
```

Now let’s build and train the `IV2SLS`

model:

```
iv2sls_model = IV2SLS(endog=df_1975['ln_wage'], exog=exog, instrument=instruments)
iv2sls_model_results = iv2sls_model.fit()
```

And let’s print the training summary:

```
print(iv2sls_model_results.summary())
```

## Interpretation of results of the 2SLS model

Since our primary interest is in estimating the effect of education on hourly wages, we’ll focus our attention on the coefficient estimate of the *education* variable.

We see that the 2SLS model has estimated the coefficient of *education* as 0.0505 with a standard error of 0.032 and a 95% confidence interval of -0.013 to 0.114. The p value of 0.117 suggests a significance at (1–0.117)100%=88.3%. Overall, and as expected for a 2-SLS model, the model lacks precision.

Note that dependent variable is *log*(wage). To calculate the rate of change of hourly wages for each unit change (i.e. one year) of education, we must exponentiate the coefficient of *education*.

e^(0.0505)=1.05179 implying that a unit increase in number of years of education is estimated to yield an increase of $1.05179 in hourly wages, and vice-versa.

## Comparison of the IV estimator with an OLS estimator

Let’s compare the performance of the 2SLS model with a straight-up OLS model that regresses *log(wage)* on *education*.

```
reg_expr = 'ln_wage ~ education'
olsr_model = smf.ols(formula=reg_expr, data=df_1975)
olsr_model_results = olsr_model.fit()
print(olsr_model_results.summary())
```

We’ll focus our attention on the estimated value of the coefficient of *education*. At 0.1086, it is double the estimate reported by the 2SLS model.

e^(0.1086)=1.11472, implying a unit increase (decrease) in the number of years of education is estimated to translate into a $1.11472 increase (decrease) in hourly wages.

The higher estimate from OLS is expected due to the suspected endogeniety of *education. *In practice, depending on the situation we are modeling, we may want to accept the more conservative estimate of 0.0505 reported by the 2SLS model. However, (and against the 2SLS model), the coefficient estimate from the OLS model is highly significant with a p-value that is essentially zero. Recollect that the estimate from the 2SLS model was significant at only a 88% confidence level.

Also, (and again as expected from the OLS model), the coefficient estimate of *education* reported by the OLS model has a much smaller standard error (0.014) as compared to that from the 2SLS model (0.032). And therefore, the corresponding 95% CIs from the OLS model are much tighter than those estimated by the 2SLS model.

For comparison, here are the coefficient estimates of *education* and corresponding 95% CIs from the two models:

With the IV estimator, one trades precision of estimates for the removal of endogeneity and the consequent bias in the estimates.

And here’s a comparison of the **main effect** of *education* estimated by the two models on hourly wages:

## Data set and tutorial code

The wages data set used in the chapter can be accessed from this link. The associated documentation can be found here.

Here is the complete source code shown in this chapter:

## Citations and Copyrights

### Data set

The Labor Force Participation data set is available as part of R data sets. It is made available by Vincent Arel-Bundock as part of the Rdatasets package under the GPL-3 license.

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **Introduction To Instrumental Variables

**NEXT: ** The Poisson Regression Model