# Introduction To The Difference-In-Differences Regression Model

We’ll show how to use the DID model to estimate the effect of hurricanes on house prices

In this chapter, we will study the Difference-In-Differences regression model. The DID model is a powerful and flexible regression technique that can be used to estimate the differential impact of a ‘Treatment’ on the treated group of individuals or things.

We will also illustrate the use of the Difference-In-Differences regression model to estimate the effect of hurricanes on property prices in the United States.

## Defining the terms: Treatment, treated group, control group

The words ‘treatment’ and ‘treated group’ may invoke a picture of a randomized controlled trial to test the efficacy of a drug or medical treatment.

While the DID model can indeed be used very effectively in that setting, in statistics, it is customary to ascribe a much broader interpretation to the word ‘Treatment’. ‘Treatment’ is any event that selectively affects only some of the individuals or things in a study. Examples of Treatment include an increase in state-mandated minimum wage that affects only restaurants in one state (as analyzed in the well-cited study by Card and Krueger in 1994), or the opening of a new airline route connecting two regions of a large country, or a natural disaster that affects only some parts of a country, or an experimental drug or medical procedure that is administered to only some of the participants in a study. In all these examples, the unit of study is respectively a restaurant, a town or a county, a county or a state, or a volunteer.

A study comprises many units (individuals or things) divided into a treatment group or a control group depending on whether they were or were not subjected to the treatment.

## The response variable

In each of such studies, one wants to measure an outcome , a response, and know if it will achieve a mean value that is statistically different within the treatment group than in the control group. For example, the 1994 study by Card and Krueger analyzed whether an increase of minimum wage by New Jersey in 1992 from \$4.25 to \$5.05 resulted in a statistically significant change in employment level amongst fast food restaurant workers in New Jersey from that in neighboring Pennsylvania which did not change its minimum wage. Other examples of a response variable are SAT score of the participant, pollution level in a county, and tree cover in a country.

## The Effect of Time

In practice, a complication is introduced by the passage of time. Whatever be the response variable being measured, be it SAT scores, employment level, house price inflation, or blood sugar level of participants, the natural flow of time will change the value of this variable in a potentially significant way as the study progresses from the pre-treatment to the post-treatment phase of the experiment. The experimenter must discount the partial effect of time (and the numerous hidden factors that time acts as a proxy for) on the change in the mean value of the response variable in both the control group and the treatment group. In other words, the experimenter must determine if the treatment itself caused any change in the mean value of the response variable within the treatment group that was over and above what was caused by the passage of time, and, whether this additional treatment-induced effect was observed much more in the treated group than in the control group.

The Difference-in-Differences (DID) regression model can be used to easily and quite elegantly perform all of the above mentioned analysis.

The fitted DID model will tell us whether there is evidence of a net-additional effect observed in the treated group that is purely treatment induced, the estimated value of this, whether this estimate is statistically significant and if so, the 95% or 99% confidence intervals are around the estimated effect.

## Structure of the Difference-In-Differences model

The following equation illustrates the structure of the DID model:

The first thing we note about this equation is that, it is that of a linear regression model.

y_i is the observed response for the ith observation. It is the value being measured in each group before and after treatment.

β_0 is the intercept of regression.

Time_Period_i is a dummy variable that takes the value 0 or 1 depending on whether the ith measurement refers to the pre or post treatment period respectively.

Treated_i is a dummy variable that takes the value 0 or 1 depending on whether the ith measurement refers to an individual in the control group or the treatment group respectively.

(Time_Period_i*Treated_i) is an interaction term. It stores the multiplication of the two dummy variable values for the ith observation.

ϵ_i is the error term associated with the ith observation and it captures the effect of all factors that the model was not able to adequately represent.

The two dummy variables in the model yield the follow 2 X 2 matrix of regression equations:

DID model is trained using the Ordinary Least Squares Regression technique.

For the trained (a.k.a. fitted) model, the corresponding expectations are as follows. The caps (^) above the coefficients indicate that they are the estimated (fitted) values of the corresponding coefficients. Replacing y_i with the expected value of y_i also allows us to drop the error term ϵ_i since in a well-behaved OLS regression model, the expected value of the error term is zero:

We wish to calculate the difference in the expected value of y_i between the before (pre-)and after (post-)treatment phases of the study.

For the treatment group, the difference in expectations works out as follows:

Similarly, for the control group we have:

The difference between the two differences gives us the net effect of the treatment on the treatment group:

We see that this Difference-in-differences effect is the coefficient of the interaction term (Time_Period_i*Treatment_Group_i).

It is this result that gives the DID model much of its usefulness.

After the DID model is trained, the fitted coefficient of the interaction term (Time_Period_i*Treatment_Group_i) will give us the the estimated difference-in-differences effect that we are seeking. The coefficient’s t-score and corresponding p value will tell us whether the effect is significant and if so, we can construct the 95% or 99% confidence interval around the estimated coefficient using the coefficient’s standard error reported by the model.

Let’s illustrate the procedure for building and training a Difference-In-Differences regression model using an interesting real world example.

## How to build a Difference-In-Differences model to estimate the effect of coastal weather events on house prices

We’ll use the DID model to estimate the effect of coastal weather events on house prices in the United States. Specifically, we’ll analyze the effect of the the 2005 Atlantic hurricane season which was the most active hurricane season in recorded history up until 2020.

Incidentally, this topic has been extensively researched using a variety of methods. Some researchers have focused on the effect of a single storm or many storms on the house prices in a single city or a singe state while others have zoomed out their attention to a regional or national level. There are hyper-local studies of the effect of severe weather events on the house prices in a single US county, while others have studied the effect of several years worth of severe weather events on the house prices of several coastal cities. There is also an interesting recent study on estimating the impact of distant but approaching hurricane on property prices.

Several of these studies have used the Difference-In-Differences regression model (or some variation or enhancement thereof). Interestingly, although perhaps unsurprisingly, the findings from these studies are diverse and contradictory depending on the methodology used by the researchers and extent of the spatial and temporal scope of the study.

### Our approach to the problem

In the rest of this chapter, we will build a rather simple Difference-In-Differences regression model to study the effect of the 2005 hurricane season on the change in the House Price Index a.k.a. house price inflation in the coastal states that were heavily impacted by the hurricane season versus the ones that weren’t. Our model will be a simple one compared to the ones employed in the previous work in this area. Nevertheless, as we will soon see, we will arrive at the same sorts of results as obtained in the research literature in this area.

In our little experiment, the ‘treatment’ will mean being subjected to the full brunt of 2005 hurricane season. The ‘unit’ being subjected to (or not subjected to) the treatment is a US state having a coastline to the sea. There are 24 of such states in the United States:

### Defining the criteria for being included in the Treatment group

We’ll decide whether a state falls in the treatment group by examining the actions taken by the US Federal Emergency Management Agency (FEMA) in that state during the 2005 Atlantic hurricane season.

FEMA provides direct assistance to individuals in counties that have suffered wide-spread damage due to disasters. This type of assistance is called Individual Assistance and differs from the other type of assistance that FEMA offers called Community Assistance. We will count the number of counties in each coastal state which qualified for receiving individual assistance from FEMA at anytime during the 2005 Atlantic hurricane season. Here are those those state-wise counts:

If a county qualified for IA more than once, we will count it multiple times. The rationale behind the double counting is that during each disaster, some of the damaged property may have been different than the property damaged during the previous disaster. Similarly, some of the rebuilt or repaired property may also have gotten damaged again in a subsequent incident. Both cases can impact the resale value of the property. Additionally, multiple disaster events in the same county may, in theory and at least temporarily, make properties in that county less attractive to potential home buyers thereby depressing the prices or reducing the growth in prices. On the other hand, a reduction in transaction-worthy housing inventory in the county may (temporarily) increase house price inflation. Our regression model should help us determine which of these effects are dominant.

The table shown above contains a wide variability in counts and we are faced with the question of how to determine if a state was affected ‘enough’ to be considered a Treatment state. Should we consider New Hampshire with 9 affected counties as a Treatment state? What about California with 8 affected counties, or New York state with 11 affected counties? At the other end of the counts scale are the gulf states of Louisiana, Alabama and Mississippi which were by all accounts greatly affected and are clearly ‘Treatment’ group states.

We’ll try to resolve this question by drawing the line at the median of counts. Any state with a count greater or equal to the median (14) will fall into the treatment group. The rest will be part of the control group. Here is the how the group-wise map looks like:

As we can see from the map, we would be dealing with a highly unbalanced data set with the treatment group being far smaller than the control. This will almost certainly influence in the quality of the estimates produced by our DID model.

### Setting up the Treatment column

Using the treatment group selection criteria outlined above, we’ll add a column called Disaster_Affected and set its value to 1 for states with a count ≥14, and to 0 for the rest:

### Setting up the Time Period column

Next, we will add a Time_Period column which we will set to 0 to indicate the period before the start of the 2005 hurricane season, and to 1 to indicate the period after the end of the hurricane season. Notice below that we have duplicated the rows so that each state has a row with Time_Period=0 and a row with Time_Period=1.

### The methodology for calculating the value of the response variable

This section described the procedure for calculating the values of the response variable y_i.

Our goal is to study the effect of the 2005 hurricane season on house prices in the coastal states. To that end, we’ll use the state-wise All Transactions House Price Index published by the US Federal Reserve, and available for download under the public domain license from US FRED. Here’s how the index looks like for the District of Columbia:

We will access 24 of these time series data sets for the 24 states of interest and we’ll knock them together into a 24-state data panel as follows:

For our study, the time periods of interest to us are the 4 quarters immediately prior to the 2005 hurricane season and the 4 quarters immediately following the season. The hurricane season itself ran from 8 June 2005 to 6 Jan 2006. Hence, we are interested in house price index change across the quarters starting from 1 July 2004, 1 October 2004, 1 January 2005 and 1 April 2005, and then again across the 4 quarters following the 2005 season namely, 1 April 2006, 1 July 2006, 1 October 2006 and 1 January 2007. Let’s zoom into this region of interest to see how it looks like:

For each state, we will calculate the average quarter-over-quarter fractional change in the house price index over the two sets of quarters. Doing so will give us the value of the response variable, namely, the average Q-o-Q change in HPI in the pre-treatment and the post-treatment phases of the study for each state.

The Q-o-Q fractional change in house price index across any two consecutive quarters i and (i-1) can be calculated using the following formula:

HPI Fractional Change = [HPI_i — HPI_(i-1)]/HPI_(i-1)

Here are the Q-o-Q fractional change values for the 4 quarters of interest before and after the 2005 hurricane season. The highlighted cells illustrate the calculation for one of the quarters:

Next, we take the vertical average of each block of 4 quarters to arrive at the average fractional change in HPI across 4 quarters both before and after the 2005 hurricane season. We repeat this calculation for each state to get the value of the response variable HPI_CHG for the pre-treatment and post-treatment phases.

Note that for each state, we have calculated two response values: the top value is the pre-treatment value and the bottom one is the post-treatment value. Thus, there is one value corresponding to Time_Period=0 and another one corresponding to Time_Period=1. Let’s include these average values in the data set we will use to train the DID model:

The last column of the above data set set HPI_CPG is our response variable y_i.

Now that our data set is built, we can get back to the task of building and training the DID model.

## Building the Difference-In-Differences model for house price inflation

Let’s start by stating the equation for our DID model:

To build and train the model, we’ll use Python and Python based libraries Pandas and statsmodels.

Let’s begin by importing all the required packages:

```import pandas as pd
from patsy import dmatrices
import statsmodels.api as sm
```

Next, we’ll load the data set into a Pandas DataFrame as follows:

```df = pd.read_csv('us_fred_coastal_us_states_avg_hpi_before_after_2005.csv', header=0)
```

Form the regression expression in Patsy syntax. The intercept is assumed to be present and will be included in the data set automatically:

`reg_exp = 'HPI_CHG ~ Time_Period + Disaster_Affected + Time_Period*Disaster_Affected'`

Using Patsy, carve out the training matrices:

```y_train, X_train = dmatrices(reg_exp, df, return_type='dataframe')
```

Build the DID model:

```did_model = sm.OLS(endog=y_train, exog=X_train)
```

Train the model:

```did_model_results = did_model.fit()
```

Print the training summary:

```did_model_results.summary()
```

We see the following output (I have highlighted the interesting parts):

### How to interpret the training output of the DID model

We see that the adjusted R-squared is 0.504. The model has been able to explain more than 50% of the variance in the response variable HPI_CHG. That is a great result. The p value of the F-statistic is 1.88e-07 which is statistically speaking, highly significant, leading us to conclude that the model’s variables are jointly significant and they are together doing a much better job of explain the variance in HPI_CHG than a simple mean model.

We also note is that all coefficients are statistically significant as indicated by their p values which are all smaller than 0.05.

The equation of the fitted model is as follows:

Time_Period and Disaster_Affected are 0/1 dummy variables. The four possible combinations are:

Let’s see how to interpret each combination of the two dummy variables: Time_Period and Disaster_Affected. We’ll also switch to working with expected values of HPI_CHG, which results in dropping of the subscript i as also the residual error term e_i.

#### Time_Period_i=0 and Disaster_Affected_i=0

We get the following equation:

This equation gives us the estimated mean inflation in house prices in the control group during the four quarters immediately preceding the 2005 hurricane season. The value of the estimated mean inflation is simply the intercept of regression: 0.0371, or 3.71%.

#### Time_Period_i=1 and Disaster_Affected_i=0

This equation give us the estimated mean inflation in house prices in the control group states in the post-treatment period, i.e. during the four quarters following the hurricane season. The value of the estimated mean inflation is 0.0371 — 0.0278=0.0093, or 0.93%.

#### Time_Period_i=0 and Disaster_Affected_i=1

This equation gives us the estimated mean house price inflation in the treatment group states during the four quarters prior to the start of the hurricane season. The value of this inflation is 0.0371 — 0.0139=0.0232, or 2.32%.

#### Time_Period_i=1 and Disaster_Affected_i=1

This equation gives us the estimated mean house price inflation in the treatment group during the four quarters following the end of the hurricane season. The value of this inflation is 0.0371 — 0.0278 — 0.0139+0.0197=0.0151 or 1.51%.

Let’s tabulate our findings:

The third row of the table mentions the vertical differences (post-season — pre-season) in the estimated values.

We see that for those in the Disaster Affected group, the inflation in house prices in the four quarters following the hurricane season were lower by 0.81% as compared to the house price inflation experienced in the four quarters prior to the start of the hurricane season.

For those in the non Disaster Affected group, the inflation in house prices in the four quarters following the hurricane season were lower by 2.78% as compared to the house price inflation experienced in the four quarters prior to the start of the hurricane season.

The difference-in-difference effect between the two groups is:

The following graphic may help in visualizing the various estimated values:

The value of 1.97% is exactly the value of the coefficient of Time_Period*Disaster_Affected interaction term reported by the trained DID regression model:

The estimated difference-in-differences of 1.97% suggests that the house price inflation in the states that were especially affected by the 2005 hurricane season cooled down less than in the rest of the coastal states after the season ended. One way to explain this effect is by noting that inflation is often inversely proportional to supply. Due to extensive property damage suffered by the treatment group states, the resulting reduction in house inventory may have temporarily fed house price inflation in those states during the four quarters immediately following the end of the hurricane season.

Here’s the source code used in this chapter:

### Data set

All-Transactions House Price Index for various US states, courtesy of U.S. Federal Housing Finance Agency, retrieved from FRED, Federal Reserve Bank of St. Louis;, June 12, 2022 (available in public domain). The curated version of the data set used in this chapter is available for download from here.

Card, David and Krueger, Alan, (1994), Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania, American Economic Review, 84, issue 4, p. 772–93.

Ortega, Francesc and Taspinar, Suleyman, Rising Sea Levels and Sinking Property Values: The Effects of Hurricane Sandy on New York’s Housing Market (March 29, 2018). Available at SSRN or http://dx.doi.org/10.2139/ssrn.3074762

Liao, Yanjun and Graff Zivin, Joshua and Panassie, Yann, How Hurricanes Sweep Up Housing Markets: Evidence from Florida. Available at SSRN or http://dx.doi.org/10.2139/ssrn.4103049

Fisher, J.D., Rutledge, S.R. The impact of Hurricanes on the value of commercial real estate. Bus Econ 56, 129–145 (2021). https://doi.org/10.1057/s11369-021-00212-9

Seung Kyum Kim, Richard B. Peiser, The implication of the increase in storm frequency andintensity to coastal housing markets, Journal of Flood Risk Management, 26 May 2020, https://doi.org/10.1111/jfr3.12626

Anthony Murphy & Eric Strobl, 2010. The impact of hurricanes on housing prices: evidence from U.S. coastal cities, Working Papers 1009, Federal Reserve Bank of Dallas.

Fang, L., Li, L. & Yavas, A. The Impact of Distant Hurricane on Local Housing Markets. J Real Estate Finan Econ (2021). https://doi.org/10.1007/s11146-021-09843-3

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.