Introduction To The Difference-In-Differences Regression Model

We’ll show how to use the DID model to estimate the effect of hurricanes on house prices


In this chapter, we will study the Difference-In-Differences regression model. The DID model is a powerful and flexible regression technique that can be used to estimate the differential impact of a ‘Treatment’ on the treated group of individuals or things.

We will also illustrate the use of the Difference-In-Differences regression model to estimate the effect of hurricanes on property prices in the United States.

Defining the terms: Treatment, treated group, control group

The words ‘treatment’ and ‘treated group’ may invoke a picture of a randomized controlled trial to test the efficacy of a drug or medical treatment.

While the DID model can indeed be used very effectively in that setting, in statistics, it is customary to ascribe a much broader interpretation to the word ‘Treatment’. ‘Treatment’ is any event that selectively affects only some of the individuals or things in a study. Examples of Treatment include an increase in state-mandated minimum wage that affects only restaurants in one state (as analyzed in the well-cited study by Card and Krueger in 1994), or the opening of a new airline route connecting two regions of a large country, or a natural disaster that affects only some parts of a country, or an experimental drug or medical procedure that is administered to only some of the participants in a study. In all these examples, the unit of study is respectively a restaurant, a town or a county, a county or a state, or a volunteer.

A study comprises many units (individuals or things) divided into a treatment group or a control group depending on whether they were or were not subjected to the treatment.

The response variable

In each of such studies, one wants to measure an outcome , a response, and know if it will achieve a mean value that is statistically different within the treatment group than in the control group. For example, the 1994 study by Card and Krueger analyzed whether an increase of minimum wage by New Jersey in 1992 from $4.25 to $5.05 resulted in a statistically significant change in employment level amongst fast food restaurant workers in New Jersey from that in neighboring Pennsylvania which did not change its minimum wage. Other examples of a response variable are SAT score of the participant, pollution level in a county, and tree cover in a country.

The Effect of Time

In practice, a complication is introduced by the passage of time. Whatever be the response variable being measured, be it SAT scores, employment level, house price inflation, or blood sugar level of participants, the natural flow of time will change the value of this variable in a potentially significant way as the study progresses from the pre-treatment to the post-treatment phase of the experiment. The experimenter must discount the partial effect of time (and the numerous hidden factors that time acts as a proxy for) on the change in the mean value of the response variable in both the control group and the treatment group. In other words, the experimenter must determine if the treatment itself caused any change in the mean value of the response variable within the treatment group that was over and above what was caused by the passage of time, and, whether this additional treatment-induced effect was observed much more in the treated group than in the control group.

The Difference-in-Differences (DID) regression model can be used to easily and quite elegantly perform all of the above mentioned analysis.

The fitted DID model will tell us whether there is evidence of a net-additional effect observed in the treated group that is purely treatment induced, the estimated value of this, whether this estimate is statistically significant and if so, the 95% or 99% confidence intervals are around the estimated effect.

Structure of the Difference-In-Differences model

The following equation illustrates the structure of the DID model:

The Difference-In-Differences regression model
The Difference-In-Differences regression model (Image by Author)

The first thing we note about this equation is that, it is that of a linear regression model.

y_i is the observed response for the ith observation. It is the value being measured in each group before and after treatment.

β_0 is the intercept of regression.

Time_Period_i is a dummy variable that takes the value 0 or 1 depending on whether the ith measurement refers to the pre or post treatment period respectively.

Treated_i is a dummy variable that takes the value 0 or 1 depending on whether the ith measurement refers to an individual in the control group or the treatment group respectively.

(Time_Period_i*Treated_i) is an interaction term. It stores the multiplication of the two dummy variable values for the ith observation.

ϵ_i is the error term associated with the ith observation and it captures the effect of all factors that the model was not able to adequately represent.

The two dummy variables in the model yield the follow 2 X 2 matrix of regression equations:

The matrix of possible regression equations produced by the two dummy variables
The matrix of possible regression equations produced by the two dummy variables (Image by Author)

DID model is trained using the Ordinary Least Squares Regression technique.

For the trained (a.k.a. fitted) model, the corresponding expectations are as follows. The caps (^) above the coefficients indicate that they are the estimated (fitted) values of the corresponding coefficients. Replacing y_i with the expected value of y_i also allows us to drop the error term ϵ_i since in a well-behaved OLS regression model, the expected value of the error term is zero:

The expected values (predictions) from the fitted regression model for each of the four scenarios yielded by the two dummy variables
The expected values (predictions) from the fitted regression model for each of the four scenarios yielded by the two dummy variables (Image by Author)

We wish to calculate the difference in the expected value of y_i between the before (pre-)and after (post-)treatment phases of the study.

For the treatment group, the difference in expectations works out as follows:

The difference in estimated response within the treatment group between the after-treatment and before-treatment phases of the study
The difference in estimated response within the treatment group between the after-treatment and before-treatment phases of the study (Image by Author)

Similarly, for the control group we have:

The difference in estimated response within the control group between the after-treatment and before-treatment phases of the study
The difference in estimated response within the control group between the after-treatment and before-treatment phases of the study (Image by Author)

The difference between the two differences gives us the net effect of the treatment on the treatment group:

The estimated Difference-In-Difference effect between the treatment and control group
The expected value of the Difference-In-Difference effect between the treatment and control group (Image by Author)

We see that this Difference-in-differences effect is the coefficient of the interaction term (Time_Period_i*Treatment_Group_i).

It is this result that gives the DID model much of its usefulness.

After the DID model is trained, the fitted coefficient of the interaction term (Time_Period_i*Treatment_Group_i) will give us the the estimated difference-in-differences effect that we are seeking. The coefficient’s t-score and corresponding p value will tell us whether the effect is significant and if so, we can construct the 95% or 99% confidence interval around the estimated coefficient using the coefficient’s standard error reported by the model.

Let’s illustrate the procedure for building and training a Difference-In-Differences regression model using an interesting real world example.


How to build a Difference-In-Differences model to estimate the effect of coastal weather events on house prices

We’ll use the DID model to estimate the effect of coastal weather events on house prices in the United States. Specifically, we’ll analyze the effect of the the 2005 Atlantic hurricane season which was the most active hurricane season in recorded history up until 2020.

Incidentally, this topic has been extensively researched using a variety of methods. Some researchers have focused on the effect of a single storm or many storms on the house prices in a single city or a singe state while others have zoomed out their attention to a regional or national level. There are hyper-local studies of the effect of severe weather events on the house prices in a single US county, while others have studied the effect of several years worth of severe weather events on the house prices of several coastal cities. There is also an interesting recent study on estimating the impact of distant but approaching hurricane on property prices.

Several of these studies have used the Difference-In-Differences regression model (or some variation or enhancement thereof). Interestingly, although perhaps unsurprisingly, the findings from these studies are diverse and contradictory depending on the methodology used by the researchers and extent of the spatial and temporal scope of the study.

Our approach to the problem

In the rest of this chapter, we will build a rather simple Difference-In-Differences regression model to study the effect of the 2005 hurricane season on the change in the House Price Index a.k.a. house price inflation in the coastal states that were heavily impacted by the hurricane season versus the ones that weren’t. Our model will be a simple one compared to the ones employed in the previous work in this area. Nevertheless, as we will soon see, we will arrive at the same sorts of results as obtained in the research literature in this area.

In our little experiment, the ‘treatment’ will mean being subjected to the full brunt of 2005 hurricane season. The ‘unit’ being subjected to (or not subjected to) the treatment is a US state having a coastline to the sea. There are 24 of such states in the United States:

States with a coastline to the sea (Source: MapChart under CC BY-SA 4.0)
States with a coastline to the sea (Source: MapChart under CC BY-SA 4.0)

Defining the criteria for being included in the Treatment group

We’ll decide whether a state falls in the treatment group by examining the actions taken by the US Federal Emergency Management Agency (FEMA) in that state during the 2005 Atlantic hurricane season.

FEMA provides direct assistance to individuals in counties that have suffered wide-spread damage due to disasters. This type of assistance is called Individual Assistance and differs from the other type of assistance that FEMA offers called Community Assistance. We will count the number of counties in each coastal state which qualified for receiving individual assistance from FEMA at anytime during the 2005 Atlantic hurricane season. Here are those those state-wise counts:

State-wise counts of counties qualifying for IA during the 2005 Atlantic hurricane season. Data source: List of disasters declared by FEMA in 2005
State-wise counts of counties qualifying for IA during the 2005 Atlantic hurricane season. Data source: List of disasters declared by FEMA in 2005 (Image by Author)

If a county qualified for IA more than once, we will count it multiple times. The rationale behind the double counting is that during each disaster, some of the damaged property may have been different than the property damaged during the previous disaster. Similarly, some of the rebuilt or repaired property may also have gotten damaged again in a subsequent incident. Both cases can impact the resale value of the property. Additionally, multiple disaster events in the same county may, in theory and at least temporarily, make properties in that county less attractive to potential home buyers thereby depressing the prices or reducing the growth in prices. On the other hand, a reduction in transaction-worthy housing inventory in the county may (temporarily) increase house price inflation. Our regression model should help us determine which of these effects are dominant.

The table shown above contains a wide variability in counts and we are faced with the question of how to determine if a state was affected ‘enough’ to be considered a Treatment state. Should we consider New Hampshire with 9 affected counties as a Treatment state? What about California with 8 affected counties, or New York state with 11 affected counties? At the other end of the counts scale are the gulf states of Louisiana, Alabama and Mississippi which were by all accounts greatly affected and are clearly ‘Treatment’ group states.

We’ll try to resolve this question by drawing the line at the median of counts. Any state with a count greater or equal to the median (14) will fall into the treatment group. The rest will be part of the control group. Here is the how the group-wise map looks like:

Treatment and control groups amongst the sea-facing states (Source: MapChart under CC BY-SA 4.0)
Treatment and control groups amongst the sea-facing states (Source: MapChart under CC BY-SA 4.0)

As we can see from the map, we would be dealing with a highly unbalanced data set with the treatment group being far smaller than the control. This will almost certainly influence in the quality of the estimates produced by our DID model.

Setting up the Treatment column

Using the treatment group selection criteria outlined above, we’ll add a column called Disaster_Affected and set its value to 1 for states with a count ≥14, and to 0 for the rest:

(Image by Author)

Setting up the Time Period column

Next, we will add a Time_Period column which we will set to 0 to indicate the period before the start of the 2005 hurricane season, and to 1 to indicate the period after the end of the hurricane season. Notice below that we have duplicated the rows so that each state has a row with Time_Period=0 and a row with Time_Period=1.

(Image by Author)

The methodology for calculating the value of the response variable

This section described the procedure for calculating the values of the response variable y_i.

Our goal is to study the effect of the 2005 hurricane season on house prices in the coastal states. To that end, we’ll use the state-wise All Transactions House Price Index published by the US Federal Reserve, and available for download under the public domain license from US FRED. Here’s how the index looks like for the District of Columbia:

U.S. Federal Housing Finance Agency, All-Transactions House Price Index for the District of Columbia [DCSTHPI], retrieved from FRED, Federal Reserve Bank of St. Louis;, June 12, 2022
U.S. Federal Housing Finance Agency, All-Transactions House Price Index for the District of Columbia [DCSTHPI], retrieved from FRED, Federal Reserve Bank of St. Louis;, June 12, 2022 (public domain)

We will access 24 of these time series data sets for the 24 states of interest and we’ll knock them together into a 24-state data panel as follows:

The House Price Index data for all seacoast states from Q1 1975 to Q1 2022
The House Price Index data for all seacoast states from Q1 1975 to Q1 2022 (Image by Author)

For our study, the time periods of interest to us are the 4 quarters immediately prior to the 2005 hurricane season and the 4 quarters immediately following the season. The hurricane season itself ran from 8 June 2005 to 6 Jan 2006. Hence, we are interested in house price index change across the quarters starting from 1 July 2004, 1 October 2004, 1 January 2005 and 1 April 2005, and then again across the 4 quarters following the 2005 season namely, 1 April 2006, 1 July 2006, 1 October 2006 and 1 January 2007. Let’s zoom into this region of interest to see how it looks like:

The four quarters of interest immediately preceding and immediately following the 2005 hurricane season
The four quarters of interest immediately preceding and immediately following the 2005 hurricane season. (Image by Author)

For each state, we will calculate the average quarter-over-quarter fractional change in the house price index over the two sets of quarters. Doing so will give us the value of the response variable, namely, the average Q-o-Q change in HPI in the pre-treatment and the post-treatment phases of the study for each state.

The Q-o-Q fractional change in house price index across any two consecutive quarters i and (i-1) can be calculated using the following formula:

HPI Fractional Change = [HPI_i — HPI_(i-1)]/HPI_(i-1)

Here are the Q-o-Q fractional change values for the 4 quarters of interest before and after the 2005 hurricane season. The highlighted cells illustrate the calculation for one of the quarters:

Calculation of the Q-o-Q fractional change in HPI for the quarters of interest
Calculation of the Q-o-Q fractional change in HPI for the quarters of interest (Image by Author)

Next, we take the vertical average of each block of 4 quarters to arrive at the average fractional change in HPI across 4 quarters both before and after the 2005 hurricane season. We repeat this calculation for each state to get the value of the response variable HPI_CHG for the pre-treatment and post-treatment phases.

Calculation of the average Q-o-Q fractional change in HPI across 4 quarters preceding and following the hurricane season
Calculation of the average Q-o-Q fractional change in HPI across 4 quarters preceding and following the hurricane season (Image by Author)

Note that for each state, we have calculated two response values: the top value is the pre-treatment value and the bottom one is the post-treatment value. Thus, there is one value corresponding to Time_Period=0 and another one corresponding to Time_Period=1. Let’s include these average values in the data set we will use to train the DID model:

The data set to be used for training the Difference-In-Differences model
The data set to be used for training the Difference-In-Differences model (Image by Author)

The last column of the above data set set HPI_CPG is our response variable y_i.

The data set is available for download from here.

Now that our data set is built, we can get back to the task of building and training the DID model.

Building the Difference-In-Differences model for house price inflation

Let’s start by stating the equation for our DID model:

The equation of the DID model used for estimating the effect of hurricane disasters on house price changes
The equation of the DID model used for estimating the effect of hurricane disasters on house price changes (Image by Author)

To build and train the model, we’ll use Python and Python based libraries Pandas and statsmodels.

Let’s begin by importing all the required packages:

import pandas as pd
from patsy import dmatrices
import statsmodels.api as sm

Next, we’ll load the data set into a Pandas DataFrame as follows:

df = pd.read_csv('us_fred_coastal_us_states_avg_hpi_before_after_2005.csv', header=0)

Form the regression expression in Patsy syntax. The intercept is assumed to be present and will be included in the data set automatically:

reg_exp = 'HPI_CHG ~ Time_Period + Disaster_Affected + Time_Period*Disaster_Affected'

Using Patsy, carve out the training matrices:

y_train, X_train = dmatrices(reg_exp, df, return_type='dataframe')

Build the DID model:

did_model = sm.OLS(endog=y_train, exog=X_train)

Train the model:

did_model_results = did_model.fit()

Print the training summary:

did_model_results.summary()

We see the following output (I have highlighted the interesting parts):

Training output of the Difference-In-Differences regression model (Image by Author)
Training output of the Difference-In-Differences regression model (Image by Author)

How to interpret the training output of the DID model

We see that the adjusted R-squared is 0.504. The model has been able to explain more than 50% of the variance in the response variable HPI_CHG. That is a great result. The p value of the F-statistic is 1.88e-07 which is statistically speaking, highly significant, leading us to conclude that the model’s variables are jointly significant and they are together doing a much better job of explain the variance in HPI_CHG than a simple mean model.

We also note is that all coefficients are statistically significant as indicated by their p values which are all smaller than 0.05.

The equation of the fitted model is as follows:

The equation of the fitted Difference-In-Differences model
The equation of the fitted Difference-In-Differences model (Image by Author)

Time_Period and Disaster_Affected are 0/1 dummy variables. The four possible combinations are:

Let’s see how to interpret each combination of the two dummy variables: Time_Period and Disaster_Affected. We’ll also switch to working with expected values of HPI_CHG, which results in dropping of the subscript i as also the residual error term e_i.

Time_Period_i=0 and Disaster_Affected_i=0

We get the following equation:

Expected Q-o-Q change in house price index in the control group states during the pre-hurricane period
Expected Q-o-Q change in house price index in the control group states during the pre-hurricane period (Image by Author)

This equation gives us the estimated mean inflation in house prices in the control group during the four quarters immediately preceding the 2005 hurricane season. The value of the estimated mean inflation is simply the intercept of regression: 0.0371, or 3.71%.

Time_Period_i=1 and Disaster_Affected_i=0

Expected Q-o-Q change in house price index in the control group states during the post-hurricane period
Expected Q-o-Q change in house price index in the control group states during the post-hurricane period (Image by Author)

This equation give us the estimated mean inflation in house prices in the control group states in the post-treatment period, i.e. during the four quarters following the hurricane season. The value of the estimated mean inflation is 0.0371 — 0.0278=0.0093, or 0.93%.

Time_Period_i=0 and Disaster_Affected_i=1

Expected Q-o-Q change in house price index in the treatment group states during the pre-hurricane period
Expected Q-o-Q change in house price index in the treatment group states during the pre-hurricane period (Image by Author)

This equation gives us the estimated mean house price inflation in the treatment group states during the four quarters prior to the start of the hurricane season. The value of this inflation is 0.0371 — 0.0139=0.0232, or 2.32%.

Time_Period_i=1 and Disaster_Affected_i=1

Expected Q-o-Q change in house price index in the treatment group states during the post-hurricane period
Expected Q-o-Q change in house price index in the treatment group states during the post-hurricane period (Image by Author)

This equation gives us the estimated mean house price inflation in the treatment group during the four quarters following the end of the hurricane season. The value of this inflation is 0.0371 — 0.0278 — 0.0139+0.0197=0.0151 or 1.51%.

Let’s tabulate our findings:

Estimated change in House Price Index in the Treatment and Control groups before and after the Treatment
Estimated change in House Price Index in the Treatment and Control groups before and after the Treatment (Image by Author)

The third row of the table mentions the vertical differences (post-season — pre-season) in the estimated values.

We see that for those in the Disaster Affected group, the inflation in house prices in the four quarters following the hurricane season were lower by 0.81% as compared to the house price inflation experienced in the four quarters prior to the start of the hurricane season.

For those in the non Disaster Affected group, the inflation in house prices in the four quarters following the hurricane season were lower by 2.78% as compared to the house price inflation experienced in the four quarters prior to the start of the hurricane season.

The difference-in-difference effect between the two groups is:

The estimated Difference-In-Differences effect
The estimated Difference-In-Differences effect (Image by Author)

The following graphic may help in visualizing the various estimated values:

Estimated change in House Price Index in the Treatment and Control groups before and after the Hurricane season a.k.a. Treatment
Estimated change in House Price Index in the Treatment and Control groups before and after the Hurricane season a.k.a. Treatment (Image by Author)

The value of 1.97% is exactly the value of the coefficient of Time_Period*Disaster_Affected interaction term reported by the trained DID regression model:

The fitted DID model
The fitted DID model (Image by Author)

The estimated difference-in-differences of 1.97% suggests that the house price inflation in the states that were especially affected by the 2005 hurricane season cooled down less than in the rest of the coastal states after the season ended. One way to explain this effect is by noting that inflation is often inversely proportional to supply. Due to extensive property damage suffered by the treatment group states, the resulting reduction in house inventory may have temporarily fed house price inflation in those states during the four quarters immediately following the end of the hurricane season.


Here’s the source code used in this chapter:

import pandas as pd
from patsy import dmatrices
import statsmodels.api as sm
#Load the data set into a Pandas Dataframe
df = pd.read_csv('us_fred_coastal_us_states_avg_hpi_before_after_2005.csv', header=0)
#Print it
print(df)
#Form the regression expression in Patsy syntax. The intercept is assumed to be present and will be
# included in the data set automatically
reg_exp = 'HPI_CHG ~ Time_Period + Disaster_Affected + Time_Period*Disaster_Affected'
#Carve out the training matrices
y_train, X_train = dmatrices(reg_exp, df, return_type='dataframe')
#Build the DID model
did_model = sm.OLS(endog=y_train, exog=X_train)
#Train the model
did_model_results = did_model.fit()
#Print out the training results
did_model_results.summary()

Citations and Copyrights

Data set

All-Transactions House Price Index for various US states, courtesy of U.S. Federal Housing Finance Agency, retrieved from FRED, Federal Reserve Bank of St. Louis;, June 12, 2022 (available in public domain). The curated version of the data set used in this chapter is available for download from here.

Paper and Book Links

Card, David and Krueger, Alan, (1994), Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania, American Economic Review, 84, issue 4, p. 772–93.

Ortega, Francesc and Taspinar, Suleyman, Rising Sea Levels and Sinking Property Values: The Effects of Hurricane Sandy on New York’s Housing Market (March 29, 2018). Available at SSRN or http://dx.doi.org/10.2139/ssrn.3074762

Liao, Yanjun and Graff Zivin, Joshua and Panassie, Yann, How Hurricanes Sweep Up Housing Markets: Evidence from Florida. Available at SSRN or http://dx.doi.org/10.2139/ssrn.4103049

Fisher, J.D., Rutledge, S.R. The impact of Hurricanes on the value of commercial real estate. Bus Econ 56, 129–145 (2021). https://doi.org/10.1057/s11369-021-00212-9

Seung Kyum Kim, Richard B. Peiser, The implication of the increase in storm frequency andintensity to coastal housing markets, Journal of Flood Risk Management, 26 May 2020, https://doi.org/10.1111/jfr3.12626

Anthony Murphy & Eric Strobl, 2010. The impact of hurricanes on housing prices: evidence from U.S. coastal cities, Working Papers 1009, Federal Reserve Bank of Dallas.

Fang, L., Li, L. & Yavas, A. The Impact of Distant Hurricane on Local Housing Markets. J Real Estate Finan Econ (2021). https://doi.org/10.1007/s11146-021-09843-3

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.


PREVIOUS: What Are Dummy Variables And How To Use Them In A Regression Model

NEXT: A Guide To Building Linear Models For Discontinuous Data


UP: Table of Contents