A Guide To Building Linear Models For Discontinuous Data

We’ll study how to model one commonly occurring case of discontinuous data


Most real-world processes that generate data are at the mercy of real-world events. Stock prices plunge and soar in response to quarterly results announcements and news of natural and man-made disasters. Macroeconomic indicators such as unemployment and oil prices undergo sudden trend reversals in response to the onsets of financial crisis, recession, and supply shocks.

Take any real-world time series data collected over a long enough period of time and you would be hard pressed to not to find at least one sharp rise or fall in trend that is severe enough that it essentially breaks the series into a ‘before’ and an ‘after’ segment.

In such cases, the analyst’s job is not only to build a statistical model that is robust to such discontinuities but also to estimate the effect of the event that caused them.

Researchers have developed numerous techniques to achieve these twin-goals. We’ll study one such technique in the context of a commonly occurring case of discontinuous (or nearly discontinuous) data.

Let’s take a look at the following chart that shows the weekly unemployment rate in the United States from January 1, 2002 and January 1, 2020. The data contains a major recession, the ‘Great Recession’, which extended from December 2007 to June 2009.

U.S. Bureau of Labor Statistics, Unemployment Rate [UNRATE], retrieved from FRED, Federal Reserve Bank of St. Louis; June 18, 2022
U.S. Bureau of Labor Statistics, Unemployment Rate [UNRATE], retrieved from FRED, Federal Reserve Bank of St. Louis; June 18, 2022 (public domain data)

If we wish to construct a regression model for this data, we would have at least the following different ways of going about it:

  1. We could construct a piece-wise regression model for modeling the three distinct sections of the above data set, namely the section before, during and after the recession.
  2. We could use a regression technique known as a Regression Kink Design.
  3. We could experiment with a Hidden Markov Model. Specifically, a 2-state Markov Switching Autoregression (MSAR) Model. The two states would denote the the recessionary and non-recessionary phases. The MSAR model should also be able to represent the auto-correlation that is bound to be present in such a time series data set.
  4. Finally, we could simply choose not to model the data during the recessionary period thereby essentially compressing the recessionary phase to a zero-width region. Thus, contrary to the first three approaches, we will not be interested in the behavior of the time series during the recession. And therefore, we also need to be ready to lose the information contained in the data set that lies within the recessionary period.

But if we make this simplifying and, some might argue, an incorrect decision to drop the data points in the recessionary phase of the time series, it opens the door to the following interesting linear model:

A regression discontinuity model for unemployment rate
A regression discontinuity model for unemployment rate (Image by Author)

In the above model:

  • β_0 is the intercept.
  • β_1 is the coefficient of the Time_Period variable. Time_Period is a positive integer that goes 1,2,3,etc. We have introduced this variable to capture what appears to be a month-over-month time trend in both sections of the data, before and after the recession.
  • β_2 is the coefficient of the dummy variable Epoch. We set Epoch to 0 for the portion of the data set that lies prior to the start of the recession, and we set Epoch to 1 for the portion that lies after the end of the recessionary phase.
  • ϵ is the error term which captures the variance in UNRATE that our model cannot explain.

Here’s a link to the data set containing the employment rate, Time_Period and Epoch variables. It does not contain the data points contained within the recessionary phase.

Building and training the regression model

Let’s build and train this model using Python, Pandas and statsmodels.

We’ll begin by loading the data set into a Pandas Dataframe:

import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.graphics.tsaplots as tsa
from matplotlib import pyplot as plt

df = pd.read_csv('unemployment_rate_us_fred.csv', header=0)

Next, let’s construct the regression expression in Patsy syntax.

reg_exp = 'UNRATE ~ Time_Period + Epoch'

In the above expression, notice that the regression intercept is not explicitly mentioned but statsmodels will automatically include it when we use it to build the model.

Let’s build the OLS model using statsmodel. We will pass the regression expression and the data set as parameters:

regdis_model = smf.ols(formula=reg_exp, data=df)

Next, let’s train the model and print the training summary:

regdis_model_results = regdis_model.fit()
print(regdis_model_results.summary())

We see the following output. I have highlighted a few interesting areas in the output:

The training output of the regression model
The training output of the regression model (Image by Author)

Analyzing model performance

At the top-right, we observe that the adjusted R-squared is .925. The model has been able to explain 92.5% of the variance in the unemployment rate.

Right below the adjusted R-squared, we see that Statsmodels has reported the output of the F-test for regression analysis. The F-statistic has been calculated for us and it’s value is 1220 with a p value of 1.09E-111. The F-test’s output indicates that, at a p of < .001, the model’s coefficients are jointly significant, meaning, the model is doing a better job (in this case, a much better job) at explaining the variance in UNRATE than a mean model.

Let’s turn our attention toward the center of portion of the output where we see the values of the coefficient estimates, their standard errors, p values and confidence intervals around the estimated values. We see that the estimates of all three coefficients are significant at a p value of < .001.

The coefficient of Time_Period ( — 0.0525) indicates that the unemployment rate is estimated to reduce by a mean rate of 0.0525 (about 5%) for each unit increase in Time_Period on either side of the recessionary phase. Recollect that a unit time period is one month in our data set.

Finally, the coefficient of Epoch offers an interesting insight. Its value is 6.3757. Our model has estimated that the Great Recession of 2008–2009 has caused a mean increase in unemployment in the US by 6.3757% with a fairly tight 95% confidence interval that ranges from 6.111% to 6.641%.

Analysis of residuals

Let’s look at the results of the Jarque-Bera and Omnibus tests of normality of residual errors reported by statsmodels:

The p-values of both tests are comfortably above .05 thereby validating the the default hypothesis of the tests that the residual errors are normally distributed. That means, we can rely upon the confidence interval estimates reported by the model. That’s good news.

However, the Durbin-Watson test’s statistic comes out at < 2.0 indicating a positive auto-correlation among the residual errors. The test’s finding is vindicated by the following auto-correlation plot:

Auto-correlation plot of the residual errors
Auto-correlation plot of the residual errors (Image by Author)

The above plot can be produced by executing the following two lines of code:

tsa.plot_acf(regdis_model_results.resid)
plt.show()

The plot shows a strong auto-correlation at LAG-1, which means that each value in the time series of residual errors is correlated with the value that immediately precedes it in the time series. Since values at LAG 1 and LAG 2 are correlated, and LAG 2 and LAG 3 are similarly correlated, LAG 1 and LAG 3 are also correlated but to a lesser extent. Thus we see a gently sloping curve of correlations at the other lags also.

A strong auto-correlation of this type in the residuals implies that our model is missing one of more key explanatory variables. Or, the entire functional form of our model may need to be revisited. Perhaps, one of the other three kinds of models suggested at the beginning of the article would prove to be more effective in explaining the variance in UNRATE.

Nevertheless, our plucky little linear model seems to have held up well to the scrutiny. It has allowed us to model the discontinuity that we introduced into the unemployment data set. It has not only estimated the trend in unemployment on both sides of the Great Recession but also given us a means to scientifically estimate the impact of the Great Recession on the unemployment rate.


Citations and Copyrights

Data set

U.S. Bureau of Labor Statistics, Unemployment Rate [UNRATE], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/UNRATE. (available in public domain). The curated version of the data set used in this article is available for download from here.

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.


PREVIOUS: An Introduction To The Difference-In-Differences Regression Model

NEXT: The Quantile Regression Model


UP: Table of Contents