What Are Dummy Variables And How To Use Them In A Regression Model

And how to interpret the regression coefficients of dummy variables


A dummy variable is a binary variable that takes a value of 0 or 1. One adds such variables to a regression model to represent factors which are of a binary nature i.e. they are either observed or not observed.

Within this broad definition lie several interesting use cases. Here are some of them:

  1. For representing a Yes/No property: To indicate whether a data point has a certain property. For example, a dummy variable can be used to indicate whether a car engine is of type ‘Standard’ or ‘Turbo’. Or if a participant in a drug trial belongs to the placebo group or the treatment group.
  2. For representing a categorical value: A related use of dummies is to indicate which one of a set of categorical values a data point belongs to. For example, a vehicle’s body style could be one of convertible, hatchback, coupe, sedan, or wagon. In this case, we would add five dummy variables to the data set, one for each of the 5 body styles and we would ‘one hot encode’ this five element vector of dummies. Thus, the vector [0, 1, 0, 0, 0] would represent all hatchbacks in the data set.
  3. For representing an ordered categorical value: An extension of the use of dummies to represent categorical data is one where the categories are ordered. Suppose our Automobiles data set contains cars with engines having 2,3,4,5,6,8 or 12 cylinders. Here, we need to also capture the information contained in the ordering. We will soon see how to do this.
  4. For representing a seasonal period: A dummy variable can be added to represent each one of the possibly many seasonal periods contained in the data. For example, the flow of traffic through intersections often exhibits seasonality at an hourly level (they are highest during the morning and evening rush hours) and also a weekly period (lowest on Sundays). Adding dummy variables to the data for each of the two seasonal periods will allow you explain away much of the variation in the traffic flow that is attributable to daily and weekly variations.
  5. For representing Fixed Effects: While building regression models for panel data sets, dummies can be used to represent ‘unit-specific’ and ‘time-specific’ effects, especially in a Fixed Effects regression model.
  6. For representing Treatment Effects: In a treatment effects model, a dummy variable can be used to represent the effect of both time (i.e. the effect before and after treatment is applied), the effect of group membership (whether the participant received the treatment or the placebo), and the effect of the interaction between the time and group memberships.
  7. In regression discontinuity designs: This is best explained with an example. Imagine a data set of monthly employment rate numbers that contains a sudden, sharp increase in the unemployment rate caused by a brief and severe recession. For this data, a regression model used for modeling the unemployment rate can deploy a dummy variable to estimate the expected impact of the recession on the unemployment rate.

In this chapter, we’ll explain how to use dummy variables in the first three situations, namely:

  1. For representing a Yes/No property
  2. For representing a categorical value
  3. For representing an ordered categorical value

The last four use-cases, namely the use of dummies to deseasonalize data, to represent fixed effects and treatment effects, and for modeling regression discontinuities all deserve their own separate chapters.

Incidentally, the use of dummies for representing Fixed Effects is covered here:

Understanding the Fixed Effects Regression Model

We will cover the use dummies in building a Treatment Effects model and in modeling the effect of discontinuities in a different chapter.

Let’s dive into the first use case.

How to use a dummy variable for representing a Yes/No property

We’ll illustrate the procedure by using the following data set of vehicles containing specifications of 200+ automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle:

The automobiles data set
The automobiles data set (Source: UC Irvine)

We’ll consider a subset of this data consisting of the following seven variables:
make
aspiration
body_style
curb_weight
num_of_cylinders
engine_size
price

A 7-variable subset of the Automobiles data set. (Source: UC Irvine)
A 7-variable subset of the Automobiles data set. (Source: UC Irvine)

The above 7-variables version can be downloaded from here.

In the above data set, the aspiration variable is of type Standard or Turbo. Our regression goal is to estimate the effect of aspiration on vehicle price. To that end, we will introduce a dummy variable to encode aspiration as follows:

aspiration_std=1 when aspiration is standard, and 0 otherwise.

We’ll use the Python based Pandas library to load the data set into memory as a Dataframe. Then we’ll use the statsmodels library to build a simple linear regression model in which the response variable is price, and the regression variable is aspiration_std (plus the intercept of regression).

Let’s start by importing all the required packages.

import pandas as pd
import statsmodels.formula.api as smf

Let’s import the 7-variable subset of the automobiles data set into a DataFrame:

df = pd.read_csv('automobiles_dataset_subset_uciml.csv', header=0)

We’ll add dummy variable columns to represent the aspiration variable.

df_with_dummies = pd.get_dummies(data=df, columns=['aspiration'])

Print out the dummy-augmented data set.

print(df_with_dummies)

We see the following output. I have highlighted the two dummy variable columns added by Pandas:

The dummy-augmented data set (Image by Author)

Let’s construct the regression expression. The intercept of regression is added automatically later on by the model.

reg_exp = 'price ~ aspiration_std'

Notice that we have added only one dummy variable aspiration_std and not both, aspiration_std and aspiration_turbo. We did this to avoid perfect collinearity as every vehicle engine in the data set is either of type turbo or of type standard. There is no third type. In this case, regression intercept captures the effect of aspiration_turbo. Specifically, the estimated value of the regression intercept in the trained model is the estimated mean price of all turbo type automobiles.

Alternately, we could have added both aspiration_std and aspiration_turbo and left out the regression intercept. In this later case, because the model would not have the regression intercept, we would not be able to use the R-squared value to judge its goodness-of-fit.

Let’s build the Ordinary Least Squares Regression model on this dummies augmented dataset:

olsr_model = smf.ols(formula=reg_exp, data=df_with_dummies)

Even though we have passed the entire 7-variables data set into this model, internally, statsmodels will use the regression expression parameter (reg_exp) to carve out only the columns of interest.

Let’s train the model:

olsr_model_results = olsr_model.fit()

Let’s print out the training summary.

print(olsr_model_results.summary())

We see the following output. I have highlighted the parts which we will be examining closely:

Training summary of the OLSR model
Training summary of the OLSR model (Image by Author)

How to interpret the model training summary

The first thing we notice is that the adjusted R-squared is 0.027. The aspiration variable has been able to explain just a little under 3% of the variance in the automobile price. It seems awfully small but we do not need to read too much into the low value of adjusted R-squared. Recollect that our goal was to estimate the effect of aspiration on price. We never expected aspiration to by itself explain away much of the variance in price. Besides, notice that the F-statistic’s p value is significant at .0107 indicating that even this very simple linear model has been able to fit the data better than the mean model (which is basically flat horizontal line passing through the mean value of price).

Next, we notice that the model’s regression intercept and the coefficient of aspiration_std are both statistically significant i.e. non-zero, at a p value of less than .001 and at .011 respectively. That is great news. Let’s see how to interpret the values of these coefficients.

How to interpret the coefficient of the dummy variable in the regression model

Recollect that we had left out the dummy variable aspiration_turbo from the model to avoid perfect collinearity. By leaving out aspiration_turbo, we have given the job of storing the mean price of the turbos to the regression model’s intercept. The regression intercept is 16250 indicating that the mean price of turbos is $16250.

We need to interpret the coefficients of all dummy variables in the model with reference to the value of the intercept.

In our case, there is only one dummy, aspiration_std. It’s value is 3712.62 and it has a negative sign. That indicates that automobiles having a ‘standard’ type aspiration are on average $3712.62 less expensive than those that have a ‘turbo’ type aspiration. The estimated mean price of turbos is $16250. Hence, the estimated mean price of non-turbos is $16250 — $3712.62=$12,537.38.

Using statistical notation, we can represent the two means as follows:

E(price|aspiration=’standard’) = $12,537.38

This estimate has the following 95% confidence interval around the mean:

[$16250 — $6555.64 =$9,694.36, $16250 — $869.607=$15,380.393].

We took the values -$6555.64 and -$869.607 from CI portion of the model training output shown below:

For turbos, the expectation and the CIs work out as follows:

E(price|aspiration=’turbo’) = $16250 with a 95% CI of [$13700, $18800].

The fitted model’s regression equation is as follows:

price = — 3.712.62*aspiration_std + 16250 + e

Where ‘e’ contains the residual error of regression.


Next, let’s look at the use of dummy variables to represent categorical data.

How to use dummy variables for representing a categorical regression variable

Suppose we wish to estimate the effect of body_style on price. Body_style is a categorical variable that has the following set of values: [convertible, hardtop, hatchback, sedan, wagon]. Our overall strategy to represent body_style would be similar to that for aspiration. So let’s dive straight into the implementation. We’ll continue working with the Pandas Dataframe that contains the 7-variable automobiles data set.

Let’s augment the DataFrame with dummy variable columns to represent body_style:

df_with_dummies = pd.get_dummies(data=df, columns=['body_style'])

Print out the dummy-augmented data set:

print(df_with_dummies)

We see the following output:

The dummy-augmented data set
The dummy-augmented data set (Image by Author)

Notice the newly add dummy variable columns, one for each body_style.

Next, we’ll construct the regression equation in Patsy syntax: As before we’ll leave out one dummy variable (body_style_convertible) to void perfect collinearity. The regression model’s intercept will hold the coefficient of body_style_convertible.

reg_exp = 'price ~ body_style_hardtop + body_style_hatchback + body_style_sedan + body_style_wagon'

Let’s build the OLS regression model:

olsr_model = smf.ols(formula=reg_exp, data=df_with_dummies)

Let’s train the model:

olsr_model_results = olsr_model.fit()

And let’s print out the training summary:

print(olsr_model_results.summary())

We see the following output:

Training summary of the OLSR model
Training summary of the OLSR model (Image by Author)

As before, we won’t focus on the adjusted R-squared. Instead, let’s look at the F-statistic and note that it is significant at a p value of < .001. It indicates that irrespective of the value of R-squared, the variables we have included in the model have been able to do a better job of explaining the variance in price than a simple mean model. With this important piece of due-diligence done and out of the way, let’s dig into the coefficients of all the variables.

How to interpret the coefficients of dummy variables

Let’s turn our gaze toward the fitted model’s coefficients. The estimated intercept is 21890. The intercept is the estimated mean price of convertibles since that was the dummy that we dropped from the regression equation. This estimate is significant at a p < .001. The 95% CI for this estimate is [$16000, $27800].

The coefficients of the four style-specific dummies for hardtop, hatchback, sedan and wagon represent the extent to which the mean of the corresponding style deviates from the estimated mean price of convertibles.

The fitted model has estimated the mean deviation for hardtops as $318, but this estimate is not statistically significant. In fact, at a p = .936, it is highly insignificant. So how should we interpret this coefficient? The obvious way to interpret it is to assume that it is in reality, zero. That implies that the estimated mean price of hardtops is the same as the estimated mean price of convertibles, namely $21890. But that doesn’t quite paint the complete picture. The estimate of $318 comes with an enormous standard error of $3980.519. The corresponding 95% CI is roughly 4 times as large, stretching from $-7532.146 to $8168.146. Can you draw any practical use from a mean that comes with such a large variability of values around it? The answer is no. Means of distributions that have a very wide variance only very poorly represent any specific value from the distribution. Thus, instead of saying that hardtops have the same mean price as convertibles (which is still technically correct), it would be more useful to state that in this data set, the hardtop property has no ability to explain any of the variance in the price of automobiles.

On the other hand, the estimated coefficients of hatchback, sedan and wagon styles are all statistically significant (in fact, they are highly significant) at a p < .001, .018 and .005 respectively. The hatchback’s coefficient is -11930 indicating that the estimated mean price of hatchbacks is $11930 less than the estimated mean price of convertibles. Thus, we estimate the mean price of hatchbacks as $21890 — $11930 = $9,960. The 95% CI around this estimate is [$21890 — $18100=$3,790, $21890 — $5742.639=$16147.361].

Similarly, sedans come at an estimated mean price that is $7430.7447 lower than that of convertibles, and wagons come in at an estimated mean price of $9518.54 lower than that of convertibles.

In summary, our model has shown that on average, convertibles are the most expensive vehicle followed by sedans, wagons and hatchbacks in that order, and nothing useful can be said about the hardtop style in its ability to explain the variance in price.

The fitted model’s equation is as follows:

The fitted regression model’s equation
The fitted regression model’s equation (Image by Author)

How to use dummy variables to represent ordered categorical values

The final use case we will consider is one where the categorical variable imposes a certain order on its constituents. Once again, we’ll use the automobiles data for illustration. Specifically, we’ll turn our attention toward the variable num_of_cylinders.

A 7-variable subset of the Automobiles data set. (Source: UC Irvine)
A 7-variable subset of the Automobiles data set. (Source: UC Irvine)

At first glance, num_of_cylinders might appear to be an integer valued variable. A possible regression model that regresses price on num_of_cylinders is as follows:

A naive regression model that regresses price on num_of_cylinders (Image by Author)

This model has a fatal flaw which becomes apparent when we differentiate the expected value of price w.r.t. num_of_cylinders:

The change in the expected value of automobile price per unit change in number of cylinders
The change in the expected value of automobile price per unit change in number of cylinders (Image by Author)

We see that this model will estimate a constant expected change in price for each unit change in the number of cylinders. The model will estimate the difference in the mean price of 2 cylinder vehicles and 3 cylinder vehicles to be exactly the same as that between 3 and 4 cylinder vehicles and so on. In the real world we would not expect to see such a uniform variation in vehicle prices.

A more realistic model could be one where the num_of_cylinders is treated as a categorical variable with each value of num_of_cylinders being represented by a dummy variable.

Our data set has vehicles with 2,3,4,5,6,8 and 12 cylinders. Hence we construct the model as follows:

A linear model in which num_of_cylinders is represented as a categorical dummy variable
A linear model in which num_of_cylinders is represented as a categorical dummy variable (Image by Author)

We have left out the dummy for num_of_cylinders_2. The intercept β_0 will capture the coefficient for num_of_cylinders_2. The coefficients of all dummy variables will contain the estimated deviation in the mean price for the respective category of vehicles from the estimated mean price of 2-cylinder vehicles. The 95% CIs can be calculated as illustrated above.

Let’s build and fit this model on the automobiles data set and print out the training summary.

#Add dummy variable columns to represent num_of_cylinders
df_with_dummies = pd.get_dummies(data=df, columns=['num_of_cylinders'])

olsr_model = smf.ols(formula=reg_exp, data=df_with_dummies)

olsr_model_results = olsr_model.fit()

print(olsr_model_results.summary())

We see the following output:

Training summary of the linear model
Training summary of the linear model (Image by Author)

How to interpret the training summary and coefficients of the dummy variables

The first thing that catches the eye in the summary is the large adjusted R-squared of 0.618. The num_of_cylinders appears to have the capacity to by itself explain a whopping 61.8% of the variance in automobile prices.

As always, we will do our due-diligence with examining the p-value of the F-statistic (which at 2.87E — 39 is obviously less than .001) indicating that all the regression variables in the model are jointly highly significant.

As before, our focus remains on the estimated coefficients, their p values and the 95% CIs.

Let’s start with the regression intercept. Its estimate is $13020 which is the estimated mean price of 2-cylinder automobiles. The mean price is statistically significant at a p of .001 with a 95% CI of [$8176.803, $17900].

3-cylinder automobiles come in at an estimated mean price of $13020 — $7869.0=$5,151 but this estimate is statistically significant only at a p of .153. It fails the 95%, the 90% and the 85% confidence tests but clears the 80% confidence level.

4-cylinder autos come in right behind the 3-cylinder ones at an estimated mean price of $13020 — 2716.8025=$10303.1975. Again, at a p of .273, the significance of this estimate is valid only at a confidence level of (1 — .273)100%=72.7%.

The estimated means of 5, 6, 8 and 12-cylinder automobiles are all highly significant. 8-cylinder automobiles seem to be on-average the most expensive ones of the lot with their estimated mean price coming in at a colossal $25,880 more than their 2-cylinder brethren.

The following figure shows the mean prices plotted against the number of cylinders along with the lower and upper 95% bounds around the mean.

Mean price of automobiles as a function of number of cylinders
Mean price of automobiles as a function of number of cylinders (Image by Author)

We see that the price does not change by a constant amount with each unit change in the number of cylinders. This vindicates the insight we had earlier that we ought not to represent num_of_cylinders as a simple integer-valued variable.

Here’s the equation of the fitted model:

The equation of the fitted regression model
The equation of the fitted regression model (Image by Author)

Here is the complete source code used in this chapter:

import pandas as pd
import statsmodels.formula.api as smf
from patsy import dmatrices
import scipy.stats as st
from matplotlib import pyplot as plt
#Import the 7-variable subset of the automobiles dataset into a DataFrame
df = pd.read_csv('automobiles_dataset_subset_uciml.csv', header=0)
#############################################################################################
# Dummy variables regression 1
#############################################################################################
#Add dummy variable columns to represent the aspiration variable
df_with_dummies = pd.get_dummies(data=df, columns=['aspiration'])
#Print out the dummy-augmented data set
print(df_with_dummies)
#Construct the regression expression. The intercept of regression is added automatically.
#We add only one dummy variable aspiration_std and not both, _std and _turbo so as to avoid
# perfect collinearity.In this case, regression intercept captures the effect of
# aspiration_turbo. Specifically, the value of the intercept is the coefficient of aspiration_turbo.
# Alternately, we could have added both aspiration_std and aspiration_turbo and left out the
# regression intercept. In this later case, because the model would not have the regression
# intercept, we would not be able to use the R-squared value to judge its goodness-of-fit.
reg_exp = 'price ~ aspiration_std'
#Build the Ordinary Least Squares Regression model. Even though the entire 7-variables data set
# is passed into the model, internally, statsmodels uses the regression express (reg_exp) to
# carve out the columns of interest
olsr_model = smf.ols(formula=reg_exp, data=df_with_dummies)
#Train the model
olsr_model_results = olsr_model.fit()
#Print the training summary
print(olsr_model_results.summary())
#############################################################################################
# Dummy variables regression 2
#############################################################################################
#Add dummy variable columns to represent body_style
df_with_dummies = pd.get_dummies(data=df, columns=['body_style'])
#Print out the dummy-augmented data set
print(df_with_dummies)
#Construct the regression expression. As before we'll leave out one dummy variable (
# body_style_convertible) to void perfect collinearity. The regression model's intercept will
# hold the coefficient of body_style_convertible
reg_exp = 'price ~ body_style_hardtop + body_style_hatchback + body_style_sedan + \
body_style_wagon'
#Build the OLS Regression model.
olsr_model = smf.ols(formula=reg_exp, data=df_with_dummies)
#Train the model
olsr_model_results = olsr_model.fit()
#Print the training summary
print(olsr_model_results.summary())
#############################################################################################
# Dummy variables regression 3
#############################################################################################
#Add dummy variable columns to represent num_of_cylinders
df_with_dummies = pd.get_dummies(data=df, columns=['num_of_cylinders'])
#Form the regression expression
reg_exp = 'price ~ num_of_cylinders_3 + num_of_cylinders_4 + ' \
'num_of_cylinders_5 + num_of_cylinders_6 + num_of_cylinders_8 + num_of_cylinders_12'
#Build and fit the model and print out the training summary
olsr_model = smf.ols(formula=reg_exp, data=df_with_dummies)
olsr_model_results = olsr_model.fit()
print(olsr_model_results.summary())

Citations and Copyrights

Data set

The Automobile Data Set citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. (CC BY 4.0) Download link

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.


PREVIOUS: Building Robust Linear Models For Nonlinear, Heteroscedastic Data

NEXT: The Poisson Regression Model


UP: Table of Contents