And how to interpret the regression coefficients of dummy variables
A dummy variable is a binary variable that takes a value of 0 or 1. One adds such variables to a regression model to represent factors which are of a binary nature i.e. they are either observed or not observed.
Within this broad definition lie several interesting use cases. Here are some of them:
- For representing a Yes/No property: To indicate whether a data point has a certain property. For example, a dummy variable can be used to indicate whether a car engine is of type ‘Standard’ or ‘Turbo’. Or if a participant in a drug trial belongs to the placebo group or the treatment group.
- For representing a categorical value: A related use of dummies is to indicate which one of a set of categorical values a data point belongs to. For example, a vehicle’s body style could be one of convertible, hatchback, coupe, sedan, or wagon. In this case, we would add five dummy variables to the data set, one for each of the 5 body styles and we would ‘one hot encode’ this five element vector of dummies. Thus, the vector [0, 1, 0, 0, 0] would represent all hatchbacks in the data set.
- For representing an ordered categorical value: An extension of the use of dummies to represent categorical data is one where the categories are ordered. Suppose our Automobiles data set contains cars with engines having 2,3,4,5,6,8 or 12 cylinders. Here, we need to also capture the information contained in the ordering. We will soon see how to do this.
- For representing a seasonal period: A dummy variable can be added to represent each one of the possibly many seasonal periods contained in the data. For example, the flow of traffic through intersections often exhibits seasonality at an hourly level (they are highest during the morning and evening rush hours) and also a weekly period (lowest on Sundays). Adding dummy variables to the data for each of the two seasonal periods will allow you explain away much of the variation in the traffic flow that is attributable to daily and weekly variations.
- For representing Fixed Effects: While building regression models for panel data sets, dummies can be used to represent ‘unit-specific’ and ‘time-specific’ effects, especially in a Fixed Effects regression model.
- For representing Treatment Effects: In a treatment effects model, a dummy variable can be used to represent the effect of both time (i.e. the effect before and after treatment is applied), the effect of group membership (whether the participant received the treatment or the placebo), and the effect of the interaction between the time and group memberships.
- In regression discontinuity designs: This is best explained with an example. Imagine a data set of monthly employment rate numbers that contains a sudden, sharp increase in the unemployment rate caused by a brief and severe recession. For this data, a regression model used for modeling the unemployment rate can deploy a dummy variable to estimate the expected impact of the recession on the unemployment rate.
In this chapter, we’ll explain how to use dummy variables in the first three situations, namely:
- For representing a Yes/No property
- For representing a categorical value
- For representing an ordered categorical value
The last four use-cases, namely the use of dummies to deseasonalize data, to represent fixed effects and treatment effects, and for modeling regression discontinuities all deserve their own separate chapters.
Incidentally, the use of dummies for representing Fixed Effects is covered here:
We will cover the use dummies in building a Treatment Effects model and in modeling the effect of discontinuities in a different chapter.
Let’s dive into the first use case.
How to use a dummy variable for representing a Yes/No property
We’ll illustrate the procedure by using the following data set of vehicles containing specifications of 200+ automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle:
We’ll consider a subset of this data consisting of the following seven variables:
The above 7-variables version can be downloaded from here.
In the above data set, the aspiration variable is of type Standard or Turbo. Our regression goal is to estimate the effect of aspiration on vehicle price. To that end, we will introduce a dummy variable to encode aspiration as follows:
aspiration_std=1 when aspiration is standard, and 0 otherwise.
We’ll use the Python based Pandas library to load the data set into memory as a Dataframe. Then we’ll use the statsmodels library to build a simple linear regression model in which the response variable is price, and the regression variable is aspiration_std (plus the intercept of regression).
Let’s start by importing all the required packages.
import pandas as pd import statsmodels.formula.api as smf
Let’s import the 7-variable subset of the automobiles data set into a DataFrame:
df = pd.read_csv('automobiles_dataset_subset_uciml.csv', header=0)
We’ll add dummy variable columns to represent the aspiration variable.
df_with_dummies = pd.get_dummies(data=df, columns=['aspiration'])
Print out the dummy-augmented data set.
We see the following output. I have highlighted the two dummy variable columns added by Pandas:
Let’s construct the regression expression. The intercept of regression is added automatically later on by the model.
reg_exp = 'price ~ aspiration_std'
Notice that we have added only one dummy variable aspiration_std and not both, aspiration_std and aspiration_turbo. We did this to avoid perfect collinearity as every vehicle engine in the data set is either of type turbo or of type standard. There is no third type. In this case, regression intercept captures the effect of aspiration_turbo. Specifically, the estimated value of the regression intercept in the trained model is the estimated mean price of all turbo type automobiles.
Alternately, we could have added both aspiration_std and aspiration_turbo and left out the regression intercept. In this later case, because the model would not have the regression intercept, we would not be able to use the R-squared value to judge its goodness-of-fit.
Let’s build the Ordinary Least Squares Regression model on this dummies augmented dataset:
olsr_model = smf.ols(formula=reg_exp, data=df_with_dummies)
Even though we have passed the entire 7-variables data set into this model, internally, statsmodels will use the regression expression parameter (
reg_exp) to carve out only the columns of interest.
Let’s train the model:
olsr_model_results = olsr_model.fit()
Let’s print out the training summary.
We see the following output. I have highlighted the parts which we will be examining closely:
How to interpret the model training summary
The first thing we notice is that the adjusted R-squared is 0.027. The aspiration variable has been able to explain just a little under 3% of the variance in the automobile price. It seems awfully small but we do not need to read too much into the low value of adjusted R-squared. Recollect that our goal was to estimate the effect of aspiration on price. We never expected aspiration to by itself explain away much of the variance in price. Besides, notice that the F-statistic’s p value is significant at .0107 indicating that even this very simple linear model has been able to fit the data better than the mean model (which is basically flat horizontal line passing through the mean value of price).
Next, we notice that the model’s regression intercept and the coefficient of aspiration_std are both statistically significant i.e. non-zero, at a p value of less than .001 and at .011 respectively. That is great news. Let’s see how to interpret the values of these coefficients.
How to interpret the coefficient of the dummy variable in the regression model
Recollect that we had left out the dummy variable aspiration_turbo from the model to avoid perfect collinearity. By leaving out aspiration_turbo, we have given the job of storing the mean price of the turbos to the regression model’s intercept. The regression intercept is 16250 indicating that the mean price of turbos is $16250.
We need to interpret the coefficients of all dummy variables in the model with reference to the value of the intercept.
In our case, there is only one dummy, aspiration_std. It’s value is 3712.62 and it has a negative sign. That indicates that automobiles having a ‘standard’ type aspiration are on average $3712.62 less expensive than those that have a ‘turbo’ type aspiration. The estimated mean price of turbos is $16250. Hence, the estimated mean price of non-turbos is $16250 — $3712.62=$12,537.38.
Using statistical notation, we can represent the two means as follows:
E(price|aspiration=’standard’) = $12,537.38
This estimate has the following 95% confidence interval around the mean:
[$16250 — $6555.64 =$9,694.36, $16250 — $869.607=$15,380.393].
We took the values -$6555.64 and -$869.607 from CI portion of the model training output shown below:
For turbos, the expectation and the CIs work out as follows:
E(price|aspiration=’turbo’) = $16250 with a 95% CI of [$13700, $18800].
The fitted model’s regression equation is as follows:
price = — 3.712.62*aspiration_std + 16250 + e
Where ‘e’ contains the residual error of regression.
Next, let’s look at the use of dummy variables to represent categorical data.
How to use dummy variables for representing a categorical regression variable
Suppose we wish to estimate the effect of body_style on price. Body_style is a categorical variable that has the following set of values: [convertible, hardtop, hatchback, sedan, wagon]. Our overall strategy to represent body_style would be similar to that for aspiration. So let’s dive straight into the implementation. We’ll continue working with the Pandas Dataframe that contains the 7-variable automobiles data set.
Let’s augment the DataFrame with dummy variable columns to represent body_style:
df_with_dummies = pd.get_dummies(data=df, columns=['body_style'])
Print out the dummy-augmented data set:
We see the following output:
Notice the newly add dummy variable columns, one for each body_style.
Next, we’ll construct the regression equation in Patsy syntax: As before we’ll leave out one dummy variable (body_style_convertible) to void perfect collinearity. The regression model’s intercept will hold the coefficient of body_style_convertible.
reg_exp = 'price ~ body_style_hardtop + body_style_hatchback + body_style_sedan + body_style_wagon'
Let’s build the OLS regression model:
olsr_model = smf.ols(formula=reg_exp, data=df_with_dummies)
Let’s train the model:
olsr_model_results = olsr_model.fit()
And let’s print out the training summary:
We see the following output:
As before, we won’t focus on the adjusted R-squared. Instead, let’s look at the F-statistic and note that it is significant at a p value of < .001. It indicates that irrespective of the value of R-squared, the variables we have included in the model have been able to do a better job of explaining the variance in price than a simple mean model. With this important piece of due-diligence done and out of the way, let’s dig into the coefficients of all the variables.
How to interpret the coefficients of dummy variables
Let’s turn our gaze toward the fitted model’s coefficients. The estimated intercept is 21890. The intercept is the estimated mean price of convertibles since that was the dummy that we dropped from the regression equation. This estimate is significant at a p < .001. The 95% CI for this estimate is [$16000, $27800].
The coefficients of the four style-specific dummies for hardtop, hatchback, sedan and wagon represent the extent to which the mean of the corresponding style deviates from the estimated mean price of convertibles.
The fitted model has estimated the mean deviation for hardtops as $318, but this estimate is not statistically significant. In fact, at a p = .936, it is highly insignificant. So how should we interpret this coefficient? The obvious way to interpret it is to assume that it is in reality, zero. That implies that the estimated mean price of hardtops is the same as the estimated mean price of convertibles, namely $21890. But that doesn’t quite paint the complete picture. The estimate of $318 comes with an enormous standard error of $3980.519. The corresponding 95% CI is roughly 4 times as large, stretching from $-7532.146 to $8168.146. Can you draw any practical use from a mean that comes with such a large variability of values around it? The answer is no. Means of distributions that have a very wide variance only very poorly represent any specific value from the distribution. Thus, instead of saying that hardtops have the same mean price as convertibles (which is still technically correct), it would be more useful to state that in this data set, the hardtop property has no ability to explain any of the variance in the price of automobiles.
On the other hand, the estimated coefficients of hatchback, sedan and wagon styles are all statistically significant (in fact, they are highly significant) at a p < .001, .018 and .005 respectively. The hatchback’s coefficient is -11930 indicating that the estimated mean price of hatchbacks is $11930 less than the estimated mean price of convertibles. Thus, we estimate the mean price of hatchbacks as $21890 — $11930 = $9,960. The 95% CI around this estimate is [$21890 — $18100=$3,790, $21890 — $5742.639=$16147.361].
Similarly, sedans come at an estimated mean price that is $7430.7447 lower than that of convertibles, and wagons come in at an estimated mean price of $9518.54 lower than that of convertibles.
In summary, our model has shown that on average, convertibles are the most expensive vehicle followed by sedans, wagons and hatchbacks in that order, and nothing useful can be said about the hardtop style in its ability to explain the variance in price.
The fitted model’s equation is as follows:
How to use dummy variables to represent ordered categorical values
The final use case we will consider is one where the categorical variable imposes a certain order on its constituents. Once again, we’ll use the automobiles data for illustration. Specifically, we’ll turn our attention toward the variable num_of_cylinders.
At first glance, num_of_cylinders might appear to be an integer valued variable. A possible regression model that regresses price on num_of_cylinders is as follows:
This model has a fatal flaw which becomes apparent when we differentiate the expected value of price w.r.t. num_of_cylinders:
We see that this model will estimate a constant expected change in price for each unit change in the number of cylinders. The model will estimate the difference in the mean price of 2 cylinder vehicles and 3 cylinder vehicles to be exactly the same as that between 3 and 4 cylinder vehicles and so on. In the real world we would not expect to see such a uniform variation in vehicle prices.
A more realistic model could be one where the num_of_cylinders is treated as a categorical variable with each value of num_of_cylinders being represented by a dummy variable.
Our data set has vehicles with 2,3,4,5,6,8 and 12 cylinders. Hence we construct the model as follows:
We have left out the dummy for num_of_cylinders_2. The intercept β_0 will capture the coefficient for num_of_cylinders_2. The coefficients of all dummy variables will contain the estimated deviation in the mean price for the respective category of vehicles from the estimated mean price of 2-cylinder vehicles. The 95% CIs can be calculated as illustrated above.
Let’s build and fit this model on the automobiles data set and print out the training summary.
#Add dummy variable columns to represent num_of_cylinders df_with_dummies = pd.get_dummies(data=df, columns=['num_of_cylinders']) olsr_model = smf.ols(formula=reg_exp, data=df_with_dummies) olsr_model_results = olsr_model.fit() print(olsr_model_results.summary())
We see the following output:
How to interpret the training summary and coefficients of the dummy variables
The first thing that catches the eye in the summary is the large adjusted R-squared of 0.618. The num_of_cylinders appears to have the capacity to by itself explain a whopping 61.8% of the variance in automobile prices.
As always, we will do our due-diligence with examining the p-value of the F-statistic (which at 2.87E — 39 is obviously less than .001) indicating that all the regression variables in the model are jointly highly significant.
As before, our focus remains on the estimated coefficients, their p values and the 95% CIs.
Let’s start with the regression intercept. Its estimate is $13020 which is the estimated mean price of 2-cylinder automobiles. The mean price is statistically significant at a p of .001 with a 95% CI of [$8176.803, $17900].
3-cylinder automobiles come in at an estimated mean price of $13020 — $7869.0=$5,151 but this estimate is statistically significant only at a p of .153. It fails the 95%, the 90% and the 85% confidence tests but clears the 80% confidence level.
4-cylinder autos come in right behind the 3-cylinder ones at an estimated mean price of $13020 — 2716.8025=$10303.1975. Again, at a p of .273, the significance of this estimate is valid only at a confidence level of (1 — .273)100%=72.7%.
The estimated means of 5, 6, 8 and 12-cylinder automobiles are all highly significant. 8-cylinder automobiles seem to be on-average the most expensive ones of the lot with their estimated mean price coming in at a colossal $25,880 more than their 2-cylinder brethren.
The following figure shows the mean prices plotted against the number of cylinders along with the lower and upper 95% bounds around the mean.
We see that the price does not change by a constant amount with each unit change in the number of cylinders. This vindicates the insight we had earlier that we ought not to represent num_of_cylinders as a simple integer-valued variable.
Here’s the equation of the fitted model:
Here is the complete source code used in this chapter:
Citations and Copyrights
The Automobile Data Set citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. (CC BY 4.0) Download link