Introduction To The Quantile Regression Model

We’ll look at how to predict the median and other quantile points


In a regression model, one is normally interested in estimating the conditional mean of the response variable. For example, consider the following plot of vehicle prices versus number of cylinders:

Price versus number of cylinders. Gold dots represent the conditional mean: E(price|num_of_cylinders)
Price versus number of cylinders. The gold dots represent the conditional mean: E(price|num_of_cylinders) (Image by Author) (Dataset: UCI ML Automobile used under CC BY 4.0)

The gold dots represent the observed mean price conditioned upon the number of cylinders.

If we wanted to estimate this conditional mean price using a regression model, we could employ the following linear model to do the job:

A linear model for estimating the price of an automobile
A linear model for estimating the price of an automobile (Image by Author)

If we fit the above linear model to the data and plot its estimated values, we get the following plot:

The trend line of the trained OLS model plotted on top of the raw data. The red dots are the estimated conditional mean prices while the gold dots are the observed conditional mean prices
The trend line of the trained OLS model plotted on top of the raw data. The red dots are the estimated conditional mean prices while the gold dots are the observed conditional mean prices (Image by Author)

For other kinds of data, a more complex model such as a Poisson or a Cox Proportional Hazards model may be employed. In all cases, we are asking the model to estimate the conditional mean.

But in some data sets, the mean is not a good exemplar of the data.

Here are three examples of data sets for which the mean may not be the appropriate statistic to estimate:

  • The data are highly skewed to the left or to the right. Such is often the case in insurance claims or healthcare costs data where most claims are small valued but the data has a long tail of claims of increasing value.
  • The data is multi-modal i.e. it has more than one highly frequently occurring value. For example, if the data has two roughly identical peaks, (a bimodal data set), the mean model will estimate a point inside the valley in between the two peaks which is not a good representative of the data.
  • The data contains influential outliers. Regression models such as the Ordinary Least Squares Regression model that rely on minimizing the sum of squares of residual errors are highly sensitive to the presence of outliers. In such cases, it may be beneficial to minimize the sum of absolute values of residual errors, in addition to not using the mean as the quantity to be estimated.

In all these cases, the mean does not adequately represent the nature of the data. One may be better served by estimating the conditional median.

In the most general terms, our “median-seeking” regression model would be specified by the following equation:

A regression model that estimates the conditional median of y for a certain value of X
A regression model that estimates the conditional median of y for a certain value of X (Image by Author)

In the above equation, X is the regression matrix and x_i is the ith row of the matrix. β_cap is the vector of fitted regression coefficients and f(.) is some function of β_cap and x_i for estimating the median under the constraint that the probability of the estimated value f(β_cap, x_i) of y being greater or equal to any observed value of y is 50%.

For e.g., a model for estimating the median price of automobiles in which f(β_cap, x_i) has a linear form would be:

A model for estimating the median price of automobiles
A model for estimating the median price of automobiles (Image by Author)

We may even go a step further. In a data set, the median is the 0.5 quantile (or 50th percentile) point meaning that 50% of the data points are less than the value of the median. Similarly, there are other quantile points that can be defined. The 0.1 quantile point (10th percentile) is the value such that only 10% of the data set is smaller than this value. Similarly, the 0.25 quantile point is greater in value than 25% of the data set, and so on.

We may want to build a regression model that estimates any or all of these quantile points (or corresponding percentile values).

The general equation of such a model is as follows:

The general equation of the q-Quantile regression model
The general equation of the q-Quantile regression model (Image by Author)

In the above equation, Q(.) is the estimated quantile point for the q-quantile (or (q*100)th percentile). As before, f(β_cap, x_i) is a function that yields the estimated value of the desired q-quantile point subject to the constraint that the probability of any observed value of y being less than or equal to the estimated value f(β_cap, x_i) is exactly q. q ranges from 0 to 1, both inclusive.

For e.g., a model for estimating the 95th percentile price of automobiles in which f(β_cap, x_i) has a linear form would be:

A model for estimating the 95th percentile auto price given number of cylinders
A model for estimating the 95th percentile auto price given number of cylinders (Image by Author)

What is a quantile regression model used for?

When the data is distributed in a different way in each quantile of the data set, it may be advantageous to fit a different regression model to meet the unique modeling needs of each quantile instead of trying to fit a one-size-fits-all model that predicts the conditional mean. In such cases, the coefficients of all such quantile models would differ from each other.

The converse situation is one where the data is identically distributed in each quantile. Specifically, the error term of a model that estimates the conditional mean has an identical distribution for each x_i. For example, the errors are normally distributed for all x_i and all the distributions are identical in their mean and variance. In other words, the errors are homoskedastic. It can be shown that this case, when different quantile regression models are fitted on a homoskedastic data set, the corresponding sets of coefficients of all models will work out to be statistically identical, and the various quantile models will differ from each other only in the value the regression intercept. In fact, this behavior can form the basis for a test of heteroskedasticity.

Another reason to build a quantile model is that we are actually interested in estimating a particular quantile point for say, socioeconomic reasons. For instance, we may want to estimate the the 95th percentile SAT score of test takers who share a set of demographic characteristics.

In the next section, we’ll go through a short tutorial on how to build a quantile model using Python.


How to build a quantile regression model using Python and statsmodels

We’ll illustrate the procedure of building a quantile regression model using the following data set of vehicles containing specifications of 200+ automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle:

The automobiles data set
The automobiles data set (Source: UC Irvine under CC BY 4.0)

A subset of this data can be downloaded from here.

Let’s load the data set into a Pandas DataFrame and plot the data set.

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices
import numpy as np
from matplotlib import pyplot as plt

#Import the 7-variable subset of the automobiles dataset into a DataFrame
df = pd.read_csv('automobiles_dataset_subset_uciml.csv', header=0)

#Plot price versus num_of_cylinders
fig = plt.figure()

fig.suptitle('Price versus Number of Cylinders')

plt.xlabel('number_of_cylinders')
plt.ylabel('price')

plt.scatter(x=df['num_of_cylinders'], y=df['price'])

Let’s also plot the conditional means E(price|num_of_cylinders):

num_of_cylinders = np.array(df.groupby('num_of_cylinders')['num_of_cylinders'].mean())

conditional_means = np.array(df.groupby('num_of_cylinders')['price'].mean())

plt.scatter(x=num_of_cylinders, y=conditional_means, color='gold', marker='o')

plt.show()

We get the following graph. The gold dots represent the conditional means:

Price versus number of cylinders. Gold dots represent the conditional mean: E(price|num_of_cylinders)
Price versus number of cylinders. Gold dots represent the conditional mean: E(price|num_of_cylinders) (Image by Author)

Next, let’s define the equation of the regression model in Patsy syntax. The intercept is assumed:

reg_exp = 'price ~ num_of_cylinders'

We’ll use this expression to carve out the y and X matrices:

y_train, X_train = dmatrices(reg_exp, df, return_type='dataframe')

We’ll first build and train an Ordinary Least Squares (OLS) Regression Model which estimates the conditional mean price for a given value of num_of_cylinders and plot the regression line on the scatter plot of points:

#Build and train an OLS regression model
olsr_model = sm.OLS(endog=y_train, exog=X_train)
olsr_model_results = olsr_model.fit()
print(olsr_model_results.summary())

#Plot the OLS regression line on the scatter plot of Price versus num_of_cylinders
fig = plt.figure()
fig.suptitle('Price versus Number of Cylinders')
plt.xlabel('number_of_cylinders')
plt.ylabel('price')
plt.scatter(x=df['num_of_cylinders'], y=df['price'])

#Get the estimated conditional means from the trained OLS model
y_pred_ols = olsr_model_results.predict(X_train)

#Plot the estimated conditional means
ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols,
    color='red', marker='o', linestyle='solid', label='OLS Model')

#Also plot the observed conditional means i.e. E(price | num_of_cylinders)
conditional_mean_pts = plt.scatter(x=num_of_cylinders, y=conditional_means, c='gold', marker='o', label='E(price | num_of_cylinders)')

plt.legend(handles=[ols, conditional_mean_pts])

plt.show()

We see the following graph:

The trend line of the trained OLS model plotted on top of the raw data. The red dots are the estimated conditional mean prices while the gold dots are the observed conditional mean prices
The trend line of the trained OLS model plotted on top of the raw data. The red dots are the estimated conditional mean prices while the gold dots are the observed conditional mean prices (Image by Author)

Let’s now start working with the median and other quantile points.

We’ll start with training a model that will estimate the conditional median, namely, median(price|num_of_cylinders).

Let’s create an instance of the quantile regression model as follows:

median_model = smf.quantreg(formula=reg_exp, data=df)

Next, we’ll train the model. We’ll tell statsmodels that we want to fit to the conditional median which is the 0.5 quantile point:

median_model_results = median_model.fit(q=0.5)

Now, let’s plot the estimated conditional median points from this model against the backdrop of the raw price versus num_of_cylinders data. For comparison we’ll also plot the estimated conditional means from the OLS model that we built earlier, and we’ll also plot the observed values of the conditional means and conditional medians for comparison.

The following piece of Python code does all of this:

fig = plt.figure()

fig.suptitle('Price versus Number of Cylinders')

plt.xlabel('number_of_cylinders')
plt.ylabel('price')

#Show the scatter plot of price versus num_of_cylinders
plt.scatter(x=df['num_of_cylinders'], y=df['price'], c='deepskyblue')

#Get the estimated conditional medians from the median model
y_pred_median = median_model_results.predict(X_train)

#Plot the estimated conditional medians
median, = plt.plot(X_train['num_of_cylinders'], y_pred_median,
    color='black', marker='o', linestyle='solid',  label='Median Model')

#For comparison, also plot the estimated conditional means from the OLS model we built earlier
ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols,
    color='red', marker='o', linestyle='solid',  label='OLS Model')

#Calculate the observed conditional medians
conditional_medians = np.array(df.groupby('num_of_cylinders')['price'].median())

#Plot the observed conditional medians
conditional_median_pts = plt.scatter(x=num_of_cylinders, y=conditional_medians, c='sienna', marker='^', label='Median(price | num_of_cylinders)')

#For comparison, plot the observed conditional means
conditional_mean_pts = plt.scatter(x=num_of_cylinders, y=conditional_means, c='gold', marker='o', label='E(price | num_of_cylinders)')

#Set up the legend and show the plot
plt.legend(handles=[ols, median, conditional_mean_pts, conditional_median_pts])

plt.show()

We see the following plot:

Predicted conditional means from the OLS model (red) and conditional medians from the quantile model (black), observed conditional means (gold dots), and observed conditional medians, (black triangles) superimposed on the raw data
Predicted conditional means from the OLS model (red) and conditional medians from the quantile model (black), observed conditional means (gold dots), and observed conditional medians, (black triangles) superimposed on the raw data (Image by Author)

One thing we notice from the plot is that the observed values of the conditional median and conditional mean points almost overlap each other implying that the price data is more or less balanced around the mean. This is borne out by the trend lines of OLS and median regression models which are very close to each other, especially toward the center portion of the data.

Let’s also plot the regression lines for a few other quantile points, say, [0.1, 0.25, 0.5, 0.75, 0.9].

fig = plt.figure()

fig.suptitle('Price versus Number of Cylinders')

plt.xlabel('number_of_cylinders')
plt.ylabel('price')

plt.scatter(x=df['num_of_cylinders'], y=df['price'])

coeff = []
colors = ['orange', 'lime', 'yellow', 'cyan', 'violet']
i=0
handles = []
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]

for q in quantiles:
    #Build the model 
    quantile_model = smf.quantreg(formula=reg_exp, data=df)

    #Fit the model
    quantile_model_results = quantile_model.fit(q=q)

    print(quantile_model_results.summary())

    coeff.append(quantile_model_results.params['num_of_cylinders'])

    #Get the estimated values from the quantile model 
    y_pred_quantile = quantile_model_results.predict(X_train)

    #Plot the estimated values
    quantile, = plt.plot(X_train['num_of_cylinders'], y_pred_quantile, color=colors[i], marker='o', linestyle='solid',  label=str(int(q*100))+'th percentile Model')

    i = i+1
    handles.append(quantile)

#Also plot the estimated values from the OLS model for comparison
ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols,
    color='black', marker='o', linestyle='solid', label='OLS Model')

handles.append(ols)
plt.legend(handles=handles)

plt.show()

Here’s how the plot looks like:

A plot of predictions from various quantile models super imposed against the raw data (sky blue). The OLS model’s predictions (black) are shown for reference
A plot of predictions from various quantile models super imposed against the raw data (sky blue). The OLS model’s predictions (black) are shown for reference (Image by Author)

If the errors from the OLS model are identically distributed, in other words, if the errors are homoskedastic, the trend lines for all quantile models will differ only in the intercept, i.e. they will be parallel to each other. That’s clearly not what we are seeing in this case leading us to believe that the errors from the OLS model are heteroskedastic.

A plot of the estimated coefficient of num_of_cylinders against the quantile number yields the following plot which only reinforces the above conclusion:

Plot of the estimated coefficient of num_of_cylinders from the various quantile models against the quantile number
Plot of the estimated coefficient of num_of_cylinders from the various quantile models against the quantile number (Image by Author)

One more thing we notice is that the partial effect of num_of_cylinders on the vehicle price (i.e. the coefficient of num_of_cylinders) increases with the quantile number, indicating that the higher quantile prices are more sensitive than lower quantile prices to changes in number of cylinders. For instance, the 90th percentile vehicle price changes more with each unit change in the number of cylinders than the 80th percentile price.


Here’s the complete source code used in the chapter:

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices
from matplotlib import pyplot as plt
#Import the 7-variable subset of the automobiles dataset into a DataFrame
df = pd.read_csv('automobiles_dataset_subset_uciml.csv', header=0)
#Plot price versus num_of_cylinders
fig = plt.figure()
fig.suptitle('Price versus Number of Cylinders')
plt.xlabel('number_of_cylinders')
plt.ylabel('price')
plt.scatter(x=df['num_of_cylinders'], y=df['price'])
plt.show()
#Construct an OLS model
#Start with defining the model's equation in Patsy syntax
reg_exp = 'price ~ num_of_cylinders'
#Carve out the y and X matrices
y_train, X_train = dmatrices(reg_exp, df, return_type='dataframe')
#build and train an OLS model
olsr_model = sm.OLS(endog=y_train, exog=X_train)
olsr_model_results = olsr_model.fit()
print(olsr_model_results.summary())
#Plot the OLS regression line on the scatter plot of Price versus num_of_cylinders
fig = plt.figure()
fig.suptitle('Price versus Number of Cylinders')
plt.xlabel('number_of_cylinders')
plt.ylabel('price')
plt.scatter(x=df['num_of_cylinders'], y=df['price'])
y_pred_ols = olsr_model_results.predict(X_train)
ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols,
color='red', marker='o', linestyle='solid', label='OLS Model')
plt.legend(handles=[ols])
plt.show()
#build and train a model that estimates the median
median_model = smf.quantreg(formula=reg_exp, data=df)
median_model_results = median_model.fit(q=0.5)
print(median_model_results.summary())
#Plot the OLS regression line on the scatter plot of Price versus num_of_cylinders
fig = plt.figure()
fig.suptitle('Price versus Number of Cylinders')
plt.xlabel('number_of_cylinders')
plt.ylabel('price')
plt.scatter(x=df['num_of_cylinders'], y=df['price'])
y_pred_median = median_model_results.predict(X_train)
median, = plt.plot(X_train['num_of_cylinders'], y_pred_median,
color='cyan', marker='o', linestyle='solid', label='Median Model')
ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols,
color='red', marker='o', linestyle='dashed', label='OLS Model')
plt.legend(handles=[ols, median])
plt.show()
#Plot all regression lines for multiple quantiles on the scatter plot of price versus
# num_of_cylinders
fig = plt.figure()
fig.suptitle('Price versus Number of Cylinders')
plt.xlabel('number_of_cylinders')
plt.ylabel('price')
plt.scatter(x=df['num_of_cylinders'], y=df['price'])
coeff = []
colors = ['orange', 'lime', 'yellow', 'cyan', 'violet']
i=0
handles = []
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
for q in quantiles:
quantile_model = smf.quantreg(formula=reg_exp, data=df)
quantile_model_results = quantile_model.fit(q=q)
print(quantile_model_results.summary())
coeff.append(quantile_model_results.params['num_of_cylinders'])
y_pred_quantile = quantile_model_results.predict(X_train)
quantile, = plt.plot(X_train['num_of_cylinders'], y_pred_quantile,
color=colors[i], marker='o', linestyle='solid', label=str(int(q*100))+'th quantile Model')
i = i+1
handles.append(quantile)
ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols,
color='red', marker='o', linestyle='dashed', label='OLS Model')
handles.append(ols)
plt.legend(handles=handles)
plt.show()
#Plot the coefficient of num_of_cylinders versus quantile number
fig = plt.figure()
fig.suptitle('Coefficient of num_of_cylinders versus Quantile number')
plt.xlabel('Quantile number')
plt.ylabel('Coefficient of num_of_cylinders')
plt.plot(quantiles, coeff)
plt.show()

Citations and Copyrights

Data set

The Automobile Data Set is sourced from UCI ML data sets repository under CC BY 4.0.

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.


PREVIOUS: Building Linear Models For Discontinuous Data

NEXT: How To Use Proxy Variables In A Regression Model


UP: Table of Contents