We’ll look at how to predict the median and other quantile points

In a regression model, one is normally interested in estimating the conditional mean of the response variable. For example, consider the following plot of vehicle prices versus number of cylinders:

The gold dots represent the observed mean price conditioned upon the number of cylinders.

If we wanted to estimate this conditional mean price using a regression model, we could employ the following linear model to do the job:

If we fit the above linear model to the data and plot its estimated values, we get the following plot:

For other kinds of data, a more complex model such as a Poisson or a Cox Proportional Hazards model may be employed. In all cases, we are asking the model to estimate the conditional *mean*.

**But in some data sets, the mean is not a good exemplar of the data**.

Here are three examples of data sets for which the mean may not be the appropriate statistic to estimate:

- The data are highly skewed to the left or to the right. Such is often the case in insurance claims or healthcare costs data where most claims are small valued but the data has a long tail of claims of increasing value.
- The data is multi-modal i.e. it has more than one highly frequently occurring value. For example, if the data has two roughly identical peaks, (a bimodal data set), the mean model will estimate a point inside the valley in between the two peaks which is not a good representative of the data.
- The data contains influential outliers. Regression models such as the Ordinary Least Squares Regression model that rely on minimizing the sum of squares of residual errors are highly sensitive to the presence of outliers. In such cases, it may be beneficial to minimize the sum of absolute values of residual errors, in addition to
*not*using the mean as the quantity to be estimated.

In all these cases, the mean does not adequately represent the nature of the data. One may be better served by estimating the **conditional median**.

In the most general terms, our “median-seeking” regression model would be specified by the following equation:

In the above equation, ** X** is the regression matrix and

*x**_i*is the

*ith*row of the matrix.

*β**_cap*is the vector of

*fitted*regression coefficients and

*f(.)*is some function of

*β**_cap*

**and**

*x**_i*for estimating the median under the constraint that the probability of the estimated value

*f(*

*β**_cap,*

*x**_i)*of

*y*being greater or equal to any observed value of

*y*is 50%.

For e.g., a model for estimating the median price of automobiles in which *f(**β**_cap, **x**_i) *has a **linear form** would be:

We may even go a step further. In a data set, the median is the 0.5 quantile (or 50th percentile) point meaning that 50% of the data points are less than the value of the median. Similarly, there are other quantile points that can be defined. The 0.1 quantile point (10th percentile) is the value such that only 10% of the data set is smaller than this value. Similarly, the 0.25 quantile point is greater in value than 25% of the data set, and so on.

We may want to build a regression model that estimates any or all of these quantile points (or corresponding percentile values).

The general equation of such a model is as follows:

In the above equation, *Q(.)* is the estimated quantile point for the q-quantile (or (q*100)th percentile). As before, *f(**β**_cap, **x**_i) *is a function that yields the estimated value of the desired q-quantile point subject to the constraint that the probability of any observed value of *y* being less than or equal to the estimated value *f(**β**_cap, **x**_i) *is exactly *q*. *q* ranges from 0 to 1, both inclusive.

For e.g., a model for estimating the 95th percentile price of automobiles in which *f(**β**_cap, **x**_i) *has a **linear form** would be:

## What is a quantile regression model used for?

When the data is distributed in a different way in each quantile of the data set, it may be advantageous to fit a different regression model to meet the unique modeling needs of each quantile instead of trying to fit a one-size-fits-all model that predicts the conditional mean. In such cases, the coefficients of all such quantile models would differ from each other.

The converse situation is one where the data is identically distributed in each quantile. Specifically, the error term of a model that estimates the conditional mean has an identical distribution for each *x**_i*. For example, the errors are normally distributed for all *x**_i* and all the distributions are identical in their mean and variance. In other words, the **errors are homoskedasti**c. It can be shown that this case, when different quantile regression models are fitted on a homoskedastic data set, the corresponding sets of coefficients of all models will work out to be statistically identical, and* the various quantile models will differ from each other only in the value the regression intercept. *In fact, this behavior can form the basis for a test of heteroskedasticity.

Another reason to build a quantile model is that we are actually interested in estimating a particular quantile point for say, socioeconomic reasons. For instance, we may want to estimate the the 95th percentile SAT score of test takers who share a set of demographic characteristics.

In the next section, we’ll go through a short tutorial on how to build a quantile model using Python.

## How to build a quantile regression model using Python and statsmodels

We’ll illustrate the procedure of building a quantile regression model using the following data set of vehicles containing specifications of 200+ automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle:

A subset of this data can be downloaded **from here**.

Let’s load the data set into a Pandas DataFrame and plot the data set.

```
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices
import numpy as np
from matplotlib import pyplot as plt
#Import the 7-variable subset of the automobiles dataset into a DataFrame
df = pd.read_csv('automobiles_dataset_subset_uciml.csv', header=0)
#Plot price versus num_of_cylinders
fig = plt.figure()
fig.suptitle('Price versus Number of Cylinders')
plt.xlabel('number_of_cylinders')
plt.ylabel('price')
plt.scatter(x=df['num_of_cylinders'], y=df['price'])
```

Let’s also plot the conditional means *E(price|num_of_cylinders):*

```
num_of_cylinders = np.array(df.groupby('num_of_cylinders')['num_of_cylinders'].mean())
conditional_means = np.array(df.groupby('num_of_cylinders')['price'].mean())
plt.scatter(x=num_of_cylinders, y=conditional_means, color='gold', marker='o')
plt.show()
```

We get the following graph. The gold dots represent the conditional means:

Next, let’s define the equation of the regression model in Patsy syntax. The intercept is assumed:

```
reg_exp = 'price ~ num_of_cylinders'
```

We’ll use this expression to carve out the ** y** and

**matrices:**

*X*```
y_train, X_train = dmatrices(reg_exp, df, return_type='dataframe')
```

We’ll first build and train an Ordinary Least Squares (OLS) Regression Model which estimates the conditional mean *price* for a given value of *num_of_cylinders *and plot the regression line on the scatter plot of points:

```
#Build and train an OLS regression model
olsr_model = sm.OLS(endog=y_train, exog=X_train)
olsr_model_results = olsr_model.fit()
print(olsr_model_results.summary())
#Plot the OLS regression line on the scatter plot of Price versus num_of_cylinders
fig = plt.figure()
fig.suptitle('Price versus Number of Cylinders')
plt.xlabel('number_of_cylinders')
plt.ylabel('price')
plt.scatter(x=df['num_of_cylinders'], y=df['price'])
#Get the estimated conditional means from the trained OLS model
y_pred_ols = olsr_model_results.predict(X_train)
#Plot the estimated conditional means
ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols,
color='red', marker='o', linestyle='solid', label='OLS Model')
#Also plot the observed conditional means i.e. E(price | num_of_cylinders)
conditional_mean_pts = plt.scatter(x=num_of_cylinders, y=conditional_means, c='gold', marker='o', label='E(price | num_of_cylinders)')
plt.legend(handles=[ols, conditional_mean_pts])
plt.show()
```

We see the following graph:

Let’s now start working with the median and other quantile points.

We’ll start with training a model that will estimate the conditional median, namely, *median(price|num_of_cylinders)*.

Let’s create an instance of the quantile regression model as follows:

```
median_model = smf.quantreg(formula=reg_exp, data=df)
```

Next, we’ll train the model. We’ll tell statsmodels that we want to fit to the conditional median which is the 0.5 quantile point:

```
median_model_results = median_model.fit(q=0.5)
```

Now, let’s plot the estimated conditional median points from this model against the backdrop of the raw *price* versus *num_of_cylinders* data. For comparison we’ll also plot the estimated conditional means from the OLS model that we built earlier, and we’ll also plot the observed values of the conditional means and conditional medians for comparison.

The following piece of Python code does all of this:

```
fig = plt.figure()
fig.suptitle('Price versus Number of Cylinders')
plt.xlabel('number_of_cylinders')
plt.ylabel('price')
#Show the scatter plot of price versus num_of_cylinders
plt.scatter(x=df['num_of_cylinders'], y=df['price'], c='deepskyblue')
#Get the estimated conditional medians from the median model
y_pred_median = median_model_results.predict(X_train)
#Plot the estimated conditional medians
median, = plt.plot(X_train['num_of_cylinders'], y_pred_median,
color='black', marker='o', linestyle='solid', label='Median Model')
#For comparison, also plot the estimated conditional means from the OLS model we built earlier
ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols,
color='red', marker='o', linestyle='solid', label='OLS Model')
#Calculate the observed conditional medians
conditional_medians = np.array(df.groupby('num_of_cylinders')['price'].median())
#Plot the observed conditional medians
conditional_median_pts = plt.scatter(x=num_of_cylinders, y=conditional_medians, c='sienna', marker='^', label='Median(price | num_of_cylinders)')
#For comparison, plot the observed conditional means
conditional_mean_pts = plt.scatter(x=num_of_cylinders, y=conditional_means, c='gold', marker='o', label='E(price | num_of_cylinders)')
#Set up the legend and show the plot
plt.legend(handles=[ols, median, conditional_mean_pts, conditional_median_pts])
plt.show()
```

We see the following plot:

One thing we notice from the plot is that the observed values of the conditional median and conditional mean points almost overlap each other implying that the price data is more or less balanced around the mean. This is borne out by the trend lines of OLS and median regression models which are very close to each other, especially toward the center portion of the data.

Let’s also plot the regression lines for a few other quantile points, say, [0.1, 0.25, 0.5, 0.75, 0.9].

```
fig = plt.figure()
fig.suptitle('Price versus Number of Cylinders')
plt.xlabel('number_of_cylinders')
plt.ylabel('price')
plt.scatter(x=df['num_of_cylinders'], y=df['price'])
coeff = []
colors = ['orange', 'lime', 'yellow', 'cyan', 'violet']
i=0
handles = []
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
for q in quantiles:
#Build the model
quantile_model = smf.quantreg(formula=reg_exp, data=df)
#Fit the model
quantile_model_results = quantile_model.fit(q=q)
print(quantile_model_results.summary())
coeff.append(quantile_model_results.params['num_of_cylinders'])
#Get the estimated values from the quantile model
y_pred_quantile = quantile_model_results.predict(X_train)
#Plot the estimated values
quantile, = plt.plot(X_train['num_of_cylinders'], y_pred_quantile, color=colors[i], marker='o', linestyle='solid', label=str(int(q*100))+'th percentile Model')
i = i+1
handles.append(quantile)
#Also plot the estimated values from the OLS model for comparison
ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols,
color='black', marker='o', linestyle='solid', label='OLS Model')
handles.append(ols)
plt.legend(handles=handles)
plt.show()
```

Here’s how the plot looks like:

If the errors from the OLS model are identically distributed, in other words, if the errors are homoskedastic, the trend lines for all quantile models will differ only in the intercept, i.e. they will be parallel to each other. That’s clearly not what we are seeing in this case leading us to believe that the errors from the OLS model are heteroskedastic.

A plot of the estimated coefficient of *num_of_cylinders* against the quantile number yields the following plot which only reinforces the above conclusion:

One more thing we notice is that the partial effect of *num_of_cylinders* on the vehicle *price* (i.e. the coefficient of *num_of_cylinders*) increases with the quantile number, indicating that the higher quantile prices are more sensitive than lower quantile prices to changes in number of cylinders. For instance, the 90th percentile vehicle price changes more with each unit change in the number of cylinders than the 80th percentile price.

Here’s the complete source code used in the chapter:

## Citations and Copyrights

### Data set

**The Automobile Data Set** is sourced from UCI ML data sets repository under CC BY 4.0.

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **Building Linear Models For Discontinuous Data

**NEXT: ** How To Use Proxy Variables In A Regression Model