We’ll look at how to predict the median and other quantile points
In a regression model, one is normally interested in estimating the conditional mean of the response variable. For example, consider the following plot of vehicle prices versus number of cylinders:
The gold dots represent the observed mean price conditioned upon the number of cylinders.
If we wanted to estimate this conditional mean price using a regression model, we could employ the following linear model to do the job:
If we fit the above linear model to the data and plot its estimated values, we get the following plot:
But in some data sets, the mean is not a good exemplar of the data.
Here are three examples of data sets for which the mean may not be the appropriate statistic to estimate:
- The data are highly skewed to the left or to the right. Such is often the case in insurance claims or healthcare costs data where most claims are small valued but the data has a long tail of claims of increasing value.
- The data is multi-modal i.e. it has more than one highly frequently occurring value. For example, if the data has two roughly identical peaks, (a bimodal data set), the mean model will estimate a point inside the valley in between the two peaks which is not a good representative of the data.
- The data contains influential outliers. Regression models such as the Ordinary Least Squares Regression model that rely on minimizing the sum of squares of residual errors are highly sensitive to the presence of outliers. In such cases, it may be beneficial to minimize the sum of absolute values of residual errors, in addition to not using the mean as the quantity to be estimated.
In all these cases, the mean does not adequately represent the nature of the data. One may be better served by estimating the conditional median.
In the most general terms, our “median-seeking” regression model would be specified by the following equation:
In the above equation, X is the regression matrix and x_i is the ith row of the matrix. β_cap is the vector of fitted regression coefficients and f(.) is some function of β_cap and x_i for estimating the median under the constraint that the probability of the estimated value f(β_cap, x_i) of y being greater or equal to any observed value of y is 50%.
For e.g., a model for estimating the median price of automobiles in which f(β_cap, x_i) has a linear form would be:
We may even go a step further. In a data set, the median is the 0.5 quantile (or 50th percentile) point meaning that 50% of the data points are less than the value of the median. Similarly, there are other quantile points that can be defined. The 0.1 quantile point (10th percentile) is the value such that only 10% of the data set is smaller than this value. Similarly, the 0.25 quantile point is greater in value than 25% of the data set, and so on.
We may want to build a regression model that estimates any or all of these quantile points (or corresponding percentile values).
The general equation of such a model is as follows:
In the above equation, Q(.) is the estimated quantile point for the q-quantile (or (q*100)th percentile). As before, f(β_cap, x_i) is a function that yields the estimated value of the desired q-quantile point subject to the constraint that the probability of any observed value of y being less than or equal to the estimated value f(β_cap, x_i) is exactly q. q ranges from 0 to 1, both inclusive.
For e.g., a model for estimating the 95th percentile price of automobiles in which f(β_cap, x_i) has a linear form would be:
What is a quantile regression model used for?
When the data is distributed in a different way in each quantile of the data set, it may be advantageous to fit a different regression model to meet the unique modeling needs of each quantile instead of trying to fit a one-size-fits-all model that predicts the conditional mean. In such cases, the coefficients of all such quantile models would differ from each other.
The converse situation is one where the data is identically distributed in each quantile. Specifically, the error term of a model that estimates the conditional mean has an identical distribution for each x_i. For example, the errors are normally distributed for all x_i and all the distributions are identical in their mean and variance. In other words, the errors are homoskedastic. It can be shown that this case, when different quantile regression models are fitted on a homoskedastic data set, the corresponding sets of coefficients of all models will work out to be statistically identical, and the various quantile models will differ from each other only in the value the regression intercept. In fact, this behavior can form the basis for a test of heteroskedasticity.
Another reason to build a quantile model is that we are actually interested in estimating a particular quantile point for say, socioeconomic reasons. For instance, we may want to estimate the the 95th percentile SAT score of test takers who share a set of demographic characteristics.
In the next section, we’ll go through a short tutorial on how to build a quantile model using Python.
How to build a quantile regression model using Python and statsmodels
We’ll illustrate the procedure of building a quantile regression model using the following data set of vehicles containing specifications of 200+ automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle:
A subset of this data can be downloaded from here.
Let’s load the data set into a Pandas DataFrame and plot the data set.
import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf from patsy import dmatrices import numpy as np from matplotlib import pyplot as plt #Import the 7-variable subset of the automobiles dataset into a DataFrame df = pd.read_csv('automobiles_dataset_subset_uciml.csv', header=0) #Plot price versus num_of_cylinders fig = plt.figure() fig.suptitle('Price versus Number of Cylinders') plt.xlabel('number_of_cylinders') plt.ylabel('price') plt.scatter(x=df['num_of_cylinders'], y=df['price'])
Let’s also plot the conditional means E(price|num_of_cylinders):
num_of_cylinders = np.array(df.groupby('num_of_cylinders')['num_of_cylinders'].mean()) conditional_means = np.array(df.groupby('num_of_cylinders')['price'].mean()) plt.scatter(x=num_of_cylinders, y=conditional_means, color='gold', marker='o') plt.show()
We get the following graph. The gold dots represent the conditional means:
Next, let’s define the equation of the regression model in Patsy syntax. The intercept is assumed:
reg_exp = 'price ~ num_of_cylinders'
We’ll use this expression to carve out the y and X matrices:
y_train, X_train = dmatrices(reg_exp, df, return_type='dataframe')
We’ll first build and train an Ordinary Least Squares (OLS) Regression Model which estimates the conditional mean price for a given value of num_of_cylinders and plot the regression line on the scatter plot of points:
#Build and train an OLS regression model olsr_model = sm.OLS(endog=y_train, exog=X_train) olsr_model_results = olsr_model.fit() print(olsr_model_results.summary()) #Plot the OLS regression line on the scatter plot of Price versus num_of_cylinders fig = plt.figure() fig.suptitle('Price versus Number of Cylinders') plt.xlabel('number_of_cylinders') plt.ylabel('price') plt.scatter(x=df['num_of_cylinders'], y=df['price']) #Get the estimated conditional means from the trained OLS model y_pred_ols = olsr_model_results.predict(X_train) #Plot the estimated conditional means ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols, color='red', marker='o', linestyle='solid', label='OLS Model') #Also plot the observed conditional means i.e. E(price | num_of_cylinders) conditional_mean_pts = plt.scatter(x=num_of_cylinders, y=conditional_means, c='gold', marker='o', label='E(price | num_of_cylinders)') plt.legend(handles=[ols, conditional_mean_pts]) plt.show()
We see the following graph:
Let’s now start working with the median and other quantile points.
We’ll start with training a model that will estimate the conditional median, namely, median(price|num_of_cylinders).
Let’s create an instance of the quantile regression model as follows:
median_model = smf.quantreg(formula=reg_exp, data=df)
Next, we’ll train the model. We’ll tell statsmodels that we want to fit to the conditional median which is the 0.5 quantile point:
median_model_results = median_model.fit(q=0.5)
Now, let’s plot the estimated conditional median points from this model against the backdrop of the raw price versus num_of_cylinders data. For comparison we’ll also plot the estimated conditional means from the OLS model that we built earlier, and we’ll also plot the observed values of the conditional means and conditional medians for comparison.
The following piece of Python code does all of this:
fig = plt.figure() fig.suptitle('Price versus Number of Cylinders') plt.xlabel('number_of_cylinders') plt.ylabel('price') #Show the scatter plot of price versus num_of_cylinders plt.scatter(x=df['num_of_cylinders'], y=df['price'], c='deepskyblue') #Get the estimated conditional medians from the median model y_pred_median = median_model_results.predict(X_train) #Plot the estimated conditional medians median, = plt.plot(X_train['num_of_cylinders'], y_pred_median, color='black', marker='o', linestyle='solid', label='Median Model') #For comparison, also plot the estimated conditional means from the OLS model we built earlier ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols, color='red', marker='o', linestyle='solid', label='OLS Model') #Calculate the observed conditional medians conditional_medians = np.array(df.groupby('num_of_cylinders')['price'].median()) #Plot the observed conditional medians conditional_median_pts = plt.scatter(x=num_of_cylinders, y=conditional_medians, c='sienna', marker='^', label='Median(price | num_of_cylinders)') #For comparison, plot the observed conditional means conditional_mean_pts = plt.scatter(x=num_of_cylinders, y=conditional_means, c='gold', marker='o', label='E(price | num_of_cylinders)') #Set up the legend and show the plot plt.legend(handles=[ols, median, conditional_mean_pts, conditional_median_pts]) plt.show()
We see the following plot:
One thing we notice from the plot is that the observed values of the conditional median and conditional mean points almost overlap each other implying that the price data is more or less balanced around the mean. This is borne out by the trend lines of OLS and median regression models which are very close to each other, especially toward the center portion of the data.
Let’s also plot the regression lines for a few other quantile points, say, [0.1, 0.25, 0.5, 0.75, 0.9].
fig = plt.figure() fig.suptitle('Price versus Number of Cylinders') plt.xlabel('number_of_cylinders') plt.ylabel('price') plt.scatter(x=df['num_of_cylinders'], y=df['price']) coeff =  colors = ['orange', 'lime', 'yellow', 'cyan', 'violet'] i=0 handles =  quantiles = [0.1, 0.25, 0.5, 0.75, 0.9] for q in quantiles: #Build the model quantile_model = smf.quantreg(formula=reg_exp, data=df) #Fit the model quantile_model_results = quantile_model.fit(q=q) print(quantile_model_results.summary()) coeff.append(quantile_model_results.params['num_of_cylinders']) #Get the estimated values from the quantile model y_pred_quantile = quantile_model_results.predict(X_train) #Plot the estimated values quantile, = plt.plot(X_train['num_of_cylinders'], y_pred_quantile, color=colors[i], marker='o', linestyle='solid', label=str(int(q*100))+'th percentile Model') i = i+1 handles.append(quantile) #Also plot the estimated values from the OLS model for comparison ols, = plt.plot(X_train['num_of_cylinders'], y_pred_ols, color='black', marker='o', linestyle='solid', label='OLS Model') handles.append(ols) plt.legend(handles=handles) plt.show()
Here’s how the plot looks like:
If the errors from the OLS model are identically distributed, in other words, if the errors are homoskedastic, the trend lines for all quantile models will differ only in the intercept, i.e. they will be parallel to each other. That’s clearly not what we are seeing in this case leading us to believe that the errors from the OLS model are heteroskedastic.
A plot of the estimated coefficient of num_of_cylinders against the quantile number yields the following plot which only reinforces the above conclusion:
One more thing we notice is that the partial effect of num_of_cylinders on the vehicle price (i.e. the coefficient of num_of_cylinders) increases with the quantile number, indicating that the higher quantile prices are more sensitive than lower quantile prices to changes in number of cylinders. For instance, the 90th percentile vehicle price changes more with each unit change in the number of cylinders than the 80th percentile price.
Here’s the complete source code used in the chapter: