# A Tutorial On Generalized Least Squares Estimation Using Python And Statsmodels

We’ll see how to use the GLS estimator to fit a linear model on a real world data set

The Generalized Least Squares (GLS) estimator is an effective alternative to the Ordinary Least Squares (OLS) estimator for fitting linear models on data sets that exhibit heteroskedasticity (i.e., non-constant variance) and/or auto-correlation.

In the previous chapter, we had detailed out the motivation for the GLS estimator and described how it is built. In this chapter, we’ll use the GLS estimator to fit a linear model on a real world data set.

While it is possible to learn how to use GLS by simply following the tutorial shown in this chapter, I would strongly recommend reviewing the following chapter first:

A Deep-Dive Into Generalized Least Squares Estimation

## A brief introduction to GLS

When we use the Ordinary Least Squares estimator to fit a linear model on a data set, we implicitly assume that the model’s errors are homoskedastic (constant variance) and uncorrelated. But this assumption often breaks down for real world data sets. In the face of heteroskedastic, auto-correlated errors, the OLS estimator is still consistent and unbiased but it loses its efficiency. Specifically, OLS is no long produces coefficient estimates with the least possible variance.

The issue is that when using OLS, the formula for calculating the variance of the coefficient estimates assumes homoskedastic, uncorrelated errors.

For a linear model given by the following equation:

An Ordinary Least Squares estimation of β yields the following OLS estimator:

In the above formula, β_cap is the vector of coefficients estimated by OLS, X is the matrix of regression model’s coefficients (including the intercept), y is the response variable.

The variance of the coefficient estimates is calculated as follows:

σ² is the constant variance of the error term ϵ of the model.

But when the errors are heteroskedastic and/or correlated, the above formula outputs an incorrect variance leading to errors in the calculation of standard errors, z-scores, p-values and confidence intervals of the estimated coefficients. Coefficients that are not statistically significant may get incorrectly flagged as significant. The whole procedure for statistical inference yields incorrect values.

The Generalized Least Squares estimator remedies this situation. The GLS estimator of β for the linear model y = βX + ϵ is given by the following formula:

And the variance of the coefficient estimates is given by the following equation:

Here, Ω (capital Omega) matrix is an [n x n] size matrix which along with the σ² scaling factor forms the covariance matrix of errors as follows:

Here, σ² forms a scaling factor, a constant, that we extract out of the matrix such that ω_ij=ρ_ij/σ². Each diagonal element ρ_ii is the variance of the error e_ii corresponding to the ith row in the data set, while each non-diagonal element ρ_ij is the covariance between e_i and e_j.

To use equations (5) and (6), we must estimate σ² and Ω. There are several techniques available to do so. In this chapter, we’ll describe one such technique and use it to fit a GLS model on a data set drawn from the US census bureau.

## The regression model

The following linear model of county-level poverty rates in the United States uses socioeconomic factors such as the median age of the county’s inhabitants, the vacancy rate of houses for sale in the county and the percentage of the county’s population with at least one college degree.

## The data set

To train this model, we’ll use use data from the US Census Bureau aggregated at a county level. The data is a subset of the 2015–2019 American Community Survey (ACS) 5-Year Estimates conducted by the US Census Bureau. The following table contains the data that we’ll use (tap or click on the image to zoom it):

The data set used in this example can be downloaded from here. The complete ACS data set can be pulled from the US Census Bureau’s website using publicly available APIs, or directly from the Census Bureau’s Community Explorer website.

## Estimation of the linear model using Generalized Least Squares using Python and Statsmodels

We’ll use Python and Pandas to load the ACS data file into memory, and we’ll use the Python based statsmodels package to build and fit the linear model.

Let’s start by importing the required packages and loading the data file into a Pandas `DataFrame`:

```import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from patsy import dmatrices
from matplotlib import pyplot as plt
import numpy as np

#Load the US Census Bureau data into a Dataframe
```

Let’s construct the model’s equation in Patsy syntax. Statsmodels will automatically add the intercept of regression to the model, so we don’t have to explicitly specify it in the model’s equation:

```reg_expr = 'Percent_Households_Below_Poverty_Level ~ Median_Age + Homeowner_Vacancy_Rate + Percent_Pop_25_And_Over_With_College_Or_Higher_Educ'
```

Build and train the model and print the training summary:

```olsr_model = smf.ols(formula=reg_expr, data=df)olsr_model_results = olsr_model.fit()print(olsr_model_results.summary())
```

We see the following summary:

We’ll ignore the value of R-squared (or adjusted R-squared of 0.191) reported by statsmodels as our interest lies in estimating the main effects (namely the coefficient estimates), their standard errors and 95% confidence intervals, and whether the coefficients are statistically significant. Let’s zoom into the relevant portion of the output:

We see that all coefficient estimates are significant at a p of < .001. By default, Statsmodels has assumed homoskedastic errors and accordingly used Eq (3) reproduced below to calculate the covariance matrix of coefficient estimates:

If the errors are not homoskedastic, then the above formula will yield misleading results.

A visual test of heteroskedasticity should tell us if the variance of the errors (as estimated by the variance of the fitted model’s residuals) is constant. We’ll plot the fitted model’s residuals (which are unbiased estimates of the model’s errors) against the fitted (i.e., predicted) values of the response variable, as follows:

```#Carve out the y and X matrices
y_train, X_train = dmatrices(reg_expr, df, return_type='dataframe')

#Get the predicted values of y from the fitted model
y_cap = olsr_model_results.predict(X_train)

#Plot the model's residuals against the predicted values of y
plt.xlabel('Predicted value of Percent_Households_Below_Poverty_Level')
plt.ylabel('Residual error')plt.scatter(y_cap, olsr_model_results.resid)
plt.show()
```

We see the following scatter plot (the red lines are drawn just to illustrate the heteroskedastic variance):

The model’s errors are clearly heteroskedastic. The use of the GLS estimator is indicated for this data set.

To do so, we must first estimate the covariance matrix of errors, namely the Σ (Sigma) matrix.

We’ll use the following procedure for estimating the covariance matrix of errors Σ:

1. Chop up y_cap into intervals of size 1.
2. For each interval, collect the residual errors from the OLS fitted model and compute their sample variance.
3. Regress these sample variances on the centers of the intervals of y_cap. For e.g. the center of the interval (5, 6] is 5.5.
4. Use this fitted regression model to estimate the sample variance of the residuals of the OLSR model.
5. Use these estimated sample variances of residuals as the estimates of the corresponding error term’s population variances.

Let’s implement this procedure.

Create a range object for intervalizing (binning) y_cap:

```interval_range=range(round(min(y_cap)), round(max(y_cap)))
```

Convert it to a list:

```ycap_intervals=list(interval_range)
```

Intervalize y_cap:

```intervalized_ycap=pd.cut(x=y_cap, bins=ycap_intervals)
```

Print it out to see how it looks like:

```print(intervalized_ycap)
```

We see that the intervalizing has created 37 unique intervals across 3219 y values:

Next, we use a dictionary to extract the set of unique intervals of y_cap. The keys of the dict are the intervals while the values will be list of residuals corresponding to those intervals. We’ll fill in the values soon.

```distinct_intervals_of_ycap={}
for intr in list(intervalized_ycap):
distinct_intervals_of_ycap[intr] = []
```

Create a DataFrame consisting of fitted (predicted) y values from the OLSR model, the interval that each fitted value lies in, and the residual from the fitted OLSR model corresponding to each fitted y value:

```series_dict={'Y_CAP':y_cap, 'INTR':intervalized_ycap, 'RESID':olsr_model_results.resid}
df_intervalized_ycap = pd.DataFrame(data=series_dict)
```

Now iterate through the rows of this DataFrame, and for each row, collect the residual, and using the interval as the key, add the residual into the appropriate slot in distinct_intervals_of_ycap dict:

```for index, row in df_intervalized_ycap.iterrows():
resid = row['RESID']
interval = row['INTR']
distinct_intervals_of_ycap[interval].append(resid)
```

Next, construct two arrays of size=length of data set. The fist array will store the midpoints of the intervals of fitted y values. The second array will store the sample variance of the residuals corresponding to all the fitted y values that lie within the interval.

```midpoints_of_ycap_intervals = []

sample_variance_of_intervalized_residuals = []

for key in distinct_intervals_of_ycap.keys():
if pd.isnull(key):
continue
resids = distinct_intervals_of_ycap[key]
#Skip intervals with <= 1 residual
if len(resids) > 1:
sample_variance_of_intervalized_residuals.append(np.var(a=resids, ddof=1))
midpoints_of_ycap_intervals.append(key.mid)
```

Plot the sample variances of intervalized residuals against the midpoints of the intervalized fitted y to get a sense of the relationship between them:

```plt.xlabel('Midpoints of intervalized fitted y values')
plt.ylabel('Variance of intervalized residuals')
plt.scatter(x=midpoints_of_ycap_intervals, y=sample_variance_of_intervalized_residuals)
plt.show()
```

From looking at the plot, we suspect an exponential relationship between the variance and y_cap

An exponential relationship between the variance of residuals and the fitted y values is commonly seen when the errors are heteroskedastic.

Let’s build a log-linear model between the intervalized variances and the intervalized fitted y. Specifically, we’ll regress `log(sample_variance_of_intervalized_residuals)` on
`midpoints_of_ycap_intervals` and a constant:

```log_sample_variance_of_intervalized_residuals = np.log(sample_variance_of_intervalized_residuals)
```

Construct a DataFrame consisting of `log_sample_variance_of_intervalized_residuals` and `sample_variance_of_intervalized_residuals`:

```df_ycap_mids_resid_vars = pd.DataFrame(data={'LOG_INTERVALIZED_RESID_VARIANCE':log_sample_variance_of_intervalized_residuals,'MIDPOINT_INTERVALIZED_YCAP':midpoints_of_ycap_intervals})

ols_resid_variance_model = smf.ols(
'LOG_INTERVALIZED_RESID_VARIANCE ~ MIDPOINT_INTERVALIZED_YCAP',  data=df_ycap_mids_resid_vars)

ols_resid_variance_model_results = ols_resid_variance_model.fit()
```

Use this model to predict the sample variances of all residuals in the original OLSR model in which we regressed Percent_Households_Below_Poverty_Level on Median_Age, Homeowner_Vacancy_Rate, Percent_Pop_25_And_Over_With_College_Or_Higher_Educ and a constant:

```df_ycap_from_ols_model = pd.DataFrame(data={'MIDPOINT_INTERVALIZED_YCAP':y_cap})

log_resid_sample_variance_estimate = ols_resid_variance_model_results.predict(
df_ycap_from_ols_model['MIDPOINT_INTERVALIZED_YCAP'])

resid_sample_variance_estimate = np.exp(log_resid_sample_variance_estimate)
```

Let’s visually inspect the estimated sample variance of residuals by plotting them against the fitted values from the original OLSR model:

```plt.xlabel('Fitted y values')
plt.ylabel('Estimated sample variance of residuals')
plt.scatter(x=y_cap, y=resid_sample_variance_estimate)
plt.show()
```

The plot shows the exponential relationship which is entirely expected since we are using a log-linear model.

We now have all the ingredients to build the Σ matrix.

We’ll begin with creating an Identity matrix of size [n x n] and fill in its diagonal with the estimated sample variances of residuals:

```sigma_matrix = np.identity(len(y_cap))

np.fill_diagonal(sigma_matrix, np.exp(resid_sample_variance_estimate))
```

Build and train the GLS model using this Σ matrix:

```gls_model = sm.GLS(endog=y_train, exog=X_train, sigma=sigma_matrix)

gls_model_results = gls_model.fit()

print(gls_model_results.summary())
```

We see the following summary:

Let’s compare the parameter estimates from the OLS model, and the GLS model:

```data={'OLS_MODEL':olsr_model_results.params,
'GLS_MODEL':gls_model_results.params}

print(pd.DataFrame(data=data))
```

We see the following comparative output:

Let’s also compare the standard errors of the parameter estimates from the OLS model, and the GLS model:

```data={'OLS_MODEL':olsr_model_results.bse, 'GLS_MODEL':gls_model_results.bse}

print(pd.DataFrame(data=data))
```

We see the following comparative output showing what we would expect from the GLS estimated model, namely that its standard errors are much smaller than those from the OLS estimated model:

The GLS estimated model is seen to be more precise than the OLS estimated model for this data set.

Here’s the code used in the chapter: