# A Tutorial On Solving a System of Linear Regression Equations

How to fit a system of regression equations using Python and Statsmodels

In this chapter, we’ll walk through a tutorial on how to fit a system of linear regression models using the Generalized Least Squares (GLS) technique

We’ll use the term ‘regression model’ and ‘regression equation’ more or less interchangeably.

This tutorial assumes that you already know the concepts and theory behind fitting regression equation systems using GLS. And needless to say, you know what GLS is. If you don’t know either of these things or need a quick refresher on them, I’ll recommend reviewing the following chapters in the specified order before jumping into this tutorial:

A Deep-Dive into Generalized Least Squares Estimation

A Tutorial on Generalized Least Squares Regression Using Python and Statsmodels

An Illustrated Introduction to Systems of Regression Equations

## A system of regression equations

The system of regression models we’ll use in this tutorial is based on the following Capital Asset Model (CAPM):

In the above equation, the CAPM regresses the risk-free return on asset i, at time t, over some trailing time period (such as 3-months), on the risk-free return on the m market in which the asset i is traded, at the same time instant t, and over the same trailing time period.

We’ll consider the following basket of 8 stocks traded on the NYSE and NASDAQ in our equations system:

1. Chevron Corporation
2. Halliburton Company
3. Alcoa Corporation
4. United States Steel Corporation
5. Ford Motor Company
6. Tesla Inc
7. Alphabet Inc
8. Microsoft Corporation

The following table lists each stock, it’s ticker on NYSE or NASDAQ, the sector that it lies in, and the corresponding market as proxied by the S&P 500 industry/sector index.

We’ll construct a CAPM model (which is a linear model) for each of the above 8 stocks. Thus, these 8 stocks will form a system of 8 Capital Asset Pricing Models.

We’ll assume that the trailing time period is 3-months.

We’ll also assume that the risk-free asset r_f is the 3-Month Treasury Bill Secondary Market Rate series from US FRED.

## Regression goal

We have two regression goals:

1. To estimate the effect of the trailing 3-month risk-free returns on the respective market on the trailing 3-month risk-free returns on each stock in the system.
2. To estimate stock-specific, time-invariant fixed effects that influence the trailing 3-month risk-free returns on each stock in the system.

Our system of 8 models contains 2 coefficients α_i and β_i per model. The coefficient β_i let’s us estimate the effect of the market (goal #1), while α_i lets us estimate the stock-specific effect (goal #2).

Thus, we need to estimate a set of 16 coefficients: α_1 thru α_8, and β_1 thru β_8.

## Regression strategy

The error terms ϵ_it of CAPM models tend to be highly correlated across models. We’ll demonstrate this shortly in the context of a real world data set. Due to such cross-model correlations, we cannot use the Ordinary Least Squares (OLS) estimator to estimate our the individual models in the system even though they are all linear models. That’s because OLS produces inefficient (informally speaking: imprecise) estimates of coefficients in the face of such correlations amongst errors.

Therefore, our regression strategy will be the following:

1. We will estimate the matrix Ω of correlation errors of the 8-model system by individually estimating the 8 models using OLS and using the fitted models’ residual errors as the proxy for the corresponding error terms ϵ_i.
2. We’ll use the (Feasible) Generalized Least Squares (FGLS) estimator to estimate the entire 8-equation system in one-shot. The FGLS estimator is robust to heteroskedastic (non-constant variance) errors and errors that are correlated across models provided we supply the estimated matrix of correlation errors to the FGLS estimator. We’ll supply this matrix from point (1).

## The data set

We’ll train this 8-equation system on the historical stock price data downloaded from Yahoo Finance

As mentioned earlier, we’ll consider the 3-Month Treasury Bill Secondary Market Rate series from US FRED as the risk free asset.

After downloading the data from the respective sources, we process it so as to turn it into what is essentially a panel data set of trailing 3-month risk adjusted returns for each stock, and the same for the corresponding market index.

The processed data set is available for download over here. This is the data set that we will train the GLS model on.

Here’s how a slice of this data set looks like. The Risk Adjusted Returns (RAR) data panel for the 8 stocks and the corresponding market indices (Image by Author)

## Model Training

We begin by importing all the required packages:

```import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from patsy import dmatrices
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sb
```

Now let’s create a list of the assets whose capital asset pricing models will make up the the equations system:

```asset_names = ['Chevron', 'Halliburton', 'Alcoa', 'US Steel', 'Ford', 'Tesla', 'Google',
'Microsoft']
```

M = number of equations in the system.

```M = len(asset_names)
```

Import the 90-day returns file into a Pandas `DataFrame`. The file has 90-day Risk-Adjusted-Returns data on 8 stock assets over the time period 02-Jan-2019 thru 30-Jan-2020, along with the 90-daya RAR data for the corresponding S & P market indices.

```df_90day_RAR = pd.read_csv('90day_RAR_on_assets.csv', header=0, parse_dates=['Date'], index_col=0)
```

N = number of data samples per equation:

```N = len(df_90day_RAR)
```

Create 8 independent linear models, one for each asset in the file and fit them using OLS. We’ll use the Patsy syntax for defining each model’s equation. Each model also contains an intercept which is assumed i.e. we don’t need to explicitly state it in the model’s equation:

```regression_equations = ['RAR_Chevron ~ RAR_Energy', 'RAR_Halliburton ~ RAR_Energy',
'RAR_Alcoa ~ RAR_Metals', 'RAR_USSteel ~ RAR_Metals', 'RAR_Ford ~ RAR_Auto',
'RAR_Tesla ~ RAR_Auto', 'RAR_Google ~ RAR_Technology', 'RAR_Microsoft ~ RAR_Technology']
```

Create a DataFrame to store the residual errors from the M models. Each column will contain the N residuals from each one of the M regression models:

```df_OLS_resid = pd.DataFrame()
```

Build and train the M linear models, and print out the training summary. Along the way, accumulate the residuals from each fitted model in a new column of the residuals data frame `df_OLS_resid`. Also along the way, accumulate the corresponding y and X design matrices for each model which we’ll soon use to build the stacked (block) model:

```y_matrices = []
X_matrices = []
i=0
for expr in regression_equations:
olsr_model = smf.ols(formula=expr, data=df_90day_RAR)
olsr_model_results = olsr_model.fit()
print(olsr_model_results.summary())
df_OLS_resid[asset_names[i]] = olsr_model_results.resid
y, X = dmatrices(expr, df_90day_RAR, return_type='dataframe')
y_matrices.append(y)
X_matrices.append(X)
i = i + 1
```

Here’s how the DataFrame of residuals `df_OLS_resid` looks like: The DataFrame of residual errors from the 8 individually fitted models (Image by Author)

As mentioned in the strategy section, if the residuals from the individual OLS models are heavily correlated across models, then the OLS estimator will not have fit the models efficiently.

Let’s display the heat map of correlations:

```sb.heatmap(df_OLS_resid.corr())
plt.show()
```

We see the following heat map showing a significant amount of correlation in several off diagonal positions indicating strong correlations between the residual errors of the different linear models in the system: Heat map of correlations between the residuals of the separately fitted linear models (Image by Author)

Let’s also look at the numerical correlation values by printing out the table of correlations: Table of correlations between the residuals of the separately fitted linear models (Image by Author)

Given the large degree of correlation of residuals across different models, we’ll have to fit the entire system of equations as one model.

We’ll start by building a Seemingly Unrelated Regression (SUR) model by stacking the y and X matrices from the individual models. And we’ll fit the SUR model using Feasible GLS. Again, I’ll encourage you to visit the links at the beginning of this chapter for the concepts and theory behind the SUR model and the (Feasible) GLS technique.

## The SUR model

The equation for our SUR model is simply y = + ϵ where y, X, β, and ϵ are block matrices created by stacking the y, X, β, and ϵ matrices from the individual CAPMs. The following figure illustrates the structure of this model: The Seemingly Unrelated Regression Model expressed using block matrices y, X, β, and ϵ (Image by Author)

We assume that there are M equations in the system (in our example, M=8) and we have fed N data samples into each equation (in our example, N=182).

The y matrix (vector) on the left is constructed by stacking the M number of y matrices (vectors) from the M equations. Each y matrix is of size [N x 1]. Hence the stacked y matrix is a column vector of size [MN x 1].

Similarly, the matrix of coefficients β is a stacked column matrix (vector) of size [(k_1+k_2+…+k_M) x 1] where k_1, k_2,…k_M are the number of regression variables in equation 1,2,…M respectively. In our example, k_1=k_2=…=k_M=2.

The regression errors matrix ϵ has the same dimensions as the stacked y matrix namely a column vector of size [MN x 1].

The stacked X matrix has the following structure:

We’ll now build the stacked X and y matrices. In the context of our 8-equation system, the stacked X has the following block structure:

```#X = [X0 Z Z Z Z Z Z Z
#     Z X1 Z Z Z Z Z Z
#     Z Z X2 Z Z Z Z Z
#     Z Z Z X3 Z Z Z Z
#     Z Z Z Z X4 Z Z Z
#     Z Z Z Z Z X5 Z Z
#     Z Z Z Z Z Z X6 Z
#     Z Z Z Z Z Z Z X7]
```

X0 thru X7 are the regression matrices from the individual data sets that we carved out earlier, and Z is a matrix of zeros of size [N x 2] i.e. N rows (for the N data samples fed into each CAPM) times 2 regression variables per model (the intercept and the the RAR on the respective market).

X0 thru X7 are each of size [N x 2]. Hence the block X matrix has a size [(8xN) x (8×2)]=[8N x 16].

```Z = np.zeros((N,2))

R1 = np.hstack((X_matrices, Z,Z,Z,Z,Z,Z,Z))
R2 = np.hstack((Z, X_matrices, Z,Z,Z,Z,Z,Z))
R3 = np.hstack((Z,Z, X_matrices, Z,Z,Z,Z,Z))
R4 = np.hstack((Z,Z,Z, X_matrices, Z,Z,Z,Z))
R5 = np.hstack((Z,Z,Z,Z, X_matrices, Z,Z,Z))
R6 = np.hstack((Z,Z,Z,Z,Z, X_matrices, Z,Z))
R7 = np.hstack((Z,Z,Z,Z,Z,Z, X_matrices, Z))
R8 = np.hstack((Z,Z,Z,Z,Z,Z,Z, X_matrices))

X = np.vstack((R1, R2, R3, R4, R5, R6, R7, R8))
```

The stacked y matrix has the following block structure:

```#y = [y0
#     y1
#     y2
#     y3
#     y4
#     y5
#     y6
#     y7]
```

Just like with the individual X matrices, y0 thru y7 are the y matrices from the individual CAPMs. y0 thru y7 are each of size [N x 1]. Hence the block y matrix has a size [8N x 1].

```y = np.vstack((y_matrices, y_matrices, y_matrices, y_matrices, y_matrices,
y_matrices, y_matrices, y_matrices))
```

### Estimation of the SUR model

We’ll estimate the SUR model using the Feasible Generalized Least Squares estimator. The FGLS estimator requires us to estimate the Ω (Omega) matrix which is the matrix of correlations between the regression errors ϵ of the SUR model.

Recollect that the SUR model’s regression errors ϵ form a column vector of size [MN x 1].

The correlation matrix of errors Ω is formed by multiplying the error vector ϵ of size [MN x 1] with its transpose ϵ’ of size [1 x MN]. Hence, the correlation matrix of errors will be a square matrix of size [MN x MN].

Each element of Ω is the correlation between e_ti — the error corresponding to the t-th data sample fed into the i-th equation in the system, and e_sj — the error corresponding to the s-th data sample fed into the j-th equation in the system.

Before we estimate Ω, we must make two simplifying assumptions:

1. We’ll assume that the regression errors from individual models in the system may be correlated across equations for the same time point t, but they are not correlated across different time points i.e. Corr(e_ti, e_sj) = 0 if t != s.
2. We’ll assume that the regression errors for any individual model (or pair of models) are homoskedastic i.e. they demonstrate constant variance. i.e. for any ϵ_ti and ϵ_tj in Ω, _ti)(ϵ_tj) = σ²_ij

With these two assumptions in place, it can be shown that the covariance matrix of errors Ω can be expressed as the Kronecker product (denoted by the symbol ⊗) of two matrices — A Σ (Sigma) and an I (Identity) matrix as illustrated in the figure below: The covariance matrix of errors Ω calculated as the Kronecker product Σ ⊗ I (Image by Author)

Thus, to estimate Ω, we must estimate the [M x M] size Σ matrix. We’ll denote S to be our estimate of Σ.

The S matrix has the following structure:

```# [e_11 e_12 e_13 ... e_1M
#  e_21 e_22 e_23 ... e_2M
#  ...  ...  ...  ... e_2M
#  e_M1 e_M2 e_M3 ... e_MM]
```

Recollect hat we had gathered together into the Pandas DataFrame `df_OLS_resid`, the residual errors from fitting the 8 models using OLS. Let’s examine for a minute how this DataFrame `df_OLS_resid` looks like: The DataFrame of residual errors from the 8 individually fitted models (Image by Author)

This DataFrame has M=8 columns corresponding to the 8 equations in our equations system. Each column contains the N=182 residuals corresponding to the 182 data samples fed into the corresponding linear model.

Using the convention used in Greene , we’ll denote this residual errors DataFrame by the letter E.

We’ll use E to calculate the [M x M] size (in our case, [8 x 8] size) matrix S.

It’s useful to think of the S matrix as an element-wise summation of N layers, each layer being of size [M x M]. Each [M x M] ‘layer’ is a matrix product of two vectors, e_t and e’_t (e_t_transpose) such that e’_t is the t-th row of the residuals in E. e’_t is a row vector of size [1 x M] thereby making e_t a column vector of size [1 x M]. e’_t is a row vector of size [1 x M] (Image by Author)

The matrix multiplication of e_t and e’_t yields a square matrix of size [M x M]. The formula for S is as follows: Calculation of S as the matrix product of transpose of the residuals Dataframe with itself (Image by Author)

Lets perform this calculation:

```E=np.array(df_OLS_resid)
S=(1/N)*np.matmul(np.transpose(E), E)
```

We’ll now calculate an estimate of Ω as the Kronecker product of S and I where I is an Identity matrix of size [N x N].

```I = np.identity(N)
estimated_omega = np.kron(S, I)
```

We can now use the estimated value of the Ω matrix to estimate the entire system of 8-equations in one go using the GLS model as follows. Incidentally, statsmodels calls the parameter into which we pass the Omega matrix as `sigma`:

```gls_model = sm.GLS(endog=y, exog=X, sigma=estimated_omega)
gls_model_results = gls_model.fit()
print(gls_model_results.summary())
```

We see the following output:

The training summary deserves some explanation.

### Understanding the training output of the SUR model

At the top of the training output, statsmodels shows some helpful summary information. For e.g. it tells us that it has used GLS as the estimator, the total number of data samples fed into the training were NM=1456 (=N x M = 182 per model x 8 models), the degrees of freedom (dof) of the model is 15 (= 16–1), and the dof of residuals is (NM — 16)=(1456–16)=1440:

Also notice the Covariance Type which is mentioned as nonrobust which means that statsmodels has not used an estimator for calculating the covariance matrix of coefficient estimates that is robust to heteroskedasticity (such as the White’s Heteroskedasticity Consistent estimator.

Next, we glance across to the right side of the panel where we see that the adjusted R-squared is 0.917 indicating that the model has been able to explain more than 91% of variance in y. That’s a pretty decent goodness-of-fit.

The F-statistic is 1068 with a p value < .001 indicating that the model’s variables are jointly significant.

The Log-Likelihood, AIC and BIC values indicate comparative goodness-of-fit scores so they cannot be interpreted without a reference model.

Now let’s move our eye down to the center portion of the output where we see the values of all the estimated coefficients, their standard errors, p values, and 95% confidence intervals: The coefficient estimates of the fitted SUR model, their standard errors, p values and 95% confidence intervals (Image by Author)

The x1 thru x16 are the coefficients of the corresponding regression variables in our SUR model. The SUR model consists of 8 equations with 2 variables per equation, namely the intercept and the respective industry index. Therefore there are 8 x 2 = 16 regression variables whose coefficients are estimated by GLS. In the above figure, we have illustrated the mapping of the model’s variables to the estimated coefficients. RAR stands for Risk Adjusted Return.

A look down the p values column tells us that except for 3 coefficients, the rest of the coefficients are statistically significant at a p of .05 (95%).

How should we interpret the estimated value of each coefficient?

Let’s look at x4. It is the estimate for RAR_Energy corresponding to the Halliburton portion of the SUR model. GLS has estimated its value to be 2.2377:

Remember that the SUR model is a linear model. Hence, the value of +2.2377 tells us two things:

1. The Risk Adjusted Return on Halliburton for any trailing 3-month period is directly proportional to the trailing 3-month RAR on the S&P Energy index.
2. For each unit percentage increase (decrease) in the trailing 3-month RAR on the S&P Energy Index, the trailing 3-month RAR on Halliburton increases (decreases) by 2.2377% with a 95% confidence interval of [2.059%, 2.417%].

### A comparison with the OLS model

Earlier in the chapter, we mentioned that fitting the 8 regression equations in the system separately using OLS will lead to imprecisely estimated models as OLS will not be able to efficiently fit these 8 models due to the correlation of errors across different models.

Let’s verify that claim by comparing the standard errors of some of the coefficients estimated earlier by OLS and those estimated by GLS.

Here’s the comparison of the coefficient estimates, their standard errors (outlined in red), p values and 95% confidence intervals. The blue rows contain the values estimated by GLS for the stacked SUR model while the orange rows contain the values estimated by individually estimating the 8 equations using OLS. A comparison between the coefficient estimates from GLS and OLS (Image by Author)

From looking at the table, one thing is starkly clear. The standard errors of the estimates from GLS are systematically smaller than those estimated by OLS.

It vindicates the theory that in such cases where there is heteroskedasticity and/or correlation between errors across models, GLS will produce a higher-precision (lower standard error) coefficient estimates than OLS.

Here’s the complete source code used in the tutorial:

The data set used by the tutorial is available for download over here.

### Data set

Board of Governors of the Federal Reserve System (US), 3-Month Treasury Bill Secondary Market Rate, Discount Basis [DTB3], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/DTB3, December 10, 2022.