How to fit a mesh of inter-dependent regression models using techniques such as GLS

In this article, we’ll study a very interesting topic in statistical modeling, namely, how to fit a system of inter-dependent regression models. At first glance, this topic may appear a bit esoteric, but systems of interconnected regression models pop up surprisingly often in real life.

In the rest of this article, we’ll use the term ‘equation’ and ‘model’ interchangeably.

The concept is best illustrated with an example.

## The Capital Asset Pricing Model

The following regression model is commonly used in Finance for estimating the price that the market is willing to pay for a capital asset. Examples of capital assets are a stock or a bond of a company, your house, a barrel of crude oil, or a bushel of corn.

Let’s look at what the above equation is telling us. *r_it* is the rate of return over some trailing period on asset *i* at time point *t*. An example of *r_it* is the return you will get at time point *t *after having invested $100 in shares of Ford Motor Company for the previous 3 months.

*r_ft* is the risk-free rate of return obtained at time point *t* over the *same *trailing period. Here, risk-free means that there is no risk whatsoever of losing your principal amount over the specific trailing period. As such, there is no such thing in this world as a 100% risk-free investment. But in practice, at least in the United States, US treasury bills that become due for payment in 3 months (DTB3) come closest to being considered a risk-free investment. That’s because, the chance of the US government running out of money in the next 3 months and not unable to pay back loans (which is what T-bills are) that are due in 3 months is basically zero.

Thus in equation (1), the term *(r_it — r_ft)* is the risk-free rate of return on asset *i *at time *t *over a specific trailing period. It is the excess return that one would get upon investing in asset *i* at time *t* instead of just having stashed away the money in the risk-free asset *f*. Notice that the risk free rate could in theory be negative. If Ford appreciates by only 0.25% over a 3 month period while the 3-month T-bill gives you 0.3% over the same 3-month period, your risk-free rate for investing in Ford over 3 months is a negative 0.05%! One way to look at this situation is that by investing in the 3-month T-bill, you are willing to pay a premium of .05% to the US government for protecting your capital during that particular 3-month period.

Getting back to Eq (1), the term *r_mt* is the return at time *t* on investing in the overall market in which asset *i* is being traded. For Ford, asset ‘*m*’ could be the S&P 500 index, or perhaps even better, the S&P 500 Automobiles Industry Index (SP500.251020) (although this particular choice for “ market” may be debatable). In any case, *(r_mt — r_ft)* is the risk adjusted return gotten at time *t* on investing in the market that Ford is traded in.

For NYSE:F, we can state the CAPM as follows:

Thus, the **Capital Asset Pricing Model** (**CAPM**) regresses the risk adjusted return over a certain trailing time period on asset *i* at time *t*, on the risk adjusted return on the corresponding market at *t*, over the same trailing time period. *β_i* is the regression coefficient for the regression variable that represents the risk adjusted return for the market.

The Capital Asset Pricing Model is a linear regression model.

There are undoubtedly several factors that influence the return on asset *i* that the return on the market is not able to capture. If *i* is a stock such as NYSE:F, there are stock specific factors such as its quarterly financial results, news about major investments announced by Ford during the quarter, and any adverse events such as large-scale vehicle recalls, all of which will impact the market price of NYSE:F.

In eq (2), the portion of the variance in the return on NYSE:F that the return on SP500.251020 is *not* able to capture, is the portion that leaks into the error term *ϵ_(NYSE:F)t*.

In general, the error term *ϵ_it *of the model in Eq (1) explains the variance in *(r_it — r_ft)* that the regression variable *(r_mt — r_ft)* is unable to capture.

How does all of this relate to the topic of this article?

## The CAPM as a system of regression equations

If we use the CAPM to estimate the risk adjusted returns on *M* assets, say all 500 stocks in the S&P index, we would end up a system of 500 regression equations as follows:

Notice that for each asset *i=1,2,…M*, the corresponding market may be different. For e.g. when i=NASDAQ:MSFT, the market could be S&P 500 Information Technology Sector Index.

The reason to lay out these equations is naturally so that we can estimate the corresponding regression models. To do so, we would need to estimate all 500 regression coefficients *β_1, β_2, β_3,…,β_500.*

Since each equation in (3) is a** linear regression model**, a naive approach would be to individually fit the 500 regression models on historical stock price data.

But this estimation strategy will yield imprecise estimated coefficients, for the following reason:

Over any given time period, say 3 months, the return on a stock is often correlated with the return on other related stocks and also the return on other markets that those related stocks trade in.

Consider the following 3-equation system consisting of the CAPMs for Ford, U.S. Steel and GlobalFoundries:

In (4a), (4b), (4c), the term *r_(DTB3)t* is the return on the 3-month T-bill.

Over any medium-to-long term period, the stock price of Ford is influenced by the profit that Ford makes from selling vehicles. This profit is influenced by the price at which Ford buys raw materials to make those vehicles. Steel and computer-chips are two raw materials that are used in large quantities for making motor vehicles. These sources of influence are not adequately captured in the regression variable *(r_(SP500.251020)t — r_(DTB3)t)*, namely the risk adjusted return of investing in the S&P 500 Automobiles (Industry) index in the regression equation for Ford. We know that in a regression model, what the regression variables cannot explain leaks into the error term. Thus, in the model in (4a), the unexplained variance in *(r_(NYSE:F)t — r_(DTB3)t) *is captured within the error term *ϵ_(NYSE:F)t*.

Now, the prices at which companies such the steel maker U. S. Steel (NYSE:X) and the chip maker GlobalFoundries (NASDAQ:GFS) are able to sell their steel and computer chips to companies such as Ford directly determine the profits of those steel and computer chips companies. And those profits in turn guide their stock prices and so also the risk adjusted returns on the investments in those stocks. Again, these sources of influence, namely the sales prices and the profit on steel and computer chips are not adequately captured in the corresponding market rates of return regression variables in the models (4b) and (4c). And therefore, these influences leak into the error terms *ϵ_(NYSE:X)t *and *ϵ_(NASDAQ:GFS)t *of the respective models.

What we have inferred is that for any given time frame *t* such as 3 months, the error term *ϵ_(NYSE:F)t* of the capital asset model for Ford is correlated with the the error terms *ϵ_(NYSE:X)t* and *ϵ_(NASDAQ:GFS)t *of the capital asset pricing models for U. S. Steel and GlobalFoundries respectively.

So the regression equations (4a), (4b), and (4c) aren’t really independent. In fact, they can be chained into each other.

### The effect of correlated error terms

What we showed in the previous section is an important observation. There is information contained within these correlations which we would not be able to harvest if we estimate the three equations (4a), (4b), and (4c) as three different linear regression models.

If we fit the three linear models individually using Ordinary Least Squares (OLS), this unharvested information will leak into the residual errors of the respective fitted models making the residual errors bigger. Bigger residual errors will mean a poor goodness-of-fit of the respective fitted model.

Secondly, the fitted models with exhibit a greater amount of variance in their estimated coefficients. Greater variance in coefficients will lead to larger standard errors and wider confidence intervals for each coefficient. These are all signs of a sub-optimally fitted model.

We have thus come upon the following result:

Ordinary Least Squares estimation is

inefficientwhen used to individually fit the linear regression models contained in systems of regression equations in which the error terms from different models are correlated.

In such cases, the correct strategy is to fit the entire system of regression equations as a single model using a technique such as

Generalized Least Squares (GLS)orFeasible Generalized Least Squares (FGLS).

Let’s formalize these observations.

## A framework for a system of regression equations

Consider the following set of *M* linear regression models:

Let’s assume that we have fed each model with *N* samples of data. That would mean that *y**_1, **y**_2, **y**_3,…,**y**_M* are all vectors of size *[N x 1].*

Let’s look at how the ** y** vector for the

*M-th*equation looks like:

In the above vector, *y_tM* is the *t-th* sample in the data set that is fed into the *Mth* equation in the system.

*X**_1, **X**_2, **X**_3, …,**X**_M* are matrices containing observed values of regression variables. Let *X**_1* contain *N *observations each for *k_1* variables including the regression intercept at column 1*. *Similarly**, X**

*_2,*

*X**_3, …,*

*X**_M*contain

*N*observations each for

*k_2, k_3,…k_M*regression variables. That would make

*X**_1*a matrix of size

*[N x k_1]*,

*X**_2*of size

*[N x k_2]*, and so on and

*X**_M*of size

*[N x k_M]*.

Here’s how the regression matrix for the *M-th *equation looks like:

In the above matrix, the first column is a column of *1s *and it acts as the placeholder for the regression intercept *β**_1M* of the *M-th* equation*.* The remaining columns contain regression variables. The *Mth *regression model contains *K_M* variables. Hence, *X_M *contains *k_M *columns.

Let’s look closely at the subscript-structure of each element in this matrix. For any element *x_tjm *in *X**_m*, the first subscript on *x_tjm *refers to the data sample number, the second subscript indicates the regression variable number, and the third subscript indicates the equation number.

Thus, element *x_tjm* in this matrix is the observed value of the *j-th* regression variable in the *t-th* sample in the data set that was fed into the *m-th* regression model in the system.

Even though the elements in *X**_M* (or in general *X**_m*) are triple-subscribed, we represent *X**_m* as a two-dimensional matrix. The third subscript (*m*), just serves the purpose of reminding us about the equation number (*m*) in the system that the variable lies in.

Some or all of the regression variables may be common amongst the different ** X** matrices.

Let’s look at the coefficients.

In the system of *M* equations, *β**_1, **β**_2, …, **β**_M* are the coefficient vectors of size *[k_1 x 1], [k_2 x 1], …, [k_M x 1]*. Here’s how the coefficients vector for the *M-th* equation looks like. The first subscript on *β*** **refers to the coefficient number, and the second subscript indicates the equation number:

*ϵ**_1, **ϵ**_2, …, **ϵ**_M* are vectors of regression errors, each of size *[N x 1]*. Here’s how the regression errors vector for the *M-th* equation looks like. The first and second subscripts on *ϵ*** **have the same meaning as for

*β*:

And here’s how these matrices come together to form the regression equation of the *M-th* model in the system:

Let’s once again circle back to the system of *M* equations:

We have seen earlier in the context of the CAPM, how, given two regression models for two different stocks *p* and *q *(e.g. NYSE:F and NYSE:GFS), and any given time *t*, the regression errors *ϵ**_tp* and *ϵ**_tq *from the respective models for *p* and *q* are correlated with each other. Such correlations make it impossible to efficiently estimate the two regression models individually using a technique such as OLS.

In general, even though the above set of *M *equations appear at first glance to be unrelated, they may be, in reality, intimately related to each other, and therefore *they need to solved as a single system*. Such a system of inter-connected regression equations is called the **Seemingly Unrelated Regression (SUR) model**.

## Setting the stage for the estimation of the SUR model

To estimate the SUR model, the first step is to “stack” the ** y**,

**,**

*X***, and**

*β***matrices from the**

*ϵ**M*equations one over the other as follows:

The ** y** matrix (vector) on the left is constructed by stacking the

*M*number of

**matrices (vectors) from the**

*y**M*equations. Each

**matrix is of size**

*y**[N x 1]*. Hence the stacked

**matrix is a column vector of size**

*y**[MN x 1]*.

Similarly, the matrix of coefficients ** β** is a stacked column matrix (vector) of size

*[(k_1+k_2+…+k_M) x 1].*

The regression errors matrix ** ϵ** has the same dimensions as the stacked

**matrix namely a column vector of size**

*y**[MN x 1]*.

Let’s look at the stacked ** X** matrix. It has the following interesting structure:

The stacked ** X** matrix illustrates a concept in linear algebra called block matrices. Notice how the individual

**matrices from the**

*X**M*equations are arranged along the main diagonal. Each one of these matrices

*X**_1,*

*X**_2, …,*

*X**_m, …,*

*X**_M*has

*N*rows corresponding to the

*N*data samples fed into each equation. Hence the the stacked

**matrix has**

*X**MN*rows. Each matrix

*X**_m*has potentially a different number of columns

*k_m*. The stacked

**contains**

*X**(k_1+k+2+…+k_m+…+k_M)*columns. All off-diagonal positions in

**are occupied by**

*X***which is a matrix containing only zeros. The**

*0***matrix lying at row**

*0**i*and column

*m*in

**is of size**

*X**[N x k_m]*.

The matrix ** X** is called a

**block matrix**as it is made up of “blocks”. Here, a block is used to refer to the matrices

*X**_1,*

*X**_2,…,*

*X**_M*and the

**matrices.**

*0*Similarly, matrices ** y**,

**and**

*β,***are also block matrices.**

*ϵ*## The SUR model for a system of three regression equations

Let’s revisit our example 3-equation system:

In the above system, there are three CAPM models, one each for the stocks of Ford, U.S. Steel, and GlobalFoundries. Thus, *M = 3*.

Each model contains 2 regression variables:

- a stock specific intercept
*α*, and - the risk adjusted return for the corresponding market asset such the S&P metals index.

Thus, there are two coefficients to be estimated for each model, i.e. *k_1=k_2=k_3=2*. And *K=(k_1+k_2+k_3) = 6* coefficients to be estimated in all.

Just for the purpose of illustration, we’ll consider a tiny subset of a made-up data set that contains the Risk Adjusted Returns (RAR) for the three stocks over five time points, i.e. , each model is fed a data set of 5 samples. Thus *N = 5*.

Here’s how this tiny slice of the data looks like:

Now let’s look at how the block matrices for this data will look like.

Here’s the block ** y** matrix of size

*[15 x 1]*:

Following is the block matrix of coefficients:

Here’s the block ** X** matrix:

Notice how the individual *X**_m* matrices from the three equations are fitted along the main diagonal while ** 0** matrices are fitted in the off-diagonal places. The individual

*X**_m*matrices are each of size

*[5 x 2]*. Similarly, all

**matrices are of size**

*0**[5 x 2]*since in our sample system, each equation contains 2 regression variables. The block

**matrix is of size**

*X**[15 x 6]*.

And here’s the block matrix of error terms. It has the same size as the block ** y** matrix, namely

*[15 x 1].*The first subscript on each error term refers to the sample number (1/2/3/4/5) in the respective data set while the second subscript refers to the equation number (1/2/3).

Let’s knock together all the block matrices into a single SUR model:

We may want to verify that the above matrix equation is dimensionally correct. ** X** is of size

*[15 x 6]*and

**is of size**

*β**[6 x 1]*. So

**is of size**

*Xβ**[15 x 1]*. Add to that

**of size**

*ϵ**[15 x 1]*and we get

**of size**

*y**[15 x 1]*.

### Is SUR a classical linear model?

From the looks of it, we may be tempted to identify the “block-matrixed” SUR model illustrated above as a **classical linear model** and fit it using straight up OLS. But that would be a grave mistake. And that’s because one of the fundamental assumptions of the classical linear model is that there is no auto-correlation amongst the errors of the model, and we have already seen how the Capital Asset Pricing equation for Ford contains an error term that is correlated with the error terms of the Capital Asset Pricing equations of U.S. Steel and GlobalFoundries.

Since the SUR model in our example is not a classical linear model, it cannot be efficiently estimated using Ordinary Least Squares. So how do we estimate it?

A promising approach happens to be to use the **Generalized Least Squares **(**GLS**) and **Feasible Generalized Least Squares** (**FGLS**) which I have covered in detail in this article (and a tutorial on GLS in this article).

GLS (or FGLS) requires us to specify the covariance matrix of the error terms of the model. So let’s inspect the fine-grained structure of the covariance matrix of regression errors of the SUR model. We’ll use our 3-equation regression system of asset pricing models to guide us through this process.

### The covariance matrix of errors of the SUR model

The covariance matrix of errors is obtained by multiplying the errors matrix ** ϵ **with its transpose

**and dividing the result by**

*ϵ’**N*(number of samples feed into each model). The transpose of

**is simply**

*ϵ***with its rows and columns interchanged. Here’s how the transpose of the block error matrix of the 3-equation SUR model looks like:**

*ϵ*The covariance of ** ϵ**, denoted as

*Cov (*

*ϵ**)*is

*ϵϵ’/**(N).*

Let’s take a closer look at this matrix multiplication. The block error matrix ** ϵ** is of size

*[15 x 1]*. Its transpose

**is of size**

*ϵ’**[1 x 15]*. Hence

**is square matrix of size**

*ϵϵ’**[15 x 15]*. In general case, if

**is of size**

*ϵ**[MN x 1]*, the covariance matrix of errors is a square matrix of size

*[MN x MN]*, and it contains

*(MN)²*covariance terms.

Following is an important property of the covariance matrix of errors to keep in mind:

The diagonal elements of the covariance matrix contain the variance of the individual error terms, while the off-diagonal elements contain the covariance between different error terms.

Here’s how the covariance matrix of errors looks like for the 3-equation SUR model (you may want to zoom into the figure to take in the detail):

The matrix multiplication of ** ϵ **with its transpose has caused the three blocks in the block

**matrix to multiplex into 9 different blocks (shown with thick block borders).**

*ϵ*For any *ϵ**_ti* in the covariance matrix, the first subscript on ** ϵ** represents the data sample number and second subscript represents the model number in the equations system.

Thus, *ϵ**_ti *is the error term for the *t-th *data sample of the *i-th* model in the system. Ditto for *ϵ**_sj. *For e.g. *ϵ**_12* is the error term for the first data sample of the CAPM for U.S. Steel stock which is equation #2 in the 3-equation system.

An element (*ϵ**_ti**)(ϵ**_sj**)**/(5)* in the matrix represents the covariance between the error terms *ϵ**_ti *and *ϵ**_sj.*

Let’s zoom into a portion of the main diagonal of this matrix:

The diagonal elements of Cov(** ϵ**) are

*ϵ²**_ti/5*

**where**

*t=1,2,…,5*for the 5 data samples supplied to each CAPM model, and

*i=1,2,3*for each of the three CAPM models in the system. We make the following observation:

The diagonal elements of the covariance matrix of errors of the SUR model are the variances of the error terms of the individual models within the equations system.

For e.g., *ϵ²**_32*** /5** is the variance of the error term corresponding to data sample #3 supplied into the second CAPM in the system, i.e. the model for the returns on U.S. Steel stock.

The off-diagonal elements of the the covariance matrix also have an interesting covariance structure. Let’s zoom into a single block of off-diagonal elements:

We’ll look at the blue bordered diagonal elements of this block first. Notice how for each (*ϵ**_ti**)(ϵ**_sj*** )** the first subscript on the two

**s is the same. The first subscript indicates the data sample number fed into each equation. For our 3-equation stock pricing model, the first subscript indicates a specific time point at which the trailing 3-month returns of all stocks and market indices were measured.**

*ϵ***Thus, all diagonal elements of this block contain the correlations between the error terms of different models in the system**

*for the same time point*(whatever that time point might be).Incidentally, the 15 elements along the main diagonal of the covariance matrix already satisfy this property as they have the form *(ϵ**_ti)(**ϵ**_ti).*

We’ll make a quick note of this property.

The diagonal elements of all blocks in the covariance matrix of errors in the SUR model contain the correlations between the error terms of the individual models within the equations system,

at the same time point.

Let’s highlight all such diagonally positioned blocks:

Let’s once again inspect the zoomed in block we inspected earlier and note the orange bordered cell*:*

This cell contains the covariance *(ϵ**_11)(**ϵ**_42).* Notice how all four subscripts are different. This is a characteristic of all cells in this block as also all blocks which are not along the main diagonal of the blocked covariance matrix. These covariances are of error terms from different equations in the system and *across different time points*. In time series settings in which the the *‘t’* and *‘s’* subscripts represent time points (as also in cross-sectional settings where they represent individuals), it’s uncommon for different equations in a system to be correlated at two different time points (or individuals). We’ll soon see the advantages resulting from setting all such covariances to zero.

Let’s once again zoom into one of the blocks along the main diagonal of the blocked covariance matrix:

All *off-diagonal* elements in this block contain variances of the type *(ϵ**_ti)(**ϵ**_sj) where i = j = 1*. That is, all these elements contain the covariance of errors from *a particular equation* in the system (in the above case, equation 1: the CAPM for Ford) *across different time points*. Depending on the modeling scenario, such **auto-correlations of errors** within individual equations in a system can be assumed to be zero thereby greatly simplifying the estimation of the SUR model.

## Estimation of the Seemingly Unrelated Regression model

As mentioned earlier, the **Generalized Least Squares** (**GLS**) technique (covered in depth in this article) provides a promising avenue for estimation of the SUR model.

The argument for the GLS technique goes along the following lines:

Given a linear regression model *y** = **Xβ** + *** ϵ**, we wish to estimate the coefficient vector

**.**

*β*But the estimation of** β** using Ordinary Least Squares (OLS) will be

**inefficient**(imprecise) if the errors are heteroskedastic (non-constant variance) and/or correlated.

In the previous section, we have seen that the error terms of the Seemingly Unrelated Regression model form a covariance structure that is both heteroskedastic and correlated.

Therefore, we must use a different kind of estimator for estimating ** β. **One such estimator that takes into account both heteroskedasticity and correlation between errors is the

**Generalized Least Squares**estimator.

The GLS estimator of ** β** is given by the following matrix equation:

Let’s interpret this equation in the context of the SUR model.

In the above equation, *β**_cap* is the estimated value of ** β.** In the context of the SUR model,

**is the block matrix we reviewed earlier. Here’s how it looks like for the 3-equation CAPM system:**

*β*In general, *β**_cap* is of size *[K x 1]*.

In equation (5), ** X**’ is the transpose of the block

**matrix we looked at earlier.**

*X***is of size**

*X**[MN x K]*. Let’s reproduce it below for reference:

** X**’ is of size

*[K x MN]*.

** y** is the block matrix of dependent variables of size

*[MN x 1]*. In our 3-equation CAPM system fed with 5 data samples per equation, it looks like this:

The newly seen matrix in Equation (5) is the ** Ω** (Omega) matrix.

**is the covariance matrix of regression errors. As we have seen,**

*Ω***is a square matrix of size**

*Ω**[MN x MN].*

For the 3-equation SUR model fed with 5 samples each, here is ** Ω** in its full glory:

Equation (5) gives us the following insight:

The matrices ** X**,

*X**’*and

**are all known quantities.**

*y*If

is known, the GLS estimator shown in equation (5) can be used to efficiently estimate all K coefficients in the system of equations contained within the SUR model.Ω

But now we run into a practical difficulty.

*Ω contains the correlations between population-level regression errors which are inherently unobservable to the experimenter.*

We will overcome this difficulty as follows:

- By making two key assumptions which we’ll soon describe, we will simplify the structure of
thereby making*Ω*tractable.*Ω* - We will fit the individual equations in the SUR model using Ordinary Least Squares and we’ll use the residual errors from those fitted models as unbiased estimates of the corresponding regression errors of the SUR model.

### Making *Ω *tractable

*Ω*

Let’s make the following assumption about the fine-grained structure of the covariances of errors in ** Ω**:

In a system of regression equations, the regression errors from individual equations are assumed to not be correlated across different time points (or across different individuals).

Recollect that for any regression error *ϵ**_ti* in the system, the first subscript refers to the data sample number, in other words a specific time point or individual in the system, and the second subscript refers to the equation number in the system.

The assumption we are making is that:

*For any **ϵ**_ti and **ϵ**_sj in **Ω**, **(ϵ**_ti)(**ϵ**_sj) = 0 if t != s.*

The effect of this assumption is to make all such elements in ** Ω** be zero. After applying this assumption to the 3-equation CAPM system, the simplified

**matrix looks like this (for simplicity, imagine its divided by 5 ):**

*Ω*The second assumption we’ll make is one of heteroskedasticity of errors within and across equations. Specifically, we’ll assume the following:

In a system of regression equations, the regression errors for any individual equation (or pair of equations) are homoskedastic i.e. they demonstrate constant variance.

This assumption yields the following result:

*For any **ϵ**_ti and **ϵ**_tj in **Ω**, **(ϵ**_ti)(**ϵ**_tj) = *σ*²_ij*

The effect of this assumption is that all shaded elements in ** Ω** attain a block-specific constant value. After applying this assumption to the 3-equation CAPM system, the simplified

**matrix looks like this:**

*Ω*Notice that in-spite of these two assumptions, the errors of the SUR model aren’t necessarily homoskedastic (constant variance) and uncorrelated. A glance down the main diagonal of the above matrix reveals that σ*²_11, *σ*²_22, and *σ*²_33 *may not all turn out to be of equal value. Several off-diagonal elements σ*²_ij* (where *i != j*) also may not all be zero. In the face of likely heteroskedasticity and correlated errors, the OLS estimator continues to be inappropriate for estimating this model.

Having said that, the ** Ω** matrix in its simplified format can now be stated in the following innovative manner:

In the above representation, ** I** is a

*[5 x 5]*Identity matrix, i.e. a matrix that has ones along its main diagonal and zeros at all other positions.

**is of size**

*I**[5 x 5]*because we have fed each equation in our 3-equation system with 5 data samples.

We can represent ** Ω **even more concisely by pulling out all nine σ

*²*elements into their own

*[3 x 3]*matrix that we’ll call

**(capital Sigma) and performing what’s known as a**

*Σ***Kronecker product**of

**with a**

*Σ**[5 x 5]*Identity matrix:

The Kronecker product of ** Σ** and

**is done as follows:**

*I*- We create a placeholder matrix of size
*[5 x 5]*. - Next, we multiply each element σ
*²_ij*inwith the entire*Σ**[5 x 5]*Identity matrix giving us a*[5 x 5]*diagonal matrix having*σ²_ij*along its main diagonal. - We replace the element at the
*(i, j)th*position of the*[5 x 5]*size placeholder matrix with this*[5 x 5]*size diagonal matrix. - We repeat steps 2 and 3 for each
*i,j=1, 2, 3*to create a matrix of size*[15 x 15]*.

The following figure illustrates how the Kronecker product is built block-by-block:

In general, if a SUR model is constructed over a system of *M* equations, the ** Σ **matrix is of size

*[M x M]*. If each equation in the system is fed

*N*data samples, the

*I*matrix is of size [

*N x N]*. And

**⊗**

*Σ***yields the covariance matrix of errors**

*I***for the entire system of size**

*Ω**[MN x MN]*, which is exactly as desired.

Let’s circle back to GLS.

Our goal is to estimate the coefficient vector ** β** of the SUR model. We saw that we could use

**G**eneralized

**L**east

**S**quares to efficiently estimate

**The GLS estimator mentioned earlier in Eq (5) is reproduced below.**

*β.*To estimate ** β**, we need to know

**,**

*X*

*X**’*,

**, and the inverse of the covariance matrix of errors**

*y*

*Ω.*** X**,

*X**’*,

**are observed quantities. We saw how we can derive a simplified version of**

*y***as the Kronecker product of**

*Ω***and**

*Σ*

*I.*It can be shown that the inverse of ** Ω** is simply the Kronecker product of the inverse of

**and**

*Σ***. Namely, the following:**

*I*Therefore, what remains to be estimated is *Σ.*

### Estimation of the *Σ *matrix

*Σ*

Let’s revisit the *[3 x 3]* ** Σ **matrix for the 3-equation SUR model consisting of stocks of Ford, U.S. Steel, and GlobalFoundries:

The values along the main diagonal of ** Σ **are the variances of the errors for the individual models in the 3-equation system. These can be easily estimated by fitting the individual models using OLS, collecting the residual errors from the 3 models, and calculating the variance of the residual errors.

For e.g., we would fit the CAPM for NYSE:Ford illustrated in Eq (4a) using OLS over the data set of 5 samples (in a real setting, the data set would undoubtedly be a lot larger!).

After fitting this model using OLS, we would calculate the residual errors of the fitted model by subtracting the value estimated by the fitted model from the observed value of the Risk Free Return on Ford. We’ll get 5 such residual errors: *e_11, e_21, e_31, e_41, e_51* corresponding to the five data samples. As before, the first subscript refers to the sample number and the second subscript indicates the equation number.

Next, we would calculate the variance *s²_11* of these five residuals using the standard formula for *sample* variance as follows:

Here, *e_bar_1* is the mean of the 5 residual errors. The mean of residual errors of linear model fitted using OLS is always zero. Hence, *e_bar_1 *can always be assumed to be zero. The demoninator contains a correction of 2 for the degrees of freedom as this model contains two variables: the intercept, and the RAR on the respective market.

Thus, we fit each of the three models (or in general *M* models) in the system using OLS, calculate the respective residual errors from the fitted models, and get their variance *e_ii* using the following general formula:

The sample variances (or covariances) from an OLS estimated linear model happen to be unbiased estimates of the corresponding population variances (or covariances). Hence, we can use *s²_ii *as an estimate for the corresponding diagonal element *σ²_ii *in *Σ.*

Now let’s look at the off-diagonal elements *σ²_ij* in ** Σ**. These are covariances between the regression errors from the

*i-th*and the

*j-th*regression models in the system with the constraint that the co-variances are calculated across the same time points.

Once again, we will appeal to the above result about sample and population covariances and use the following formula for sample covariance to estimate the corresponding population covariance *σ²_ij*:

Using the above techniques, we can construct the following *[3 x 3]* (or in general, an *M x M*) ** S** matrix that will be an unbiased estimate of the

**matrix:**

*Σ*With the ** S** matrix in place, we will proceed with the GLS (more appropriately, Feasible GLS) estimation of

**as follows (The**

*β**Est. (…)*indicates an estimated value):

Since the ‘*cap’* in *β**_cap* already indicates that it is the estimated value of ** β**, strictly speaking, the

*Est (…)*notation is not needed for

*β**_cap.*

Equation (6) is the estimator that we have been seeking for estimating the system of regression equations using the SUR model.

The beauty of this estimator is that once we calculate

Sas an estimate ofΣ, equation (6) lets us do an essentially single-shot estimation of all regression coefficients in the system of regression equations using an estimator that isunbiased, consistent, and efficient. This estimatorallows for non-constant variancesa.k.a. heteroskedasticity in the errors of the system, and ittakes into account the intricate relationships between the different regression equationsin the system without losing its precision.

## Citations and Copyrights

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **An Introduction To 2-Stage Least Squares (2SLS) Estimation

**NEXT: **A Tutorial on Solving A System Of Regression Equations Using Python And Statsmodels