Your model looses precision. We’ll explain why.

In the previous chapter, we saw how leaving out important variables causes the regression model’s coefficients to become biased. In this chapter, we’ll look at the converse of this situation namely, the damage caused to your regression model from stuffing it with variables that are entirely superfluous.

### What are irrelevant and superfluous variables?

There are several reasons a regression variable can be considered as irrelevant or superfluous. Here are some ways to characterize such variables:

- A variable that is
**unable to explain any of the variance**in the response variable () of the model.*y* - A variable whose
**regression coefficient**(*β_m*)**is statistically insignificant**(i.e. zero) at some specified*α*level. - A variable that is
**highly correlated with the rest of the regression variables**in the model. Since the other variables are already included in the model, it is unnecessary to include a variable that is highly correlated with the existing variables.

Adding irrelevant variables to a regression model causes the coefficient estimates to become less precise, thereby causing the overall model to loose precision. In the rest of the chapter, we’ll explain this phenomenon in greater detail.

It can be tempting to stuff your model with many regression variables in the hope of achieving a better fit. After all, one may speculate that if a variable is judged to be irrelevant, the training algorithm (such as Ordinary Least Squares) will simply squeeze its coefficient down to near-zero. Additionally, it can be shown that R-squared for a linear model (or pseudo-R-squared for a nonlinear model) will only increase with every addition of a regression variable to the model.

Unfortunately in such situations, while R-squared (or pseudo-R-squared) keeps going up, the model keeps getting less precise.

We’ll explain the reasons for this progressive fall in precision using the linear regression model as our workbench.

## The classical linear model as our workbench

The classical linear regression model’s equation can be expressed as follows:

Here’s the matrix form of the above equation:

In equation (1), ** y** is the

**dependent variable**,

**is the matrix of**

*X***regression variables**,

**is the vector of**

*β**k*

**regression coefficients**

*β_1, β_2, β_3, …, β_k*containing the

*population level*values of each coefficient and the intercept of regression

*β_1*, and

**is the vector of**

*ϵ***error terms**.

**is the difference between the observed value of**

*ϵ***and the modeled value of**

*y***. The error terms**

*y***of the regression model reflect the portion of the variance in the dependent variable**

*ϵ***that the regression variables**

*y***were not able to explain.**

*X*We will assume that each one of the *n* error terms *ϵ_i* *[i=1 to n]* in the error terms vector ** ϵ** varies around a certain mean value (which is assumed to be zero), and the variance of each error term around its mean averages out to some value σ

**. Thus the errors are assumed to have a zero mean and a constant variance σ**

*²*

*².*If the correct set of regression variables are included in the model, they would be able to explain much of the variance in ** y** thereby making the variance

*σ*

**of the error term very small. On the other hand, if important variables are left out, the portion of the variance in**

*²***that they would have otherwise been able to explain would now leak into the error term causing the variance**

*y**σ*

**to be large.**

*²*Solving (a.k.a. “fitting” or training) the linear model on a data set of size *n*, yields *estimated* values of ** β** which we will denote as

**Thus, the fitted linear model’s equation looks like this:**

*β_cap.*In the above equation, ** e **is a column vector of

**residual errors**(a.k.a.

**residuals**). For the

*ith*observation, the residual

*e_i*is the difference between the

*ith*observed value of

*y_i*and the corresponding

*ith*fitted (predicted) value

*y_cap_i*.

*e_i=(y_i — y_cap_i)*

Before we proceed further on our quest to find the effects of irrelevant variables on the model, we will state the following important observation:

### Estimated regression coefficients β_cap are random variables with a mean and a variance

Let us understand why this is so: Each time we train the model on a different randomly selected data set of size *n*, we will get a different set of estimates of the true values of coefficients ** β**. Thus, the vector of estimated coefficients

*β_cap=**[β_cap_1, β_cap_2, …, β_cap_k]*are a set of random variables having a certain unknown probability distribution. If the training algorithm does not produce biased estimates, the mean (a.k.a. the expectation) of this distribution is the set of true population level values of the coefficients

**.**

*β*Specifically, the ** conditional expectation** of the estimated coefficients

**is their true population value**

*β_cap,***, where the conditioning is on the regression matrix**

*β***This can be denoted as follows:**

*X.***It can be shown that** the **conditional variance** of ** β_cap** can be calculated by the following equation:

In the above equation:

is a column vector of fitted regression coefficients of size*β_cap**(k x 1)*, i.e.*k*rows and*1*column, assuming there are*k*regression variables in the model including the intercept and also including any irrelevant variables.is a matrix of regression variables of size*X**(n x k)*where*n*is the size of the training data set.is the transpose of*X’*, i.e.*X*with its rows and columns interchanged. It’s as if*X*has been turned on its side. Hence*X*is of size*X’**(k x n)*.*σ²*is the variance of the error termof the regression model. In practice, we use*ϵ**s²*which is the variance of the residual errorsof the fitted model as an unbiased estimate of*e**σ²*.*σ²*and s² are scalar quantities (and hence not depicted in**bold**font).is the matrix multiplication of*X’X*with its transpose. Since*X*is of size*X**(n x k)*andis of size*X’**(k x n)*,is of size*X’X**(k x k)*.- The superscript of
*(-1)*indicates that we have taken the inverse of this*(k x k)*matrix which is another matrix of size*(k x k)*. - Finally, we have scaled each element of this inverse matrix with the variance
*σ²*of the error term.*ϵ*

Equation (4) gives us what is known as the **variance-covariance matrix** of the regression model’s coefficients. As explained above, this is a *(k x k) *matrix that looks like this:

The elements that run down the main diagonal i.e. the one that goes from the top-left to bottom-right of the variance-covariance matrix contain the variances of the *estimated values* of the *k* regression coefficients *β_cap**=[β_cap_1, β_cap_2,…,β_cap_k]*. Every other element *(m,n)* in this matrix contains the covariance between the estimated coefficients *β_cap_m and β_cap_n.*

The square-root of the main-diagonal elements are the standard errors of the regression coefficient estimates. We know from **interval estimation theory** that greater is the standard error, lesser is the precision of the estimate and wider are the confidence intervals around the estimate.

Greater is the variance of the estimated coefficients, lesser is the precision of estimates. And therefore, lesser is the precision of the predictions generated by the trained model.

It is useful to inspect the two boundary cases that arise out of the above observation:

*Var(β_cap_m|X) = 0 : *In this case, the variance of the coefficient estimate is zero and therefore the value of the coefficient estimate is equal to the population value of the coefficient *β_m.*

*Var(β_cap_m|X) = ∞ : *In this case, the estimate is infinitely imprecise and therefore the corresponding regression variable is completely irrelevant.

Let’s examine the *mth* regression variable in the ** X** matrix:

This variable can be represented by the column vector *x**_m* of size *(n x 1)*. In the fitted model, its regression coefficient is *β_cap_m.*

The variance of this coefficient i.e. *Var(β_cap_m**|X**)* is the *mth *diagonal element of the variance-covariance matrix in equation (4). This variance can be expressed as follows:

In the above equation,

*σ²*is the variance of the error term of the model. In practice, we estimate*σ²*using the variance*s²*of the residual errors of the fitted model.*n*is the number of data samples.*R²_m*is the R-squared of a linear regression model in which the dependent variable is the*mth*regression variable*x**_m*and the explanatory variables are the rest of the variables in thematrix. Thus,*X**R²_m*is the R-squared of the regression of*x**_m*on the rest of.*X**Var(**x**_m)*is the variance of*x**_m*and it is given by the usual formula for variance as follows:

Before we analyze equation (5), let’s recollect that for the *mth *regression variable, greater the variance of *β_cap_m*, lesser is the precision of the estimate, and vice versa.

Now let’s consider the following scenarios:

### Scenario 1

In this scenario, we will assume that variable *x**_m* happens to be highly correlated to the other variables in the model.

In this case, *R²_m*, which is the R-squared obtained from regressing *x**_m* with the rest of ** X**, will be close to

*1.0*. In equation (5), this would cause

*(1 — R²_m)*in the denominator to be close to zero thereby causing the variance of

*β_cap_m*to be extremely large and thus, imprecise. Hence, we have the following result:

When you add a variable that is highly correlated to other regression variables in the model, the coefficient estimate of this highly correlated variable in the trained model becomes imprecise. Greater is the correlation, higher is the imprecision in the estimated coefficient.

Correlation between regression variables is called **multicollinearity**.

A well known consequence of having multicollinearity among regression variables is loss of precision in the coefficient estimates.

### Scenario 2

Now consider a second regression variable *x**_j* such that *x**_m* is highly correlated with *x**_j*. Equation (5) can also be used to calculate the variance of *x**_j *as follows:

*R²_j *is the R-squared value of the linear regression of *x**_j* on the rest of ** X** (including

*x**_m*). Since

*x**_m*is assumed to be highly correlated with

*x**_j*, if we were to leave out

*x**_m*from the model, it would reduce

*R²_j*by a significant amount, (1 —

*R²_j)*in the denominator of the above equation will correspondingly increase and lead to a reduction in the variance of

*β_cap_j.*Unfortunately, the converse of this finding is also true! Including the highly correlated variable

*x**_m*will increase the variance (i.e. reduce the precision) of

*β_cap_j*. This suggests another important consequence of including a highly correlated variable such as

*x**_m*:

When you add a variable that is highly correlated to other regression variables in the model, it reduces the precision of the coefficient estimates of all regression variables in the model.

### Scenario 3

Consider a third scenario. Irrespective of whether or not *x**_m* is particularly correlated with any other variable in the model, the very presence of *x**_m* in the model will cause *R²_j *, which is* *the R-squared of the model in which we are regressing *x**_j* on the rest of *X**, *to be larger than when *x**_m* is not included in the model. This behavior arises from the formula for R-squared. From equation (5), we know that when *R²_j *increases, the denominator of equation (5) becomes smaller, causing the variance of *β_cap_j* to increase. This effect, namely the loss of precision of *β_cap_j *is especially pronounced if *x**_m* is also not to explain any of the variance in the dependent variable ** y**. In this case, addition of

*x**_m*to the model does not reduce the variance

*σ²*of the error term

**of the model. Recollect that the error term contains the portion of variance in**

*ϵ***that**

*y***is unable to explain. Thus, when**

*X*

*x**_m*is an irrelevant variable, its addition to the model only leads to a decrease in the denominator of equation (5) without causing a compensatory reduction in the numerator of equation (5), thereby causing the

*Var(β_cap_j|*

*X**)*for all

*j*in

**to be larger. And thus, we have another important result:**

*X*Addition of irrelevant variables to a regression model will make the coefficient estimates of all regression variables to become less precise.

Finally, let’s review two more things that Equation (5) brings to light:

The *n* in the denominator is the data set size. We see that greater is the data set size over which the model is trained, lesser is the variance in the coefficient estimates and therefore greater is their precision. This seems intuitive. The limiting case is when the model is trained on the entire population.

The precision of the estimated regression coefficients improves with the increase in the size of the training data set.

Furthermore, we see that greater is the variance of a regression variable such as *x**_m*, lesser is the variance in the estimated value of its regression coefficient. This may not seem entirely intuitive at first reading. We can understand this effect by noting that variables that show little to no variability would be unable to explain the variability in the dependent variable ** y**, and vice versa. For such largely ‘rigid’ variables, the training algorithm would be unable to properly estimate their contribution (as quantified by their regression coefficient) to the variability in the model’s output.

**PREVIOUS:** The Consequences Of Omitting Important Variables From A Linear Regression Model

**NEXT: **Introduction to Heteroskedasticity