Your model looses precision. We’ll explain why.
In the previous chapter, we saw how leaving out important variables causes the regression model’s coefficients to become biased. In this chapter, we’ll look at the converse of this situation namely, the damage caused to your regression model from stuffing it with variables that are entirely superfluous.
What are irrelevant and superfluous variables?
There are several reasons a regression variable can be considered as irrelevant or superfluous. Here are some ways to characterize such variables:
- A variable that is unable to explain any of the variance in the response variable (y) of the model.
- A variable whose regression coefficient (β_m) is statistically insignificant (i.e. zero) at some specified α level.
- A variable that is highly correlated with the rest of the regression variables in the model. Since the other variables are already included in the model, it is unnecessary to include a variable that is highly correlated with the existing variables.
Adding irrelevant variables to a regression model causes the coefficient estimates to become less precise, thereby causing the overall model to loose precision. In the rest of the chapter, we’ll explain this phenomenon in greater detail.
It can be tempting to stuff your model with many regression variables in the hope of achieving a better fit. After all, one may speculate that if a variable is judged to be irrelevant, the training algorithm (such as Ordinary Least Squares) will simply squeeze its coefficient down to near-zero. Additionally, it can be shown that R-squared for a linear model (or pseudo-R-squared for a nonlinear model) will only increase with every addition of a regression variable to the model.
Unfortunately in such situations, while R-squared (or pseudo-R-squared) keeps going up, the model keeps getting less precise.
We’ll explain the reasons for this progressive fall in precision using the linear regression model as our workbench.
The classical linear model as our workbench
The classical linear regression model’s equation can be expressed as follows:
Here’s the matrix form of the above equation:
In equation (1), y is the dependent variable, X is the matrix of regression variables, β is the vector of k regression coefficients β_1, β_2, β_3, …, β_k containing the population level values of each coefficient and the intercept of regression β_1, and ϵ is the vector of error terms. ϵ is the difference between the observed value of y and the modeled value of y. The error terms ϵ of the regression model reflect the portion of the variance in the dependent variable y that the regression variables X were not able to explain.
We will assume that each one of the n error terms ϵ_i [i=1 to n] in the error terms vector ϵ varies around a certain mean value (which is assumed to be zero), and the variance of each error term around its mean averages out to some value σ². Thus the errors are assumed to have a zero mean and a constant variance σ².
If the correct set of regression variables are included in the model, they would be able to explain much of the variance in y thereby making the variance σ² of the error term very small. On the other hand, if important variables are left out, the portion of the variance in y that they would have otherwise been able to explain would now leak into the error term causing the variance σ² to be large.
Solving (a.k.a. “fitting” or training) the linear model on a data set of size n, yields estimated values of β which we will denote as β_cap. Thus, the fitted linear model’s equation looks like this:
In the above equation, e is a column vector of residual errors (a.k.a. residuals). For the ith observation, the residual e_i is the difference between the ith observed value of y_i and the corresponding ith fitted (predicted) value y_cap_i. e_i=(y_i — y_cap_i)
Before we proceed further on our quest to find the effects of irrelevant variables on the model, we will state the following important observation:
Estimated regression coefficients β_cap are random variables with a mean and a variance
Let us understand why this is so: Each time we train the model on a different randomly selected data set of size n, we will get a different set of estimates of the true values of coefficients β. Thus, the vector of estimated coefficients β_cap=[β_cap_1, β_cap_2, …, β_cap_k] are a set of random variables having a certain unknown probability distribution. If the training algorithm does not produce biased estimates, the mean (a.k.a. the expectation) of this distribution is the set of true population level values of the coefficients β.
Specifically, the conditional expectation of the estimated coefficients β_cap, is their true population value β, where the conditioning is on the regression matrix X. This can be denoted as follows:
In the above equation:
- β_cap is a column vector of fitted regression coefficients of size (k x 1), i.e. k rows and 1 column, assuming there are k regression variables in the model including the intercept and also including any irrelevant variables.
- X is a matrix of regression variables of size (n x k) where n is the size of the training data set.
- X’ is the transpose of X, i.e. X with its rows and columns interchanged. It’s as if X has been turned on its side. Hence X’ is of size (k x n).
- σ² is the variance of the error term ϵ of the regression model. In practice, we use s² which is the variance of the residual errors e of the fitted model as an unbiased estimate of σ². σ² and s² are scalar quantities (and hence not depicted in bold font).
- X’X is the matrix multiplication of X with its transpose. Since X is of size (n x k) and X’ is of size (k x n), X’X is of size (k x k).
- The superscript of (-1) indicates that we have taken the inverse of this (k x k) matrix which is another matrix of size (k x k).
- Finally, we have scaled each element of this inverse matrix with the variance σ² of the error term ϵ.
Equation (4) gives us what is known as the variance-covariance matrix of the regression model’s coefficients. As explained above, this is a (k x k) matrix that looks like this:
The elements that run down the main diagonal i.e. the one that goes from the top-left to bottom-right of the variance-covariance matrix contain the variances of the estimated values of the k regression coefficients β_cap=[β_cap_1, β_cap_2,…,β_cap_k]. Every other element (m,n) in this matrix contains the covariance between the estimated coefficients β_cap_m and β_cap_n.
The square-root of the main-diagonal elements are the standard errors of the regression coefficient estimates. We know from interval estimation theory that greater is the standard error, lesser is the precision of the estimate and wider are the confidence intervals around the estimate.
Greater is the variance of the estimated coefficients, lesser is the precision of estimates. And therefore, lesser is the precision of the predictions generated by the trained model.
It is useful to inspect the two boundary cases that arise out of the above observation:
Var(β_cap_m|X) = 0 : In this case, the variance of the coefficient estimate is zero and therefore the value of the coefficient estimate is equal to the population value of the coefficient β_m.
Var(β_cap_m|X) = ∞ : In this case, the estimate is infinitely imprecise and therefore the corresponding regression variable is completely irrelevant.
Let’s examine the mth regression variable in the X matrix:
This variable can be represented by the column vector x_m of size (n x 1). In the fitted model, its regression coefficient is β_cap_m.
The variance of this coefficient i.e. Var(β_cap_m|X) is the mth diagonal element of the variance-covariance matrix in equation (4). This variance can be expressed as follows:
In the above equation,
- σ² is the variance of the error term of the model. In practice, we estimate σ² using the variance s² of the residual errors of the fitted model.
- n is the number of data samples.
- R²_m is the R-squared of a linear regression model in which the dependent variable is the mth regression variable x_m and the explanatory variables are the rest of the variables in the X matrix. Thus, R²_m is the R-squared of the regression of x_m on the rest of X.
- Var(x_m) is the variance of x_m and it is given by the usual formula for variance as follows:
Before we analyze equation (5), let’s recollect that for the mth regression variable, greater the variance of β_cap_m, lesser is the precision of the estimate, and vice versa.
Now let’s consider the following scenarios:
In this scenario, we will assume that variable x_m happens to be highly correlated to the other variables in the model.
In this case, R²_m, which is the R-squared obtained from regressing x_m with the rest of X, will be close to 1.0. In equation (5), this would cause (1 — R²_m) in the denominator to be close to zero thereby causing the variance of β_cap_m to be extremely large and thus, imprecise. Hence, we have the following result:
When you add a variable that is highly correlated to other regression variables in the model, the coefficient estimate of this highly correlated variable in the trained model becomes imprecise. Greater is the correlation, higher is the imprecision in the estimated coefficient.
Correlation between regression variables is called multicollinearity.
A well known consequence of having multicollinearity among regression variables is loss of precision in the coefficient estimates.
Now consider a second regression variable x_j such that x_m is highly correlated with x_j. Equation (5) can also be used to calculate the variance of x_j as follows:
R²_j is the R-squared value of the linear regression of x_j on the rest of X (including x_m). Since x_m is assumed to be highly correlated with x_j, if we were to leave out x_m from the model, it would reduce R²_j by a significant amount, (1 —R²_j) in the denominator of the above equation will correspondingly increase and lead to a reduction in the variance of β_cap_j. Unfortunately, the converse of this finding is also true! Including the highly correlated variable x_m will increase the variance (i.e. reduce the precision) of β_cap_j. This suggests another important consequence of including a highly correlated variable such as x_m:
When you add a variable that is highly correlated to other regression variables in the model, it reduces the precision of the coefficient estimates of all regression variables in the model.
Consider a third scenario. Irrespective of whether or not x_m is particularly correlated with any other variable in the model, the very presence of x_m in the model will cause R²_j , which is the R-squared of the model in which we are regressing x_j on the rest of X, to be larger than when x_m is not included in the model. This behavior arises from the formula for R-squared. From equation (5), we know that when R²_j increases, the denominator of equation (5) becomes smaller, causing the variance of β_cap_j to increase. This effect, namely the loss of precision of β_cap_j is especially pronounced if x_m is also not to explain any of the variance in the dependent variable y. In this case, addition of x_m to the model does not reduce the variance σ² of the error term ϵ of the model. Recollect that the error term contains the portion of variance in y that X is unable to explain. Thus, when x_m is an irrelevant variable, its addition to the model only leads to a decrease in the denominator of equation (5) without causing a compensatory reduction in the numerator of equation (5), thereby causing the Var(β_cap_j|X) for all j in X to be larger. And thus, we have another important result:
Addition of irrelevant variables to a regression model will make the coefficient estimates of all regression variables to become less precise.
Finally, let’s review two more things that Equation (5) brings to light:
The n in the denominator is the data set size. We see that greater is the data set size over which the model is trained, lesser is the variance in the coefficient estimates and therefore greater is their precision. This seems intuitive. The limiting case is when the model is trained on the entire population.
The precision of the estimated regression coefficients improves with the increase in the size of the training data set.
Furthermore, we see that greater is the variance of a regression variable such as x_m, lesser is the variance in the estimated value of its regression coefficient. This may not seem entirely intuitive at first reading. We can understand this effect by noting that variables that show little to no variability would be unable to explain the variability in the dependent variable y, and vice versa. For such largely ‘rigid’ variables, the training algorithm would be unable to properly estimate their contribution (as quantified by their regression coefficient) to the variability in the model’s output.