What happens when regression variables or the dependent variable contains measurement errors
Measurement errors can seep into an experiment for a variety of reasons such as by the measuring instrument itself, the format of the experiment, or erroneous responses submitted by survey respondents. One may argue that more often than not a variable will be measured with some error. Given this scenario, it’s important to know what is the impact of such measurement errors on the regression model and how to mitigate that impact.
In this chapter, we’ll set out to do just that using a linear regression model as our workbench.
There are two cases of measurement errors to consider:
- When the error is in the response variable (y) of the regression model.
- When one or more explanatory variables x_i are measured with an error.
Measurement errors in the response variable
Consider the following linear model:
In the above equation, y*, 1, x_2, x_3, and ϵ are column vectors of size [n x 1] assuming that there are n rows in the data set. The vector 1 is simply a vector of 1s. The multiplication symbol (*) is explicitly shown where needed but it can just as well be dropped for brevity. The * above the y is not a multiplication sign. We’ll explain what it is shortly. The error term ϵ is assumed to be normally distributed with a zero mean and variance σ²_ϵ, i.e. ϵ ~ N(0, σ²_ϵ).
For the ith row in the data set, the above model can be restated as follows:
For convenience, we’ll continue with Eq (1) instead of (1a). Let’s assume that x_2 and x_3 are exogenous i.e. not correlated with ϵ, so the model in Eq (1) (or 1a) can be estimated consistently using Ordinary Least Squares (OLS) and the OLS estimator will yield unbiased estimates of all regression coefficients.
Let’s suppose that y* is the true (exact) value of the dependent variable but the value we have measured (i.e. observed) is y, and y contains an error ν which we assume to be additive. We can express the relation of y* with y as follows:
We’ll assume that ν is normally distributed with a mean of zero and variance σ²_ν. Notation-size: ν ~N(0, σ²_ν). As we’ll see, this assumption will help in the analysis.
Substituting Eq (2) in Eq (1) and rearranging the terms yields the following:
In the above model, x_2 and x_3 continue to remain exogenous i.e. not correlated with the composite error term (ϵ + ν). Therefore, just as with the earlier model, the model in Eq (3) can be estimated consistently using Ordinary Least Squares (OLS) and OLS will yield unbiased estimates of all regression coefficients.
Let’s examine the mean and variance characteristics of the random variable (ϵ + ν). We’ll use the following two identities for the mean and variance of a linear combination of two independent random variables X_1 and X_2 with means μ_1 and μ_2 and variances σ²_1 and σ²_2 respectively:
Recollect that the errors ϵ and ν are zero-centered and normally distributed i.e. ϵ ~N(0, σ²_ϵ) and ν ~N(0, σ²_ν).
Applying the above two identities, we get the mean and variance of the composite error (ϵ + ν) as zero and (σ²_ϵ +σ²_ν) respectively, as follows:
From this, we can see that while the mean of the composite error is still zero, its variance is larger than that of the original model.
The larger variance of the error term means that the predictions of this model containing the imprecisely measured y are less precise. They come with a larger sized prediction interval than those of the model with the exact y*.
Now let’s visit the case where one of the regression variables contains an error term.
Measurement errors in a regression variable
Consider the following linear model:
As before, y, 1, x_2, x*_3, and ϵ are column vectors of size [n x 1].
This time, we’ll assume that y is measured exactly but x*_3 contains a measurement error. Specifically, x*_3 represents the exact value of the variable but the observed value is x_3 . Also as before, we’ll assume that the error in the observed value is zero-centered, normally distributed and additive. Thus, we can say the following about x*_3:
δ is the measurement error. We assume it to be normally distributed around a zero mean, i.e. δ ~N(0, σ²_δ).
To estimate the effect of this measurement on the model’s operating characteristics, we’ll follow the same investigatory approach as with y. Let’s substitute Eq (5) into the regression model given by Eq (4) and rearrange the terms a bit as follows:
As with the model in Eq (3), this model too contains a composite error term (ϵ — β_3*δ). We shouldn’t be fooled by the negative sign in the error into thinking that the composite error is smaller than the error term ϵ of the original model. It isn’t necessarily smaller, because its value depends on the signs of ϵ, β_3 and δ. In fact, as we’ll see in a bit, in absolute terms, it may be larger than the error of the original model.
The error (ϵ — β_3*δ) is a linear combination of the presumably independent random variables ϵ and δ. We have also assumed that δ and ϵ are each normally distributed around a zero mean. Hence, we can use the same type of analysis as before to calculate the mean and variance of the composite error (ϵ — β_3*δ). Specifically, we can show that the composite error term (ϵ — β_3*δ) is also normally distributed around a zero mean and it has a variance of (σ²_ϵ + β²_3*σ²_δ).
The effects of measurement error in the regression variables on the model
The variance (σ²_ϵ + β²_3*σ²_δ) of the composite error of the model in Eq (6) suggests the following effects on the resulting model:
- Since all quantities in (σ²_ϵ + β²_3*σ²_δ) are non-negative, the composite error is at least as large (and in practice usually larger) than the variance σ²_ϵ of the original model’s error term. That makes the prediction intervals of the model containing the measurement errors in the regression terms i.e. the model in Eq (6), wider than those for the exact model. This result is similar to what we saw earlier with the model that contains a measurement error in the response variable y.
- The size of the error variance is proportional to how big is the absolute value of β_3, which in turn is a measure of how tightly is x_3 (the variable containing the measurement error) correlated with y. This finding seems intuitive if one looks at it this way: the model’s precision suffers more seriously if highly relevant regression variables contain measurement errors, than if irrelevant variables contain measurement errors.
- There is a flip side to observation #2. Even an irrelevant variable, when it is measured erroneously, will cause a loss of precision to some degree in the resulting model. Hence, if you are doubtful about the relevance of a variable, and it is also likely to be difficult to measure precisely, you may be doing your regression model a favor by simply leaving it out.
- Finally, the size of the error variance is proportional to how large is the variance of the measurement error in x_3. The less precisely is x_3 measured, the larger is the error variance in the resulting model. This somehow does seem intuitive.
Correlation between the regression variable and the measurement error
Consider once again the following relationship between the theoretically exact value x*_3, its imperfectly observed value x_3, and the measurement error δ:
There are two interesting scenarios to consider:
- The measurement error δ is not correlated with the observed value x_3.
- The measurement error δ is correlated with the observed value x_3.
Recall the model in Eq (6):
We got this model from replacing x*_3 with (x_3 — δ) in the original model. In the above model, x_2 and x_3 are uncorrelated with the error term ϵ of the original model.
Let’s examine the first one of the above two scenarios.
x_3 is not correlated with δ.
Since x_3 is also uncorrelated with the error term ϵ, it implies that x_3 is uncorrelated with the composite error term (ϵ — β_3*δ). Thus, in Eq (6), x_3 continues to remain exogenous. There is usually no compelling reason to believe that the other regression variables in the model (in this case, x_2) would be correlated with the measurement error in x*_3 i.e. δ. Hence all regression variables in model (6) are exogenous. Moreover, we showed earlier that the composite error term of the model in Eq (6) is zero centered. The model in Eq (6) can be consistently estimated using OLS with no bias in the estimated coefficients.
Now consider the second scenario.
x_3 is correlated with δ
Since δ is part of the error term in this model, it makes x_3 endogenous, i.e. x_3 is correlated with the composite error term of the model. This model can no longer be consistently estimated using OLS.
This second scenario consisting of an endogenous x_3 yields another unfortunate dividend. Since the error δ that x_3 is correlated with is hiding inside the composite error term, δ effectively plays the role of an omitted regression variable causing leading to the Omitted Variable Bias, i.e. any attempt to estimate the model using OLS will cause the estimated coefficients to be systematically biased away from their true values.
It can also be shown that this bias in the estimated coefficients is toward zero. That is, the values of β estimated by the OLS estimator are depressed or attenuated toward zero causing them to be smaller than the true values. This is known as attenuation bias and it can cause the experimenter to think that a regression variable is less effective than it actually is in explaining the variance of the response variable.
All coefficients, not just the coefficient of the error-prone variable(s) will be afflicted by this attenuation bias, and the bias will not go away no matter how large is the size of the sample or how well-balanced it is.
So what is the experimenter to do when one of the regression variables is being measured imperfectly and the degree of imperfection happens to be correlated with the measured value?
Fortunately, the situation is not hopeless for there are a few remedies:
- The easy and lazy thing to do is to just accept the endogeneity in the resulting model and the resulting bias in the estimated coefficients. This is a perfectly sound choice if the amount of correlation between the error and the observed value is anticipated to be small. This is usually a judgement call on the part of the experimenter as the measurement error cannot be directly observed.
- In panel data models, if the error prone variable does not vary with time, one can simply difference it out of the model.
- One may identify one or more instrumental variables (IV) for the error-prone endogenous variable, replace the error-prone variable with the IVs and estimate the resulting model consistently using an IV-estimator such as 2-stage least squares.
- Measurement errors can (and often do) creep into both the response variable and the explanatory variables of a regression model.
- In case of a linear model, measurement errors in the response variable is usually not a big problem. The model can still be consistently estimated using least squares (or in case of a model with instrumented variables, using 2-stage least squares).
- If the errors are in an explanatory variable, the model can still be consistently estimated using OLS (or 2-stage OLS) provided the errors are not correlated with the observed value of the variable.
- In all cases, measurement errors increase the variance of the error term of the model which cause the prediction intervals to be wider and the model’s predictions less precise. The loss in precision is proportional to the degree of imprecision in measuring the response variable or the explanatory variables.
- The model’s precision suffers more seriously if highly relevant regression variables contain measurement errors, than if irrelevant variables contain measurement errors.
- If you are doubtful about the relevance of a variable, and it is also likely to be difficult to measure precisely, you may be doing your regression model a favor by simply leaving it out.
- If one or more explanatory variables are measured erroneously and the measurement error is proportional to the observed value of the variable, it makes the variable endogenous. This model cannot be estimated consistently using the OLS estimator and any attempt to do so will result in coefficient estimates that are biased toward zero i.e. attenuated.
- When faced with the above situation, common remedies are to either accept the bias, or to difference out the error-prone variable(s), or to identify instrumental variables to take the place of the error-prone variables.