While exogeneity is a good thing, endogeneity can put into question your model’s effectiveness
In this chapter, we’ll look at what exogenous and endogenous variables are in the context of regression analysis. We’ll also explain what happens to your regression model when one or more regression variables turn out to be endogenous. We’ll learn how to spot endogeneity, and we’ll touch upon a few ways to deal with it.
Let’s start with the raw definitions of the terms, and we’ll follow it up by developing our intuition about them using real-world examples.
What is an exogenous variable?
Consider the following linear regression model:

In the above model, y is the dependent variable. X is the matrix of explanatory variables including the placeholder for the intercept term, β is the vector of regression coefficients (and it includes the intercept β_0], and ϵ is the vector of error terms.
While X may be able to explain some of the variance in y, the balance amount of unexplained variance in y has to go somewhere. That ‘somewhere’ is the error ϵ. Thus, the error term ϵ represents the effect of all factors on the dependent variable that explanatory variables of the model have not been able to account for.
Let’s further assume that there are n data points in the data sample, and k regression variables. Thus, y is a column vector of size [n x 1], β is a column vector of size [k x 1], X is a matrix of size [n x k] (which includes the placeholder column of 1s for the intercept), and ϵ is a column vector of size [n x 1], as follows:

The model’s equation for the ith row in the sample can be expressed as follows (where x_i_k is the value of the kth regression variable x_k):

With this setup in place, let’s get to the definitions of interest.
In econometrics, and especially in the context of a regression model such as the one depicted in Eq (1), an exogenous variable is an explanatory variable that is not correlated with the error term.
In the context of the above regression model, the regression variable x_k is exogenous if x_k is not correlated with ϵ. Specifically, for any given row of the data set, the value of x_k_i should be uncorrelated with the corresponding error ϵ_i. Remember that in regression analysis, we consider the value of the kth regression variable for the ith row x_k_i, and the corresponding error ϵ_i to be random variables. So the concept of correlation (or in this case, lack of) applies to them.
The consequence of this uncorrelated-ness is that the mean value of the error term is not influenced by (and therefore not a function of) an exogenous explanatory variable. In other words, an exogenous explanatory variable carries no information about the model errors and therefore cannot be used to predict (even inexactly) the errors.
Example of an exogenous variable
As we’ll soon see in the discussion on endogeneity, truly exogenous explanatory variables are hard to come by. One only need look hard enough to uncover some subtle, underlying link between an explanatory variable and the error term. Nevertheless, let’s make an attempt to illustrate exogeneity with a real-world example.
Year 2005 was the second most active Atlantic hurricane season in recorded history with 28 major storms wreaking billions of dollars of property damage in numerous states in the US. Suppose we wish to estimate the effect of hurricane damage on the change in property prices in Atlantic ocean-facing states from before to after the 2005 hurricane season. In the following model, ΔPrice_i is the change in average sales price of property in Atlantic ocean-facing state i from 2004 to 2006. Hurricane_Affected_i is a binary (1/0) variable indicating whether state i experienced significant damage from hurricanes, δ is the coefficient that measures the degree of influence that Hurricane_Affected_i has on ΔPrice_i. x_i, β and ϵ_i have their usual meanings as before. We wish to primarily estimate δ.

Let’s assume that on a time scale that stretches across hundreds of hurricane seasons, the area that constitutes each Altantic ocean-facing state has roughly the same chance of experiencing significant property damage from Atlantic hurricanes. With this assumption, it is easy to see that whether the ith Atlantic ocean-facing state would have experienced significant property damage in the 2005 season must be independent of pretty much any sort of factor contained within the error term of the model. Thus, Hurricane_Affected_i is uncorrelated with ϵ_i making Hurricane_Affected_i an exogenous variable.
The converse of this situation yields an endogenous variable.
What is an endogenous variable?
An endogenous variable is any variable in the regression model that is correlated with the error term. An endogenous variable carries information about the error term (and vice versa). In theory, at least an inexact function can be constructed to predict the mean value of the error given the value of the endogenous variable.
In the regression model shown in Eq (1), if the kth regression variable x_k is endogenous, the following holds true for any row i in the data set:
E(ϵ_i|x_k_i) = f(x_k_i), where f(.) is some function of x_k_i
Hidden factors inside the error term
Since correlation is a two-way street, another way of looking at endogeneity is to imagine that the error term of the regression model influences the mean value of the endogenous regression variable.
By this view, one may imagine that there are one or more unobserved factors hiding within the error term of the model. These factors are correlated with the endogenous variables of the model and therefore, when the values of these hidden factors undergo a change, the mean values of all of the correlated endogenous variables also change. That in turn changes the mean value of the response variable of the model. Remember that the endogenous explanatory variables of the model represent observable quantities, while the error term is inherently invisible to the experimenter. Therefore what the experimenter observes is the endogenous variables changing their values and the corresponding changes in the observed value of the dependent variable. What the experimenter does not realize is that at least some portion of the variations in the endogenous explanatory variables are being brought about by the changes in the hidden, unobserved factors in the error terms.
When you have endogenous variables in your model, unbeknownst to you, your model’s error term is influencing the model’s response using all of the endogenous explanatory variables in the model as the communication channel!
As a modeler, this is not a good state-of-affairs to find oneself in for a number of good reasons. We’ll explain below what those reasons are.
Consider the following data set of prices of passenger vehicles plotted against the number of engine cylinders:

If we suspect that a linear relationship exists between price and number of cylinders, we may represent that relationship using the following model:

In the above model, price is the dependent variable, num_of_cylinders is the single explanatory variable, β_1 is the regression intercept, β_2 is the coefficient of num_of_cylinders, and ϵ is the error term. The sub-script i indicates the quantities for the ith row (ith vehicle) in the sample.
We are interested in estimating the mean price of the ith vehicle given its number of cylinders, in other words, the conditional expectation of price on number of cylinders and it is denoted as E(price_i|num_of_cylinders_i).
Since we have assumed a linear relationship between price and number of cylinders, we would expect this conditional expectation to be a function of only the number of cylinders. Thus, we are seeking a function of num_of_cylinders (and only num_of_cylinders) that would yield the conditional mean price E(price_i|num_of_cylinders_i). In other words, we are seeking the following conditional mean function:

If we apply the Expectation operator E(.) to both sides of Eq (5), we get the following relationship:

In the above equation, the blue bit on the R.H.S. resolves to simply β_1, and the green bit resolves to β_2*num_of_cylinders_i:

The only way that we will be able to construct an estimable linear model of the kind in Eq. (6), is if the gray colored term on the R.H.S. of (7b) is zero, i.e.,
E(ϵ_i|num_of_cylinders_i) needs to be zero.
This conditional expectation is the mean value of error in the modeled price of the ith vehicle conditioned upon the specific value of the number of cylinders.
A sufficient condition for this conditional expectation to be zero (or constant) is if the error is not correlated with num_of_cylinders, in other words, if num_of_cylinders is exogenous. If the conditional mean of the error E(ϵ_i|num_of_cylinders_i) is some non-zero constant, we can simply add it into the intercept β_0 of the model and our desired conditional mean function in Eq (6) is still intact.
But if num_of_cylinders_i is endogenous, Eq (6) is no longer the correct mean function. We should not be using a technique such as least-squares to estimate the conditional mean using Eq (6). And if we still go ahead and estimate Eq (6) using least-squares, we will get incorrect, specifically, biased, estimates of regression coefficients β.
Let’s delve into this last point in some detail.
The effect of endogeneity on a regression model
Let’s revisit the model in Eq (1):

Suppose the kth regression variable x_k is endogenous, while variables x_1 thru x_(k-1) are exogenous. Using this supposition, we can partition the X matrix into two matrices as follows:
- A matrix X* of size [n x (k-1)], which contains variables x_1 through x_(k-1) from X which are all assumed to be exogenous.
- A column vector x_k of size [n x 1] which contains the kth column of X. We are assuming that x_k is endogenous, i.e. x_k is correlated with the error ϵ.
We multiply the X* matrix with β* which is a column vector of size [(k-1) x 1] containing all coefficients from β except the kth coefficient. Notice that as before, the matrix multiplication of X* with β* yields a column vector of size [n x 1]. To this, we add the [n x 1] size column vector x_k scaled with the kth coefficient β_k, and finally, to this sum we add the [n x 1] column vector of errors so as to yield the [n x 1] column vector of the observed y values.

Since we have assumed that x_k is correlated with ϵ, there must be at least one hidden factor within ϵ that x_k is correlated with. This hidden factor can be considered as an explanatory variable that the experimenter has omitted from the model simply because it is unobservable or unmeasurable and therefore impossible to include. Let this omitted variable be hypothetically denoted by w.
If w is included in the model, the theoretically correct model would be the following:

Where all variables X*, x_k and w are now exogenous, and thus, the error term v is not correlated with any of them. This model can be correctly estimated using ordinary least-squares, and all estimated coefficients will be unbiased.
Unfortunately, this is an impossible model as w cannot be observed. The model we must estimate is given by Eq. (8).
In a previous chapter on omitted variable bias, we have seen that:
- If an explanatory variable is omitted from a regression model, and
- The omitted variable is correlated with at least one of the explanatory variables in the model,
the omission has the effect of biasing the estimates of the coefficients of all variables that are included in the model.
The amount of bias is directly proportional to the amount of correlation between the omitted variable and the variables in the model that it is correlated with conditioned upon the remaining variables in the model, and is inversely proportional to the variance of the endogenous variable conditioned upon the remaining variables in the model.
In our case, the correlation between the endogenous x_k and the error term ϵ can be construed as a correlation between x_k and the hypothetical variable w. Since w cannot be observed, it is effectively omitted from the model causing the coefficients of all variables in model to be biased away from their true values.
Greater the correlation of is x_k with ϵ, the more biased will the estimated regression coefficients be from their true population values.
It is possible to quantify this bias. When the theoretical model in Eq (9) is estimated, the estimated coefficients come out to be as follows:

In the above equation:
- E(β_cap_k) is the mean value of the kth coefficient estimated by least-squares,
- β_k is its true population value,
- Cov(w,x_k|[x_1,…x_(k-1)]) is the conditional covariance of w with x_k
- Var(x_k|[x_1,…x_(k-1)]) is the conditional variance of x_k.
Alas in practice, this bias cannot be measured for the simple reason that the experimenter cannot observe w.
Situations that result in endogenous variables
Earlier in the chapter, we happened to mention that endogenous variables are easy to come by. And that fact can have a terrible effect on the most carefully thought of experiments. Here are a couple of examples that will illustrate how easy it is for endogeneity to creep into your regression model.
Treatment effect models
Often, one is interested in establishing relationships between one or more explanatory variables and the response variable that go beyond just correlation. One wants to know if event A causes event B. Does smoking cause lung cancer (it does!)? Does an Ivy league education lead to greater lifetime earnings (the jury is still out on that one)? There is vast body of literature on causality. Let’s examine one such model in which we seek to estimate the effect of getting a high GPA (say ≥3.8/4.0) in an Ivy league college on lifetime earnings. The model may be stated as follows:

X is a matrix of regression variables that may be relevant to the study. X may contain variables such as parent’s education, ethnicity, gender etc. High_GPA is a binary (1/0) variable. It is the ‘treatment’ variable in the model. Our aim is to estimate the value of γ.
Suppose we restrict the scope of the experiment to only include students from Ivy League colleges.
The question is, is High_GPA exogenous? Elite schools have been placing a large emphasis on a high degree of collaboration in coursework. One may suspect that students whose personality traits are conducive toward their being able to collaborate effectively with other students would be able to derive maximum benefit from this school policy. Such students may even develop an edge in their college GPAs over other students. But traits such as openness, honesty, likability, non-introvertedness, leadership etc. — that may be conducive toward effective collaboration are also the ones that could influence the person’s ability to acquire and hold high-paying employment positions or run successful businesses after college. All of these traits are inherently unmeasurable in the quantitative setting of a regression model and therefore their effect is hidden in the error term of the model. At the same time, they appear to be correlated with the observable explanatory variable High_GPA, thereby making High_GPA endogenous. It is safe to assume that there is a positive correlation between these personality traits and High_GPA, and therefore the estimated coefficient of High_GPA in Eq (11) will be biased on the higher side of its true value. In other words, the experimenter is likely to overestimate the effect of High_GPA on Lifetime_Earnings.
It should be noted that even if the experimenter goes to great lengths to ensure a perfectly balanced sample in terms of all the parameters of the model, the estimated coefficient of High_GPA will still be biased. Same holds true no matter how large the sample is. The over-estimation of the effect of a high GPA on earnings cannot be avoided.
Non-random samples
Consider an experiment that seeks to estimate the effect of weekly orange juice consumption by adult inhabitants of a town on monthly frequency of common colds. The expectation may be one of a negative correlation between the two.
To enlist volunteers, the experimenter posts flyers all across town at places such as outside supermarkets, the public library, bus stops etc.
Unfortunately, this experiment may be doomed due to endogeneity issues.
The flyers are posted at only outdoor locations and therefore are necessarily out of reach of home-bound, physically, or mentally challenged inhabitants of the town. If the text on the flyers is too small, older people may not spot it or be able to read it easily enough. People who do not take public transportation to get around may be less likely to participate. In general, the cohort of study participants may turn out to be a certain set of able-bodied residents who step outdoors regularly and often take public transportation — hardly what you may call a randomly selected sample.
If there happens to an Income variable in this model, the Income variable is likely to be highly correlated with exactly all of these factors — home bound-ness, physical fitness, non-use of public transportation etc. — which have not been controlled for by the experimenter and therefore whose effects are hidden in the error term. That would make Income endogenous in the model.
If some of the same unobserved factors are correlated with the weekly intake of orange juice, then the Weekly_Orange_Juice_Intake variable is also endogenous.
If the model in its current form is estimated using OLS, the estimated coefficients of all variables will be biased away from their true values, thereby systematically over or underestimating the impact of each variable on the incidence of common colds. Specifically, the experimenter will either over-estimate or under-estimate the effect of true effect (if it exists) of orange juice consumption on the frequency of common colds.
Remedies for endogeniety
We’ll end the chapter with a short overview of techniques and strategies available to us when faced with variable endogeniety. We’ll study how to use these tools in subsequent chapters. They are as follows:
- If we suspect that the variables that are assumed to be endogenous are not heavily correlated with unobserved factors in the error term, then we can assume that the resulting bias in the coefficients will be mild. This can be seen from Eq (10). We can just accept this presumed mild level of bias. Incidentally, recollect that we would have no means to numerically estimate this bias.
- We may choose to use one or more proxy variables in place of the unobserved effects hiding in the error term thereby ‘pulling’ their effect out of the error term and into the model. An example of a proxy is number of years of schooling (which is measurable) for education (which is unmeasurable), and the IQ score (measurable) as a proxy for skill or ability (unmeasurable). We’ll cover the use of proxies in a subsequent chapter.
- In time series models, panel data models or treatment effect models, if the (suspected) endogenous variables do not vary with time, we can simply simply difference them out.
- A powerful estimation technique called Instrumental Variables (IV) estimation can be used to isolate out the portion of the variance in the endogenous variable that may be correlated with the error term by introducing another set of regression variables called instruments. We’ll study IV estimation in a subsequent chapter.
Summary and key takeaways
- In econometrics, and especially in the context of a regression model, an exogenous variable is an explanatory variable that is not correlated with the error term.
- An endogenous variable is any variable in the regression model that is correlated with the error term. By definition, the dependent variable y of a regression model is always endogenous.
- When a regression model contains one or more endogenous explanatory variables, the model’s error term influences the model’s response via all of the endogenous explanatory variables.
- When a regression model containing endogenous explanatory variables is estimated using least-squares, the estimated coefficients are biased away from their true population values.
- Endogeneity can be surprising easy to introduce into the design of an experiment, and when present, it can spoil the usefulness of the experiment’s results.
- Endogeneity, if it is suspected to be severe, can be controlled using techniques such as proxy variables, differencing, and instrumental variables.
Citations and Copyrights
Data set
The Automobile Data Set is sourced from UCI ML data sets repository under CC BY 4.0.
Images
All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.
PREVIOUS: Understanding Partial Effects, Main Effects, And Interaction Effects
NEXT: The Assumptions Of The Linear Regression Model