How to use proxies in place of variables that cannot be observed
Sometimes, one or more explanatory variables in a regression model cannot be observed or measured. In this chapter, we’ll look at how to use proxy variables in place of such unmeasurable variables. We’ll explain what makes for a good versus a bad proxy. And we’ll examine the tradeoff between including proxies versus leaving them out.
Consider the following regression model as a way to estimate the college GPA of a student given several factors such as academic readiness, emotional readiness, drug use while in high school, gender and ethnicity:
In the above model, the subscript i refers to the ith student entering college. ϵ_i is the error in the ith observation.
Academic readiness depends on several factors such as ability to grasp and assimilate new and complex information, knowledge of the intended college major, ability to master complex college-level coursework and many other things. Emotional readiness gets into aspects such as personality traits, family background, and pre-college experiences. One may surmise that Academic_Readiness and Emotional_Readiness are intrinsically unmeasureable variables.
With two major variables in our model not being directly observable, we are in a sticky situation. At this point, we have basically three options on how to proceed:
- We could drop Academic_Readiness and Emotional_Readiness from the model, and accept the bias in the estimated coefficients that could result from omitting them, or
- We could find suitable proxy variables which could to some extent stand-in for Academic_Readiness and Emotional_Readiness, or
- We could use a technique called as Instrumental Variables to account for our inability to use Academic_Readiness and Emotional_Readiness in our model.
In an upcoming chapter, we’ll study the use of Instrumental Variables. In this chapter, we’ll look at how to use proxy variables.
Consider academic readiness. We cannot measure it directly, but how about we use High School GPA and SAT score as two variables that would act as proxies for academic readiness? And suppose we use the student’s EQ (Emotional Quotient) score as a proxy for emotional readiness?
If we use these proxies, we will be estimating the following model:
We may be able to use these proxies if we can assume (or establish) the following about them:
The Proxy should be correlated with the primary variable
The proxy variable(s) should be correlated with the variable that they are proxying for. This seems like an obvious thing but one should still put this requirement through a sanity check while choosing proxies. Unfortunately, since the primary variable cannot be directly observed, one can only introspect and use one’s judgement to decide whether the proxy satisfies this requirement.
In our example, a student’s High School GPA, and SAT score needs to be correlated with their academic readiness. One way to express this correlation is via the following linear model:
The above model is inestimable since academic readiness is an unobservable variable. Hence, we can only postulate that were this model to be fitted on real data, and the coefficients δ_1 and δ_2 found to be jointly significant (via an F-test for regression analysis), then the proxies for academic readiness would be jointly correlated with academic readiness. As with all proxy-based studies, in the absence of actual data, we must speculate that this may well be true for our two chosen proxies.
The Proxy should not endogenous
In the model shown below that regresses academic readiness on its two proxies, the two proxies should not be endogenous:
That is, HS_GPA_i and SAT_Score_i should not be correlated with the error term ϵ_i of the model.
A more serious situation arises in the following scenario: Suppose in a model that includes the proxy variables in place of the primary variables, the proxy variables turn out to be endogenous. In our example, it would be the following model:
For the above model to be estimable via least-squares, all variables on the R.H.S. should be exogenous, i.e. not correlated with the error ϵ_i.
Gender_i, Ethnicity_i, and Drug_Use_In_HS_i are by definition exogenous.
If one or more of the proxies HS_GPA_i, SAT_Score_i, and EQ_Score_i are endogenous, the above model can no longer be correctly estimated.
Moreover, the presence of endogeneity implies that there are one or more factors hiding within the model’s error term ϵ_i that are influencing the response variable College_GPA_i via the endogenous proxies. One way to look at this situation is that an endogenous proxy is a proxy for another variable which could be the real proxy for the primary variable, thereby making our choice of proxy wrong or at least sub-optimal.
To test whether the proxies are endogenous in the above model, one would normally run one of several available tests for endogeneity.
The Proxy should not inject additional information into the model
Suppose we were able to measure academic readiness. If a model already contains academic readiness as a variable, adding High School GPA and SAT score to this model should not feed additional information into the response College GPA. In other words, the proxy should not be able to explain any additional variance in the response variable beyond what the variable that they were proxying for is able to explain.
Mathematically, if the following theoretical model is to be fitted to (theoretical) data, the estimated coefficients of High_School_GPA_i, SAT_Score_i and EQ_Score_i should work out to be statistically insignificant (i.e. in population terms, zero).
What would happen if they are non-zero?
Let’s imagine the scenario where one or more of γ_1, γ_2, and γ_3 are non-zero. That would mean that the corresponding proxy variables are relevant. In a model that leaves out the relevant proxy variables, these omitted variables would exert their influence via the error term causing the coefficients of all variables in the model to be biased away from their true values. This is the classic Omitted Variable Bias situation.
But things get worse. Since the omitted variables are also proxies, they are by definition correlated with the primary variables Academic_Readiness and Emotional_Readiness. Since the omitted proxies are exerting their influence through the error term, we are now in a situation where the error term is correlated with the primary variables of the model, thereby making the primary variables endogenous. This violates a fundamental requirement (and assumption) of the linear model.
When to use Proxies and when to leave them out
In the above discussion, we examined some of the effects of including a badly chosen or a sub-optimal proxy. Given that there exist no statistical tests to verify most of the qualities of a “good” proxy, we may also want to examine the effects of leaving out what might otherwise prove to be a useful proxy.
In most cases, the chosen proxy may not be able to perfectly satisfy all requirements. For example, a proxy may be endogenous but the degree of endogeneity may be mild. One may also suspect the proxy of injecting additional information into the model above and beyond what is needed for it to stand-in for the primary. But we may also suspect this additional effect to be tiny and practically insignificant.
On the other hand, our common sense may suggest that there is an obvious and significant correlation between the unobservable primary variable and its observable proxy. In such a situation, simply leaving out the proxy would provide us with no means to account for the effect of an unobservable but important explanatory variable. Such a fitted model without either the primary explanatory variable or its proxy would result in higher residual error and a lower goodness-of-fit than when the proxy is included. Furthermore, if the omitted primary variable also happens to be correlated with any of the explanatory variables in the model, it would bias all estimated coefficients away from their true values.
In summary, unless one has strong reasons for leaving out a proxy, it is on balance better to include it and tolerate a possible loss of precision in the model’s estimated coefficients, than to entirely omit the proxy.
Let’s illustrate the use of proxy variables via a real world example.
Suppose we wish to estimate incidence of poverty in a community. Our unit of estimation will be a county in the United States. We’ll define the incidence of poverty in the county as the percentage of households in the county that are below the federal poverty level.
Let’s hypothesize that this percentage is correlated with the following factors:
- The median age of the county’s inhabitants.
- How strong is the demand for housing (especially home ownership) in the county. We will measure this factor in terms of the Homeowner Vacancy Rate — a federally published statistic which measures the percentage of for-sale homes in the county that lie unsold.
- On average, how well educated is the population of the county.
There will undoubtedly be many other factors that would be correlated with the degree of prevalence of poverty in the county. But for now, we’ll consider the above three factors.
A linear model that links the above three factors with the incidence of poverty in the county would look like this:
In the above model, the subscript i indicates that the data refers to county i in the data set.
The Median Age in the county is directly measurable, and so is the Homeowner Vacancy Rate.
The Percent_Well_Educated which measures the percentage of people in the county who are “well educated” is not directly observable. Although people may generally agree that being well-educated is an important step toward achieving financial independence, being well-educated means different things to different people.
Furthermore, education as an achieved quality is arguably the integral effect of various observable and unobservable quantities such as number of years of formal schooling, participation in various kinds of trainings and apprenticeships—both formal and informal, general knowledge about one’s chosen profession, lateral knowledge about related topics, and the “soft” skills that one brings to bear for solving the problem at hand.
These factors make education exactly the sort of variable for which one would want to use proxies. In our model, we will use the variable Percentage of People in the County with a College Education or Higher as the proxy for the variable Percent_Well_Educated.
Doing so yields the following model:
Let’s look at whether our chosen proxy satisfies the requirements of a “good” proxy.
Is the proxy correlated with the primary variable?
If our definition of being well educated includes (among other things) having at least a college degree, then (and only then) higher the percentage of county inhabitants with at least a college degree, the more well-educated is the county’s populace, and vice versa.
Is the proxy endogenous?
If the proxy is endogenous, it is correlated with the error term of the model. In the following model, we’ll have to test if the proxy Percent_With_College_Or_Higher_Educ_i is endogenous.
This is not a straightforward test and it deserves its own chapter. We’ll show how to perform this test in an upcoming chapter.
Does the proxy inject additional information into the model?
To answer this question, we would need to construct a model that contains both the primary and the proxy and check if its goodness-of-fit is greater than a model that contains only the primary variable. Since the primary variable Percentage of Well Educated People is inherently unmeasureable, this becomes an impossible experiment. Instead, we appeal to our view of what makes a well-educated individual. And we posit that while it is possible for an individual to be well-educated without having a college degree, more often than not, such an individual will have a college degree. The data, if the right sort of it is possible to collect, may well prove us wrong in this notion. But if we are right, then adding the variable Percentage of People With College Or Higher Education will bring no additional information into a model that already contains the primary variable Percentage of well educated people.
With our introspection completed, let’s turn to the task of building the regression model.
Building and training the model
To fit the model, we’ll use a data set pulled from the US Census Bureau.
Specifically, we’ll use data about several socioeconomic indicators collected by the Census Bureau and aggregated at a county level. The data is a subset of the 2015–2019 American Community Survey (ACS) 5-Year Estimates conducted by the US Census Bureau. The following table contains the data that we will use (tap or click on the image to zoom it):
The data set used in this example can be downloaded from here. The complete ACS data set can be pulled from the US Census Bureau’s website using publicly available APIs, or directly from the Census Bureau’s Community Explorer website.
To recap, we would want to fit the following model to this data set:
In the data set, we’ve made a slight change to the name of the proxy variable Percent_With_College_Or_Higher_Educ_i so as to reflect the name used for this variable by the US Census Bureau, namely, Percent_Pop_25_And_Over_With_College_Or_Higher_Educ.
Let’s start by importing the required packages and loading the data file into a Pandas
import pandas as pd import statsmodels.formula.api as smf #Load the US Census Bureau data into a Dataframe df = pd.read_csv('us_census_bureau_acs_2015_2019_subset.csv', header=0)
Construct the model’s equation in Patsy syntax. Statsmodels will automatically add the intercept of regression to the model, so we don’t have to explicitly specify it in the model’s equation:
reg_expr = 'Percent_Households_Below_Poverty_Level ~ Median_Age + Homeowner_Vacancy_Rate + Percent_Pop_25_And_Over_With_College_Or_Higher_Educ'
Build and train the model and print the training summary:
olsr_model = smf.ols(formula=reg_expr, data=df) olsr_model_results = olsr_model.fit() print(olsr_model_results.summary())
We see the following summary:
We will ignore the value of R-squared (or adjusted R-squared) as our interest lies in estimating the main effects of the observed explanatory variables on the response variable, namely, the poverty level in the county.
As an aside, we see that the coefficients of all explanatory variables are found to be significant at a p < .001.
Our model has estimated the main effects of each variable as follows:
The coefficient of Median_Age is -0.3317 indicating that a unit increase in the median age (years) in a county is correlated with a 0.33% drop in the percentage of the below-poverty level households in the county.
The coefficient of Homeowner_Vacancy_Rate is 0.9198 indicating that every percentage point increase in the vacancy rate is correlated with an almost similar level of percentage point increase in the below-poverty level households in the county, and vice versa. This is a significant sized effect.
The effect of this proxy variable has a size and direction that is similar to that of Median_Age. The coefficient is -0.3169 implying that roughly every 3% percentage point increase in the number of people with a college or higher degree correlates with one percentage point reduction in the number of below-poverty level households in the county.
Correlation and not necessarily causation
Notice that we have been careful to state all our results in terms of correlation. Without the benefit of causal analysis, it is not at all clear whether the so-called “explanatory” variables influence the percentage of households below poverty level in a county, or if the arrow of influence points the other way.
Higher levels of education may very well lead to a reduction in poverty. One could also argue that affluent households and households generally above the poverty line would be able to afford college education more easily than ones below the poverty line. The nature of influence of the median age and the demand for housing on poverty level is even murkier.
Summary and takeaways
- When one or more explanatory variables in a regression model cannot be observed or measured, we may want to use proxy variables in place of such unmeasurable variables.
- A proxy variable must ideally satisfy three requirements: 1) It should be correlated (preferably strongly) with the primary variable, 2) It should not be endogenous, 3) It should bring no additional information into the model beyond what is required for it to be a stand-in for the primary variable.
- Most of the requirements of what makes a good proxy are inherently untestable since the primary variable cannot be observed. Therefore, we must use our judgement to determine whether, and to what degree, the chosen proxy variable satisfies or violates these requirements.
- Faced with unobservable explanatory variables, on balance, it is better to include proxies and deal with the consequent loss of precision in the model’s coefficients, than to omit the proxies and suffer the consequences of a badly fitted model and possibly biased coefficient estimates.