And a Python tutorial on how to build and train a Fixed Effects model on a real-world panel data set
The Fixed Effects regression model is used to estimate the effect of intrinsic characteristics of individuals in a panel data set. Examples of such intrinsic characteristics are genetics, acumen and cultural factors. Such factors are not directly observable or measurable but one needs to find a way to estimate their effects since leaving them out leads to a sub-optimally trained regression model. The Fixed Effects model is designed to address this problem.
This chapter is PART 2 of the following three part section on Panel Data Analysis:
- The Pooled OLS Regression Model For Panel Data Sets
- The Fixed Effects Regression Model For Panel Data Sets
- The Random Effects Regression Model for Panel Data Sets
A primer on panel data
A panel data set contains data that is collected over a certain number of time periods for one or more uniquely identifiable “units”. Examples of units are animals, persons, trees, lakes, corporations and countries. A data panel is called a balanced or an unbalanced panel depending on whether or not all units are tracked for the same number of time periods. If the same set of units is tracked throughout the study, it’s called a fixed panel but if the units change during the study, it’s called a rotating panel.
Panel data sets usually arise out of longitudinal studies. The Framingham Heart Study is possibly the most well known example of a longitudinal study that has been running since 1948.
In this chapter, we’ll look at a real world panel data set containing the Year-over-Year % growth in per capita GDP of seven countries measured from 1992 through 2014. Along with GDP growth data, the panel also contains Y-o-Y % growth in Gross Capital Formation in each country:
In the above data set, the unit is a country, the time frame is 1992 through 2014 (23 time periods), and the panel data is fixed and balanced.
The set of data points pertaining to one unit (one country) is called a group. In the above data panel, there are seven groups.
A Regression model for per-capita GDP growth
Suppose we wish to investigate the influence of Y-o-Y % growth in gross capital formation on Y-o-Y % growth in GDP.
Our dependent or response variable y is Y-o-Y % growth in per capita GDP. The independent or explanatory variable X is Y-o-Y % growth in gross capital formation.
In notation form, the Y-o-Y % growth in per capita GDP can be expressed as a function of Y-o-Y % growth in gross capital formation as follows:
In the above regression equation, ϵ_i_t is the error term of regression and it captures the variance in Y-o-Y Growth in per capita GDP of country i during year t that the model isn’t able to “explain”.
Let’s create a scatter plot of y versus X to see how the data looks like.
We’ll start by importing all the required Python packages including ones we would use later on to construct the Fixed Effects model.
import pandas as pd import scipy.stats as st import statsmodels.api as sm import statsmodels.formula.api as smf from matplotlib import pyplot as plt import seaborn as sns
df_panel = pd.read_csv('wb_data_panel_2ind_7units_1992_2014.csv', header=0)
We’ll use Seaborn to plot per capita GDP growth across all time periods and across all countries versus gross capital formation growth in each country:
colors = ['blue', 'red', 'orange', 'lime', 'yellow', 'cyan', 'violet'] sns.scatterplot(x=df_panel['GCF_GWTH_PCNT'], y=df_panel['GDP_PCAP_GWTH_PCNT'], hue=df_panel['COUNTRY'], palette=colors). set(title='Y-o-Y % Change in per-capita GDP versus Y-o-Y % Change in Gross capital formation') plt.show()
We see the following plot:
The Y-o-Y % growth in per capita GDP appears to be linearly related to the Y-o-Y % growth in gross capital formation, so, we’ll assume the following linear functional form for our regression model for each unit (country) i:
In the above equation, all variables are matrices of a certain dimension. Assuming n units, k regression variables per unit, and T time periods per unit, the dimensions of each matrix variable in the above equation are as follows:
- y_i is the response variable (per capita GDP growth) for unit i. It is a column vector of size [T x 1].
- X_i is the regression variables matrix of size [T x k].
- β_i is the coefficients matrix of size [k x 1] containing the population value of the coefficients for the k regression variables in X_i.
- ϵ_i is a column vector of size [T x 1] containing the error terms, one error for each of the T time periods.
Following is the matrix form of the above equation for unit i:
In our example, T=23, k=1 and n=7.
Let’s focus our attention on the error terms of the model, ϵ_i. The following are the important sources of errors:
- Errors are introduced due to random environmental noise, or by the measuring apparatus. Measurement errors introduced by the experimenter because they used the measuring apparatus incorrectly.
- Errors are introduced due to the omission of explanatory variables which were observable and measurable. These variables would have been able to ‘explain’ some of the variance in the response variable y, and therefore their omission from the X matrix causes the unexplained variance to ‘leak’ into the error term of the regression model.
- Errors are introduced due to an incorrect functional form or missing variable transformations for some of the regression variables or for the response variable. For example, suppose we need to regress the logarithm of GDP change on the gross capital formation change but we fail to log transform response variable.
- There is always the possibility that our choice of regression model is wrong. For example, if the correct model happens to be the Nonlinear Least Squares model but instead we use the OLS linear regression model, it would lead to additional regression errors.
- Finally, there will be errors introduced due to the omission of variables that not measurable. Such variables represent qualities that are intrinsic to the unit being measured. For our countries data panel where the unit is the country, an example of a unit-specific variable could be the socioeconomic fabric of the country that fuels or inhibits GDP growth under different environmental circumstances, and cultural aspects of decision-making in business and government that have evolved over hundreds of years in that country. All such factors impact the Y-o-Y % change in GDP but they cannot be directly measured. However, the omission of such factors from the regression matrix X has the same effect as in (2), that is, their effect leaks into additional variance observed in the error term.
Keeping the above commentary in mind, we can express the general form of the linear regression model for country i as follows:
In the above equation:
- y_i is a matrix of size [T x 1] containing the T observations for country i.
- X_i is a matrix of size [T x k] containing the values of k regression variables all of which are observable and relevant.
- β_i is a matrix of size [k x 1] containing the population (true)values of regression coefficients for the k regression variables.
- Z_i is a matrix of size [T x m] containing the (theoretical) values of all the variables (m in number) and effects that cannot be directly observed.
- γ_i is a matrix of size [m x 1] containing the (theoretical) population values of regression coefficients for the m unobservable variables.
- ε_i is a matrix of size [T x 1] containing the errors corresponding to the T observations for country i.
Here is how the matrix multiplications and additions look like:
All unit-specific effects are assumed to be introduced by the term Z_iγ_i. The matrix Z_i and its coefficients vector γ_i are purely theoretical terms since what they represent cannot be in reality observed and measured.
Our objective is to find a way to estimate the impact of all unobservable effects contained in Z_i on y, i.e. we need to estimate the impact of the Z_iγ_i term of the regression equation on y_i.
To simplify the estimation, we’ll combine the effect of all country-specific unobservable effects into one variable which we will call z_i for country i. z_i is a matrix of size [T x 1] since it contains only one variable z_i and it has T rows corresponding to T number of “measurements” of z_i for T time periods.
Since z_i is not directly observable, in order to measure the effects of z_i, we need to formalize the effect of leaving out z_i. Fortunately, there is a well-studied concept in statistics called the omitted-variable bias which we can use for this purpose.
Omitted variable bias
While training the model on the panel data set, if we leave out z_i from the model, it will cause what is known as the omitted variable bias. It can be shown that if the regression model is estimated without considering z_i, then the estimated values β_cap_i of the coefficients β_i will be biased as follows:
One can see that the bias introduced in the estimated value β_cap_i is proportional to the covariance between the omitted variable z_i and the explanatory variables X_i.
The above equation suggests an approach for constructing the following two kinds of models — the Fixed Effects model, and the Random Effects model depending on whether or not the Covariance term in the above equation is zero, i.e. whether or not the unobservable effects z_i are correlated with the regression variables.
In the rest of this chapter, we’ll focus on the Fixed Effects model, while in next chapter, I’ll explain how to build and train the Random Effects model.
The Fixed Effects Regression Model
In this model, we assume that the unobservable individual effects z_i are correlated with the regression variables. In effect, it means that the Covariance(X_i, z_i) in the above equation is non-zero.
In many panel data studies, this assumption about correlation is a reasonable one to make. For example, in a stock trading scenario, a trader’s trading acumen or “knack” for making a profit is unmeasurable and unique to that individual. This acumen or knack can be presumed to vary with measurable factors such as age and education level. One may propose (rightly or wrongly) that the process of getting an advanced degree boosts one’s intrinsic acumen or knack at performing some task.
In the Fixed Effects model, we also assume that the bias introduced due to the omission of the unit-specific factors is group-specific.
To compensate for this bias, we will introduce a group-specific intercept called c_i into the model. c_i is assumed to act in a direction that is opposite (in a vector sense) to the effect of the omitted-variable bias.
With these two assumptions in place, we will express the Fixed Effects regression model’s equation as follows:
Here is the matrix form:
Notice that we have replaced the z_iγ_i term in the earlier equation which represented the effect of the unobservable factors, with c_i which is a unit specific matrix of size [T x 1]. For a given unit i, each element of this matrix has the same value c_i and c_i is assumed to be constant across all time periods.
For a particular time period t, the Fixed Effects model’s equation can be expressed as follows:
Here’s the matrix form:
In this form, y_i_t, c_i and ϵ_i_t are scalars as they pertain to a specific observation at time t and x_i_t is the t-th row vector of size [1 x k] in the X_i matrix We assume there are k regression variables represented in the X matrix.
Estimates (c_cap_i) of unit-specific effects (c_i) are random variables
Notice that c_i does not carry the time subscript t as it is the same for a given country for all time periods T. Having said that, the estimated value c_cap_i of the country-specific effect c_i is just as much a random variable as any coefficient in the estimated coefficients matrix β_cap_i. To see why, imagine that the fixed effects model is trained hundreds of times, each time on a different, randomly chosen (but continuous) sub-set of the panel data set. After each training run, all estimated coefficients β_cap_i and the estimated unit-specific effect c_cap_i will attain a somewhat different set of values. If we plot all these estimated values of c_i from different training runs, their frequency distribution will have some shape having a certain mean value and some variance. For example, we may theorize that they are normally distributed around the true population level values of the respective coefficients in β_i and c_i. Thus, the estimated unit-specific effect c_cap_i behaves like a random variable having some probability distribution.
In the Fixed Effects model, we assume that the estimated value of all unit specific effects have the same constant variance σ². It is also convenient (although not necessary) to assume a normally distributed c_cap_i. Thus, we have:
c_cap_i ~ N(c_i, σ²)
The following figure illustrates the probability distributions of c_i for three units in a hypothetical panel data set:
What if there are also some observable variables that are omitted?
In practice, the X matrix is often incomplete. One may have omitted one or more observable variables from the model for a variety of reasons. Perhaps the cost of measuring a variable w.r.t. its presumed effect on y is prohibitive. Perhaps there are moral reasons for not measuring some variable. Or a variable may have been left out of X just out of plain oversight on the part of the experimenter.
In such a case, their omission will bias all the parameter estimates of the fitted model including the estimated value of the unit-specific factor c_i for all units.
Estimating the Fixed Effects regression model
Estimation of a Fixed Effects model involves estimating the coefficients β_i and the unit-specific effect c_i for each unit i.
In practice, we pool together the models of all units into one common regression model by adding unit specific dummy variables d_1, d_2,…,d_n corresponding to the n units or groups as follows:
In the above equation:
- y_i_t is a scalar containing a specific observation for unit (country) i at time t.
- x_i_t is a row vector of size [1 x k] containing the values of all k regression variables for unit i at time t.
- β_i is a column vector of [k x 1] containing the population (true)values of regression coefficients for the k regression variables.
- d_i_t is a row vector of size [1 x n] containing one-hot-encoded dummy variables d_i_j_t, where j goes from 1 through n — one dummy variable for each of the n units in the data panel. For example: d= [0 1 0 0 0 0 0] is the dummies vector for unit #2. The idea is that the j-th element of the dummies vector should be 1 when j=i and 0 otherwise.
- c_i is a column vector of size [n x 1] containing the population values of unit-specific effects associated with the n units.
- ϵ_i_t is a scalar containing the error term of regression for unit i at time t.
The above model is a linear model and can be easily estimated using the OLS regression technique. This type of a linear regression model with dummy variables is called Least Squares with Dummy Variables (LSDV for short).
Model training involves doing the following:
- Pool together the unit specific matrices y_i, X_i, β_i, d_i, c_i and ϵ_i for all n units into one model.
- Train the pooled model to generate estimates for the coefficients vector β of size [k x 1] corresponding to the k regression variables, and also the estimates for the unit-specific effects vector c of size [n x 1] for the n units contained in the data panel.
The common coefficients assumption
In the pooled model, we are making the implicit and important assumption that the estimated coefficients β_cap are common for all n units. The Chow test can be used to test this assumption (although we’ll not go into it here).
For the World Bank countries data panel, what the poolability assumption means is that the population value of the slope (β) of the gross capital formation change (GCF_GWTH_PCNT) for each country is the same. In other words, a unit change in GCF_GWTH_PCNT is expected to translate into the same amount of change in the % GDP for each country. And therefore, it is the country-specific effect c_i and the error term ϵ_i_t are what are likely to cause the total % GDP change to vary across different countries for each unit change in GCF_GWTH_PCNT.
This behavior is a direct outcome of the common coefficients assumption and it happens to be an important but not immediately obvious characteristic of Fixed Effects models.
Here is the final thing to remember about the FE model before we dive into the tutorial section of the chapter:
The estimates generated from training the Fixed Effects regression model apply to only the units that are in the panel data set. The estimates from the Fixed Effects model do not generalize to other units of the same nature in the population.
What this means for the countries data panel is that the estimates of β and c_i apply to only the 7 countries in the data panel. One should not generalize the country-specific effect c_cap_i that is estimated by training the FE model on the data set to represent in any way the country-specific effect for any country that is not represented in the data set.
If we want the unit-specific effects to carry through to the population of similar units, the Random Effects model may be more suitable.
How to build a Fixed Effects regression model using Python and Statsmodels
Let us build and train a Fixed Effects model for the World Bank data panel.
We’ll continue using the Pandas Dataframe at the beginning of the chapter. We will be build and train the FE model on the flattened out version of the panel data set which looks like this:
Notice that in this flattened version, there is a column for the unit (country) and one for the time period (year).
Printing out the Pandas Dataframe reveals this structure:
Let’s create the country-specific dummy variables:
unit_col_name='COUNTRY' time_period_col_name='YEAR' #Create the dummy variables, one for each country df_dummies = pd.get_dummies(df_panel[unit_col_name])
Join the dummies Dataframe to the panel data set:
df_panel_with_dummies = df_panel.join(df_dummies)
Here’s how the data panel with dummies looks like:
Define the y and X variable names:
y_var_name = 'GDP_PCAP_GWTH_PCNT' X_var_names = ['GCF_GWTH_PCNT']
Define the units (countries) of interest:
unit_names = ['Belgium', 'CzechRepublic', 'France', 'Ireland', 'Portugal', 'UK', 'USA'] unit_names.sort()
Construct the regression equation. Note that we are leaving out one dummy variable so as to avoid perfect Multicollinearity between the 7 dummy variables. The regression model’s intercept will hold the value of the coefficient for the omitted dummy variable for USA.
lsdv_expr = y_var_name + ' ~ ' i = 0 for X_var_name in X_var_names: if i > 0: lsdv_expr = lsdv_expr + ' + ' + X_var_name else: lsdv_expr = lsdv_expr + X_var_name i = i + 1 for dummy_name in unit_names[:-1]: lsdv_expr = lsdv_expr + ' + ' + dummy_name print('Regression expression for OLS with dummies=' + lsdv_expr)
We see the following output:
Regression expression for OLS with dummies=GDP_PCAP_GWTH_PCNT ~ GCF_GWTH_PCNT + Belgium + CzechRepublic + France + Ireland + Portugal + UK
Build and train an LSDV model on the panel data containing dummies:
lsdv_model = smf.ols(formula=lsdv_expr, data=df_panel_with_dummies) lsdv_model_results = lsdv_model.fit() print(lsdv_model_results.summary())
We see the following output:
How to Interpret the Fixed Effects model’s training output
Statistical significance of estimated coefficients
The first thing to look at are the fitted model’s coefficients:
We see that the coefficient for the Y — o — Y % change in Gross Capital Formation (GCF_GWTH_PCNT) is significant at a p < .001 as indicated in the P > |t| column. That is good news.
Estimated values of country-specific effects
Next, let’s look at the coefficients for the 7 dummy variables representing the country-specific effects — which is the whole reason we built this model.
We observe that the Intercept of regression which represents the country-specific effect for USA (the omitted variable) is 0.6693 and it is statistically significant (meaning, its population value is estimated to be non-zero), at a p-value of 0.041.
The coefficient for the dummy variable for Ireland is 1.3879 and it is significant at a p-value of 0.003. The actual country-specific effect for Ireland is calculated as 0.6693+1.3879 = 2.0572.
The coefficients for the dummy variables that represent the rest of the countries —Belgium, the Czech Republic, France, Portugal and UK are not statistically significant at a p-value of 0.05. What that means is that their country-specific effects c_i can be considered to be the identical to that for USA (0.6693).
Here is the table of the estimated country-specific effects (c_cap_i) for all 7 countries in the data panel:
Goodness of fit of the Fixed Effects model
Let us now analyze the goodness-of-fit of the FE model from a variety of angles and see how well it measures up.
The first thing we’ll look at is the adjusted-R-squared value which is reported as 0.639:
The adjusted R-squared measures the fraction of the variance in the response variable y that the model was able to explain after accounting for the degrees of freedom lost due to the presence of regression variables (this model has 7 of those). The adjusted-R-squared of 0.639 (or about 64%) suggests a decent fit but not a very good fit. In the chapter on the Pooled OLS regression model, we had fitted a Pooled OLS model on the same panel data set and it came out with an adjusted R-squared of 0.619. In terms of the goodness-of-fit, the FE model seems to have improved upon the Pooled OLS model by a small amount. We will corroborate this fact in one more way soon.
Next, let’s look at the F-statistic reported in the training summary:
The F-test for regression analysis tests whether all model coefficients are jointly significant and therefore if the goodness-of-fit of the FE model is better than that of the intercept only (a.k.a. mean) model. We see that the F-test’s statistic of 41.48 is significant at a p < .001 thereby implying that the model’s goodness-of-fit is indeed better than that of the mean model.
Log-likelihood and the AIC score
Let’s now look at the Log-likelihood and the AIC score measures:
These values by themselves are meaningless. We need to compare them with the corresponding values of a competing model. Our competing model is the Pooled OLS regression model which we had trained in the earlier chapter on the same WB panel data set. Here is a side-by-side comparison of the Log-LL and AIC scores of the two models:
The log-LL and the AIC of the FE model are respectively slightly larger and slightly smaller than those for the Pooled OLSR model which is an indication that the goodness-of-fit of the FE model is somewhat better than that of the Pooled OLSR model, although not by a big margin.
Testing for the significance of the fixed effect using the F-test
Finally, let’s directly test the null hypothesis of Fixed Effects model that all country-specific effects c_i are jointly zero, meaning, in reality, there is no fixed effect at play in this data set. We can do this test by running an F-test between two models —a restricted model and an unrestricted model:
- The restricted model (the one with fewer variables) is the Pooled OLSR model covered in the earlier chapter, and,
- The unrestricted model is the Fixed Effects model.
The test statistic for the F-test is calculated as follows:
- RSS_restricted_model = the sum of squares of residual errors of the restricted (lesser number of parameters) model.
- RSS_unrestricted_model = the sum of squares of residual errors of the unrestricted (greater number of parameters) model.
- k_1 = degrees of freedom of the restricted model
- k_2 = degrees of freedom of the unrestricted model. So, k_2 is necessarily greater than k_1.
- N = number of data samples
The F-statistic obeys the F-distribution with (k_2 — k_1, N — k_2) degrees of freedom.
Let’s perform the F-test. We’ll start by setting up the variables for calculating the F-test:
#n=number of groups n=len(unit_names) #T=number of time periods per unit T=df_panel.shape/n #N=total number of rows in the panel data set N=n*T #k=number of regression variables of the Pooled OLS model including the intercept k=len(X_var_names)+1
Get the Residual Sum of Squares for the Pooled OLS model:
ssr_restricted_model = pooled_olsr_model_results.ssr
Get the Residual Sum of Squares for the Fixed Effects model:
ssr_unrestricted_model = lsdv_model_results.ssr
Get the degrees of freedom of the Pooled OLSR model:
k1 = len(pooled_olsr_model_results.params)
Get the degrees of freedom of the Fixed Effects model:
k2 = len(lsdv_model_results.params)
Calculate the F statistic:
f_statistic = ((ssr_restricted_model - ssr_unrestricted_model) /ssr_unrestricted_model) * ((N-k2)/(k2-k1)) print('F-statistic for FE model='+str(f_statistic))
Calculate the critical value of the F-distribution at alpha=.05:
alpha=0.05 f_critical_value=st.f.ppf((1.0-alpha), (k2-k1), (N-k2)) print('F test critical value at alpha of 0.05='+str(f_critical_value))
We see the following output:
F-statistic for FE model=2.448840073192174
F test critical value at alpha of 0.05=2.158306686033669
We see that the F-statistic is greater than the critical value at alpha=.05 leading us to conclusion that the goodness-of-fit of the LSDV Fixed Effects model is better than that of the Pooled OLSR model.
Here is the complete source code used in this chapter:
References, Citations and Copyrights
Paper and Book Links
Badi H. Baltagi, Econometric Analysis of Panel Data, 6th Edition, Springer
William H. Greene, Econometric Analysis, 8th Edition, 2018, Pearson