And a detailed analysis of its goodness-of-fit using Python and statsmodels
In this chapter, we’ll get to know about panel data datasets, and we’ll learn how to build and train a Pooled OLS regression model for a real world panel data set using statsmodels and Python.
After training the Pooled OLSR model, we’ll learn how to analyze the goodness-of-fit of the trained model using Adjusted R-squared, Log-likelihood, AIC and the F-test for regression. We will take an even deeper look at the goodness-of-fit of the model via a detailed analysis of its residual errors.
Along with the Fixed Effects, the Random Effects, and the Random Coefficients models, the Pooled OLS regression model happens to be a commonly considered model for panel data sets. In fact, in many panel data sets, the Pooled OLSR model is often used as the reference or baseline model for comparing the performance of other models.
This chapter is PART 1 of the following three part section on Panel Data Analysis:
- The Pooled OLS Regression Model For Panel Data Sets
- The Fixed Effects Regression Model For Panel Data Sets
- The Random Effects Regression Model for Panel Data Sets
What is panel data?
A panel data set contains data that is collected over a period of time for one or more uniquely identifiable individuals or “things”. In panel data terminology, each individual or “thing” for which data is collected is called a unit.
Here are three real world examples of panel data sets:
The Framingham Heart Study: The Framingham heart study is a long running experiment that was started in 1948 in the city of Framingham, Massachusetts. Each year, health data from 5000+ individuals is being captured with the goal of identifying risk factors for cardiovascular disease. In this data set, the unit is a person.
The Grunfeld Investment Data: This is a popular research data set that contains corporate performance data of 10 US companies that was accumulated over a period of 20 years. In this data set, the unit is a company.
The British Household Panel Survey: This is a survey of a sample of British households. Since 1991, members of each sampled household were asked a set questions and their responses were recorded. The same sample of households was interviewed again each subsequent year. The goal of the survey is to analyze the effects of socioeconomic changes happening in Britain on British households. In this data set, the unit is a household.
While building a panel data set, researchers measure one or more parameters called variables for each unit and record their values in a tabular format. Examples of variables are sex, race, weight and lipid levels for individuals or employee count, outstanding shares and EBITDA for companies. Notice that some variables may change across time periods, while others stay constant.
What results from this data collection exercise is a three dimensional data set in which each row represents a unique unit, each column contains the data from one of the measured variables for that unit, and the z-axis contains the sequence of time periods over which the the unit has been tracked.
Panel data sets arise out of longitudinal studies in which the researchers wish to study the impact of the measured variables on one or more response variables such as the yearly investment made by a company, or GDP growth of a country.
A real world panel data set
The following panel data set contains the year-over-year % growth in per capita GDP of seven countries measured from 1992 through 2014. Along with GDP growth data, the panel also contains Y-o-Y % growth in Gross Capital Formation in each country:
In the above data set, each country (the “unit”) is tracked over the same number of time periods resulting in what is called a balanced panel. An imbalanced or unbalanced panel is one in which different units are tracked over different time periods.
The above data set is also an example of a fixed panel (as against a rotating panel) because we are tracking the same set of countries in each time period.
In the rest of this chapter, we’ll look at data panels that are fixed and balanced.
In a panel data set, the set of data points pertaining to one unit is called a group. Often, the words unit and group are used interchangeably in discussions about panel data sets.
Circling back to the World Bank data set, suppose we wish to study the relationship between growth in gross capital formation and a country’s GDP growth. To that end, we form the following regression goal:
Our Regression goal
Our goal is to precisely define the relationship between growth in gross capital formulation in a country and the yearly GDP growth experienced by that country.
Our Regression strategy
Our strategy will be to choose and fit a regression model that is suitable for panel data sets, and specifically for the WB panel data. Our regression model should allow us to express yearly GDP growth experienced by country i at time period (year) t, as some function f(.) of yearly growth in gross capital formation in country i during time period t.
In notation form:
In the above regression equation, ε_it is the residual error of regression and it captures the variance in Y-o-Y Growth of country i during year t, that the model isn’t able to “explain”.
The Pooled OLS regression model
As mentioned earlier, the Pooled OLS regression model is often a good starting point and a reference model for several panel data sets. We’ll study it’s applicability to the World Bank data set.
To do so, we will “flatten” the panel data by considering Country and Year as two columns so that it looks like this:
Our dependent (endogenous) and explanatory (exogenous) variables are as follows:
Dependent variable y = GDP_PCAP_GWTH_PCNT
Explanatory variable X = GCF_GWTH_PCNT
The data set is available for download over here.
We’ll start by importing all the required packages including ones we will use later on in the chapter:
import pandas as pd import scipy.stats as st import statsmodels.api as sm import statsmodels.graphics.tsaplots as tsap from statsmodels.compat import lzip from statsmodels.stats.diagnostic import het_white from matplotlib import pyplot as plt import seaborn as sns
Load the flattened data set into a Pandas Data frame:
df_panel = pd.read_csv('wb_data_panel_2ind_7units_1992_2014.csv', header=0)
Use Seaborn to plot GDP growth over all time periods and across all countries versus gross capital formation growth:
sns.scatterplot(x=df_panel['GCF_GWTH_PCNT'], y=df_panel['GDP_PCAP_GWTH_PCNT'], hue=df_panel['COUNTRY']).set(title= 'Y-o-Y % Change in per-capita GDP versus Y-o-Y % Change in Gross capital formation') plt.show()
We see the following plot:
There seems to be a linear relation between Y-o-Y % growth in GDP versus Y-o-Y % growth in gross capital formation existing across all countries in the data panel. This bodes well for fitting a linear model using the OLS technique.
However, we also observe signs of heteroskedasticity in the response variable GDP_PCAP_GWTH_PCNT. Specifically, the variance in GDP_PCAP_GWTH_PCNT is not constant for different values of GCF_GWTH_PCNT. That does not bode well for using the OLS estimation technique.
At any rate, let’s plow ahead with fitting a OLS regression model to this flattened data panel. Later on in the chapter, we’ll see how to measure the suitability of the model using a battery of goodness of fit tests.
The Pooled OLS’s regression model equation is as follows:
The objective of training the model on the panel data set is to find the fitted coefficients β_cap_1 and β_cap_0. The ‘cap’ in β_cap signifies that it is the value of the coefficient that the model has estimated and it not the true (population level) value β which is always unknown.
ε is the residual error of the fitted model and it is a random variable with a certain mean and variance. If the OLS estimation technique does its job correctly, ε will have a mean value of zero and ε will have a constant variance conditioned upon GCF_GWTH_PCNT (i.e. ε will not be heteroskedastic), and ε will not be auto-correlated.
We’ll use statsmodels’
OLS class to build and fit the OLS regression model as follows:
Define the y and X variables:
y_var_name = 'GDP_PCAP_GWTH_PCNT' X_var_names = ['GCF_GWTH_PCNT']
Carve out the y vector from the data panel:
Carve out the X matrix from the data panel:
Add the placeholder for the regression intercept. When the model is fitted, the coefficient of this variable is the regression model’s intercept β_0.
pooled_X = sm.add_constant(pooled_X)
Build the OLS regression model:
pooled_olsr_model = sm.OLS(endog=pooled_y, exog=pooled_X)
Train the model on the (y, X) data set and fetch the training results:
pooled_olsr_model_results = pooled_olsr_model.fit()
Print the training summary:
We get the following output:
How to Interpret the Pooled OLSR model’s training output
The first thing to note is the values of the fitted coefficients: β_cap_1 and β_cap_0
β_cap_0 = 0.9720, and β_cap_1=0.2546
Both coefficients are estimated to be significantly different from 0 at a p < .001. This is good news.
The trained Pooled OLS model’s equation is as follows:
How to interpret the Pooled OLSR model’s performance
We will analyze whether Pooled OLS model is the adequate model for our regression problem. We will analyze the model’s goodness-of-fit using direct measures and tests such as R-squared and the F-test, and the Log-likelihood and AIC scores, and also indirectly via residual analysis.
Analysis of goodness-of-fit via R-squared, F-test, Log-Likelihood and AIC
The adjusted R-squared which measures the fraction of the total variance in y that is explained by X after accounting for degrees of freedom lost due to the inclusion of regression variables is 0.619 or about 62%. That is certainly not a bad number, but still nothing to be ecstatic about.
The F-test for regression that measures the joint significance of the model’s parameters has produced a test statistic of 261.4 with a p-value of 2.15e-35 thereby leading us to conclude that the model’s coefficient estimates are jointly significant at a p < .001.
The Log-likelihood of the model is -300.29 and the AIC score 604.6. These goodness-of-fit values by themselves are meaningless unless we compare them to that of a competing model. In the next chapter, we’ll fit a Fixed Effects model on the same data panel and use compare the quality of fit of the FE model with the Pooled OLSR model using these two measures.
Analysis of residual errors
Recollect that each raw residual error ε_it = y_obs_it — y_pred_it i.e. it is the difference between the observed and the predicted value of GDP_PCAP_GWTH_PCNT. Let’s print out the Pandas
Series object containing the raw residual errors of the trained model:
Following is the mean value of the residual errors:
print('Mean value of residual errors='+str(pooled_olsr_model_results.resid.mean()))
Mean value of residual errors=3.682354628259836e-16
The mean is practically zero which is the expected outcome of using the OLS estimation technique.
Are the residual errors normally distributed?
Let’s plot the Q-Q plot of the residual errors:
sm.qqplot(data=pooled_olsr_model_results.resid, line='45') plt.show()
Here is where we observe the first sign of a problem. The Q-Q plot of the residuals is a visual test of normality and it clearly shows that the residuals of the fitted model are not normally distributed. The Q-Q test’s result is backed up by the output of the Jarque-Bera and the Omnibus tests for normality shown in the bottom panel of the training summary. Both tests indicate that the residuals are not normally distributed at p < .001.
Even though the residual errors are not normally distributed, the Pooled OLS estimator is still the Best Linear Unbiased Estimator (BLUE) for the panel data regression problem. Non-normality of residual errors does not affect the BLUE-ness of OLS regression model.
One disadvantage of having residual errors that are not normally distributed is that one cannot build reliable confidence intervals for the model’s predictions. We can tolerate small departures from normality but large departures invalidate the use of either the Normal or the Student’s t-distributions. Consequently, reliable confidence intervals cannot (and therefore should not) be calculated.
Are the residual errors homoskedastic?
The OLS estimator is not efficient (although it’s still unbiased) if the residual errors of the OLSR model are heteroskedastic, i.e. the variance of residual errors is not constant across all values of X.
Let’s visually inspect the residuals to see if there is any trend present in a plot of the residuals against X:
fig, ax = plt.subplots() fig.suptitle('Raw residuals of Pooled OLS versus X') plt.ylabel('Residual (y - mu)') plt.xlabel('X='+str(X_var_names)) ax.scatter(pooled_X[X_var_names], pooled_olsr_model_results.resid, s=4, c='black', label='Residual Error') plt.show()
We see the following plot. The residuals do not seem to have a constant variance for different values of X. I have marked up the trend in variance using the red arrows:
The heteroskedasticity can be confirmed running the White test in which we will regress the square of the residual on X and test the significance of resulting regression model’s coefficients as follows:
keys = ['Lagrange Multiplier statistic:', 'LM test\'s p-value:', 'F-statistic:', 'F-test\'s ' 'p-value:'] results = het_white(resid=pooled_olsr_model_results.resid, exog=pooled_X) print('Results of the White test for heteroskedasticity of residual errors ===> ') print(lzip(keys,results))
We see the following output:
Results of the White test for heteroskedasticity of residual errors ===>
[(‘Lagrange Multiplier statistic:’, 9.918681580385458), (“LM test’s p-value:”, 0.007017552347485667), (‘F-statistic:’, 5.186450932829045), (“F-test’s p-value:”, 0.006582734100668208)]
The LM test’s p-value is < .001 indicating a rejection of the Null hypothesis of the White test that the residuals are homoskedastic.
As mentioned earlier, the Pooled OLS regression model will produce unbiased estimates of the population values even if the residual errors of the fitted model are heteroskedastic. But heteroskedasticity in the residuals will violate one of the Gauss-Markov assumptions that make the OLS estimator the Best Linear Unbiased Estimator for the problem at hand. Specifically, when the residuals are heteroskedastic, the OLS estimator becomes inefficient i.e. it loses the ability to generate predictions having the lowest possible variance amongst all possible linear unbiased estimators. When the residuals are heteroskedastic, the OLS estimator will under or over-estimate the variance in the parameter estimates, causing the standard errors of the parameter estimates to be miss-specified. Since standard errors are used for calculating confidence intervals, the confidence intervals of the parameter estimates also become incorrect. The same kind of miss-specification is seen for the standard errors and confidence intervals associated with the model’s predictions.
Are the residual errors correlated with the response variable y?
Let’s plot the residual errors against y=GDP_PCAP_GWTH_PCNT:
fig, ax = plt.subplots() fig.suptitle('Raw residuals of Pooled OLS versus y') plt.ylabel('Residual (y - mu)') plt.xlabel('y') ax.scatter(pooled_y, pooled_olsr_model_results.resid, s=4, c='black', label='Residual Error') plt.show()
We get the following plot:
There seems to be what looks like a linear trend between the residual errors and y. A correlation test using Pearson’s r confirms this visual judgement:
keys = ['Pearson\'s r:', 'p-value results = st.pearsonr(x=pooled_y, y=pooled_olsr_model_results.resid) print('Results of the Pearson\'s r test of correlation between the residual errors and the response variable y ===>') print(lzip(keys,results))
We see the following output:
Results of the Pearson's r test of correlation between the residual errors and the response variable y ===>
[("Pearson's r:", 0.6149931069935411), ('p-value:', 3.996454333518694e-18)]
The first value of 0.61499 is the amount of correlation (~ 61%) seen between the residuals and y, and the second value of 3.99645e-18 is the p-value of the result. We will ignore the reported p-value as we know that the residuals are far from being normally distributed. At any rate, the reported correlation (61%) by itself is obviously much greater than zero and therefore, significant.
The high degree of correlation between the residual errors of regression and the response variable indicate that our Pooled OLSR model is missing important explanatory variables which would have otherwise been able to “explain” some of this correlation. Whatever variance in the per capita GDP growth rate (y) of the country that the gross capital formation rate (X) has not been able to explain has leaked into the residuals in the form of both a correlation with y, and heteroskedasticity.
Are the residuals auto-correlated?
Let’s plot the Auto-Correlation Function (ACF) plot of the residual errors:
We see the following plot:
The perfect correlation of 1.0 at lag 0 is to be ignored as a number is always perfectly correlated with itself. But we see significant auto-correlation between residual errors at lags 1, 2 and 3.
Just as with heteroskedasticity in the residuals, the auto-correlation in the residuals violates one of the Gauss-Markov assumptions that make the OLS estimator BLUE. Specifically, auto-correlated residual errors cause the standard errors to be miss-specified (under-estimated) which causes the t-values (or z-values) to be over-estimated, and the confidence intervals of the parameter estimates to be miss-specified. Coefficients which are in reality zero i.e. insignificant, may get erroneously reported as non-zero (significant).
Summary of findings
In summary, we have found that the Pooled OLS regression model which we built for the World Bank data set has the following properties:
- Its adjusted R-squared is around 62% which is not bad for a real-world data set.
- The model’s parameter coefficients are found to be significant at a p < .001.
- The F-test indicates that the parameter coefficients are jointly significant at a p < .001.
- The residual errors of the model are not normally distributed, implying that the standard errors and confidence intervals associated with the model’s predictions may not be entirely reliable.
- The residual errors are heteroskedastic implying that results of t-test for parameter significance on the model’s parameters, the corresponding confidence intervals for the parameter estimates, and the results of the F-test are not entirely reliable. The conclusion holds for the standard errors and confidence intervals associated with the model’s predictions.
- The residual errors are correlated with the response variable y which means that model has left out important regression variables that would have otherwise been correlated with y, and their absence has caused the balance amount of correlation to leak into the residual errors.
- The residual errors are auto-correlated at lags 1, 2 and 3 which means that the standard errors of the model’s parameter estimates are likely under-estimated and the reported z-values (or t-values) are correspondingly over-estimated. Functionally, auto-correlation in residuals implies a general miss-specification of the regression model.
Overall, the analysis of residuals of the Pooled OLSR model is pointing toward a miss-specification of the regression model for the problem at hand. We may be able to do better with one of the other two kinds of regression models that are indicated for panel data sets namely the Fixed Effects and the Random Effects regression models.
In the next chapter, we’ll do a deep dive into the Fixed Effects regression model and we’ll look at how to build and fit the FE model on the World Bank data set. We’ll compare its goodness-of-fit with that of the Pooled OLSR model.
Here is the download link for World Bank data set used this chapter.
And here is the complete source code used in the chapter:
References, Citations and Copyrights
Paper and Book Links
Badi H. Baltagi, Econometric Analysis of Panel Data, 6th Edition, Springer
William H. Greene, Econometric Analysis, 8th Edition, 2018, Pearson