###### A goodness of fit test for regression models

The Chi-Squared test (pronounced as *Kai-*squared as in *Kai*zen or *Kai*ser) is one of the most versatile tests of statistical significance.

Here are some of the uses of the Chi-Squared test:

**Goodness of fit to a distribution:**The Chi-squared test can be used to determine whether your data obeys a known theoretical probability distribution such as the Normal or Poisson distribution.**Goodness of fit of a regression model:**The Chi-squared test can be used to measure the goodness-of-fit of your trained regression model on the training, validation, or test data sets.**Minimum Chi-squared estimation:**For data sets that obey parametric distributions such as the Normal, Poisson or Binomial distributions, the the Chi-squared test can be used to find the parameter range over which the observed values will obey the theoretical distribution.**Test for independence:**The Chi-squared test can be used to test whether two categorical random variables such as Age and Income whose values have been measured in an experiment, are independent of each other.**Residual Analysis:**In certain Generalized Linear Regression Models, the**Pearson residuals**obey a (scaled) Chi-square distribution under the Null hypothesis that the residual errors are Independent, Identically distributed Normal variables indicating a high goodness of fit of the fitted model.**Goodness of fit of nested regression models:**The**Deviance statistic**which can be used to compare the log likelihoods of nested regression models follows a Chi-squared distribution under the Null Hypothesis that adding regression variables doesn’t increase the goodness of fit of the model. So one might be better off with going with the simpler one of the two models.**Relationship to the G-squared test:**The G-test for goodness of fit reduces to a Chi-squared test for goodness of fit test when the observed and expected values are nearly equal to each other.

In the rest of this section, **we’ll focus on the use of the Chi-squared test in regression analysis**.

We’ll use a real world data set of **TAKEOVER BIDS** which is a popular data set in regression modeling literature. We’ll use the *SciPy* and *Statsmodels* libraries as our implementation tools.

## The genesis of the Chi-squared test

The Chi-squared test is based on the **Chi-squared distribution**. The Chi-squared distribution arises from summing up the squares of *n* independent random variables, each one of which follows the standard normal distribution, i.e. each normal variable has a zero mean and unit variance.

In notation form:

The *N(0, 1)* in the summation indicates a normally distributed random variable with a zero mean and unit variance. If you take *k *such variables and sum up the squares of their ‘realized’ values, you get a chi-squared (also called Chi-square) distribution with *k* degrees of freedom.

The unit variance constraint can be relaxed if one is willing to add a 1/variance scaling factor to the resulting distribution.

*𝛘²(k) *distribution has a mean of *k* and a variance of *2k*. The following figure taken from Wikimedia Commons illustrates the shape of *𝛘²(k) *for increasing values of *k*:

The Chi-squared test can used for those test statistics which are proven to asymptotically follow the Chi-square distribution under the Null hypothesis.

Let us now see how to use the Chi-squared goodness of fit test.

## Using the Chi-squared test to test goodness of fit to a known probability distribution

To test whether a given data set obeys a known probability distribution, we use the following test statistic known as the Pearson’s Chi-squared statistic:

Where:

*O_i *is the observed frequency of the *ith *outcome of the random variable.*E_i *is the expected frequency of the *ith *outcome of the random variable.

It can be shown that for ‘large enough’ values of *O_i *and *E_i *and when *O_i *are not very different than *E_i*, i.e. the data is not heavily dispersed, ** T** follows a Chi-square distribution with

*N — p*degrees of freedom where

*N*is the number of categories over which the frequencies are calculated and

*p*is the number of parameters of the theoretical probability distribution used to calculate the expected frequencies

*E_i*. For example, when the theoretical distribution is Poisson,

*p=1*since the Poisson distribution has only one parameter — the mean rate.

## Example

Let’s see how to use this test on an actual data set of observations which we will presuppose are Poisson distributed and we’ll use the Chi-squared goodness of fit test to prove or disprove our supposition.

The data set of observations we will use contains a set of 126 observations of corporate takeover activity that was recorded between 1978 and 1985 . Each observation contains several parameters such as size of the company (in billions of dollars) which experienced the take over event. The dependent ‘** y**’ variable is the number of take over bids that were made on that company. A point to note is that all 126 companies in this data set were eventually taken over within a certain period following the final recorded takeover bid on each company.

The data set can be downloaded from here.

Let’s start by importing all the required Python packages:

```
import pandas as pd
import numpy as np
from patsy import dmatrices
from matplotlib import pyplot as plt
from scipy.stats import poisson
import scipy.stats as stats
import statsmodels.api as sm
```

Let’s read the data set into a Pandas Dataframe:

```
df = pd.read_csv('takeover_bids_dataset.csv', header=0, index_col=[0])
```

Print out the first 15 rows. Each row contains takeover related activity for a unique company:

```
df.head(15)
```

The variables of interest to us are as follows:

## Explanatory variables (X matrix):

*BIDPREM*: The bid premium = Bid price/market price of the stock 15 days prior to the bid.*FINREST:*Indicator variable (1/0) indicating if the ownership structure of the company is proposed to be changed.*INSTHOLD:*Percentage of institutional holding.*LEGLREST:*Indicator variable (1/0) indicating whether the company that was the target of the take over launched any legal defense.*REALREST:*Indicator variable (1/0) indicating if the asset structure of the company is proposed to be changed.*REGULATN:*Indicator variable (1/0) indicating if the US Department of Justice intervened.*SIZE:*Size of the company in billions of dollars*SIZESQ:*Square of the size to account for any non-linearity in size.*WHITEKNT:*Indicator variable (1/0) indicating if the company’s management invited any friendly bids such as used to stave off a hostile takeover.

## Dependent variable (y vector):

*NUMBIDS**: *Integer containing number of takeover bids that were made on the company.

Print out the summary statistics for the dependent variable: NUMBIDS.

```
stats_labels = ['Mean NUMBIDS', 'Variance NUMBIDS', 'Minimum NUMBIDS', 'Maximum NUMBIDS']
stats_values = [round(df['NUMBIDS'].mean(), 2), round(df['NUMBIDS'].var(), 2), df['NUMBIDS'].min(), df['NUMBIDS'].max()]
print(set(zip(stats_labels, stats_values)))
```

We see the following stats:

```
{(‘Mean NUMBIDS’, 1.74), (‘Variance NUMBIDS’, 2.05), (‘Minimum NUMBIDS’, 0), (‘Maximum NUMBIDS’, 10)}
```

We note that the mean of *NUMBIDS* is 1.74 while the variance is 2.05. They are close but not the same. There is a small amount of over-dispersion but it may not be enough to rule out the possibility that *NUMBIDS* might be Poisson distributed with a theoretical mean rate of 1.74.

So let’s form our two hypotheses:

**H0: ***NUMBIDS *follows a Poisson distribution with a mean of 1.74.

**H1: **H0 is false. *NUMBIDS *is not Poisson distributed.

We’ll proceed with our quest to prove (or disprove) H0 using the Chi-squared goodness of fit test. To do so, we’ll use the following procedure:

- Define the two Hypotheses. We have already done that.
- Compute the observed frequencies
*O_i*for each value of the*y=**NUMBIDS*variable. - Assuming a Poisson distributed
, calculate the expected frequencies*y**E_i*for each value of*y=**NUMBIDS*. - Calculate the test statistic that we have presented above .
- Look up the p-value of the test statistic in the Chi-square table. If the p-value is less than 0.05, reject H0 at a 95% confidence level, else accept H0 (
*i.e. NUMBIDS*are Poisson distributed) at the same confidence level.

To calculate the observed frequencies *O_i*, let’s create a grouped data set, grouping by frequency of *NUMBIDS*.

```
grouped_df = df.groupby('NUMBIDS').count().sort_values(by='NUMBIDS')
grouped_df['NUMBIDS_OBS_FREQ'] = grouped_df['SIZESQ']
```

Print the grouped data set:

```
print(grouped_df['NUMBIDS_OBS_FREQ'])
```

We see that the frequencies for *NUMBIDS >= 5* are very less. The Chi-squared test is not accurate for bins with very small frequencies. To get around this issue, we’ll sum up frequencies for all *NUMBIDS >= 5* and associate that number with *NUMBIDS=5.*

```
grouped_df.at[5, 'NUMBIDS_OBS_FREQ'] = 5
```

Let’s also drop the rows for *NUMBIDS > 5* since *NUMBID=5* captures frequencies for all NUMBIDS >=5.

```
grouped_df = grouped_df.drop([6,7,10])
```

Also calculate and store the observed probabilities of *NUMBIDS*.

```
grouped_df = grouped_df.drop([6,7,10])
grouped_df['NUMBIDS_OBS_PROBA'] = grouped_df['NUMBIDS_OBS_FREQ']/len(df)
```

Now calculate and store the expected probabilities of *NUMBIDS *assuming that *NUMBIDS *are Poisson distributed.

```
grouped_df['NUMBIDS_POISSON_PMF'] = poisson.pmf(k=grouped_df.index, mu=df['NUMBIDS'].mean())
```

For *NUMBIDS >=5*, we will use the Poisson **Survival Function** which will give us the probability of seeing* NUMBIDS >=5.*

The Survival Function *S(X=x)* gives you the probability of observing a value of *X *that is greater than *x. i.e. S(X=x) = Pr(X > x)*.

```
grouped_df.at[5, 'NUMBIDS_POISSON_PMF'] = poisson.sf(k=4, mu=df['NUMBIDS'].mean())
```

Calculate the Poisson distributed expected frequency *E_i *of each *NUMBIDS:*

```
grouped_df['NUMBIDS_POISSON_FREQ'] = grouped_df['NUMBIDS_POISSON_PMF']*len(df)
```

Print the observed and expected values:

```
print(grouped_df[['NUMBIDS_OBS_PROBA', 'NUMBIDS_OBS_FREQ', 'NUMBIDS_POISSON_PMF', 'NUMBIDS_POISSON_FREQ']])
```

We see the following output:

Plot the Observed (*O_i*) and Expected (*E_i*) for all *i*:

```
labels=['0', '1', '2', '3', '4', ' >=5']
x = np.arange(len(labels))
width = 0.35
fig, ax = plt.subplots()
ax.set_title('Poisson Predicted versus Observed Frequencies of NUMBIDS')
ax.set_xlabel('NUMBIDS')
ax.set_ylabel('Frequency of NUMBIDS')
ax.set_xticks(x)
ax.set_xticklabels(labels)
bar1 = ax.bar(x=x - width/2, height=list(grouped_df['NUMBIDS_OBS_FREQ']), width=width, label='Observed Frequency')
bar2 = ax.bar(x=x + width/2, height=list(grouped_df['NUMBIDS_POISSON_FREQ']), width=width, label='Expected Frequency')
ax.legend()
ax.bar_label(bar1, padding=3)
ax.bar_label(bar2, padding=3)
fig.tight_layout()
plt.show()
```

We see the following plot:

Now let’s calculate the Chi-squared test statistic:

```
grouped_df['CHI_SQUARED_STAT'] = np.power(grouped_df['NUMBIDS_OBS_FREQ']-grouped_df['NUMBIDS_POISSON_FREQ'], 2)/grouped_df['NUMBIDS_POISSON_FREQ']
chi_squared_value = grouped_df['CHI_SQUARED_STAT'].sum()
```

Before we calculate the p-value for the above statistic, we must fix the degrees of freedom. Here are the total degrees of freedom:

```
total_degrees_of_freedom = len(grouped_df)
```

We have to reduce this number by *p *where *p*=number of parameters of the Poisson distribution. So p=1.

```
reduced_degrees_of_freedom = total_degrees_of_freedom - 1
```

Get the p-value of the Chi-squared test statistic with *(N-p)* degrees of freedom. Notice that we are once again using the Survival Function which gives us the probability of observing an outcome that is greater than a certain value, in this case that value is the Chi-squared test statistic.

```
chi_squared_p_value = stats.chi2.sf(x=chi_squared_value, df=reduced_degrees_of_freedom)
```

We will also get the test statistic value corresponding to a critical alpha of 0.05 (95% confidence level). We will use the Inverse of the Survival Function for getting this value.

Since the Survival Function *S(**X**=x) = Pr(**X** > x)*, Inverse of *S(**X**=x)* will give you the *X**=x* such that the probability of observing any *X** > x* is the given *q* value (e.g. *q=0.05 or 5%*)

```
critical_chi_squared_value_at_95p = stats.chi2.isf(q=0.05, df=reduced_degrees_of_freedom)
```

Print out all the values that we have calculated so far:

```
stats_labels=['Degrees of freedom', 'Chi-squared test statistic', 'p-value', 'Maximum allowed Chi-squared test statistic for H0 at alpha=0.05']
stats_values=[reduced_degrees_of_freedom, chi_squared_value, chi_squared_p_value, critical_chi_squared_value_at_95p]
print(set(zip(stats_labels, stats_values)))
```

We see the following output:

{('Degrees of freedom', 5), ('p-value', 4.9704641133403614e-05), ('Chi-squared test statistic', 27.306905068684152), ('Maximum allowed Chi-squared test statistic for H0 at alpha=0.05', 11.070497693516355)}

We see that the calculated value of the Chi-squared goodness of fit statistic is 27.306905068684152 and its p-value is 4.9704641133403614e-05 which is much smaller than *alpha=0.05*.

**Thus we conclude that Null Hypothesis H0 that NUMBIDS is Poisson distributed can be resolutely REJECTED at 95% (indeed even at 9.99%) confidence level.**

Notice further that the Critical Chi-squared test statistic value to accept H0 at 95% confidence level is 11.07, which is much smaller than 27.31.

We can visualize this situation by plotting Chi-squared(5):

```
plt.xlabel('Chi-squared distributed random number')
plt.ylabel('Chi-squared(df='+str(reduced_degrees_of_freedom)+') probability density')
rv = stats.chi2.rvs(df=reduced_degrees_of_freedom, size=100000)
plt.hist(x=rv, bins=100, density=True, histtype='stepfilled')
plt.show()
```

## Testing the Goodness of Fit of a Regression Model

We’ll now see how to use the Chi-squared test to test the Goodness of Fit of a Poisson Regression Model.

To start with, let’s fit the Poisson Regression Model to our takeover bids data set. We’ll construct the model equation using the syntax used by Patsy. In the below expression we are saying that *NUMBIDS *is the dependent variable and all the variables on the RHS are the explanatory variables of regression.

```
expr = 'NUMBIDS ~ LEGLREST + REALREST + FINREST + WHITEKNT + BIDPREM + INSTHOLD + SIZE + SIZESQ + REGULATN'
```

Using Patsy, carve out the ** X **and

**matrices:**

*y*```
y_train, X_train = dmatrices(expr, df, return_type='dataframe')
```

Build and fit a Poisson regression model on the training data set:

```
poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
```

Print the training summary:

```
print(poisson_training_results.summary())
```

We get the following output:

Only 3 regression variables — WHITEKNT, SIZE and SIZESQ — are seen to be statistically significant at an alpha of 0.05 as evidenced by their z scores.

Incidentally, ignore the value of the Pearson chi2 reported by statsmodels. It is the sum of the Pearson residuals of the regression. Incidentally, this sum is also Chi-square distributed under the Null Hypothesis but it’s not what we are after.

What we want to find out is if the Poisson regression model, by way of addition of regressions variables, has been able to explain some of the variance in *NUMBIDS* leading to a better goodness of fit of the model’s predictions to the data set.

In the earlier section, we have already proved the following about NUMBIDS:

*Pr(NUMBIDS=k) *does not obey *Poisson(µ=1.73)*

We now want to see if:

*Pr(NUMBIDS=k | **X**=x) ~ Poisson(µ|**X=**x)*

*i.e. is NUMBIDS Poisson distributed conditioned upon the values of the regression variables?*

Let’s start by printing out the predictions of the Poisson model on the training data set.

```
print(poisson_training_results.mu)
```

Here are how they look like:

[2.72889817 1.30246609 2.15499739 1.1900047 1.21599906 2.09184785

1.82652758 1.06685393 0.79514164 1.14477171 0.99898271 1.92814556

2.20779028 1.43286411 1.73940628 0.90328133 2.27706959 1.87184259

1.54905093 0.93227952 0.93033507 1.51971811 2.08993763 3.19936595

1.89919831 1.59212915 1.52079672 1.00089633 2.00393774 1.1054032

1.57265323 1.76745435 2.00783088 1.61220849 1.83879678 4.41843589

1.04601311 1.31612754 1.6392279 1.37670449 1.06614718 1.66242252

2.34573342 2.04544149 0.883722 2.90131042 2.31600229 2.29251975

0.91892439 2.20332645 1.44903512 1.13939068 3.47003593 0.71567673

1.6650563 1.69170618 1.03992301 2.84548307 1.69812667 1.27352075

1.03938293 1.79108836 1.37244044 2.21050863 1.86560189 1.8648059

1.86403231 1.41217325 1.67946352 0.87193693 2.17762357 3.19137732

1.57576042 2.58713174 2.76491122 1.85562101 0.95351014 1.07514803

1.03481096 2.88297676 1.12614708 2.32323011 3.14995344 1.1434731

1.01097106 1.9373812 1.07089043 1.26067475 1.83816698 1.94861201

1.53542301 1.78289842 1.27713579 1.99077092 2.13675676 1.92808338

1.29938303 1.58595018 2.05336739 1.18530808 1.1440799 4.00075889

1.11596048 1.63979224 2.01608871 1.09786493 2.73455853 1.40900652

1.15158314 0.9380958 2.58107119 2.16988211 1.07493149 1.17958898

1.33171499 1.78776301 1.37353721 2.68417969 1.61699478 2.23254292

1.02306676 1.76736974 2.31367053 1.38449909 0.91364522 4.22397452]

Each number in the above array is the expected value *µ* of *NUMBIDS* conditioned upon the corresponding values of the regression variables in that row, i.e. ** X=x**. There are a total of 126 expected values printed corresponding to the 126 rows in

**. Thus, the above array gives us the set of**

*X**conditional expectations*

*µ**|*

*X**.*

Our task is to calculate the expected probability (and therefore frequency) for each observed value of *NUMBIDS* given the expected values *µ *of the Poisson rate generated by the trained model. To do so, we will take each observed value of *NUMBIDS *in the training set and we’ll calculate the Poisson probability of observing that value given each one of the predicted rates *µ *in the array of *µ *values. For that *NUMBIDS* value, we’ll average over all such predicted probabilities to get the predicted probability of observing that value of *NUMBIDS *under the trained Poisson model.

```
def get_poisson_conditional_probability(row):
num_bids = int(row.name)
if num_bids < 5:
return np.mean(poisson.pmf(k=num_bids, mu=poisson_training_results.mu))
else: #use the Survival Function for NUMBIDS >=5
return np.mean(poisson.sf(k=num_bids, mu=poisson_training_results.mu))
grouped_df['PM_NUMBIDS_POISSON_PMF'] = grouped_df.apply(get_poisson_conditional_probability, axis=1)
grouped_df['PM_NUMBIDS_POISSON_FREQ'] = grouped_df['PM_NUMBIDS_POISSON_PMF']*len(df)
```

Now that we have our Expected Frequency *E_i *under the Poisson regression model for each value of *NUMBIDS*, let’s once again run the Chi-squared test of goodness of fit on the Observed and Expected Frequencies:

```
stats.chisquare(f_obs=grouped_df['NUMBIDS_OBS_FREQ'], f_exp=grouped_df['PM_NUMBIDS_POISSON_FREQ'], ddof=reduced_degrees_of_freedom)
```

We see the following output:

```
Power_divergenceResult(statistic=33.69923990742465, pvalue=nan)
```

We see that with the Poisson Regression model, our Chi-squared statistic is 33.69 which is even bigger than the value of 27.30 we got earlier. The p-value is also too low to be printed (hence the nan). The Poisson regression model has not been able to explain the variance in the dependent variable *NUMBIDS* as evidenced by its poor goodness of fit on the Poisson probability distribution (this time conditioned upon ** X)**.

**Hence we reject the Poisson regression model for this data set.**

Perhaps another regression model such as the Negative Binomial or the Generalized Poisson model would be better able to account for the over-dispersion in *NUMBIDS *that we had noted earlier and therefore may be achieve a better goodness of fit than the Poisson model.

Here is the complete source code:

## Citations and Copyrights

### Data set

Jaggia, S., Thosar, S. Multiple bids as a consequence of target management resistance: A count data approach. *Rev Quant Finan Acc* **3, **447–457 (1993). https://doi.org/10.1007/BF02409622 **PDF Download link** **Curated data set download link**

### Paper and Book Links

Cameron A. Colin, Trivedi Pravin K., *Regression Analysis of Count Data*, Econometric Society Monograph №30, Cambridge University Press, 1998. ISBN: 0521635675

McCullagh P., Nelder John A., *Generalized Linear Models*, 2nd Ed., CRC Press, 1989, ISBN 0412317605, 9780412317606

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **The Akaike Information Criterion

**NEXT: **Schoenfeld Residuals