###### A goodness of fit measure that is based on Information Theory

## Introduction to the AIC

The **A**kaike **I**nformation **C**riterion (**AIC**) lets you test how well your model fits the data set without over-fitting it.

The AIC score rewards models that achieve a high goodness-of-fit score and penalizes them if they become overly complex.

By itself, the AIC score is not of much use unless it is compared with the AIC score of a competing model.

The model with the lower AIC score is expected to strike a superior balance between its ability to fit the data set and its ability to avoid over-fitting the data set.

## Formula for the AIC score

The formula for the AIC score is as follows:

The AIC formula is built upon 4 concepts which themselves build upon one another as follows:

Let’s take another look at the AIC formula, but this time, let’s re-organize it a bit:

Let’s recollect that a smaller AIC score is preferable to a larger score. Using the rewritten formula, one can see how the AIC score of the model will increase in proportion to the growth in the value of the numerator, which contains the number of parameters in the model (i.e. a measure of model complexity). And the AIC score will decrease in proportion to the growth in the denominator which contains the maximized log likelihood of the model (which, as we just saw, is a measure of the goodness-of-fit of the model).

## Comparing two models using their AIC scores

The AIC score is useful only when its used to compare two models. Let’s say we have two such models with *k1* and *k2* number of parameters, and AIC scores AIC_1 and AIC_2.

Assume that AIC_1 < AIC_2 i.e. model 1 is better than model 2.

How much worse is model 2 than model 1? This question can be answered by using the following formula:

Why use the ** exp()** function to compute the relative likelihood? Why not just subtract AIC_2 from AIC_1? For one thing, the

**function ensures that the relative likelihood is always a positive number and hence easier to interpret.**

*exp()*## Example

If you build and train an **O**rdinary **L**east **S**quares **R**egression model using the Python *statsmodels *library, statsmodels

### How to select an optimal model using AIC

Let’s perform what might hopefully turn out to be an interesting model selection experiment. We’ll use a data set of daily average temperatures in the city of Boston, MA from 1978 to 2019. This data can be downloaded from NOAA’s website.

The raw data set, (which you can access over here), contains the daily average temperature values. The first few rows of the raw data are reproduced below:

### Exploring the data set

For our model selection experiment, we’ll aggregate the data at a month level.

After aggregation, which we’ll soon see how to do in *pandas*, the plotted values for each month look as follows:

Let’s also plot the average temperature TAVG against a time lagged version of itself for various time lags going from 1 month to 12 months. Following is the set of resulting scatter plots:

There is clearly a strong correlation at LAGS 6 and 12 which is to be expected for monthly averaged temperature data. Other lags such as LAG1, LAG5 and LAG7 may also exhibit a significant ability to explain some of the variance in the target variable’s value. We’ll find out soon enough if that’s true.

### Regression goal

Our regression goal will be to create a model that will predict the monthly average temperature in Boston, namely the TAVG value. Therefore our target, a.k.a. the response variable, will be TAVG.

### Regression strategy

Our regression strategy will be as follows:

- Since we have seen a strong seasonality at LAGS 6 and 12, we will hypothesize that the target value TAVG can be predicted using one or more lagged versions of the target value, up through LAG 12.
- Therefore, we’ll add lagged variables TAVG_LAG_1, TAVG_LAG_2, …, TAVG_LAG_12 to our data set. These are going to be our explanatory variables.
- Next, we’ll build several Ordinary Least Squares Regression (OLSR) models using the
*statsmodels*library. Each model will seek to explain the variance in TAVG using some combination of time-lagged variables. Since we don’t know what combination of lagged variables will lead to the ‘optimal’ model, we’ll do a brute force search through all possible combinations of lagged variables. That’s 4000+ model combinations in all! - We will build a lagged variable model corresponding to each one of these combinations, train the model and check its AIC score.
- During our search through the model space, we’ll keep track of the model with the lowest AIC score.
- In the end, we’ll print out the summary characteristic of the model with the lowest AIC score. We’ll inspect this optimal model using a couple of other model evaluation criteria also, such as the t-test and the F-test.
- Lastly, we’ll test the optimal model’s performance on the test data set.

Let’s implement this strategy.

### Implementing the regression strategy using Python, pandas and statsmodels

Import all the required packages.

```
import pandas as pd
from patsy import dmatrices
from collections import OrderedDict
import itertools
import statsmodels.formula.api as smf
import sys
import matplotlib.pyplot as plt
```

Read the data set into a pandas data frame.

```
df = pd.read_csv('boston_daily_temps_1978_2019.csv', header=0, infer_datetime_format=True, parse_dates=[0], index_col=[0])
```

The data set contains daily average temperatures. We want monthly averages. So let’s roll up the data to a month level. This turns out to be a simple thing to do using *pandas*.

```
df_resampled = df.resample('M').mean()
```

We are about to add lagged variable columns into the data set. Let’s create a copy of the data set so that we don’t disturb the original data set.

```
df_lagged = df_resampled.copy()
```

Add 12 columns, each one containing a time-lagged version of TAVG.

```
for i in range(1, 13, 1):
df_lagged['TAVG_LAG_' + str(i)] = df_lagged['TAVG'].shift(i)
```

Print out the first 15 rows of the lagged variables data set.

```
print(df_lagged.head(15))
```

This prints out the following output:

The first 12 rows contain NaNs introduced by the *shift* function. Let’s remove these 12 rows.

```
for i in range(0, 12, 1):
df_lagged = df_lagged.drop(df_lagged.index[0])
```

Print out the first few rows just to confirm that the NaNs have been removed.

```
print(df_lagged.head())
```

Before we do any more peeking and poking into the data, we will put aside 20% of the data set for testing the optimal model.

```
split_index = round(len(df_lagged)*0.8)
split_date = df_lagged.index[split_index]
df_train = df_lagged.loc[df_lagged.index <= split_date].copy()
df_test = df_lagged.loc[df_lagged.index > split_date].copy()
```

Now let’s create all possible combinations of lagged values. For this, we’ll create a dictionary in which the keys contain different combinations of the lag numbers 1 through 12.

```
lag_combinations = OrderedDict()
l = list(range(1,13,1))
for i in range(1, 13, 1):
for combination in itertools.combinations(l, i):
lag_combinations[combination] = 0.0
print('Number of combinations to be tested: ' + str(len(lag_combinations)))
```

Next, we will iterate over all the generated combinations. For each lag combination, we’ll build the model’s expression using the Patsy syntax. Next we’ll build the linear regression model for that lag combination of variables, we’ll train the model on the training data set, we’ll ask *statsmodels *to give us the AIC score for the model, and we’ll make a note of the AIC score and the current ‘best model’ if the current score is less than the minimum value seen so far. We’ll do all of this in the following piece of code:

```
expr_prefix = 'TAVG ~ '
min_aic = sys.float_info.max
best_expr = ''
best_olsr_model_results = None
#Iterate over each combination
for combination in lag_combinations:
expr = expr_prefix
i = 1
#Setup the model expression using patsy syntax
for lag_num in combination:
if i < len(combination):
expr = expr + 'TAVG_LAG_' + str(lag_num) + ' + '
else:
expr = expr + 'TAVG_LAG_' + str(lag_num)
i += 1
print('Building model for expr: ' + expr)
#Carve out the X,y vectors using patsy. We will use X_test, y_test later for testing the model.
y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')
#Build and train the OLSR model on the training data set
olsr_results = smf.ols(expr, df_train).fit()
#Store it's AIC value
lag_combinations[combination] = olsr_results.aic
#Keep track of the best model (the one with the lowest AIC score) seen so far
if olsr_results.aic < min_aic:
min_aic = olsr_results.aic
best_expr = expr
best_olsr_model_results = olsr_results
print('AIC='+str(lag_combinations[combination]))
```

Finally, let’s print out the summary of the best OLSR model as per our evaluation criterion. This is the model with the lowest AIC score.

```
print(best_olsr_model_results.summary())
```

This prints out the following output. I have highlighted a few interesting areas in the output:

Let’s inspect the highlighted sections.

### Choice of model parameters

Our AIC score based model evaluation strategy has identified a model with the following parameters:

The other lags, 3, 4, 7, 8, 9 have been determined to not be significant enough to *jointly* explain the variance of the dependent variable TAVG. For example, we see that TAVG_LAG_7 is not present in the optimal model even though from the scatter plots we saw earlier, there seemed to be a good amount of correlation between the response variable TAVG and TAVG_LAG_7. The reason for the omission might be that most of the information in TAVG_LAG_7 may have been captured by TAVG_LAG_6, and we can see that TAVG_LAG_6 is included in the optimal model.

### Statistical significance of model parameters (the t-test)

The second thing to note is that all parameters of the optimal model, **except for TAVG_LAG_10**, are *individually *statistically significant at a 95% confidence level on the two-tailed t-test. The reported p-value for their ‘t’ score is smaller than 0.025 which is the threshold p value at a 95% confidence level on the 2-tailed test.

### Joint significance of model parameters (the F-test)

The third thing to note is that all parameters of the model are *jointly significant* in explaining the variance in the response variable TAVG.

This can be seen from the F-statistic 1458. It’s p value is 1.15e-272 at a 95% confidence level. This probability value is so incredibly tiny that you don’t even need to look up the F-distribution table to verify that the F-statistic is significant. The model is definitely much better at explaining the variance in TAVG than an intercept-only model.

### The AIC score and the Maximized Log-Likelihood of the fitted model

Finally, let’s take a look at the **AIC **score of **1990.0 **reported by *statsmodels*, and the maximized log-likelihood of **-986.86**.

We can see that the model contains **8 parameters (7 time-lagged variables + intercept)**. So as per the formula for the AIC score:

AIC score = 2*number of parameters —2* maximized log likelihood

= 2*8 + 2*986.86 = 1989.72, rounded to 1990. 0

Which is exactly the value reported by statmodels.

### Testing the model’s performance on out-of-sample data

The final step in our experiment is to test the optimal model’s performance on the test data set. Remember that the model has not seen this data during training.

We will ask the model to generate predictions on the test data set using the following single line of code:

```
olsr_predictions = best_olsr_model_results.get_prediction(X_test)
```

Let’s get the summary frame of predictions and print out the first few rows.

```
olsr_predictions_summary_frame = olsr_predictions.summary_frame()
print(olsr_predictions_summary_frame.head(10))
```

The output looks like this:

Next, let’s pull out the actual and the forecasted TAVG values so that we can plot them:

```
predicted_temps=olsr_predictions_summary_frame['mean']
actual_temps = y_test['TAVG']
```

Finally, let’s plot the predicted TAVG versus the actual TAVG from the test data set.

```
fig = plt.figure()
fig.suptitle('Predicted versus actual monthly average temperatures')
predicted, = plt.plot(X_test.index, predicted_temps, 'go-', label='Predicted monthly average temp')
actual, = plt.plot(X_test.index, actual_temps, 'ro-', label='Actual monthly average temp')
plt.legend(handles=[predicted, actual])
plt.show()
```

The plot looks like this:

In the above plot, it might seem like our model is amazingly capable of forecasting temperatures for several years out into the future! However, the reality is quite different. What we are asking the model to do is to predict the current month’s average temperature by considering the temperatures of the previous month, the month before etc., in other words by considering the values of the model’s parameters: TAVG_LAG1, TAVG_LAG2, TAVG_LAG5, TAVG_LAG6, TAVG_LAG10, TAVG_LAG11, TAVG_LAG12 and the intercept of regression.

We are asking the model to make this forecast for each time period, and we are asking it to do so for as many time periods as the number of samples in the test data set. Thus our model can reliably make only one month ahead forecasts. This behavior is entirely expected given that one of the parameters in the model is the previous month’s average temperature value TAVG_LAG1.

This completes our model selection experiment.

Here is the complete Python code used in this article:

The data set is available here.

## Summary

Let’s summarize the important points:

- The
**AIC**score gives you a way to measure the goodness-of-fit of your model, while at the same time penalizing the model for over-fitting the data. - By itself, an AIC score is not useful. One needs to compare it with the AIC score of other models while performing model selection. A lower AIC score indicates superior goodness-of-fit and a lesser tendency to over-fit.
- While performing model selection using the AIC score, one should also run other tests of significance such as the Student’s t-test and the F-test so as to perform a 360 degree asessment of the model’s suitability for the data set under consideration.

## Citations and Copyrights

### Data set

Monthly average temperature in the city of Boston, Massachusetts (Source: NOAA)

### Papers and books

Akaike H. (1998) Information Theory and an Extension of the Maximum Likelihood Principle. In: Parzen E., Tanabe K., Kitagawa G. (eds) Selected Papers of Hirotugu Akaike. Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY. https://doi.org/10.1007/978-1-4612-1694-0_15

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **The F-Test for Regression Analysis

**NEXT: **The Chi-squared Test