# The Akaike Information Criterion

## Introduction to the AIC

The Akaike Information Criterion (AIC) lets you test how well your model fits the data set without over-fitting it.

The AIC score rewards models that achieve a high goodness-of-fit score and penalizes them if they become overly complex.

By itself, the AIC score is not of much use unless it is compared with the AIC score of a competing model.

The model with the lower AIC score is expected to strike a superior balance between its ability to fit the data set and its ability to avoid over-fitting the data set.

## Formula for the AIC score

The formula for the AIC score is as follows:

The AIC formula is built upon 4 concepts which themselves build upon one another as follows: The concepts on which the AIC is based (Image by Author)

Let’s take another look at the AIC formula, but this time, let’s re-organize it a bit:

Let’s recollect that a smaller AIC score is preferable to a larger score. Using the rewritten formula, one can see how the AIC score of the model will increase in proportion to the growth in the value of the numerator, which contains the number of parameters in the model (i.e. a measure of model complexity). And the AIC score will decrease in proportion to the growth in the denominator which contains the maximized log likelihood of the model (which, as we just saw, is a measure of the goodness-of-fit of the model).

## Comparing two models using their AIC scores

The AIC score is useful only when its used to compare two models. Let’s say we have two such models with k1 and k2 number of parameters, and AIC scores AIC_1 and AIC_2.

Assume that AIC_1 < AIC_2 i.e. model 1 is better than model 2.

How much worse is model 2 than model 1? This question can be answered by using the following formula: Formula for the Relative likelihood of AIC models (Image by Author)

Why use the exp() function to compute the relative likelihood? Why not just subtract AIC_2 from AIC_1? For one thing, the exp() function ensures that the relative likelihood is always a positive number and hence easier to interpret.

## Example

If you build and train an Ordinary Least Squares Regression model using the Python statsmodels library, statsmodels

### How to select an optimal model using AIC

Let’s perform what might hopefully turn out to be an interesting model selection experiment. We’ll use a data set of daily average temperatures in the city of Boston, MA from 1978 to 2019. This data can be downloaded from NOAA’s website.

The raw data set, (which you can access over here), contains the daily average temperature values. The first few rows of the raw data are reproduced below:

### Exploring the data set

For our model selection experiment, we’ll aggregate the data at a month level.

After aggregation, which we’ll soon see how to do in pandas, the plotted values for each month look as follows: Monthly average temperature in the city of Boston, Massachusetts (Source: NOAA) (Image by Author)

Let’s also plot the average temperature TAVG against a time lagged version of itself for various time lags going from 1 month to 12 months. Following is the set of resulting scatter plots: Scatter plots of average monthly temperature against lagged versions of itself (Image by Author)

There is clearly a strong correlation at LAGS 6 and 12 which is to be expected for monthly averaged temperature data. Other lags such as LAG1, LAG5 and LAG7 may also exhibit a significant ability to explain some of the variance in the target variable’s value. We’ll find out soon enough if that’s true.

### Regression goal

Our regression goal will be to create a model that will predict the monthly average temperature in Boston, namely the TAVG value. Therefore our target, a.k.a. the response variable, will be TAVG.

### Regression strategy

Our regression strategy will be as follows:

1. Since we have seen a strong seasonality at LAGS 6 and 12, we will hypothesize that the target value TAVG can be predicted using one or more lagged versions of the target value, up through LAG 12.
2. Therefore, we’ll add lagged variables TAVG_LAG_1, TAVG_LAG_2, …, TAVG_LAG_12 to our data set. These are going to be our explanatory variables.
3. Next, we’ll build several Ordinary Least Squares Regression (OLSR) models using the statsmodels library. Each model will seek to explain the variance in TAVG using some combination of time-lagged variables. Since we don’t know what combination of lagged variables will lead to the ‘optimal’ model, we’ll do a brute force search through all possible combinations of lagged variables. That’s 4000+ model combinations in all!
4. We will build a lagged variable model corresponding to each one of these combinations, train the model and check its AIC score.
5. During our search through the model space, we’ll keep track of the model with the lowest AIC score.
6. In the end, we’ll print out the summary characteristic of the model with the lowest AIC score. We’ll inspect this optimal model using a couple of other model evaluation criteria also, such as the t-test and the F-test.
7. Lastly, we’ll test the optimal model’s performance on the test data set.

Let’s implement this strategy.

### Implementing the regression strategy using Python, pandas and statsmodels

Import all the required packages.

```import pandas as pd
from patsy import dmatrices
from collections import OrderedDict
import itertools
import statsmodels.formula.api as smf
import sys
import matplotlib.pyplot as plt
```

Read the data set into a pandas data frame.

```df = pd.read_csv('boston_daily_temps_1978_2019.csv', header=0, infer_datetime_format=True, parse_dates=, index_col=)
```

The data set contains daily average temperatures. We want monthly averages. So let’s roll up the data to a month level. This turns out to be a simple thing to do using pandas.

```df_resampled = df.resample('M').mean()
```

We are about to add lagged variable columns into the data set. Let’s create a copy of the data set so that we don’t disturb the original data set.

```df_lagged = df_resampled.copy()
```

Add 12 columns, each one containing a time-lagged version of TAVG.

```for i in range(1, 13, 1):
df_lagged['TAVG_LAG_' + str(i)] = df_lagged['TAVG'].shift(i)
```

Print out the first 15 rows of the lagged variables data set.

```print(df_lagged.head(15))
```

This prints out the following output:

The first 12 rows contain NaNs introduced by the shift function. Let’s remove these 12 rows.

```for i in range(0, 12, 1):
df_lagged = df_lagged.drop(df_lagged.index)
```

Print out the first few rows just to confirm that the NaNs have been removed.

```print(df_lagged.head())
```

Before we do any more peeking and poking into the data, we will put aside 20% of the data set for testing the optimal model.

```split_index = round(len(df_lagged)*0.8)
split_date = df_lagged.index[split_index]
df_train = df_lagged.loc[df_lagged.index <= split_date].copy()
df_test = df_lagged.loc[df_lagged.index > split_date].copy()
```

Now let’s create all possible combinations of lagged values. For this, we’ll create a dictionary in which the keys contain different combinations of the lag numbers 1 through 12.

```lag_combinations = OrderedDict()
l = list(range(1,13,1))

for i in range(1, 13, 1):
for combination in itertools.combinations(l, i):
lag_combinations[combination] = 0.0

print('Number of combinations to be tested: ' + str(len(lag_combinations)))
```

Next, we will iterate over all the generated combinations. For each lag combination, we’ll build the model’s expression using the Patsy syntax. Next we’ll build the linear regression model for that lag combination of variables, we’ll train the model on the training data set, we’ll ask statsmodels to give us the AIC score for the model, and we’ll make a note of the AIC score and the current ‘best model’ if the current score is less than the minimum value seen so far. We’ll do all of this in the following piece of code:

```expr_prefix = 'TAVG ~ '

min_aic = sys.float_info.max
best_expr = ''
best_olsr_model_results = None

#Iterate over each combination
for combination in lag_combinations:
expr = expr_prefix
i = 1
#Setup the model expression using patsy syntax
for lag_num in combination:
if i < len(combination):
expr = expr + 'TAVG_LAG_' + str(lag_num) + ' + '
else:
expr = expr + 'TAVG_LAG_' + str(lag_num)

i += 1

print('Building model for expr: ' + expr)

#Carve out the X,y vectors using patsy. We will use X_test, y_test later for testing the model.
y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')

#Build and train the OLSR model on the training data set
olsr_results = smf.ols(expr, df_train).fit()

#Store it's AIC value
lag_combinations[combination] = olsr_results.aic

#Keep track of the best model (the one with the lowest AIC score) seen so far
if olsr_results.aic < min_aic:
min_aic = olsr_results.aic
best_expr = expr
best_olsr_model_results = olsr_results

print('AIC='+str(lag_combinations[combination]))
```

Finally, let’s print out the summary of the best OLSR model as per our evaluation criterion. This is the model with the lowest AIC score.

```print(best_olsr_model_results.summary())
```

This prints out the following output. I have highlighted a few interesting areas in the output:

Let’s inspect the highlighted sections.

### Choice of model parameters

Our AIC score based model evaluation strategy has identified a model with the following parameters:

The other lags, 3, 4, 7, 8, 9 have been determined to not be significant enough to jointly explain the variance of the dependent variable TAVG. For example, we see that TAVG_LAG_7 is not present in the optimal model even though from the scatter plots we saw earlier, there seemed to be a good amount of correlation between the response variable TAVG and TAVG_LAG_7. The reason for the omission might be that most of the information in TAVG_LAG_7 may have been captured by TAVG_LAG_6, and we can see that TAVG_LAG_6 is included in the optimal model.

### Statistical significance of model parameters (the t-test)

The second thing to note is that all parameters of the optimal model, except for TAVG_LAG_10, are individually statistically significant at a 95% confidence level on the two-tailed t-test. The reported p-value for their ‘t’ score is smaller than 0.025 which is the threshold p value at a 95% confidence level on the 2-tailed test. The t value and the p-value of the model parameters (Image by Author)

### Joint significance of model parameters (the F-test)

The third thing to note is that all parameters of the model are jointly significant in explaining the variance in the response variable TAVG.

This can be seen from the F-statistic 1458. It’s p value is 1.15e-272 at a 95% confidence level. This probability value is so incredibly tiny that you don’t even need to look up the F-distribution table to verify that the F-statistic is significant. The model is definitely much better at explaining the variance in TAVG than an intercept-only model. F-statistic and its p-value. All mdoel parameters are jointly significant (Image by Author)

### The AIC score and the Maximized Log-Likelihood of the fitted model

Finally, let’s take a look at the AIC score of 1990.0 reported by statsmodels, and the maximized log-likelihood of -986.86.

We can see that the model contains 8 parameters (7 time-lagged variables + intercept). So as per the formula for the AIC score:

AIC score = 2*number of parameters —2* maximized log likelihood
= 2*8 + 2*986.86 = 1989.72, rounded to 1990. 0

Which is exactly the value reported by statmodels.

### Testing the model’s performance on out-of-sample data

The final step in our experiment is to test the optimal model’s performance on the test data set. Remember that the model has not seen this data during training.

We will ask the model to generate predictions on the test data set using the following single line of code:

```olsr_predictions = best_olsr_model_results.get_prediction(X_test)
```

Let’s get the summary frame of predictions and print out the first few rows.

```olsr_predictions_summary_frame = olsr_predictions.summary_frame()
```

The output looks like this:

Next, let’s pull out the actual and the forecasted TAVG values so that we can plot them:

```predicted_temps=olsr_predictions_summary_frame['mean']
actual_temps = y_test['TAVG']
```

Finally, let’s plot the predicted TAVG versus the actual TAVG from the test data set.

```fig = plt.figure()
fig.suptitle('Predicted versus actual monthly average temperatures')

predicted, = plt.plot(X_test.index, predicted_temps, 'go-',  label='Predicted monthly average temp')

actual, = plt.plot(X_test.index, actual_temps, 'ro-', label='Actual monthly average temp')
plt.legend(handles=[predicted, actual])

plt.show()
```

The plot looks like this: Predicted versus actual values of average monthly temperatures (Image by Author)

In the above plot, it might seem like our model is amazingly capable of forecasting temperatures for several years out into the future! However, the reality is quite different. What we are asking the model to do is to predict the current month’s average temperature by considering the temperatures of the previous month, the month before etc., in other words by considering the values of the model’s parameters: TAVG_LAG1, TAVG_LAG2, TAVG_LAG5, TAVG_LAG6, TAVG_LAG10, TAVG_LAG11, TAVG_LAG12 and the intercept of regression.

We are asking the model to make this forecast for each time period, and we are asking it to do so for as many time periods as the number of samples in the test data set. Thus our model can reliably make only one month ahead forecasts. This behavior is entirely expected given that one of the parameters in the model is the previous month’s average temperature value TAVG_LAG1.

This completes our model selection experiment.

The data set is available here.

## Summary

Let’s summarize the important points:

• The AIC score gives you a way to measure the goodness-of-fit of your model, while at the same time penalizing the model for over-fitting the data.
• By itself, an AIC score is not useful. One needs to compare it with the AIC score of other models while performing model selection. A lower AIC score indicates superior goodness-of-fit and a lesser tendency to over-fit.
• While performing model selection using the AIC score, one should also run other tests of significance such as the Student’s t-test and the F-test so as to perform a 360 degree asessment of the model’s suitability for the data set under consideration.

### Data set

Monthly average temperature in the city of Boston, Massachusetts (Source: NOAA)

### Papers and books

Akaike H. (1998) Information Theory and an Extension of the Maximum Likelihood Principle. In: Parzen E., Tanabe K., Kitagawa G. (eds) Selected Papers of Hirotugu Akaike. Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY. https://doi.org/10.1007/978-1-4612-1694-0_15

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

PREVIOUS: The F-Test for Regression Analysis

NEXT: The Chi-squared Test