An Introduction To Fisher Information: Gaining The Intuition Into A Complex Concept

A brief Introduction to the concept

Consider a random variable X which is assumed to follow some probability distribution f(.), such as the Normal or the Poisson distribution. Suppose also that the function f(.) accepts some parameter θ. Examples of θ are the mean μ of the the normal distribution, or the mean event rate λ of the Poisson distribution. The Fisher Information of X measures the amount of information that the X contains about the true population value of θ (such as the true mean of the population).

The formula for Fisher Information

Clearly, there is a a lot to take in at one go in the above formula. Indeed, Fisher Information can be a complex concept to understand. So will explain it using a real world example. Along the way, we’ll also take apart the formula for Fisher Information and put it back together block by block so as to gain insight into why it is calculated the way it is.

An illustrative example

Consider the following data set of 30K+ data points downloaded from Zillow Research under their free to use terms:

Each row in the data set contains a forecast of Year-over-Year percentage change in house prices in a specific geographical location within the United States. This value is in the column ForecastYoYPctChange.

Let’s load the data set into memory using Python and Pandas and let’s plot the frequency distribution of ForecastYoYPctChange.

```import math
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm

#Plot the frequency distribution of ForecastYoYPctChange
plt.hist(df['ForecastYoYPctChange'], bins=1000)
plt.xlabel('YoY % change in house prices in some geographical area of the US')
plt.ylabel('Frequency of occurence in the dataset')
plt.show()

```

We see the following frequency distribution plot:

Defining the random variable X

In the above example, ForecastYoYPctChange is our random variable of interest. Thus, X=ForecastYoYPctChange .

The probability distribution of X

From looking at the above mentioned frequency distribution plot of ForecastYoYPctChange, we’ll assume that the random variable ForecastYoYPctChange is normally distributed with some unknown mean μ and variance σ² . For reference, here is the Probability Density Function (PDF) of such a N(μ, σ²) distributed random variable:

The PDF of ForecastYoYPctChange peaks at the population level mean μ which is unknown. Incidentally, here is the code that produced the above plot:

```from scipy.stats import norm

xlower=norm.ppf(0.01,loc=6,scale=2)
xupper=norm.ppf(0.99,loc=6,scale=2)
x = np.linspace(xlower,xupper, 100)

plt.plot(x, norm.pdf(x,loc=6,scale=2))
plt.xlabel('X')
plt.ylabel('Probability density f(X=x)')
plt.show()
```

The relationship between Fisher Information of X and variance of X

Now suppose we observe a single value of the random variable ForecastYoYPctChange such as 9.2%. What can be said about the true population mean μ of ForecastYoYPctChange by observing this value of 9.2%?

If the distribution of ForecastYoYPctChange peaks sharply at μ and the probability is vanishing small at most other values of ForecastYoYPctChange, then common sense suggests the chance of the observed value of 9.2% being very different than the true mean is also vanishingly small. By implication, the amount of uncertainty existing in the observed value of 9.2% being a ‘good’ estimate of μ is also very small. This holds true any particular observed value of ForecastYoYPctChange. Therefore, we would expect the Fisher Information contained in ForecastYoYPctChange about the population mean μ to be large.

Conversely, if the distribution of ForecastYoYPctChange is spread out pretty widely around the population mean μ, then the chance of a particular observation of ForecastYoYPctChange such as 9.2 being at or close to μ is small and therefore in this case, the Fisher Information contained in ForecastYoYPctChange about the population mean μ is small.

Clearly, the concept of Fisher Information of X for some population parameter θ (such as the mean μ), is proportional to the variance of the probability distribution of X around θ . That would explain the presence of variance in the formula for Fisher Information:

So far, we have been able to show that Fisher Information of X about the population parameter θ, has a direct relationship with the variance of X around θ. However, it is not directly equal to the variance of X. Instead, it is equal to the partial derivative of the log-likelihood of θ. To see why that is, let’s first look at the concepts of Likelihood, log-Likelihood and its partial derivative.

The concept of the Likelihood function

Returning to our data set of house price changes, since we have assumed that ForecastYoYPctChange is normally distributed, the probability (density) corresponding to a specific observation of ForecastYoYPctChange such as 9.2% is as follows:

Let’s make a simplifying substitution. We’ll use the following sample variance as a substitute for the variance of the population:

It can be shown that S² is an unbiased estimate of the population variance σ². So this is a valid substitution, especially for large samples.

In our house prices data set, the sample variance S² can be gotten as follows:

```S_squared = df['ForecastYoYPctChange'].var()
print('S^2='+str(S_squared))
```

This prints out the following:

`S^2=2.1172115582214155`

Substituting S² for σ² in the PDF of ForecastYoYPctChange , we have:

Notice one important thing about the above equation:

f(X=9.2| μ; σ² =2.11721 ) is actually a function of the population mean μ. In this form, as a function of the population parameter μ , we call this function the Likelihood function, denoted by ℒ( μ | X=9.2), or in general ℒ( θ | X=x).

ℒ( θ | X=x) is literally the likelihood of observing the particular value x of X, for different values of the population mean μ.

Let’s plot ℒ( μ | X=9.2) w.r.t. μ :

```x=np.linspace(-20,20,1000)
y=0.27418*np.exp(-0.23616*np.power(9.2-x,2))
plt.plot(x,y)
plt.xlabel('mu')
plt.ylabel('L(mu|X=9.2)')
plt.show()
```

The Likelihood function peaks at μ =9.2, which is another way of saying that if X follows a normal distribution, the likelihood of observing a value of X=9.2 is maximum when the mean of the population μ = 9.2. That seems kind of intuitive.

The concept of the Log-Likelihood function

Often, one is dealing with a sample of many observations [x_1, x_2, x_3,…,x_n] which form one’s data set. The likelihood of observing that particular data set of values under some assumed distribution of X, is simply the product of the individual likelihoods, in other words, the following:

Continuing with our example of house prices data set, the likelihood equation for a data set of YoY % increase values [x_1, x_2, x_3,…,x_n] is the following joint probability density function:

We would like to know what value of the true mean μ would maximize the likelihood of observing this particular sample of n observations. This is accomplished by taking the partial derivative of the joint probability w.r.t. μ, setting it to zero and solving for μ.

It is a lot easier to solve the partial derivative if one takes the natural logarithm of the above likelihood function. The Logarithm function turns the product into a sum, and for many probability distribution functions, their logarithm is a concave function, thereby aiding the process of finding a maximum (or minimum value). Finally, log(x) rises and falls with x. So whatever optimization goals we had about x, taking log(x) will keep those goals intact.

The logarithm of the Likelihood function is called the Log-Likelihood and is often denoted using the stylized small ‘l’: ℓ(θ | X=x)

For our house prices example, the log-likelihood of μ for a single observed value of X=9.2% and σ² =2.11721 can be expressed as follows:

In the above expression, we have made use of a couple of basic rules of logarithms, namely: ln(A*B)=ln(A)+ln(B), ln(Ax)=x*ln(A), and the natural logarithm lne(e) =1.0.

As with the Likelihood function, the Log-Likelihood is a function of some population parameter θ (in our example, θ= μ ). Let’s plot this log-likelihood function w.r..t. μ:

```x=np.linspace(-10,100,10000)
y = -1.29397 - 0.23616 * np.power(9.2-x,2)
plt.xlabel('mu')
plt.ylabel('LL(mu|X=9.2;sigma^2=2.11721)')
plt.plot(x,y)
plt.show()
```

As with the Likelihood function, the Log-Likelihood appears to be achieving its maximum value (in this case, zero) when μ =9.2%.

Maximization of the Log-Likelihood function: The Maximum Likelihood Estimate of θ

As mentioned earlier, often, one is dealing with a sample of many observations [x_1, x_2, x_3,…,x_n] which form one’s sample data set and one would like to know the likelihood of observing that particular data set of values under some assumed distribution of X . As we have seen by now, this likelihood (or Log-Likelihood) of observing a specific value of X varies depending on what is the true mean of the underlying population values.

For a set of observed values x= [x_1, x_2, x_3,…,x_n], the log-likelihood ℓ(θ | X=x) of observing x is maximized for that value of θ for which the partial derivative ofℓ(θ | X=x) w.r.t. θ is 0. In notation form:

For our house prices example, the maximum likelihood estimate is calculated as follows:

It’s easy to see this is an equation of a straight line with slope -0.47232 and y-intercept=0.47232*9.2. This line crosses the X-axis at μ =9.2% where the partial derivative is zero. Let’s plot this line.

```x=np.linspace(-10,100,10000)
y = 0.47232*(9.2-x)
plt.xlabel('mu')
plt.ylabel('Partial derivative of Log-Likelihood')
plt.plot(x,y)
plt.show()
```

Recollect that we have assumed that our data set has a variance σ²= 2.11721 . If instead, we don’t make this assumption, the maximum likelihood estimate for μ is as follows:

From the above equation, we can see that the variance σ² of the probability distribution of X has an inverse relationship with the absolute value of the slope of the partial derivative line, and therefore also the variance of the partial derivative function.

In other words, X is has a large spread around the true mean μ, the variance of the partial derivative of the log-likelihood function is small. Conversely, when X is tightly spread around the mean μ, the variance σ² is small, the slope of the partial derivative function is large, and therefore the variance of this function is also large.

This observation is exactly in line with the formulation of Fisher Information of X for μ, namely that it is the variance of the partial derivative of the log-likelihood of X=x:

Or in general terms, the following formulation:

Let’s use the above concepts to derive the Fisher Information of a Normally distributed random variable.

Fisher Information of a Normally distributed random variable

We have shown that the Fisher Information of a Normally distributed random variable with mean μ and variance σ² can be represented as follows:

To find out the variance on the R.H.S., we will use the following identity:

Using this formula, we solve the variance as follows:

The first expectation E[(X– μ)2] is simply the variance σ². And the second expectation E(X– μ) is zero as the expected value a.k.a. mean of X is μ .

Therefore, the R.H.S. works out to σ² / σ4 = 1/ σ² which is what is the Fisher Information of a normally distributed random variable with mean μ and variance σ².