A Guide to Estimator Efficiency And The Cramér–Rao Bound On Variance

The concept of efficiency and its usage explained using a real world data set

We’ll cover the following topics:

What is efficiency?
What is a statistical estimator, and how is its efficiency defined?
How to calculate the efficiency of an estimator?
How to use efficiency to build better regression models?

Let’s dive in!

What is Efficiency?

In layperson terms:

Efficiency is a measure of how much use you can get out of something for every unit of time, energy or money you have poured into it.

The efficiency of just about anything can be expressed as the ratio of the useful output to the total input:

The general definition of efficiency (Image by Author)

Following are two examples of efficiency:

Efficiency of an electric motor: The efficiency of a direct current motor is the ratio of the power output measured at its shaft to the total DC electrical power pumped into it.
Production efficiency: The production efficiency of an entire country or even the whole planet is often measured as the ratio of its GDP to the population a.k.a. GDP per capita. The per capita GDP of planet Earth was $10,925 in 2020.

Efficiency is a dimension-less quantity. But there are exceptions. In some fields, especially in economics, efficiency has a dimension, often a monetary dimension such as for GDP per capita.

When expressed as a dimensionless quantity, efficiency is a real number that varies from 0.0 to 1.0, signifying that the useful output from any device can be at most as high as the total input pumped into the device.

With this background, let’s turn our attention to efficiency as defined in statistical science. We’ll begin by introducing a fundamental device in statistical science, namely the Statistical Estimator.

What is a statistical estimator?

Let’s state an informal definition of what an estimator is:

A statistical estimator is a statistical device used to estimate the true, but unknown, value of some parameter of the population such as the mean or the median. It does this by using the information contained in the data points that make a sample of values.

Examples of an estimator

Given a sample of n values [y_1, y_2,…,y_n], here are some examples (both bad and good) of an estimator of the population mean μ:

A randomly chosen mean (not good!):

This estimator estimates the population mean μ by designating a randomly selected value from the sample as μ — This estimator estimates the population mean μ by designating a randomly selected value from the sample as *μ (Image by Author)*

Average of n values (good):

This estimator estimates the population μ mean by taking the average of n sample values (Image by Author)

It can be proved that the average-of-n-values estimator has much nicer properties than the random-choice estimator. Specifically, the average-of-n-values estimator has a lower variance than the random-choice estimator, and it is a consistent estimator of the population mean μ.

The linear model’s estimator

Let’s also look at an estimator used in a commonly used regression model. The following estimator estimates the conditional mean μ, i.e. the mean value that is conditioned upon the regression variables vector X taking on a specific set of observed values [x_1,x_2,…x_m]. μ_cap is the estimated conditional mean calculated using θ_cap which is the vector of the fitted model’s coefficients.

No let’s return our attention to the topic at hand: Efficiency.

Just at with everything else, it is possible to calculate the efficiency of a statistical estimator.

Efficiency of an estimator

Consider an estimator T that is designed to estimate (predict) some population parameter θ. We just reviewed a few examples of T and θ. For example, T=average-of-n-values estimator of population mean μ i.e. θ=μ. The efficiency of such an estimator T is expressed as the ratio of two variances, as follows:

With the above mental picture in place, it should be easy to see that if you were to design two different types of estimators T1 and T2 for the same population parameter θ, then it is possible (indeed quite likely), that they would each exhibit a different characteristic variance. Suppose that Var(T1) > Var(T2). Now, suppose someone comes up with a third kind of estimator T3 for θ such that Var(T3) is even lower than that of T2, and so on. One might reasonably wonder if given a population parameter be, is there a lower bound on the variance exhibited by an estimator of θ. It turns that under certain conditions, there happens to be such a lower bound and it’s called the Cramér–Rao bound.

Turning our attention back to the equation for efficiency, we see that the numerator in the above equation is the Cramér–Rao bound.

Suppose you design an estimator T whose actual variance equals the Cramér–Rao bound, then it means that the efficiency of your estimator is a perfect 1.0. In all other cases, the efficiency of an estimator will range from [0 to 1.0).

Efficiency of an unbiased estimator

If the estimator is unbiased, the Cramér–Rao bound is the reciprocal of the Fisher Information I(T(θ)) of the estimator.

Thus, for an unbiased estimator T of some population parameter θ, the efficiency of T(θ) is expressed as:

Efficiency of an unbiased estimator T of population parameter θ — Efficiency of an **unbiased** estimator T of population parameter θ (Image by Author)

Fisher Information can be a complex concept to take in. We’ll describe it in the following way:

Fisher Information

Suppose you are working with a random variable T which is assumed to follow some probability distribution f(.), such as the Normal or the Poisson distribution. Suppose that the function f(.) accepts some parameter θ. Examples of θ are the mean μ of the the normal distribution, or the mean event rate λ of the Poisson distribution. Then the Fisher Information of T provides a way to measure the amount of information that T contains about the true population value of θ (such as the true mean of the population).

Three different ways of understanding the concept of Estimator Efficiency

The efficiency of an estimator is a measure of more than one aspect of its characteristics. Following are three related ways of looking at the efficiency of an estimator:

Estimator efficiency as a measure of its precision

The efficiency of an estimator is a measure of how ‘tight’ are it’s estimates around the true population value of the parameter that it is estimating, as compared to a perfectly efficient estimator. A perfectly efficient estimator is one whose variance is equal to the Cramér–Rao bound for that class of estimators. Thus, the notion of efficiency is directly based upon the degree of variation in the estimator’s predictions. This concept of variance of an estimator’s predictions is a very important one and we’ll soon illustrate how to calculate it using a real world data set.

Estimator efficiency as a determinant of minimum sample size required

The efficiency of an estimator is also a measure of how many (or how few, depending on your perspective) data points you would need so as to achieve the desired level of estimation quality. The quality of estimation can be measured using a variety of ways. One popular measure is a loss function such as the Mean Squared Error (MSE). The idea here is that a highly efficient estimator will require a smaller sized sample than its lower efficiency brethren to generate predictions at or below the desired threshold of MSE.

Since getting data is always an expensive affair, all other things being approximately the same, it can help to get your hands on a highly efficient estimator instead of chasing the biggest data set available for the problem.

Estimator efficiency as a way to compare two estimators

The efficiency property also gives you a way to compare the estimation precision (and accuracy) of two competing estimators for the sample problem, and same sample data set. Alternately, the estimator’s efficiency gives the modeler a means to determine how much bigger (or smaller) the sample size needs to be if their estimator of choice needs to match the precision (or accuracy—remember that they are not the same thing!) of a competing estimator.

An important special case

The efficiency of two estimators can be compared by simply comparing the variance of the respective estimator’s predictions, i.e. the one with a lower variance is considered to be more efficient, provided the following conditions are satisfied:

The two estimators are used to predict the same parameter of the population. For example, both are estimators of the population mean.
The two estimators belong to the same class, i.e. the predictions produced by the two estimators follow the same probability distribution. For e.g. the estimates produced by both estimators are Poisson distributed. In this situation, both estimators have the same Fisher Information for the population parameter that they are estimating.
Both estimators are unbiased estimators of the population parameter that they are estimating. In this situation, the reciprocal of their Fisher Information is the Cramer-Rao bound on variance, in turn making the Cramer-Rao bound on variance the same for both estimators.

When the above three conditions hold true, the numerator of the efficiency equation, namely, the lower bound on variance, is identical for both estimators. Therefore, the efficiency such estimators can be compared by simply comparing the variance of their respective predictions.

Thus, we have the following important result:

Among a group of unbiased estimators whose predictions of some population parameter θ follow identical probability distributions, the estimator whose predictions have the least variance is the most efficient estimator.

We shall now look at how to calculate the numerator and the denominator of the efficiency equation for the average-of-n-values estimator.

Let’s start with the denominator: the variance of the estimator’s predictions.

What is meant by the variance of an estimator and how to calculate it?

The concept of variance of an estimator’s predictions is best understood using the following thought experiment:

Suppose you wish to estimate the mean count of some bacteria per cubic ml of seawater at some public beach during the summer months. To do so, you collect 100 water samples at the beach at different times of the day and measure the bacterial count in each sample. This is your sample data set of bacterial counts: [y_1, y_2, …,y_100]. Next, you decide to use the average-of-n-values estimator to estimate the mean bacterial count, and you use this observed sample mean y_bar as your estimate of the population mean μ. In summary, you are using average-of-n-values sample mean y_bar as the estimator of the population mean μ.

But now suppose, a friend of yours collects another set of water samples at 100 randomly selected places on the beach. They would get a second set of 100 bacterial counts: [y_1, y_2, …,y_100] and another sample mean y_bar_2. Suppose 200 people repeat this procedure, they will among themselves, end up with 200 sample means y_bar_1, y_bar_2,…,y_bar_200. These 200 sample means would themselves be distributed (approximately normally) around the true population mean μ. And therefore one could be able to calculate the variance of these sample means.

Thus, the average-of-n-values estimator of the population mean μ is itself a random variable that follows a probability distribution that has both a mean and a variance associated with it.

The average-of-n-values estimator is an unbiased estimator of the population mean, i.e. it’s expected value is actually the population mean μ. And also that this estimator is consistent, meaning that its prediction will will converge to the population mean μ as n → N i.e. the size of the entire population. In case of the beach example, N can be safely considered to be infinite.

It can also be shown that the variance of the predictions of the average-of-n-values estimator is σ²/n, where σ² is the variance of the underlying population of values that we are dipping into so as to build our sample of size n.

In fact, the concept of estimator variance is so central to the computation of efficiency, that we will illustrate it with a real world example.

A real world example of estimator variance

We will use the following data set of 30K+ data points downloaded from Zillow Research under their free to use terms:

Forecast of Year-over-Year percentage change in house prices (Source: Zillow Research)

Each row in the data set contains a forecast of Year-over-Year percentage change in house prices in a specific geographical location within the United States. This value is in the column ForecastYoYPctChange.

Our goal is to estimate the mean forecast of Year-over-Year percentage change in house prices across the United States, i.e. the population mean μ.

Suppose we did not have access to this complete data set of 30K rows all at once. Instead we happened to have access to only 100 randomly selected locations. And we need to use the 100 point sample to estimate the mean forecast of the Year-over-Year percentage change in house prices across the United States.

The following Python code illustrates this task.

We will loading the data set into memory
Next, we will randomly select 100 data points with replacement. The ‘with replacement’ technique ensures that each data point is independently selected of any other other point. This technique can result in duplicates in our sample, but when the population of values is a large one, the chance of duplicates is minimal. On the upside, the sampling with replacement technique is required to make the statistical math work out nicely.
Finally, we will use the average-of-n-values estimator to estimate the population mean μ.

import math
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm


#Load the data file
df = pd.read_csv('zhvf_uc_sfrcondo_tier_0.33_0.67_month.csv', header=0, infer_datetime_format=True, parse_dates=['ForecastedDate'])

#Randomly select 100 data points with replacement
df_sample = df.sample(n=100, replace=True)

#Print the mean of the sample. This is our estimate of the population mean mu
print('Estimate of population mean mu='+str(df_sample['ForecastYoYPctChange'].mean()))

Now, let us repeat the above procedure 100 times to yield 100 independent estimates of the population mean μ. We’ll plot these 100 means to see what kind of distribution the predictions have:

means = []
for i in range(100):
    df_sample = df.sample(n=100, replace=True)
    means.append(df_sample['ForecastYoYPctChange'].mean())

#Plot the distribution
plt.hist(means, bins=25)
plt.show()

We see the following plot:

Frequency distribution of sample means (Image by Author)

If we continue this practice of drawing samples of size n (=100), we’ll find that the frequency distribution will start peaking at the true population mean μ.

Here’s the frequency distribution of 10,000 sample means:

We have made the following important observation:

The average-of-n-values estimator’s predictions is itself a random variable that follows a probability distribution having a mean and a variance.

If the estimator is unbiased, the mean of its predictions will be the same as the true population mean μ, as the number of predictions tends to infinity:

If the estimator is unbiased, the expected value of its predictions μ_cap equals the population mean μ
If the estimator is unbiased, the expected value of its predictions μ_cap equals the population mean μ — If the estimator is unbiased, the expected value of its predictions ***μ_cap*** *equals the population mean μ (Image by Author)*

Our interest is in the variance of the estimator’s predictions.

Here’s the variance of the average-of-means estimator’s 10,000 predictions:

print('Variance of the estimator='+str(np.var(means)))

It prints out the following:

Variance of the estimator=0.020872772633239996

The above deep dive into variance has also yielded us an unexpected dividend. It has enabled us to estimate the variance σ² of the population of house price change forecasts. Recollect that the variance of the average-of-n-values estimator is σ²/n, where σ² is the variance of the underlying population, and n=sample size=100. So we can estimate the variance of the population to be 2.08728.

Circling back…

Let’s circle back to the equation of efficiency of an estimator T that produces unbiased estimates of some population parameter θ:

So far, we have gotten some insight into the concept of variance in the estimator’s predictions, namely, the denominator in the above equation.

To calculate the numerator, we need to know the Fisher Information for the estimator in question. To know the Fisher Information, we need to know the probability distribution of the estimator’s predictions.

For the average-of-n-values estimator that we have been using to estimate the YoY % change in house prices, we know the following:

The predictions of the average-of-n-values estimator are approximately normally distributed (this can be proved) in the asymptotic case as the number of estimates tends to infinity.
The average-of-n-values estimator generates an unbiased estimate of the population mean μ. Therefore, the mean of the average-of-n-values estimator’s predictions is simply the population mean μ.
The variance of the estimator’s predictions is σ²/n where σ² is the variance of the underlying population of price changes and n is the size of the sample over which the average is taken. In our example, n=100.

Therefore, we can state the following about the probability distribution of the average-of-n-values estimator of μ:

The average-of-n-values estimator’s predictions are normally distributed with mean equal to the population mean μ and variance equal to a population variance σ² scaled by sample size n — The average-of-n-values estimator’s predictions are normally distributed with mean equal to the population mean μ and variance equal to a population variance *σ² scaled by sample size n (Image by Author)*

It can be proved that the Fisher Information of an estimator of unknown population mean μ that is normally distributed and has a known variance σ², is simply 1/σ².

Thus, the Fisher Information of the average-of-n-values estimator of the population mean μ is n/σ².

Recollect once again that the variance of this estimator’s predictions is σ²/n.

Therefore, the efficiency of the average-of-n-values estimator of population parameter μ is:

Efficiency of the average-of-n-values estimator of the population mean μ is a perfect 1.0 — Efficiency of the average-of-n-values estimator of the population mean *μ is a perfect 1.0 (Image by Author)*

This is an important result. It shows that the average-of-n-values estimator, for all of its simplicity, is an efficient estimator.

The average-of-n-values estimator of the population mean μ is an efficient estimator.

Citations and copyrights

Fisher R. A., (1922) On the mathematical foundations of theoretical statistics, Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character. 222309–368. http://doi.org/10.1098/rsta.1922.0009

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

PREVIOUS: Introduction to Fisher Information

NEXT: Estimating the Range of a Parameter: A Guide to Interval Estimation

UP: Table of Contents