Estimator Bias, And The Bias — Variance Tradeoff

Plus, how to compare estimators based on their bias, variance and mean squared error

A statistical estimator can be evaluated on the basis of how biased it is in its prediction, how consistent its performance is, and how efficiently it can make predictions. And the quality of your model’s predictions are only as good as the quality of the estimator it uses.

In this section, we’ll cover the property of bias in detail and learn how to measure it.

The bias of an estimator happens to be joined at the hip with the variance of the estimator’s predictions via a concept called the bias — variance tradeoff, and so, we’ll learn about that concept too.

We’ll close with a discussion on the Mean Squared Error of the estimator, its applicability to regression modeling, and we’ll show how to evaluate various estimators of the population mean using the properties of bias, variance and their Mean Squared Error.


What is a Statistical Estimator?

An estimator is any procedure or formula that is used to predict or estimate the value of some unknown quantity such as say, your flight’s departure time, or today’s NASDAQ closing price.

Let’s state an informal definition of what an estimator is:

A statistical estimator is a statistical device used to estimate the true, but unknown, value of some parameter of the population such as the mean or the median. It does this by using the information contained in the data points that make a sample of values.

In our daily lives, we tend to employ various types of estimators without even realizing it. Following are some types of estimators that we commonly use:

Estimators based on good (or bad!) Judgement

You ask your stock broker buddy to estimate how high the price of your favorite stock will go in a year’s time. In this case, you are likely to get an interval estimate of the price, instead of a point estimate.

Estimators based on rules of thumb, and some calculation

You estimate the efforts needed to complete your next home improvement project using some estimation technique such as the Work Breakdown Structure.

Estimators based on polling

You ask an odd number of your friends, who they think will win the next election, and you accept the majority result.

In each case, you wish to estimate a parameter you don’t know.

In statistical modeling, the mean, especially the mean of the population, is a fundamental parameter that is often estimated.

Let’s look at a real life data sample.

Following is a data set of surface temperatures in the North Eastern Atlantic ocean at a certain time of year:

This data set contains many missing readings. We’ll clean it up by loading it into a Pandas Dataframe and removing all the rows containing a missing temperature value:

import pandas as pd

df = pd.read_csv('NE_Atlantic_Sea_Surface_Temperatures.csv', header=0, infer_datetime_format=True, parse_dates=['time(UTC)'])

df = df.dropna()
print(df)

We get the following output:

The cleaned up data set with all NaN rows removed (Image by Author)
The cleaned up data set with all NaN rows removed (Image by Author)

At nearly 800k data points (n=782,668), this data set constitutes a very large sized data sample. Often, you’ll have to make do with sample sizes that are in the teens and the hundreds, and occasionally, in thousands.

Let’s print out the frequency distribution of temperature values in this sample:

from matplotlib import pyplot as plt

df['sea_surface_temperature(degree_C)'].hist(bins=50)
plt.show()

We get the following plot:

A histogram of temperature values in Northeast Atlantic ocean (Image by Author)
A histogram of temperature values in Northeast Atlantic ocean (Image by Author)

Now, suppose we wish to estimate the mean surface temperature of the entire North Eastern Atlantic, in other words, the population mean µ.

Here are three possible estimators for the population mean:

  • Estimator #1: We could take the average of the minimum and maximum temperature value in the above sample:
y_min = df['sea_surface_temperature(degree_C)'].min()
y_max = df['sea_surface_temperature(degree_C)'].max()
print('y_min='+str(y_min) + ' y_max='+str(y_max))
print('Estimate #1 of the population mean='+str((y_max-y_min)/2))

We get the following output:

y_min=0.28704015899999996 y_max=15.02521203
Estimate #1 of the population mean=7.3690859355
  • Estimator #2: We could choose a random value from the sample and designate it as the population mean:
rand_temp = df.sample()['sea_surface_temperature(degree_C)'].iloc[0]
print('Estimate #2 of the population mean='+str(rand_temp))

We get the following output:

Estimate #2 of the population mean=13.5832207
  • Estimator #3: We could use the following estimator, which averages out all temperature values in the data sample:
An estimator for the population mean (Image by Author)
An estimator for the population mean (Image by Author)
y_bar = df['sea_surface_temperature(degree_C)'].mean()
print('Estimate #3 of the population mean='+str(y_bar))

We get the following output:

Estimate #3 of the population mean=11.94113359335031

Which estimator should we use? It can be shown that the third estimator — y_bar, the average of n values — provides an unbiased estimate of the population mean. But then, so do the first two!

In any case, this is probably a good point to understand a bit more about the concept of bias.

Estimator Bias

Suppose you are shooting basketballs. While some balls make it through the net, you find that most of your throws are hitting a point below the hoop. In this case, whatever technique you are using to estimate the correct angle and speed of the throw is underestimating the (unknown) correct values of angle and speed. Your estimator has a negative bias.

A biased dunking technique that seems to more often than not, undershoot the basket located at the center of the figure (Image by Author)
A biased dunking technique that seems to more often than not, undershoot the basket located at the center of the figure (Image by Author)

With practice, your throwing technique improves, and you are able to dunk more often. And from a bias perspective, you begin overshooting the basket approximately just as often as undershooting it. You have managed to reduce your estimation technique’s bias.

A less biased, more ‘balanced’ dunking technique, where the misses are just as often on each side of the basket at the center of the figure (Image by Author)
A less biased, more ‘balanced’ dunking technique, where the misses are just as often on each side of the basket at the center of the figure (Image by Author)

The bias-variance trade-off

One aspect that might be apparent to you from the above two figures is that, while in the first figure, although the bias is large, the ‘dispersion’ of the missed shots is less, leading to a lower variance in outcomes. In the second figure, the bias has undoubtedly reduced because of a more uniform spreading out of the missed shots, but that has also lead to a higher spread, a.k.a. higher variance.

The first technique appears to have a larger bias and a smaller variance and it is vice versa for the second technique. This is no coincidence and it can be easily proved (in fact, we will prove it later!) that there is a direct give and take between the bias and variance of your estimation technique.

How to calculate an estimator’s bias

Let’s follow through with the basketball example. We can see that the location of the basket (orange dot at the center of the two figures) is a proxy for the (unknown) population mean for the angle of throw and speed of throw that will guarantee a dunk. The misses (the blue dots) are a proxy for your technique’s estimation of what the population mean values are for the angle and speed of throw.

In statistical parlance, each throw is an experiment that produces an outcome. The value of the outcome is the location of the ball (the blue dot) on the vertical plane that contains the basket.

If you assume that the outcome of each throw is independent of the previous ones (this is pretty much impossible in real life, but let’s play along!), we can define n independent, identically distributed random variables y_1, y_2, y_3,…y_n to represent your n throws.

Let’s recollect our average-of-n-sample-values estimator:

An estimator for the population mean (Image by Author)
An estimator for the population mean (Image by Author)

Note that this mean y_bar relates to our sample of n values , i.e. n ball throws, or n ocean surface temperatures, etc. Switching to the ocean temperatures example, if we collect another set of n ocean temperature values and average them out, we’ll get a second value for the sample mean y_bar. A third sample of size n will yield a third sample mean, and so on. So, the sample mean y_bar is itself a random variable. And just like any random variable, y_bar has a probability distribution and an expected value, denoted by E(y_bar).

We are now in position to define the bias of the estimator y_bar for the population mean µ as follows:

The bias of the estimator y_bar for the population mean µ, is the difference between the expected value of the sample mean y_bar, and the population mean µ. Here is the formula:

The bias of the estimator for the population mean (Image by Author)
The bias of the estimator for the population mean (Image by Author)

In general, given a population parameter θ (e.g. mean, variance, median etc.), and an estimator θ_cap of θ, the bias of θ_cap is given as the difference between the expected value of θ_cap and the actual (true) value of the population parameter θ, as follows:

The bias of the estimator θ_cap for the population parameter θ (Image by Author)
The bias of the estimator θ_cap for the population parameter θ (Image by Author)

The sample mean as an unbiased estimator of the population mean

We will now show that the average-of-n-sample-values estimator y_bar that we saw earlier, demonstrates a zero bias in its ability to predict the population mean µ.

Using the expression of bias, the bias of y_bar is given by:

The bias of the estimator for the population mean (Image by Author)
The bias of the estimator for the population mean (Image by Author)

Now, let’s calculate E(y_bar):

Expected value of sample mean y_bar (Image by Author)
Expected value of sample mean y_bar (Image by Author)

The random variables y_1, y_2, y_3,…y_n are all identically distributed around the population mean µ. Therefore, each one of them has an expected value of µ:

The sample mean y_bar has zero bias (Image by Author)
The sample mean y_bar has zero bias (Image by Author)

A zero bias estimator isn’t necessarily a great thing. In fact, recollect the following two estimators of the mean surface temperature which we had also considered:

Two more estimators of the population mean (Image by Author)
Two more estimators of the population mean (Image by Author)

Since the expected value of each one of the random variables y_i is population mean µ, estimators (1) and (2) each have a bias B(.)= E(y_bar)-µ=µ-µ=0.

But common sense says that estimators #(1) and #(2) are clearly inferior to the average-of-n-samplevalues estimator #(3). So there must be other performance measures by which we can evaluate the suitability of the average-of-n-sample-values estimator and show it as superior than the others.

One of those measures is Mean Square Error (MSE).

Mean Squared Error of an estimator, and its applicability to Regression Modeling

To most people who deal with regression models, the Mean Squared Error is a familiar performance measure. Several estimators such as the Ordinary Least Squares (OLS) estimator for linear models, and several neural network estimators seek to minimize the mean squared error between the training data and the model’s predictions.

Consider the general equation of a regression model:

The general equation of a regression model having an additive residual error (Image by Author)
The general equation of a regression model having an additive residual error (Image by Author)

where, y_obs = [y_1, y_2, y_3,…y_n] = the training data sample,
X = the regression variables matrix,
β_cap = the fitted variable coefficients, and
ε = residual errors of regression

are the model’s predictions. f(X,β_cap)=y_cap is the vector of fitted values corresponding to vector of observed values y_obs. So, the above model equation can be written concisely as follows:

Regression model with an additive residual error (Image by Author)
Regression model with an additive residual error (Image by Author)

The mean-squared-error of the estimating function y_cap = f(.) is:

Mean squared error of the estimation function f(X,β_cap) (Image by Author)
Mean squared error of the estimation function f(X,β_cap) (Image by Author)

A regression model’s estimation function will usually, but not always, try to minimize the MSE.

The above equation for MSE can be written in the language of Expectation as follows:

Mean squared error of the estimation function f(X,β_cap) (Image by Author)
Mean squared error of the estimation function f(X,β_cap) (Image by Author)

Notice that we have also reversed the order of the terms in the brackets, just to maintain the convention.

The above two equations of MSE are equivalent as can be seen from the following illustration:

Formula for MSE expressed as an expectation (Image by Author)
Formula for MSE expressed as an expectation (Image by Author)

From the above figure, we see that the commonly used formula of Mean Squared Error simply assumes that the probability of occurrence of each squared difference (y_i-y_cap_i)² is uniformly distributed over the interval [0, n], causing each probability to be 1/n.

Let’s return to the general formula for the MSE of an estimator:

Mean squared error of the estimation function f(X,β_cap) (Image by Author)
Mean squared error of the estimation function f(X,β_cap) (Image by Author)

Suppose we are using the average-of-n-sample-values estimator y_bar. So, y_cap=y_bar, and therefore y_obs is now the theoretically known (but practically unobserved) population mean µ. Therefore, the MSE of sample mean y_bar can be expressed as follows:

Mean Squared Error of the sample mean Y_bar (Image by Author)
Mean Squared Error of the sample mean Y_bar (Image by Author)

To calculate the MSE of y_bar, we will use the following result from expectation theory:

Relation between Expectation and Variance of a random variable Y (Image by Author)
Relation between Expectation and Variance of a random variable Y (Image by Author)

Applying the above formula to the R.H.S. of the MSE expression, we get the following result:

MSE of sample mean Y_bar (Image by Author)
MSE of sample mean Y_bar (Image by Author)

Let’s inspect the R.H.S. of the above equation. The first term under the square is E(y_bar-µ) which as we know, is the bias of the sample mean y_bar.

The second term on the R.H.S. Variance(y_bar-µ) is simply Variance(y_bar) since the population mean is a constant. Thus, the R.H.S is the sum of the Bias (squared) and the Variance of y_bar.

We can generalize this finding as follows:

The Mean Squared Error of the estimator θ_cap of any population parameter θ, is the sum of the bias B(θ_cap) of the estimator w.r.t. θ, and the variance Var(θ_cap) of the estimator w.r.t. θ:

The Bias-Variance tradeoff (Image by Author)
The Bias-Variance tradeoff (Image by Author)

While estimating any quantity, one often aims for a certain target Mean Squared Error. The above equation puts into clear light, the trade off one has to do between the Bias and the Variance of the estimator for achieving a specified mean squared error.

When θ_cap=Y_bar, the average-of-n-sample-values, we can calculate the MSE of Y_bar as follows:

Mean Squared Error of the sample mean (Image by Author)
Mean Squared Error of the sample mean (Image by Author)

We have already shown that the sample mean is an unbiased estimator of the population mean. So Bias(y_bar) is zero. Let’s calculate the variance of the sample mean.

Variance of the sample mean (Image by Author)
Variance of the sample mean (Image by Author)

We will now use the following property of the Variance of a linear combination of n independent random variables y_1, y_2, y_3,…y_n:

Variance of a linear combination of n independent random variables (Image by Author)
Variance of a linear combination of n independent random variables (Image by Author)

Here, c_1, c_2, …, c_n are constants. Applying this result to y_bar, we have:

Variance of the sample mean of n i.i.d. random variables (Image by Author)
Variance of the sample mean of n i.i.d. random variables (Image by Author)

Not only are y_1, y_2, y_3,…y_n independent random variables, they are also identically distributed, and so they all have the same mean and variance, and it is equal to the respective population values of those two parameters. Specifically, for all i, Var(y_i) = σ². Therefore:

Variance of the sample mean of n i.i.d. random variables (Image by Author)
Variance of the sample mean of n i.i.d. random variables (Image by Author)

Thus, we have shown that the variance of the average-of-n-i.i.d.-random-variables estimator is σ²/n.

Variance of the sample mean of n i.i.d. random variables (Image by Author)
Variance of the sample mean of n i.i.d. random variables (Image by Author)

This result also bears out our intuition that as the sample size increases, the variance in the sample means ought to reduce.

Plugging this result back into the formula for MSE(y_bar), and remembering that Bias(y_bar)=0, we get:

The Mean Squared Error of the sample mean of n i.i.d. random variables (Image by Author)
The Mean Squared Error of the sample mean of n i.i.d. random variables (Image by Author)

Let’s now come full circle back to our data set of surface temperatures of Northeastern Atlantic ocean. Our sample size n is 782,668. We had considered the following three estimators of the unknown population mean µ:

Three candidate estimators for the population mean µ (Image by Author)
Three candidate estimators for the population mean µ (Image by Author)

Let’s compare the performance of the three estimators using the measures of bias, variance and MSE as follows:

A comparison of performance of different estimators of the population mean (Image by Author)
A comparison of performance of different estimators of the population mean (Image by Author)

We can see the average-of-n-sample-values estimator (estimator #3) has a zero bias, and the lowest variance and the lowest Mean Squared Error among the three candidates.


Summary

  • An estimator is any procedure or formula that is used to predict or estimate the value of some unknown quantity.
  • Given a population parameter θ (e.g. mean, variance, median etc.), and an estimator θ_cap of θ, the bias of θ_cap is the difference between the expected value of θ_cap and the actual (true) value of the population parameter θ.
  • The sample mean, when expressed as the average of n i.i.d. random variables, is an unbiased estimator of the population mean.
  • The Variance of the sample mean of n i.i.d. random variables is σ²/n, where σ² is the population’s variance.
  • The Mean Squared Error of an estimator θ_cap is the sum of: 1) the square of its bias and 2) its variance. Therefore, for any desired Mean Squared Error, there is always a tradeoff between the bias and the variance of the estimator’s output.
  • A regression model employing an estimator with a small bias won’t necessarily achieve a higher goodness-of-fit than another model that employs an estimator with a higher bias. One should also look at other properties of the estimator such as its MSE , it’s consistency and its efficiency.
  • Some estimators, such as Maximum Likelihood Estimators, do not seek to minimize the Mean Squared Error. For such estimators, one measures their performance using goodness-of-fit measures such as deviance.

References and Copyrights

Data set

North East Atlantic Real Time Sea Surface Temperature data set downloaded from data.world under CC BY 4.0

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.


PREVIOUS: Hidden Markov Models

NEXT: Estimator Consistency And Its Connection With The Bienaymé–Chebyshev Inequality


UP: Table of Contents