###### Plus, how to compare estimators based on their bias, variance and mean squared error

A statistical estimator can be evaluated on the basis of how biased it is in its prediction, how consistent its performance is, and how efficiently it can make predictions. And the quality of your model’s predictions are only as good as the quality of the estimator it uses.

In this section, we’ll cover the property of **bias **in detail and learn how to measure it.

The bias of an estimator happens to be joined at the hip with the **variance **of the estimator’s predictions via a concept called the **bias — variance tradeoff**, and so, we’ll learn about that concept too.

We’ll close with a discussion on the **Mean Squared Error** of the estimator, its applicability to regression modeling, and we’ll show how to evaluate various estimators of the population mean using the properties of bias, variance and their Mean Squared Error.

## What is a Statistical Estimator?

An estimator is any procedure or formula that is used to predict or estimate the value of some unknown quantity such as say, your flight’s departure time, or today’s NASDAQ closing price.

Let’s state an informal definition of what an estimator is:

A statistical estimator is a statistical device used to estimate the true, but unknown, value of some parameter of the population such as the mean or the median. It does this by using the information contained in the data points that make a sample of values.

In our daily lives, we tend to employ various types of estimators without even realizing it. Following are some types of estimators that we commonly use:

### Estimators based on good (or bad!) Judgement

You ask your stock broker buddy to estimate how high the price of your favorite stock will go in a year’s time. In this case, you are likely to get an **interval estimate** of the price, instead of a **point estimate**.

### Estimators based on rules of thumb, and some calculation

You estimate the efforts needed to complete your next home improvement project using some estimation technique such as the Work Breakdown Structure.

### Estimators based on polling

You ask an odd number of your friends, who they think will win the next election, and you accept the majority result.

In each case, you wish to estimate a parameter you don’t know.

In statistical modeling, the mean, especially the **mean of the population**, is a fundamental parameter that is often estimated.

Let’s look at a real life data sample.

Following is a data set of surface temperatures in the North Eastern Atlantic ocean at a certain time of year:

This data set contains many missing readings. We’ll clean it up by loading it into a Pandas Dataframe and removing all the rows containing a missing temperature value:

```
import pandas as pd
df = pd.read_csv('NE_Atlantic_Sea_Surface_Temperatures.csv', header=0, infer_datetime_format=True, parse_dates=['time(UTC)'])
df = df.dropna()
print(df)
```

We get the following output:

At nearly 800k data points (n=782,668), this data set constitutes a very large sized data sample. Often, you’ll have to make do with sample sizes that are in the teens and the hundreds, and occasionally, in thousands.

Let’s print out the frequency distribution of temperature values in this sample:

```
from matplotlib import pyplot as plt
df['sea_surface_temperature(degree_C)'].hist(bins=50)
plt.show()
```

We get the following plot:

Now, suppose we wish to estimate the **mean surface temperature** of the *entire* North Eastern Atlantic, in other words, the population mean *µ*.

Here are three possible estimators for the population mean:

**Estimator #1:**We could take the average of the minimum and maximum temperature value in the above sample:

```
y_min = df['sea_surface_temperature(degree_C)'].min()
y_max = df['sea_surface_temperature(degree_C)'].max()
print('y_min='+str(y_min) + ' y_max='+str(y_max))
print('Estimate #1 of the population mean='+str((y_max-y_min)/2))
```

We get the following output:

y_min=0.28704015899999996 y_max=15.02521203

Estimate #1 of the population mean=7.3690859355

**Estimator #2:**We could choose a random value from the sample and designate it as the population mean:

```
rand_temp = df.sample()['sea_surface_temperature(degree_C)'].iloc[0]
print('Estimate #2 of the population mean='+str(rand_temp))
```

We get the following output:

Estimate #2 of the population mean=13.5832207

**Estimator #3:**We could use the following estimator, which averages out all temperature values in the data sample:

```
y_bar = df['sea_surface_temperature(degree_C)'].mean()
print('Estimate #3 of the population mean='+str(y_bar))
```

We get the following output:

Estimate #3 of the population mean=11.94113359335031

Which estimator should we use? It can be shown that the third estimator — *y_bar*, the average of *n* values — provides *an unbiased estimate of the population mean*. But then, so do the first two!

In any case, this is probably a good point to understand a bit more about the concept of *bias*.

## Estimator Bias

Suppose you are shooting basketballs. While some balls make it through the net, you find that most of your throws are hitting a point below the hoop. In this case, whatever technique you are using to estimate the correct angle and speed of the throw is underestimating the (unknown) correct values of angle and speed. *Your estimator has a negative bias.*

With practice, your throwing technique improves, and you are able to dunk more often. And from a bias perspective, you begin overshooting the basket approximately just as often as undershooting it. *You have managed to reduce your estimation technique’s bias.*

## The bias-variance trade-off

One aspect that might be apparent to you from the above two figures is that, while in the first figure, although the bias is large, the ‘dispersion’ of the missed shots is less, leading to a lower variance in outcomes. In the second figure, the bias has undoubtedly reduced because of a more uniform spreading out of the missed shots, but that has also lead to a higher spread, a.k.a. higher variance.

The first technique appears to have a larger bias and a smaller variance and it is vice versa for the second technique. This is no coincidence and it can be easily proved (in fact, we will prove it later!) that there is a direct give and take between the bias and variance of your estimation technique.

## How to calculate an estimator’s bias

Let’s follow through with the basketball example. We can see that the location of the basket (orange dot at the center of the two figures) is a proxy for the (unknown) population mean for the angle of throw and speed of throw that will guarantee a dunk. The misses (the blue dots) are a proxy for your technique’s estimation of what the population mean values are for the angle and speed of throw.

In statistical parlance, each throw is an experiment that produces an outcome. The value of the outcome is the location of the ball (the blue dot) on the vertical plane that contains the basket.

If you assume that the outcome of each throw is independent of the previous ones (this is pretty much impossible in real life, but let’s play along!), we can define *n* independent, identically distributed random variables *y_1, y_2, y_3,…y_n* to represent your *n* throws.

Let’s recollect our average-of-n-sample-values estimator:

Note that this mean *y_bar* relates to our sample of *n* values , i.e. *n* ball throws, or *n* ocean surface temperatures, etc. Switching to the ocean temperatures example, if we collect another set of *n* ocean temperature values and average them out, we’ll get a second value for the sample mean *y_bar*. A third sample of size *n* will yield a third sample mean, and so on. So, the sample mean *y_bar* is itself a random variable. And just like any random variable, *y_bar* has a probability distribution and an expected value, denoted by *E(y_bar)*.

We are now in position to define the bias of the estimator *y_bar* for the population mean *µ* as follows:

The bias of the estimator *y_bar* for the population mean *µ*, is the difference between the expected value of the sample mean *y_bar*, and the population mean *µ*. Here is the formula:

In general, given a population parameter *θ* (e.g. mean, variance, median etc.), and an estimator *θ_cap *of *θ*, the **bias of θ_cap**

*is given as the difference between the expected value of*

*θ_cap*and the actual (true) value of the population parameter

*θ*, as follows:

## The sample mean as an unbiased estimator of the population mean

We will now show that the average-of-n-sample-values estimator *y_bar* that we saw earlier, demonstrates a **zero bias** in its ability to predict the population mean *µ*.

Using the expression of bias, the bias of *y_bar* is given by:

Now, let’s calculate *E(y_bar)*:

The random variables *y_1, y_2, y_3,…y_n *are all identically distributed around the population mean *µ. *Therefore, each one of them has an expected value of *µ*:

A zero bias estimator isn’t necessarily a great thing. In fact, recollect the following two estimators of the mean surface temperature which we had also considered:

Since the expected value of each one of the random variables *y_i* is population mean *µ*, estimators (1) and (2) each have a bias *B(.)= E(y_bar)-µ=µ-µ=0*.

But common sense says that estimators #(1) and #(2) are clearly inferior to the average-of-*n-*sample*–*values estimator #(3). So there must be other performance measures by which we can evaluate the suitability of the average-of-*n-*sample-values estimator and show it as superior than the others.

One of those measures is **Mean Square Error** (MSE).

## Mean Squared Error of an estimator, and its applicability to Regression Modeling

To most people who deal with regression models, the **Mean Squared Error** is a familiar performance measure. Several estimators such as the Ordinary Least Squares (OLS) estimator for linear models, and several neural network estimators seek to minimize the mean squared error between the training data and the model’s predictions.

Consider the general equation of a regression model:

*where, **y_obs** = [y_1, y_2, y_3,…y_n] = the training data sample,*

*X**= the regression variables matrix,*

*β_cap**= the*

*fitted**variable coefficients, and*

*ε**= residual errors of regression*

are the model’s predictions. *f(*** X,β_cap)=y_cap** is the vector of fitted values corresponding to vector of observed values

**So, the above model equation can be written concisely as follows:**

*y_obs.*The mean-squared-error of the estimating function *y_cap** = f(.) *is:

A regression model’s estimation function will usually, but not always, try to minimize the MSE.

The above equation for MSE can be written in the language of Expectation as follows:

Notice that we have also reversed the order of the terms in the brackets, just to maintain the convention.

The above two equations of MSE are equivalent as can be seen from the following illustration:

From the above figure, we see that the commonly used formula of Mean Squared Error simply assumes that the probability of occurrence of each squared difference *(y_i-y_cap_i)²* is uniformly distributed over the interval *[0, n], *causing each probability to be *1/n.*

Let’s return to the general formula for the MSE of an estimator:

Suppose we are using the average-of-n-sample-values estimator *y_bar*. So, *y_cap=y_bar*, and therefore *y_obs* is now the theoretically known (but practically unobserved) population mean *µ*. Therefore, the MSE of sample mean *y_bar *can be expressed as follows:

To calculate the MSE of *y_bar*, we will use the following result from expectation theory:

Applying the above formula to the R.H.S. of the MSE expression, we get the following result:

Let’s inspect the R.H.S. of the above equation. The first term under the square is *E(y_bar-µ) which as we know, is the bias of the sample mean y_bar.*

The second term on the R.H.S. *Variance(y_bar-µ)* is simply *Variance(y_bar)* since the population mean is a constant. Thus, the R.H.S is the sum of the Bias (squared) and the Variance of *y_bar.*

We can generalize this finding as follows:

The Mean Squared Error of the estimator *θ_cap *of any population parameter *θ*, is the sum of the bias *B(θ_cap) *of the estimator w.r.t. *θ*, and the variance *Var(θ_cap) *of the estimator w.r.t. *θ*:

While estimating any quantity, one often aims for a certain target Mean Squared Error. The above equation puts into clear light, the trade off one has to do between the *Bias* and the *Variance* of the estimator for achieving a specified mean squared error.

When *θ_cap=Y_bar, *the average-of-n-sample-values, we can calculate the MSE of Y_bar as follows:

We have already shown that the sample mean is an unbiased estimator of the population mean. So *Bias(y_bar)* is zero. Let’s calculate the variance of the sample mean.

We will now use the following property of the Variance of a linear combination of *n*** independent** random variables *y_1, y_2, y_3,…y_n*:

Here, *c_1, c_2, …, c_n* are constants. Applying this result to *y_bar*, we have:

Not only are *y_1, y_2, y_3,…y_n *independent random variables, they are also identically distributed, and so they all have the same mean and variance, and it is equal to the respective population values of those two parameters. Specifically, for all *i*, *Var(y_i)* = *σ². *Therefore:

Thus, we have shown that the variance of the *average-of-n-i.i.d.-random-variables* estimator is *σ²/n.*

This result also bears out our intuition that as the sample size increases, the variance in the sample means ought to reduce.

Plugging this result back into the formula for *MSE(y_bar)*, and remembering that *Bias(y_bar)=0*, we get:

Let’s now come full circle back to our data set of surface temperatures of Northeastern Atlantic ocean. Our sample size *n* is 782,668. We had considered the following three estimators of the unknown population mean *µ*:

Let’s compare the performance of the three estimators using the measures of bias, variance and MSE as follows:

We can see the *average-of-n-sample-values* estimator (estimator #3) has a zero bias, and the lowest variance and the lowest Mean Squared Error among the three candidates.

## Summary

- An estimator is any procedure or formula that is used to predict or estimate the value of some unknown quantity.
- Given a population parameter
*θ*(e.g. mean, variance, median etc.), and an estimator*θ_cap*of*θ*, the**bias of***θ_cap**θ_cap*and the actual (true) value of the population parameter*θ.* - The
**sample mean**, when expressed as the average of*n*i.i.d. random variables,**is an unbiased estimator of the population mean**. - The
**Variance of the sample mean**of*n*i.i.d. random variables is*σ²/n*, where*σ²*is the population’s variance. - The
**Mean Squared Error of an estimator***θ_cap*is the sum of: 1) the square of its bias and 2) its variance. Therefore, for any desired Mean Squared Error, there is always a tradeoff between the bias and the variance of the estimator’s output. - A regression model employing an estimator with a small bias won’t necessarily achieve a higher goodness-of-fit than another model that employs an estimator with a higher bias. One should also look at other properties of the estimator such as its MSE , it’s consistency and its efficiency.
- Some estimators, such as Maximum Likelihood Estimators, do not seek to minimize the Mean Squared Error. For such estimators, one measures their performance using goodness-of-fit measures such as deviance.

## References and Copyrights

### Data set

North East Atlantic Real Time Sea Surface Temperature data set downloaded from data.world under CC BY 4.0

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS**: Hidden Markov Models

**NEXT: **Estimator Consistency And Its Connection With The Bienaymé–Chebyshev Inequality