A consistent estimator is one which produces a better and better estimate of whatever it is that it’s estimating, as the size of the data sample it is working upon goes on increasing. This improvement continues to the limiting case when the size of the data sample becomes as large as the population, where the estimate becomes equal to the true value of the parameter.

Consistency is one of the properties of an estimator, along with other properties such as bias, mean squared error, and efficiency.

Let’s illustrate the concept with a real world data set. We’ll use the same data that we used in the previous section, namely, ocean surface temperatures in the Northeast Atlantic:

Let’s load the data set into memory using the Python based Pandas library, and we’ll clean it by removing all the missing values:

```
import pandas as pd
df = pd.read_csv('NE_Atlantic_Sea_Surface_Temperatures.csv', header=0, infer_datetime_format=True, parse_dates=['time(UTC)'])
df = df.dropna()
```

Let’s print out the data set:

```
print(df)
```

The cleaned up data set is almost 800K data points, which is pretty big. Let’s consider this data set as the population of values. So in this case, we can say that we have access to the population, although in real life, we would always have to deal with a sample of values, and we would never know the full extent of the population.

Let’s calculate and printout the mean and standard deviation of the ‘population’:

```
pop_mean = df['sea_surface_temperature(degree_C)'].mean()
pop_stddevs = df['sea_surface_temperature(degree_C)'].std(ddof=0)
print('Population mean (mu)='+str(pop_mean))
print('Population variance (sigma^2)='+str(pop_stddevs))
```

Here are the values:

```
Population mean (mu)=11.94113359335031
Population variance (sigma^2)=1.3292003777893815
```

## Sampling with replacement

For illustrating the concept of estimator consistency, we’ll draw out a randomly selected sample *[y_1, y_2, …y_i,…,y_n]* of size *100 (i.e. n=100)* from this ‘population’ of values. We will draw the sample using a technique called **sampling with replacement**. It means that we will randomly draw out the first data point *y_1*, note down its value and put it back into the population. We’ll repeat this procedure for all *n* values.

Sampling with replacement can yield duplicates and therefore, it’s not always a practical sampling technique. For instance, imagine that you are selecting volunteers for a clinical trial. If you use sampling with replacement, you could in theory enroll the same person multiple times which is clearly absurd. However, if your population of values is very large, even after doing replacement, choosing the same data point more than once is an extremely rare possibility.

The big advantage of using sampling with replacement is that it ensures that each variable *y_i *of the sample can be considered an **independent, identically distributed (i.i.d.) random variable**, an assumption that can simplify a lot of analysis. Although i.i.d. variables are practically impossible to encounter in real life, ironically, the i.i.d. assumption constitutes several foundational results in statistical science.

After our little detour into sampling land, let’s get back to our discussion on estimator consistency.

For each sample *[y_1, y_2, …y_i,…,y_n]*, we’ll use the following formulae of sample mean (**y_bar**) and sample deviation (**s**) as estimators of the population mean *µ *and standard deviation *σ*:

We’ll calculate the estimates of population mean *(µ)* and standard deviation *(σ)* using the above formulae on a sample of size*100.* Next, we’ll increase the sample size by *100 *and repeat the estimation of *µ *and *σ*, and we’ll continue to do this until the sample size *n* approaches the population size *N=782668*.

Here’s the Python code:

```
from tqdm import tqdm
increment = 100
#Define two arrays to store away the means and standard deviations for various sample sizes
sample_means_deltas = []
sample_stddevs_deltas = []
# Increase the sample size by 100 in each iteration, and use tqdm to show a progress bar while we are at it
for sample_size in tqdm(iterable=range(10, len(df), increment), position=0, leave=True):
# Select a random sample of size=sample_size, with replacement
random_sample = df.sample(n=sample_size, replace=True)
# Calculate the sample mean
y_bar = random_sample['sea_surface_temperature(degree_C)'].mean()
# Calculate and store the absolute diff between sample and population means, and sample and population standard deviations
sample_means_deltas.append(abs(y_bar - pop_mean))
s = random_sample['sea_surface_temperature(degree_C)'].std()
sample_stddevs_deltas.append(abs(s - pop_stddevs))
# Plot |y_bar-mu| versus sample_size
plt.plot(sample_means_deltas)
plt.xlabel('Sample size (n)')
plt.ylabel('|y_bar-mu|')
plt.show()
# Plot |s-sigma| versus sample_size
plt.plot(sample_stddevs_deltas)
plt.xlabel('Sample size (n)')
plt.ylabel('|s-sigma|')
plt.show()
```

We see the following two plots:

In both cases, observe that the absolute difference between the estimate and the true value of the parameter progressively reduces as the sample size increases.

Also, notice that the absolute value of the difference between the sample and population means does not become zero even if the sample size (*n*) equals the population size *N=782668. *This might seem counter-intuitive. Why would the sample mean not be exactly equal to the population mean if the sample is as large as the population? The answer lies in recollecting that we are using the sampling with replacement technique for generating a sample. When this technique is used on finite sized populations, the ‘replacement’ aspect of this technique will cause the sample to have several duplicate values, even when the sample size is equal to the population size. Therefore, even in the case of *n=N*, the sample is never identical to the population.

## The Consistent Estimator

It’s not just happenstance that the estimators for the population mean and standard deviation seem to converge to the corresponding population values as the sample size increases.

We can prove that they would always converge to the population values.

Before we prove that, let’s recollect what a consistent estimator is:

A consistent estimator is one which produces a better and better estimate of whatever it is that it’s estimating, as the size of the data sample it is working upon goes on increasing. This improvement continues to the limiting case where the estimate becomes equal to the true value of the parameter when the size of the data sample becomes as large as the population.

We can express consistency in probabilistic terms as follows:

In the above equation, we are saying that, no matter how infinitesimally tiny you choose some positive value ε, as the sample size *n* tends to *∞*, the probability *P() *of the absolute difference between the average of *n* sample values *y_n* and the population mean *µ* being greater than *ε *is *zero.*

## A thought experiment

One can visual the above equation using a thought experiment as follows:

Choose some tiny positive value of *ε*. Say *ε=0.01*.

- Start with a randomly selected (with replacement) sample of
*n*values. Compute its average*y_n*, subtract it from the population mean*µ*, take the absolute value of the difference and store it away. Repeat this procedure a thousand times to yield one thousand values of absolute differences*|y_bar(n)-µ|.* - Divide these
*1000*differences into two sets of values as follows: the first setcontains differences that are less than or equal to*S1**0.01*, i.e.*|y_bar(n)-µ|*≤*ε.*The second setcontains values that greater than*S2**0.01*i.e.*|y_bar(n)-µ| > ε*. - Compute the probability of the absolute difference being greater than
*0.01*. This is simply the size of the second set divided by*1000*. i.e.*P(|y_bar(n)-µ| > ε) = sizeof(S2)/1000* - Now increase the sample size by 100, and repeat steps 1, 2, 3 to recalculate the probability
*P(|y_bar(n)-µ| > ε).* - What you’ll find is that:
*As sample size n increases, the probability P(|y_bar(n)-µ| > ε) reduces and it gets closer and closer to zero*

You’ll find that no matter how small you choose *ε, you’ll still see P(|y_bar(n)-µ| > ε) approaching zero as n increases.*

## The general condition of consistency for any estimator

For any estimator *θ_cap(n)* used to estimate the population level parameter *θ*, *θ_cap(n) *is a consistent estimator of *θ* iff:

## The average-of-n-sample-values estimator is a consistent estimator of the population mean

Recollect that we said that we can show that *y_bar* is a consistent estimator of *µ*.

To show this, we will first introduce the **Bienaymé–Chebyshev inequality** which proves the following fascinating result that applies to a wide variety of probability distributions:

Consider a probability distribution, such as the following Poisson distribution having mean *µ *and standard deviation σ. In the sample Poisson distribution shown below, *µ=20 *and *σ = sqrt(20)=4.47213*.

The **Bienaymé–Chebyshev inequality** says that the probability of the** **random variable

**attaining a value that more than**

*X**k*standard deviations away from the mean

*µ*of the probability distribution of

**is at most**

*X**1/k².*It’s expressed as follows:

Continuing with our example of a Poisson distributed variable *X** *with mean *µ=*20 and standard deviation *σ = sqrt(20)=4.47213, *we get the following table:

*k* does not have to be an integer. For example, if *x=26, *its separation from the mean *20 *is *|X-µ|/σ=|20–26|/4.47213=1.34164* times the standard deviation*. *So, as per the **Bienaymé–Chebyshev **inequality, at most 100*/(1.34164)² = 56%* of values in the Poisson distribution of ** X**, would be greater than 26.

So how does all this help us in proving that the average-of-n-sample-values mean *y_bar* is a consistent estimator of the population mean *µ?*

Let’s state once again what we want to prove, alongside the Bienaymé–Chebyshev inequality:

Let’s make the following substitution in the Bienaymé–Chebyshev inequality:

If the random variable ** X** is set as the sample mean

**, then the**

*y_bar**mean of the sample mean*is simply the population mean

*µ*and

*the variance of the sample mean can be shown to be σ²/n.*

In the equation of Bienaymé–Chebyshev inequality, when we substitute *y_bar* for ** X**, we keep

*µ*intact and we replace standard deviation

*σ*with

*σ/sqrt(n).*

Now let’s make a second substitution:

We get the following:

Now we increase the sample size *n* to the point where it equals the theoretically infinite population size:

Solving the limit yields the result we were looking for, namely that the limit of the probability is zero as the sample size becomes arbitrarily large, thereby proving that the average-of-n-sample values mean is a consistent estimator of the population mean.

## Applicability to regression modeling

A regression model is usually trained on a sample which is the training data set. After it is trained, the model’s parameters acquire a set of fitted values ** β_cap**. If you train the model on another randomly selected sample of the same size, chances are, the trained model will acquire another set of fitted values

*β’_cap**.*Training on a third sample data set wil yield a third set of fitted parameter values

**and so on. Thus,**

*β’’_cap***which have a mean and a standard deviation. Practically, a regression model cannot be trained on the entire population of values. So**

*the fitted coefficients of a regression model β_cap is actually a vector of random variables***cannot ever attain the true population level values**

*β_cap***of the coefficients. This is where the connection with consistency comes in. If the model were to be trained on a randomly selected sample of larger and larger size, the estimation procedure of**

*β***is said to be consistent if**

*β**P(|*

*β_cap — β|**>*

*ε**) = 0*as the size of the training data set tends to infinity.

## Citations and Copyrights

### Data set

North East Atlantic Real Time Sea Surface Temperature data set downloaded from data.world under CC BY 4.0

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS**: Estimator Bias And The Bias-Variance Tradeoff

**NEXT: **An Introduction To The Fisher Information: Gaining The Intuition Into A Complex Concept