…and its applicability to confidence intervals
Suppose you know the mean value of a sample and you wish to use the sample mean to estimate the interval that the population’s mean will lie in. The Interval Estimation technique can be used to arrive at this estimate at some specified confidence level. This technique can be easily extended to estimating the interval for other population level statistics such as the variance.
To illustrate, suppose you have polled one hundred randomly selected households in Boston, Massachusetts, and you have found their average annual income to be $65,713. Can you use this single number to estimate, with some quantifiable measure of confidence such as 90%, the range in which the average annual income of all three hundred thousand households in Boston will lie? It turns out, you can do this!
The following figure illustrates this situation:
In the above illustration, you may notice that we have assumed a certain probability distribution for the annual incomes of all 300K households in Boston. While it helps to have some idea what this distribution might be, it is not a strict requirement. You can get a decent interval estimate of the unknown population mean even if you have absolutely no idea what distribution the population follows!
In real life decision making, interval estimates can be more useful than relying on a point estimate. For example, next time you hear a news story on some drug having shown an efficacy of 70% in curing some civilization-threatening disease, you may want to check up the 95% confidence interval that the drug maker would have reported. If that interval turns out to be wide one, say 25%–90%, then the 70% point estimate may not be a good indication of the true efficacy.
Estimating the interval for the population mean
Let’s illustrate the process of interval estimation using a real world data set.
We’ll use the following data set of 23.7K water samples taken from various beaches in the New York City metro area from 2005 to 2021. But instead of using the entire data set, we’ll use only the water samples data for Midland beach in Staten Island, New York.
You may download the data either straight from the NYC OpenData site, or (preferably) download the curated version of the data set from over here.
We’ll use the Pandas library to load data set into memory. We’ll start by importing all the Python packages that we would need:
import math import pandas as pd import numpy as np from matplotlib import pyplot as plt from scipy.stats import norm #Load the data file df = pd.read_csv('DOHMH_Beach_Water_Quality_Data.csv', header=0, infer_datetime_format=True, parse_dates=['Sample_Date'])
Our focus will be on the water quality data gathered for MIDLAND BEACH:
df_midland = df[df['Beach_Name']=='MIDLAND BEACH'] #print the data frame print(df_midland)
We see the following output:
This is our sample. Before we proceed further, notice that there are NaNs in the DataFrame in rows containing microbial values that were too small for the measuring apparatus to detect. We’ll replace all these NaNs with 0.
Let’s print out the summary statistics for the sample:
We’ll print out one more statistic, which is the most frequently occurring value a.k.a. the mode:
We get a value of 0 for the mode, which means most samples which were taken from Midland beach contain zero (or undetectable level) levels of Enterococci.
The following plot shows the frequency distribution of sample values:
plt.hist(df_midland['Enterococci_Results'], bins=100) plt.xlabel('Number of Enterococci detected in the sample') plt.ylabel('Number of samples') plt.show()
We see that the samples from Midland beach have a large variance.
Let’s review what we have learned so far about the data:
- Our sample size n is 969
- The sample mean Y_bar is 25.311042
- The sample standard deviation S is 149.939190
- The sample is highly skewed in favor of zero values (as indeed it should for any well maintained public beach!)
What we do not know are:
- The population mean, µ. The population mean is the mean of the infinite number of samples that could theoretically have been collected at Midland beach from 2005 to 2021.
- The population standard deviation, σ
- The frequency distribution of the population.
What we want to find out is:
Can we create a probabilistic estimate for the interval in which the population mean µ might lie in? Specifically, can we identify an interval [µ_low, µ,high] so that µ lies in [µ_low, µ,high] with some probability such as 95%?
Let’s see how to estimate this interval.
We will define a random variable Y which represents the quantity we are measuring, namely the Enterococci count in the water sample.
We know that Y takes values such as 0, 10, 100, 4, 8 etc. and its units are MPN per 100 ml (the units are immaterial as long as they are all in the same units).
Given a specified probability ρ, we wish to find out two numbers µ_low and µ,high such that:
It is customary to use the notation (1-α) in place of ρ.
What we are saying here is that we want to carve out a region from the Probability Distribution Function of Y, such that:
- The area of this region is (1-α), and,
- This region extends from µ_low and µ_high on the X-axis.
The figure below illustrates this situation in the context of a purely hypothetical distribution for the random variable Y. Remember, we don’t know the actual distribution of the population Y! Note that we are using the customary notation x and f(x) in the figure below because showing a variable called Y on the X-axis might make one’s head spin! So just think of Y, when you see x in this figure.
Let’s look at the following expression again:
We know that the total area under the curve of the Probability Density Function is 1.0. So the un-shaded area is equal to α. We can re-express the above equation as follows:
We could slide µ_low and µ_high back and forth along the X-axis of the PDF curve until each un-shaded region has area α/2. The following figure illustrates this:
Thus we have:
P(Y ≤ µ_low) and P(Y ≤ µ_high) are simply the cumulative probabilities denoted by the Cumulative Distribution Function F(.) of Y. Thus, we have:
We will now (temporarily) assume that Y follows the standard normal distribution N(0, 1), i.e. it is normally distributed with a zero mean and a unit standard deviation.
You may be surprised that we are assuming that Y is N(0,1) distributed since we have seen that the Y is nowhere close to being even normally distributed, leave alone N(0,1) distributed! Here’s a reminder of how the distribution of sample Y looks like:
But worry not.
You will see that we will take this rather bold risk, and we will still get away with it!
Now, an important consequence of assuming that Y ~ N(0,1) is that CDF of Y, i.e. F(.) becomes the CDF of the standard normal distribution denoted by Φ (pronounced as Phi). Thus, F(x) = Φ(x), and therefore we have:
Let us take another small leap and define a real number p such that:
Since 0≤ α ≤ 1, we have: 0 ≤ p ≤ 1
Next, we define a quantity z_p as follows:
In the above definition, (1-p) is interpreted as the cumulative probability. For example:
This is illustrated below on the PDF of the N(0,1) distribution:
Again, illustrated below:
Thus, we see that:
Or, in general due to the symmetry of N(0,1) distribution around 0, we have:
Now, recollect that:
We have, after combining the above equations, the following result:
We can now express the (1-α)*100% confidence interval for the Enterococci count Y in terms of z-values as follows:
The following figure illustrates the two z-values for the 90% confidence interval around the zero mean of a standard normal distribution:
Let’s keep in mind that all of the above assumes that the Enterococci count Y has a standard normal distribution even though we know very well that it is not. We have to find a way to deal with this Achilles’ heel.
At the risk of mixing some metaphors, we will now play our Ace of Spades and ‘cure’ our Achilles’ heel, in the following two steps:
- We will redefine Y as the mean count of Enterococci found in n samples. Recollect that earlier, Y was just a point estimate, a single count. Now, we redefine Y to be the mean count as follows:
Here, Y_1, Y_2,…Y_n are the Enterococci counts that were measured in samples 1,2,3,…n. Note that Y_1, Y_2,…Y_n are themselves independent, identically distributed random variables. And therefore, the redefined Y as the mean of Y_1, Y_2,…Y_n, is also a random variable. To see why the mean count is also a random variable, imagine collecting another sample of size n from Midland beach. It will yield a different mean value. Now collect a third sample of size n, and you’ll get a third, possibly different value for the mean value Y, and so one. Thus, the mean count of Enterococci, Y, is itself a random variable. And therefore, everything that we have derived so far regarding the confidence intervals of Y still stands true.
- Now, let’s define another random variable Z such that:
In the above formula:
- n is the sample size (in our data set, n=969)
- µ and σ are respectively the population mean and population standard deviation.
It can be shown that if n is ‘large enough’, then by the Central Limit Theorem, Z is approximately N(0, 1) distributed.
It can also be shown that for large n, the sample standard deviation S provides an unbiased estimate of the population standard deviation σ. Simply put, we can replace σ with S in the above formula for Z.
Just like Y, the variable Z is also a random variable, and Z has an approximately standard normal distribution. Therefore, we can formulate the (1-α) confidence interval estimate for Z in the same way as for Y:
Substituting the formula for Z in the above equation:
Rearranging, we have the following result:
What we have accomplished is to arrive at the interval estimate [µ_low, µ_high] for the population mean µ at a (1-α) confidence level.
Let’s use the above formula to calculate the interval estimate for µ at 95% confidence level for the samples collected at Midland beach:
#sample size n n = len(df_midland['Enterococci_Results']) #sample mean Y Y = df_midland['Enterococci_Results'].mean() #sample standard deviation S = df_midland['Enterococci_Results'].std() #significance alpha (1-alpha)*100 = 95% alpha = 0.05 #p-value for required alpha p = alpha / 2 #z value for the specified p-value z_p=norm.ppf(1-p) #mu_low mu_low = Y - z_p*S/math.sqrt(n) #mu_high mu_high = Y + z_p*S/math.sqrt(n) print('95% Confidence intervals for the population mean (mu)='+str((mu_low, mu_high)))
We see the following output:
95% Confidence intervals for the population mean (mu)=(15.870403932430401, 34.751680690892634)
The rather wide confidence interval is due to the highly skewed distribution of the sample values which has in turn led to a large value of sample standard deviation S=149.94.
Applicability of Interval Estimation to Regression Modeling
When you train a regression model, the coefficients of the regression variables acquire their ‘fitted’ values as follows:
The bold notation indicates that the values are matrices.
For e.g., the equation of a trained linear regression model can be expressed as follows:
Each one of the fitted coefficients β_cap is a random variable. To see why, imagine that the model is trained on another training sample of size n. On the second sample, the fitted coefficients are likely to take on a different vector of values. On a third training sample of size n, β_cap would take on yet another vector of values and so one. Thus, each β_cap_j in β_cap is a random variable having some unknown distribution.
Thus, whenever you train your regression model on a training data set, what you are getting are mere point estimates of the true population values of β_cap.
Given this situation, it would be useful to know, at some level of confidence, what is the interval in which the ‘true’ population value of each β_cap_j in β_cap lies in. In other words, it would be useful to know the interval estimate for β_cap at some confidence level (1-α)*100%.
Whenever you train a regression model, the your modeling software will usually reports these confidence intervals for all regression coefficients.
Let us see a quick illustration of the application of interval estimates by building a regression model on the Enterococci counts data set.
Our variables are as follows:
Y = Enterococci_Results
X = [Sample_Location, MEASUREMENT_DAY_OF_WEEK, MEASUREMENT_MONTH], i.e. 3 regression variables.
Start by importing all the required packages:
import pandas as pd from patsy import dmatrices import statsmodels.api as sm
Load the data set into memory using Pandas:
df = pd.read_csv('DOHMH_Beach_Water_Quality_Data.csv', header=0, infer_datetime_format=True, parse_dates=['Sample_Date'])
Carve out the data for the Midland beach:
df_midland = df[df['Beach_Name']=='MIDLAND BEACH']
Add the dummy variables: MEASUREMENT_DAY_OF_WEEK, MEASUREMENT_MONTH
df_midland['MEASUREMENT_DAY_OF_WEEK'] = df['Sample_Date'].dt.dayofweek df_midland['MEASUREMENT_MONTH'] = df['Sample_Date'].dt.month
Form the regression expression:
expr = 'Enterococci_Results ~ Sample_Location + MEASUREMENT_DAY_OF_WEEK + MEASUREMENT_MONTH'
Use Patsy to carve out the X and y matrices:
y_train, X_train = dmatrices(expr, df_midland, return_type='dataframe')
Build and train a Generalized Linear Model with a Poisson link function:
poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
Print the fitted model’s summary:
We see the following output. I have highlighted the fitted coefficients of the model. Notice also that statsmodels has printed out the interval estimates for the true population level values of the coefficients at the 95% confidence level:
References and Copyrights
All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.
PREVIOUS: A Guide to Efficiency of an Estimator
NEXT: How To Adjust For Inflation In Monetary Data Sets