###### …and its applicability to confidence intervals

Suppose you know the mean value of a sample and you wish to use the sample mean to estimate the *interval *that the population’s mean will lie in. The Interval Estimation technique can be used to arrive at this estimate at some specified confidence level. This technique can be easily extended to estimating the interval for other population level statistics such as the variance.

To illustrate, suppose you have polled one hundred randomly selected households in Boston, Massachusetts, and you have found their average annual income to be $65,713. Can you use this single number to estimate, with some quantifiable measure of confidence such as 90%, the range in which the average annual income of *all* three hundred thousand households in Boston will lie? It turns out, you can do this!

The following figure illustrates this situation:

In the above illustration, you may notice that we have assumed a certain probability distribution for the annual incomes of all 300K households in Boston. While it helps to have some idea what this distribution might be, it is not a strict requirement. You can get a decent interval estimate of the unknown population mean even if you have absolutely no idea what distribution the population follows!

In real life decision making, interval estimates can be more useful than relying on a point estimate. For example, next time you hear a news story on some drug having shown an efficacy of 70% in curing some civilization-threatening disease, you may want to check up the 95% confidence interval that the drug maker would have reported. If that interval turns out to be wide one, say 25%–90%, then the 70% point estimate may not be a good indication of the true efficacy.

## Estimating the interval for the population mean

Let’s illustrate the process of interval estimation using a real world data set.

We’ll use the following data set of 23.7K water samples taken from various beaches in the New York City metro area from 2005 to 2021. But instead of using the entire data set, we’ll use only the water samples data for **Midland beach in Staten Island, New York**.

You may download the data either straight from the **NYC OpenData site**, or (preferably) download the curated version of the data set from **over here**.

We’ll use the Pandas library to load data set into memory. We’ll start by importing all the Python packages that we would need:

```
import math
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm
#Load the data file
df = pd.read_csv('DOHMH_Beach_Water_Quality_Data.csv', header=0, infer_datetime_format=True, parse_dates=['Sample_Date'])
```

Our focus will be on the water quality data gathered for MIDLAND BEACH:

```
df_midland = df[df['Beach_Name']=='MIDLAND BEACH']
#print the data frame
print(df_midland)
```

We see the following output:

This is our sample. Before we proceed further, notice that there are NaNs in the DataFrame in rows containing microbial values that were too small for the measuring apparatus to detect. We’ll replace all these NaNs with 0.

```
df_midland.fillna(value=0,inplace=True)
```

Let’s print out the summary statistics for the sample:

```
df_midland['Enterococci_Results'].describe()
```

We’ll print out one more statistic, which is the most frequently occurring value a.k.a. the mode:

```
print(df_midland['Enterococci_Results'].mode())
```

We get a value of 0 for the mode, which means most samples which were taken from Midland beach contain zero (or undetectable level) levels of Enterococci.

The following plot shows the frequency distribution of sample values:

```
plt.hist(df_midland['Enterococci_Results'], bins=100)
plt.xlabel('Number of Enterococci detected in the sample')
plt.ylabel('Number of samples')
plt.show()
```

We see that the samples from Midland beach have a large variance.

Let’s review what we have learned so far about the data:

- Our
**sample size**is 969*n* - The
**sample mean**is 25.311042*Y_bar* - The
**sample standard deviation**is 149.939190*S* - The
**sample is highly skewed**in favor of zero values (as indeed it should for any well maintained public beach!)

What we **do not know** are:

- The
**population mean,**. The*µ**population mean*is the mean of the infinite number of samples that could theoretically have been collected at Midland beach from 2005 to 2021. - The
**population standard deviation,***σ* - The
**frequency distribution**of the population.

What we want to find out is:

*Can we create a probabilistic estimate for the interval in which the population mean **µ **might lie in? Specifically, can we identify an interval [µ_low, µ,high] so that µ lies in [µ_low, µ,high] with some probability such as 95%?*

Let’s see how to estimate this interval.

We will define a random variable ** Y** which represents the quantity we are measuring, namely the Enterococci count in the water sample.

We know that ** Y **takes values such as

*0, 10, 100, 4, 8 etc.*and its units are MPN per 100 ml (the units are immaterial as long as they are all in the same units).

Given a specified probability ρ, we wish to find out two numbers *µ_low *and *µ,high* such that:

It is customary to use the notation *(1-α) *in place of *ρ*.

What we are saying here is that we want to carve out a region from the **P**robability **D**istribution **F**unction of *Y**, *such that:

- The area of this region is
*(1-α)*, and, - This region extends from
*µ_low*and*µ_high*on the X-axis*.*

The figure below illustrates this situation in the context of a purely hypothetical distribution for the random variable ** Y**. Remember, we don’t know the actual distribution of the population

**! Note that we are using the customary notation**

*Y**x*and

*f(x)*in the figure below because showing a variable called

**on the**

*Y**X-axis*might make one’s head spin! So just think of

**, when you see**

*Y**x*in this figure.

Let’s look at the following expression again:

We know that the total area under the curve of the Probability Density Function is 1.0. So the un-shaded area is equal to *α*. We can re-express the above equation as follows:

We could slide *µ_low *and *µ_high *back and forth along the *X-axis* of the PDF curve until each *un-shaded *region has area *α/2*. The following figure illustrates this:

Thus we have:

P(Y ≤ µ_low) and P(Y ≤ µ_high) are simply the cumulative probabilities denoted by the **C**umulative *D**istribution **F**unction F(.) of **Y**.* Thus, we have:

We will now (temporarily) assume that ** Y** follows the standard normal distribution

*N(0, 1)*, i.e. it is normally distributed with a zero mean and a unit standard deviation.

You may be surprised that we are assuming that ** Y** is

*N(0,1)*distributed since we have seen that the

**is nowhere close to being even normally distributed, leave alone**

*Y**N(0,1)*distributed! Here’s a reminder of how the distribution of sample

**looks like**

*Y*

*:*But worry not.

You will see that we will take this rather bold risk, and we will still get away with it!

Now, an important consequence of assuming that *Y** ~ N(0,1) *is that *CDF of **Y**, i.e. F(.) becomes the CDF of the standard normal distribution denoted by Φ (pronounced as Phi)***. **Thus, *F(x) = Φ**(**x)*, and therefore we have:

Let us take another small leap and define a real number *p* such that:

Since *0≤ α ≤ 1*, we have: *0 ≤ p ≤ 1*

Next, we define a quantity *z_p* as follows:

In the above definition, *(1-p)* is interpreted as the cumulative probability. For example:

This is illustrated below on the PDF of the *N(0,1) *distribution:

Similarly:

Again, illustrated below:

Thus, we see that:

Or, in general due to the symmetry of *N(0,1)* distribution around *0*, we have:

Now, recollect that:

And since:

We have, after combining the above equations, the following result:

We can now express the *(1-α)*100% *confidence interval for the Enterococci count ** Y** in terms of z-values as follows:

The following figure illustrates the two z-values for the 90% confidence interval around the zero mean of a standard normal distribution:

Let’s keep in mind that all of the above assumes that the Enterococci count ** Y** has a standard normal distribution even though we know very well that it is not. We have to find a way to deal with this Achilles’ heel.

At the risk of mixing some metaphors, we will now play our Ace of Spades and ‘cure’ our Achilles’ heel, in the following two steps:

- We will redefine
as the*Y***mean count of Enterococci**found in*n*samples. Recollect that earlier,was just a point estimate, a single count. Now, we redefine*Y*to be the mean count as follows:*Y*

Here, ** Y_1, Y_2,…Y_n** are the Enterococci counts that were measured in samples

*1,2,3,…n*. Note that

**are themselves independent, identically distributed random variables. And therefore, the redefined**

*Y_1, Y_2,…Y_n***as the mean of**

*Y***, is also a random variable. To see why the mean count is also a random variable, imagine collecting another sample of size**

*Y_1, Y_2,…Y_n**n*from Midland beach. It will yield a different mean value. Now collect a third sample of size

*n*, and you’ll get a third, possibly different value for the mean value

**, and so one. Thus, the**

*Y***. And therefore, everything that we have derived so far regarding the confidence intervals of**

*mean count of Enterococci, Y, is itself a random variable***still stands true.**

*Y*- Now, let’s define another random variable
such that:*Z*

In the above formula:

*n*is the sample size (in our data set,*n=969*)*µ*and*σ*are respectively the population mean and population standard deviation.

It can be shown that if *n *is ‘large enough’, then by the Central Limit Theorem, *Z** is approximately N(0, 1) distributed.*

It can also be shown that **for large n, the sample standard deviation S provides an unbiased estimate of the population standard deviation σ**

*.*Simply put, we can replace

*σ*with

*S*in the above formula for

**.**

*Z*Just like ** Y**, the variable

**is also a random variable, and**

*Z***has an approximately standard normal distribution. Therefore, we can formulate the**

*Z**(1-α)*confidence interval estimate for

**in the same way as for**

*Z***:**

*Y*Substituting the formula for ** Z** in the above equation:

Rearranging, we have the following result:

What we have accomplished is to arrive at the interval estimate *[µ_low, µ_high]* for the population mean* µ *at a *(1-α)* confidence level.

Let’s use the above formula to calculate the interval estimate for *µ at 95% *confidence level for the samples collected at Midland beach:

```
#sample size n
n = len(df_midland['Enterococci_Results'])
#sample mean Y
Y = df_midland['Enterococci_Results'].mean()
#sample standard deviation
S = df_midland['Enterococci_Results'].std()
#significance alpha (1-alpha)*100 = 95%
alpha = 0.05
#p-value for required alpha
p = alpha / 2
#z value for the specified p-value
z_p=norm.ppf(1-p)
#mu_low
mu_low = Y - z_p*S/math.sqrt(n)
#mu_high
mu_high = Y + z_p*S/math.sqrt(n)
print('95% Confidence intervals for the population mean (mu)='+str((mu_low, mu_high)))
```

We see the following output:

95% Confidence intervals for the population mean (mu)=(15.870403932430401,34.751680690892634)

The rather wide confidence interval is due to the highly skewed distribution of the sample values which has in turn led to a large value of sample standard deviation *S=149.94*.

## Applicability of Interval Estimation to Regression Modeling

When you train a regression model, the coefficients of the regression variables acquire their ‘fitted’ values as follows:

The bold notation indicates that the values are matrices.

For e.g., the equation of a **trained linear regression model** can be expressed as follows:

Each one of the fitted coefficients ** β_cap** is a random variable. To see why, imagine that the model is trained on another training sample of size

*n*. On the second sample, the fitted coefficients are likely to take on a different vector of values. On a third training sample of size

*n*,

**would take on yet another vector of values and so one. Thus, each**

*β_cap**β_cap_j*

**is a random variable having some unknown distribution.**

*in β_cap*Thus, whenever you train your regression model on a training data set, what you are getting are mere point estimates of the true population values of *β_cap.*

Given this situation, it would be useful to know, at some level of confidence, what is the interval in which the ‘true’ population value of each *β_cap_j* in ** β_cap** lies in. In other words, it would be useful to know the interval estimate for

*β_cap**at some confidence level*

*(1-α)*100%*.

Whenever you train a regression model, the your modeling software will usually reports these confidence intervals for all regression coefficients.

Let us see a quick illustration of the application of interval estimates by building a regression model on the Enterococci counts data set.

Our variables are as follows:*Y** = Enterococci_Results*

*X**= [Sample_Location, MEASUREMENT_DAY_OF_WEEK, MEASUREMENT_MONTH], i.e. 3 regression variables.*

Start by importing all the required packages:

```
import pandas as pd
from patsy import dmatrices
import statsmodels.api as sm
```

Load the data set into memory using Pandas:

```
df = pd.read_csv('DOHMH_Beach_Water_Quality_Data.csv', header=0, infer_datetime_format=True, parse_dates=['Sample_Date'])
```

Carve out the data for the Midland beach:

```
df_midland = df[df['Beach_Name']=='MIDLAND BEACH']
```

Add the dummy variables: *MEASUREMENT_DAY_OF_WEEK, MEASUREMENT_MONTH*

```
df_midland['MEASUREMENT_DAY_OF_WEEK'] = df['Sample_Date'].dt.dayofweek
df_midland['MEASUREMENT_MONTH'] = df['Sample_Date'].dt.month
```

Form the regression expression:

```
expr = 'Enterococci_Results ~ Sample_Location + MEASUREMENT_DAY_OF_WEEK + MEASUREMENT_MONTH'
```

Use Patsy to carve out the ** X** and

**matrices:**

*y*```
y_train, X_train = dmatrices(expr, df_midland, return_type='dataframe')
```

Build and train a Generalized Linear Model with a Poisson link function:

```
poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
```

Print the fitted model’s summary:

```
print(poisson_training_results.summary())
```

We see the following output. I have highlighted the fitted coefficients of the model. Notice also that statsmodels has printed out the **interval estimates** for the true population level values of the coefficients at the 95% confidence level:

## References and Copyrights

### Data set

DOHMH Beach Water Quality Data taken from NYC OpenData under their Terms of Use

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **A Guide to Efficiency of an Estimator

**NEXT: **How To Adjust For Inflation In Monetary Data Sets