The Science of Statistical Expectation

What it means to expect something, and how to compute the expected value of discrete and continuous random variables in 1-dimensional and higher dimensional spaces

It was the summer of 1988 when I stepped onto a boat for the first time in my life. It was a passenger ferry from Dover, England to Calais, France. I didn’t know it then, but I was catching the tail end of the golden era of Channel crossings by ferry. This was right before budget airlines and the Channel Tunnel almost kiboshed what I still think is the best way to make that journey.

I expected the ferry to look like one of the many boats I had seen in children’s books. Instead, what I came upon was an impossibly large, gleaming white skyscraper with small square windows. And the skyscraper appeared to be resting on its side for some baffling reason. From my viewing angle on the dock, I couldn’t see the ship’s hull and funnels. All I saw was its long, flat, windowed, exterior. I was looking at a horizontal skyscraper.

Thinking back, it’s amusing to recast my experience in the language of statistics. In all likelihood, my brain had computed the expected shape of a ferry from the data sample of boat pictures I had seen.

But my sample was hopelessly unrepresentative of the population which made the sample mean equally unrepresentative of the population mean.

I was trying to decode reality using a heavily biased sample mean.

Seasickness

This trip across the Channel was also the first time I got seasick. They say when you get seasick you should go out onto the deck, take in the fresh, cool, sea breeze and stare at the horizon. No, sorry, not for me. The only thing that really works for me is to sit down, close my eyes, and sip my favorite soda until my thoughts drift slowly away from the harrowing nausea roiling my stomach.

By the way, I am not drifting slowly away from the topic of this chapter which is statistical expectation. I’ll get right into the statistics in a minute. In the meantime, let me explain my understanding of why you get sick on a boat so that you’ll see the connection to the topic at hand.

On most days of your life, you are not getting rocked about on a ship. On land, when you tilt your body to one side, your inner ears and every muscle in your body tell your brain that you are tilting to one side. Yes, your muscles talk to your brain too! Your eyes eagerly second all this feedback and you come out just fine.

On a ship, all hell breaks loose on this affable pact between eye and ear.

On a ship, when the sea makes the ship tilt, rock, sway, roll, drift, bob, or any of the other things, what your eyes tell your brain can be startlingly different than what your muscles and inner ear tell your brain. Your inner ear might say, “Watch out! You are tilting left. You should adjust your expectation of how your world will appear.” But your eyes are saying, “Nonsense! The table I am sitting at looks perfectly level to me, as does the plate of food resting upon it. The picture on the wall of that thing that is screaming also appears straight and level. Do not listen to the ear.”

Your eyes could report something even more confusing to your brain such as “Yeah, you are tilting alright. But the tilt is not as significant or rapid as your overzealous inner ears might lead you to believe.”

It’s as if your eyes and your inner ears are each asking your brain to create two different expectations of how your world is about to change. Your brain obviously cannot do that. It gets confused. And for reasons buried in evolution your stomach expresses a strong desire to empty its contents.

Let’s try to explain this wretched situation by using the framework of statistical reasoning. This time, we’ll use a little bit of math to aid our explanation.

Should you expect to get seasick? Getting into the statistics of seasickness

Let’s define a random variable X that takes two values: 0 and 1. X is 0 if the signals from your eyes don’t agree with the signals from your inner ears. X is 1 if they do agree:

In theory, each value of X ought to carry a certain probability P(X=x). The probabilities P(X=0) and P(X=1) together constitute the Probability Mass Function of X. We state it as follows:

For the overwhelming number of times, the signals from your eyes will agree with the signals from your inner-ears. So p is almost equal to 1, and (1 — p) is a really, really tiny number.

Let’s hazard a wild guess about the value of (1 — p). We’ll use the following line of reasoning to arrive at an estimate:

According to the United Nations, the average life expectancy of humans at birth in 2023 is approximately 73 years. In seconds, that corresponds to 2302128000 (about 2.3 billion). Suppose an average individual experiences seasickness for a total of 16 hours (or 28800 seconds) in their lifetime. Let’s not quibble about the 16 hours. It’s a wild guess. So 28800 seconds gives us a working estimate of (1 — p) of 28000/2302128000 = 0.0000121626 and p=(1 —0.0000121626) = 0.9999878374. So during any second of the average person’s life, the unconditional probability of their experiencing seasickness is only 0.0000121626.

With these probabilities, we’ll run a simulation lasting 1 billion seconds (around 50% of a typical lifespan) in the lifetime of a certain John Doe. Let’s suppose JD prefers to spend most of this time on solid ground. He takes the occasional sea-cruise on which he often gets seasick. We’ll simulate whether J will experience sea sickness during each of the 1 billion seconds of the simulation. To do so, we’ll conduct 1 billion trials of a Bernoulli random variable having probabilities of p and (1 — p). The outcome of each trial will be 1 if J gets seasick, or 0 if J doesn’t get seasick. Upon conducting this experiment, we’ll get 1 billion outcomes. You too can run this simulation using the following fragment of Python code:

import numpy as np

p = 0.9999878374
num_trials = 1000000000

outcomes = np.random.choice([0, 1], size=num_trials, p=[1 - p, p])

Let’s count the number of outcomes of value 1 (=not seasick) and 0 (=seasick):

num_outcomes_in_which_not_seasick = sum(outcomes)
num_outcomes_in_which_seasick = num_trials - num_outcomes_in_which_not_seasick

We’ll print these counts. When I printed them, I got the following values. You may get slightly differing results each time you run your simulation:

num_outcomes_in_which_not_seasick= 999987794
num_outcomes_in_which_seasick= 12206

We can now calculate if JD should expect to feel seasick during any one of those 1 billion seconds.

The expectation is calculated as the weighted average of the two possible outcomes in our sample space space of {0, 1}.

The weights being the frequencies of the two outcomes. So let’s perform this calculation:

Expected value of the outcome (Image by Author)

The expected outcome is 0.999987794 which is practically 1.0. The math is telling us that during any randomly chosen second in the 1 billion seconds in JD’s simulated existence, JD should not expect to get seasick. The data seems to almost forbid it.

Let’s play with the above formula a bit. We’ll start by rearranging the terms as follows:

When rearranged in this manner, we see a delightful sub-structure emerging. The ratios in the two brackets represent the probabilities associated with the two outcomes, specifically the sample probabilities derived from our 1 billion strong data sample, rather than the population probabilities. They are sample probabilities because we calculated them using the data from our 1 billion strong data sample. Having said that, the values 0.999987794 and 0.000012206 should be pretty close to the expected population values of p and (1 — p) respectively.

By plugging in the probabilities p and (1 — p), we can restate the formula for expectation as follows:

Notice that we used the notation for expectation, which is E(). Since X is a Bernoulli(p) random variable, the above formula also shows us how to compute the expected value of a Bernoulli random variable. From the above formula we see that the expected value of X ~ Bernoulli(p) is simply, p.

Of sample means, population means, and a word to make you sound cool

E(X) is also called the population mean, denoted by μ, because it uses the probabilities p and (1 — p) which are the population values of probability. These are the ‘true’ probabilities that you will observe should you have access to the entire population of values, which is practically never. Statisticians use the word ‘asymptotic’ while referring to these and similar measures. They are referred to as asymptotic because their meaning is significant only when something, such as the sample size, approaches infinity or the size of the entire population. I think statisticians just like to say ‘asymptotic’.

I also think it’s a convenient cover for the troublesome truth that you can never measure the exact value of anything.

On the bright side, the impossibility of getting your hands on the population is ‘the great leveler’ in the field of statistical science. Whether you are a freshly minted graduate or a Nobel laureate in Economics, that door to the ‘population’ remains firmly closed for you. As a statistician, you are relegated to working with the sample whose shortcomings you must suffer in silence.

But it’s really not as bad a state of affairs as it sounds. Imagine what will happen if you started to know the exact values of things. If you have access to the population. If you can calculate the mean, the median, and the variance with bullseye accuracy. If you can foretell the future with pinpoint precision. There will be little need to estimate anything. Great big branches of statistics will cease to exist. The world will need hundreds of thousands fewer statisticians, not to mention data scientists. Imagine the impact on global unemployment, on the world economy, on world peace…

But I digress. My point is, if X is Bernoulli(p), then to calculate E(X), you cannot in practice use the actual population values of p and (1 — p). Instead, you must make do with estimates of p and (1 — p). These estimates, you will calculate using not the entire population — no chance of doing that. Instead, you will, more often than not, calculate them using a modest sized data sample. And so with much regret I must inform you that the best you can do is get an estimate of the expected value of the random variable X. Following convention, we denote the estimate of p as p̂ (p with a little cap or hat on it) and we denote the estimated expected value as Ê(X).

Estimated expectation of X (Image by Author)

Since Ê(X) uses sample probabilities, it’s called the sample mean. It is denoted by x̄ or ‘x̄_bar’. It’s an x with a bar placed on its head.

The population mean and the sample mean are the Batman and Robin of statistics.

A great deal of Statistics is devoted to calculating the sample mean and to using the sample mean as an estimate of the population mean.

And there you have it — the sweeping expanse of Statistics summed up in a single sentence. 😉

Diving into the deep end of expectation

Our thought experiment with the Bernoulli random variable has been instructive in that it has unraveled the nature of expectation to some extent. The Bernoulli variable is a binary variable, and it was simple to work with. However, the random variables we often work with can take on many different values. Fortunately, we can easily extend the concept and the formula for expectation to many-valued random variables. Let’s illustrate with another example.

The expected value of a multi-valued, discrete random variable

The following table shows information about automobiles. Specifically, the table displays the number of cylinders within the engine of each vehicle. There happen to be a total of 205 automobiles in this data set.

Cylinder counts of automobiles (Data source: UCI machine learning dataset repository under (CC BY 4.0) license) (Image by Author)

Let Y be a random variable that contains the number of cylinders of a randomly chosen vehicle from this dataset. We happen to know that the dataset contains vehicles with cylinder counts of 2, 3, 4, 5, 6, 8, or 12. So the range of Y is the set E=[2, 3, 4, 5, 6, 8, 12].

We’ll group the data rows by cylinder count such that each row in the grouped table contains the number of vehicles in the data set for the corresponding cylinder count. The table below shows the grouped counts. The Id and Make columns contain these counts. The last column indicates the corresponding sample probability of occurrence of each count. This probability is calculated by dividing each grouped count by 205, the number of vehicles in the data set:

Frequency distribution of cylinder counts

Using the sample probabilities, we can construct the Probability Mass Function P(Y) for Y. If we plot it against Y, it looks like this:

If a randomly chosen vehicle rolls out in front you, what will you expect its cylinder count to be? Just by looking at the PMF in the above figure, the number you’ll want to guess is 4. But unbeknownst to you, there’s also some cold, hard math backing this guess. Similar to the Bernoulli X, you can calculate the expected cylinder count of a randomly selected vehicle, i.e. the expected value of the random variable Y as follows:

The expected value of Y (Image by Author)

If you calculate the sum by using the probabilities from the PMF plot of Y, it comes to 4.38049 which is pretty close to your guess of 4 cylinders.

Recall that the range of Y is the set E=[2,3,4,5,6,8,12] where each element in E denotes a distinct number of cylinders. We can express the above sum as a summation over E as follows:

Formula for the expected value of the discrete random variable Y (Image by Author)

You can use the above formula to calculate the expected value of any discrete random variable whose range is the set E.

The expected value of a continuous random variable

If you are dealing with a continuous random variable, the situation changes a bit.

Let’s return to our dataset of vehicles. This time, we’ll look at the lengths of vehicles:

Lengths of automobiles (Data source: UCI machine learning dataset repository under (CC BY 4.0) license) (Image by Author)

Suppose Z is a random variable that holds the length in inches of a randomly selected vehicle. The range of Z is no longer a discrete set of values. Instead, it’s a subset of the set ℝ of real numbers. Since lengths are always positive, it’s the set of all positive real numbers, denoted as ℝ_>0.

Since the set of all positive real numbers has an (uncountably) infinite number of values, it’s meaningless to assign a probability to an individual value of Z. If you don’t believe me, consider a quick thought experiment: Imagine assigning a positive probability to each possible value of Z. You’ll find that the probabilities will sum to infinity which is absurd. So the probability P(Z=z) simply doesn’t exist. Instead, you must work with the Probability Density function f(Z=z) which assigns a probability density to different values of Z.

We previously discussed how to calculate the expected value of a discrete random variable using the Probability Mass Function.

Can we repurpose this formula to calculate the expected value of a continuous random variable such as Z? The answer is yes. To know how, imagine yourself with an electron microscope.

Take that microscope and focus it on the range of Z which is ℝ_>0. Now, zoom in on a very – very – tiny interval (z, z+δz], within this range. At this microscopic scale, you might observe that the probability density f(Z=z) is more or less constant across δz. Consequently, the product of f(Z=z) and δz can approximate the probability that a randomly selected vehicle’s length falls within the open-close interval (z, z+δz].

Armed with this approximate probability, you can approximate the expected value of Z as follows:

An approximate evaluation of E(Z) when Z is continuous (Image by Author)

Notice how we pole-vaulted from E(Y) to E(Z). To summarize, we did the following:

We replaced the discrete y_i with the real-valued z_i.
In the summation, we replaced P(Y=y) which is the PMF of Y, with f(Z=z)δz which is the approximate probability of finding z in the microscopic interval (z, z+δz].
Instead of summing over the discrete, finite range of Y which is E, we summed over the continuous, infinite range of Z which is ℝ_>0.
Finally, we replaced the equals sign with the approximation sign. And therein lies our guilt. We cheated. We sneaked in the probability f(Z=z)δz which is as an approximation of the exact probability P(Z=z). We cheated because the exact probability, P(Z=z), cannot exist for a continuous Z. Nevertheless, we must make amends for this transgression. And that’s what we’ll do next.

The above formula for E(Z) will give us an approximate value for E(Z) as we have assumed that f(Z=z) is constant over δz. in reality, it will not always be constant since δz has a finite width. We need to find a way to go from this approximation to an exact calculation. The only way to do that is to convert δz into an infinitesimal interval such that f(Z=z) must be constant over this interval. And for that, We now execute our master stroke, our pièce de résistance:.

Since ℝ_>0 is the set of positive real numbers, there are an infinite number of microscope intervals of size δz in ℝ>0. Therefore, the summation over ℝ>0 is effectively a summation over an infinite number of terms. This fact presents us with the perfect opportunity to replace the approximate summation with an exact integral, as follows:

The expected value of Z (Image by Author)

If Z’s range is the real valued interval [a, b], we set the limits of the definite integral to a and b instead of 0 and ∞.

If you know the PDF of Z and if the integral of z times f(Z=z) exists over [a, b], you will be able to solve the above integral and get E(Z) for your troubles.

For example, if Z is uniformly distributed over the range [a, b], it’s denoted as Z ~ Uniform(a, b). In this case, the PDF of Z will be as follows:

PDF of Z ~ Uniform(a, b) (Image by Author)

If you set a=1 and b=5,

f(Z=z) = 1/(5–1) = 0.25.

The probability density is a constant 0.25 from Z=1 to Z=5 and it is zero everywhere else. Here’s how the PDF of Z looks like:

PDF of Z ~ Uniform(1, 5) (Image by Author)

It’s basically a continuous flat horizontal line from (1,0.25) to (5,0.25) and it’s zero everywhere else.

Given the above f(Z), you can calculate E(Z) using the following procedure:

Procedure for calculating the expected value of a continuous random variable that is uniformly distributed over the interval [a, b] (Image by Author)

If a=1 and b=5, the mean of Z ~ Uniform(1, 5) is simply (1+5)/2 = 3 which also agrees perfectly with our intuition. If each one of the infinitely many values between 1 and 5 is equally likely, we’d expect the mean to work out to the simple average of 1 and 5.

Now I hate to deflate your spirits but in practice, you are more likely to spot double rainbows landing on your front lawn than come across continuous random variables for which you will use the integral method to calculate their expected value.

You see, delightful looking PDFs that can be integrated to get the expected value of the corresponding variables have a habit of ensconcing themselves in end-of-the-chapter exercises of college textbooks. They are like house cats. They don’t ‘do outside’. But as a practicing statistician, ‘outside’ is where you must live. Outside, you will find yourself staring at data samples of continuous values like lengths of vehicles. To model the PDF of such real-world random variables, you are likely to use one of the well-known continuous functions such as the Normal, the Log-Normal, the Chi-square, the Exponential, the Weibull and so on, or a mixture distribution. In other words, whatever seems to best fit your data.

Here are a couple of such continuous distributions and their means:

The PDFs and the means of continuous random variables that are Normally distributed and Chi-square distributed — The PDFs and the expected values of continuous random variables that are Normally distributed and Chi-square distributed (Image by Author)

For many commonly used PDFs, someone has already taken the trouble to derive the mean of the distribution by integrating ( x times f(x) ) just like we did with the Uniform distribution. I’ve mentioned the means of two such commonly used distributions in the table above.

The normal distribution with parameters μ and σ², denoted as N(μ, σ²) has a mean μ and variance σ².

The Chi-squared distribution with k degrees of freedom, denoted Chi-squared(k) or Χ²_k has a mean k and variance 2k.

Here are a couple more of such distributions and their means. As a practicing statistician, it helps to remember these means instead of calculating them from first principles:

The PDFs and the means of continuous random variables that are Exponentially distributed and Gamma distributed — The PDFs and the expected values of continuous random variables that are Exponentially distributed and Gamma distributed

Finally, in some situations, actually in many situations, real life datasets exhibit patterns that are too complex to be modeled by any one of these distributions. It’s like when you come down with a virus that mobs you with a horde of symptoms. To help you overcome them, your doctor puts you on drug cocktail with each drug having a different strength, dosage, and mechanism of action.

When you are mobbed with data that exhibits many complex patterns, you must deploy a small army of probability distributions to model it. Such a combination of different distributions is known as a mixture distribution. A commonly used mixture is the potent Gaussian Mixture which is a weighted sum of several Probability Density Functions of several normally distributed random variables, each one having a different mean and variance.

Given a sample of real valued data, you may find yourself doing something dreadfully simple: you will take the average of the continuous valued data column and anoint it as the sample mean. Hidden within this simple approach is the assumption that every single value has the sample probability of occurrence. To illustrate, to calculate the expected length of an automobile in the autos dataset, you will simply take the average of the length column and in our data set, it comes to 174.04927 inches. And that’s it. All done. But that is not it, and all is not done. For there is one question you still have to answer.

How good is your sample mean? Getting a feel for its accuracy

How do you know how accurate an estimate of the population mean is your sample mean? While gathering the data, you may have been unlucky, or lazy, or ‘data-constrained’ (which is often an excellent euphemism for good-old laziness). Either way, you are staring at a sample that is not proportionately random. It does not proportionately represent the different characteristics of the population. Let’s take the example of the autos dataset: you may have collected data for a large number of medium-sized cars, and for too few large cars. And stretch-limos may be completely missing from your sample. As a result, the expected length of a vehicle you calculate from your biased sample will be excessively biased toward the mean length of only the medium-sized cars in the population. Like it or not, you are now working on the belief that practically everyone drives a medium-sized car.

To thine own self be true

If you’ve gathered a heavily biased sample and you don’t know it and you don’t care to find out, then you should seriously reconsider a career in statistics or data science. But if you are willing to at least entertain the possibility of bias and you have some clues on what kind of data you may be missing (e.g. sports cars), then statistics will come to your rescue with powerful mechanisms to help you estimate this bias.

Unfortunately, no matter how hard you try you will never, ever, be able to gather a perfectly balanced sample. It will always contain biases because the exact proportions of various elements within the population remain forever inaccessible to you. Remember that door to the population? Remember how the sign on it always says ‘CLOSED’?

Your most effective course of action is to gather a sample that contains roughly the same fractions of all the things that exist in the population — the so-called well-balanced sample. The expected value of this well-balanced sample is the best possible sample mean that you can set sail with.

But the laws of nature don’t always take the wind out of statisticians’ sailboats. There is a magnificent property of nature expressed in a theorem called the Central Limit Theorem (CLT).

You can use the Central Limit Theorem to determine how well your sample mean estimates the unknown population mean. Put another way, you can use the Central Limit Theorem to estimate the unknown population mean given the sample mean.

The CLT is not a silver bullet for dealing with badly biased samples. If your sample predominantly consists of mid-sized cars, you have effectively redefined your notion of the population. If you are intentionally studying only mid-sized cars, you are absolved. In this situation, feel free to use the CLT. It will help you estimate the population mean of all mid-sized cars ever produced.

On the other hand, if you wish to estimate the mean length of the entire population of vehicles ever produced, but your sample contains mostly mid-sized cars, you have a problem. If you use the CLT on this biased sample, it will still give you an estimate of the population mean of all mid-sized cars ever produced. Clearly not the mean you were looking for.

To the student of statistics, let me restate that in slightly different words. If your college thesis is on how often pets yawn but your recruits are 20 cats and your neighbor’s Poodle, then CLT or no CLT, no amount of statistical wizardry will help you assess the accuracy of your sample mean.

The essence of the CLT

A comprehensive understanding of CLT is the stuff for another chapter but the essence of what it states is the following:

If you draw a random sample of data points from the population and calculate the mean of the sample, then repeat this exercise many times you’ll end up with…many different sample means. Well, duh! But something astonishing happens next. If you plot a frequency distribution of all these sample means, you’ll see that they are always normally distributed. What’s more, the mean of this normal distribution is always the mean of the population you are studying. It is this eerily captivating facet of our universe’s personality that the Central Limit Theorem describes using the language of math.

The sample mean length of 174.04927 inches marked off on a normally distributed Z that has a hypothetical population mean of 180 inches (Image by Author)

Going beyond the first dimension: Expectation in a multi-dimensional sample space

So far, our discussion about expectation has been confined to a single dimension, but it needn’t be so. We can easily extend the concept of expectation to two, three, or higher dimensions. To calculate the expectation over a multi-dimensional space, all we need is a joint Probability Mass Function or a joint Probability Density Function that is defined over the N-dim space. A joint PMF or PDF takes multiple random variables as parameters and returns the probability of jointly observing those values.

Earlier in the chapter, we defined a random variable Y that represents the number of cylinders in a randomly chosen vehicle from the autos dataset. Y is your quintessential single dimensional discrete random variable and its expected value is given by the following equation:

Expected value of a single dimensional discrete random variable (Image by Author)

Let’s introduce a new discrete random variable, X. The joint Probability Mass Function of X and Y is denoted by P(X=x_i, Y=y_j), or simply as P(X, Y). This joint PMF lifts us out of the cozy, one-dimensional space that Y inhabits, and deposits us into a more interesting 2-dimensional space. In this 2-D space, a single data point or outcome is represented by the tuple (x_i, y_j). If the range of X contains ‘p’ outcomes and the range of Y contains ‘q’ outcomes, the 2-D space will have (p x q) number of outcomes. We use the tuple (x_i, y_j) to denote each one of these (p x q) outcomes. To calculate E(Y) in this 2-D space, we must adapt the formula of E(Y) as follows:

The expected of the discrete random variable Y over a 2-dimensional space (Image by Author)

Notice that we are summing over all possible tuples (x_i, y_j) but for each tuple, we multiply the corresponding joint probability with only y_j. Let’s tease apart this sum into a nested summation as follows:

The expected value of the discrete random variable Y over a 2-dimensional space (Image by Author)

In the nested sum, the inner summation computes the product of y_j and P(X=x_i, Y=y_j) over all values of y_j and a certain x_i. Then, the outer sum repeats the inner sum for each value of x_i. Afterward, it collects all these individuals sums and adds them up to compute E(Y).

We can extend the above formula to any number of dimensions by simply nesting the summations within each other. All you need is a joint PMF that is defined over the N-dimensional space. For instance, here’s how to extend the formula to 4-D space:

The formula for the expected value of the discrete random variable Y over a 2-dimensional space — The expected value of the discrete random variable Y over a 4-dimensional space (Image by Author)

Notice how we are always positioning the summation of Y at the deepest level. You may arrange the remaining summations in any order you want — you’ll get the same result for E(Y).

You may ask, why will you ever want to define a joint PMF and go bat-crazy working through all those nested summations? What does E(Y) mean when calculated over an N-dimensional space?

The best way to understand the meaning of expectation in a multi-dimensional space is to illustrate its use on real-world multi-dimensional data.

The data we’ll use comes from a certain boat which, unlike the one I took across the English Channel, tragically did not make it to the other side.

RMS Titanic departing Southampton on April 10, 1912 (Public domain)

The following figure shows some of the rows in a dataset of 887 passengers aboard the RMS Titanic:

The Titanic data set — The Titanic dataset (CC0)

The Pclass column represents the passenger’s cabin-class with integer values of 1, 2, or 3. The Siblings/Spouses Aboard and the Parents/Children Aboard variables are binary (0/1) variables that indicate whether the passenger had any siblings, spouses, parents, or children on board the ship. In statistics, we commonly, and somewhat cruelly, refer to such binary indicator variables as dummy variables. There is nothing block-headed about them to deserve the disparaging moniker.

As you can see from the table, there are 8 variables that jointly identify each passenger in the dataset. Each of these 8 variables is a random variable. The task before us is three-fold:

We’d want to define a joint Probability Mass Function over a subset of these random variables, and,
Using this joint PMF, we’d want to illustrate how to compute the expected value of one of these variables over this multi-dimensional PMF, and,
We’d like to understand how to interpret this expected value.

To simplify things, we’ll ‘bin’ the Age variable into bins of size 5 years and label the bins as 5, 10, 15, 20,…,80. For instance, a binned age of 20 will mean that the passenger’s actual age lies in the (15, 20] years interval. We’ll call the binned random variable as Age_Range.

Once Age is binned, we’ll group the data by Pclass and Age_Range. Here are the grouped counts:

Frequency distribution of passengers by their Pclass and (binned) age — Frequency distribution of passengers by their cabin class and (binned) age (Image by Author)

In the above table, the third column contains the number of passengers aboard the Titanic for each cohort (group) that is defined by the characteristics Pclass and Age_Range. For example, there are 3 passengers with ages in the range (0, 5] occupying 1_^st class cabins. Incidentally, cohort is yet another word (along with asymptotic) that statisticians downright worship. Here’s a tip: every time you want to say ‘group’, just say ‘cohort’. I promise you this, whatever it was that you were planning to blurt out will instantly sound ten times more significant. To illustrate: “Eight different cohorts of alcohol enthusiasts (oenophiles?) were given fake wine to drink and their reactions were recorded.” See what I mean?

To be honest, ‘cohort’ does carry a precise meaning in statistics that ‘group’ simply lacks. Still, it can be instructive to say ‘cohort’ once in a while and witness feelings of respect about you grow in your listeners’ faces.

At any rate, we’ll add another column to the table of frequencies. This new column will hold the probability of observing the particular combination of Pclass and Age_Range. This probability, P(Pclass, Age_Range), is the ratio of the frequency (i.e. the number in the Name column) to the total number of passengers in the dataset (887).

Frequency distribution of passengers by their cabin class and (binned) age (Image by Author)

The probability P(Pclass, Age_Range) is the joint Probability Mass Function of the random variables Pclass and Age_Range. It gives us the probability of observing a passenger who is described by a particular combination of Pclass and Age_Range. For example, look at the row where Pclass is 3 and Age_Range is 25. There are 103 such passengers. The corresponding joint probability is 103/887 = 0.116122. That number tells us that 11.6% of passengers in the 3rd class cabins of the Titanic were 20–25 years old.

As with the one-dimensional PMF, the joint PMF also sums up to a perfect 1.0 when evaluated over all combinations of values of its constituent random variables. If your joint PMF doesn’t sum up to 1.0, you should look closely at how you have defined it. There might be an error in its formula or worse, in the design of your experiment.

In the above dataset, the joint PMF does indeed sum up to 1.0. Feel free to take my word for it!

To get a visual feel for how the joint PMF, P(Pclass, Age_Range) looks like, you can plot it in 3 dimensions. In the 3-D plot, set the X and Y axis to respectively Pclass and Age_Range and the Z axis to the probability P(Pclass, Age_Range). What you’ll see is a fascinating 3-D chart.

A 3-D plot of the joint PMF of Pclass and Age_Range — A 3-D plot of the joint PMF of **Pclass** and **Age_Range** (Image by Author)

If you look closely at the chart, you’ll notice that the joint PMF consists of three parallel plots, one for each cabin class on the Titanic. The 3-D plot brings out some of the demographics of the humanity aboard the ill-fated ocean-liner. For instance, across all three cabin classes, it’s the 15 to 40 year old passengers that made up the bulk of the population.

Now let’s work on the calculation for E(Age_Range) over this 2-D space. E(Age_Range) is given by:

Expected value of Age_Range — Expected value of **Age_Range** (Image by Author)

We run the inside sum over all values of Age_Range: 5,10,15,…,80. We run the outer sum over all values of Pclass: [1, 2, 3]. For each combination of (Pclass, Age_Range), we pick the joint probability from the table. The expected value of Age_Range is 31.48252537 years which corresponds to the binned value of 35. We can expect the ‘average’ passenger on the Titanic to be 30 to 35 years old.

If you take the mean of the Age_Range column in the Titanic dataset, you’ll arrive at exactly the same value: 31.48252537 years. So why not just take the average of the Age_Range column to get E(Age_Range)? Why build a Rube Goldberg machine of nested summations over an N-dimensional space only to arrive at the same value?

Rube Goldberg’s “self-operating napkin” machine (Public domain)

It’s because in some situations, all you’ll have is the joint PMF and the ranges of the random variables. In this instance, if you had only P(Pclass, Age_Range) and you knew the range of Pclass as [1,2,3], and that of Age_Range as [5,10,15,20,…,80], you can still use the nested summations technique to calculate E(Pclass) or E(Age_Range).

If the random variables are continuous, the expected value over a multi-dimensional space can be found using a multiple integral. For instance, if X, Y, and Z are continuous random variables and f(X,Y,Z) is the joint Probability Density Function defined over the 3-dimensional continuous space of tuples (x, y, z), the expected value of Y over this 3-D space is given in the following figure:

Expected value of the continuous random variable Y defined over a continuous 3-D space (Image by Author)

Just as in the discrete case, you integrate first over the variable whose expected value you want to calculate, and then integrate over the rest of the variables.

Expectation at the quantum scale

A famous example demonstrating the application of the multiple-integral method for computing expected values exists at a scale that is too small for the human eye to perceive. I am referring to the wave function of quantum mechanics. The wave function is denoted as Ψ(x, y, z, t) in Cartesian coordinates or as Ψ(r, θ, ɸ, t) in polar coordinates. Ψ is pronounced Psi. It’s used to describe the properties of seriously tiny things that enjoy living in really, really cramped spaces, like electrons in an atom. The wave function Ψ returns a complex number of the form A + jB, where A represents the real part and B represents the imaginary part.

We can interpret the square of the absolute value of Ψ as a joint probability density function defined over the 4-dimensional space described by the tuple (x, y, z, t) or (r, θ, ɸ, t). Specifically for an electron in a Hydrogen atom, we can interpret |Ψ|² as the approximate probability of finding the electron in an infinitesimally tiny volume of space around (x, y, z) or around (r, θ, ɸ) at time t. By knowing |Ψ|², we can run a quadruple integral over x, y, z, and t to calculate the expected location of the electron along the X, Y, or Z axis (or their polar equivalents) at time t.

Parting thoughts

I began this chapter with my experience with seasickness. And I wouldn’t blame you if you winced at the brash use of a Bernoulli random variable to model what is a remarkably complex and somewhat poorly understood human ordeal. My objective was to illustrate how expectation affects us, literally, at a biological level. One way to explain that ordeal was to use the cool and comforting language of random variables.

Starting with the deceptively simple Bernoulli variable, we swept our illustrative brush across the statistical canvas all the way to the magnificent, multi-dimensional complexity of the quantum wave function. Throughout, we sought to understand how expectation operates on discrete and continuous scales, in single and multiple dimensions, and at microscopic scales.

Conditional probability and conditional estimation

There is one more area in which expectation makes an immense impact. That area is conditional probability in which one calculates the probability that a random variable X will take a value ‘x’ assuming that certain other random variables A, B, C, etc. have already taken values ‘a’, ‘b’, ‘c’. The probability of X conditioned upon A, B, and C is denoted as P(X=x|A=a,B=b,C=c) or simply as P(X|A,B,C).

In all the formulae for expectation that we have seen, if you replace the probability (or probability density) term with the conditional version of the same, what you’ll get are the corresponding formulae for conditional expectation. It’s denoted as E(X=x|A=a,B=b,C=c). Conditional expectation lies at the heart of regression analysis and estimation. And that’s fodder for future chapters!