An exploration of the Weak Law of Large Numbers and the Central Limit Theorem through the long lens of history.
In the previous chapter, I introduced you to the Central Limit Theorem. We took apart its definition, looked at its applications, and watched it do its magic in a simulation.
I ended that chapter with a philosophical question:
“How do you determine the true probability of an event?”
This question – posed by a famous 17th century mathematician – was to lead to the discovery of the Central Limit Theorem more than a century later.
In this chapter, I’ll root into this question, and into the life of the mathematician who pondered over it, and into the big discovery that unfolded from it.
The discovery of the Weak Law of Large Numbers
It all started with Jacob Bernoulli. Sometime around 1687, the 32 year old first-born son of the large Bernoulli family of Basel in present day Switzerland started working on the 4th and final part of his magnum opus titled Ars Conjectandi (The Art of the Conjecture). In the 4th part, Bernoulli focused on Probability and its use in “Civilibus, Moralibus & Oeconomicis” (Civil, Moral and Economic) affairs.

In Part 4 of Ars Conjectandi, Bernoulli posed the following question: How do you determine the true probability of an event in situations where the sample space isn’t fully accessible? He illustrated his question with a thought experiment. When stated in modern terms, it goes like this:
Imagine an urn filled with r black tickets and s white tickets. You don’t know the values of r and s. Thus, you don’t know the ‘true’ probability p=r/(r+s) of drawing a black ticket in a single random trial.
Let X represent a random variable defined over the sample space {white ticket, black ticket}. Let the range of X be the set {0, 1}. Thus, X maps the sample space {white ticket, black ticket} to the values {0, 1}.
We’ll assign the following physical meaning to this mapping:
If a ticket selected at random (with replacement) from the urn turns out to be a black ticket, X takes the value 1. Otherwise, X takes the value 0.
Let X1, X2, X3,…,Xn represent the random variables corresponding to different independent ticket draws from the urn. Each Xi represents the outcome of a single draw from the urn. Similar to X, each Xi maps the sample space {white ticket, black ticket} to the set {0, 1}. In other words, Xi takes the value of either 0 (when a white ticket drawn), or 1 (when a black ticket is drawn).
Thus, the sum X1 + X2 + X3 +…+ Xn is the count of black tickets in a random sample of size n.
Let’s define another random variable X_barn (X with bar on it, and with a subscript ‘n’) to hold the value of this count.
X_barn can also be interpreted as the number of successes (i.e. the number of black tickets) found in n i.i.d. trials., Therefore, X_barn is Binomial distributed. We represent this as follows:
X_barn ~ Binomial(n,p)
Where ‘p’ is the true (and unknown) probability that a ticket drawn at random from the urn is a black ticket.
Keep in mind that in this chapter X_barn is the count or sum, not the mean, of n i.i.d. random variables. Thus:
- X_barn/n is the proportion of black tickets that you have observed, and
- The absolute difference between X_barn/n and p, |X_barn/n — p|, is the unknown error in your estimate of the real, unknown ratio p.
What Bernoulli theorized was that as the sample size n becomes very large, the odds of the error |X_barn/n — p| being smaller than any arbitrarily small positive number ϵ of your choice become incredibly and unfathomably large.
This is known as Bernoulli’s theorem in probability (not to be confused with Bernoulli’s principle in fluid dynamics discovered by Daniel Bernoulli who was Jacob Bernoulli’s nephew)
Shaped into an equation, Bernoulli’s theorem in probability can be expressed as follows:

In the above equation:
- The probability P(|X_barn/n — p| ≤ ϵ) is the probability of the estimation error being at most ϵ.
- P(|X_barn/n — p| > ϵ) is the probability of the estimation error being greater than ϵ.
- The constant ‘c’ is some seriously large positive number.
Some texts replace the equals sign with a ‘≥’ or a simple ‘>’.
A little bit of algebraic manipulation yields the following three alternate forms of Bernoulli’s theorem:

The equation in the first blue box comes from expressing the probability P(|X_barn/n — p| > ϵ) on the R.H.S of Bernoulli’s theorem as 1 – P(|X_barn/n — p| ≤ ϵ) and then isolating the ‘c’ over to R.H.S.
The equation in the second blue box comes expressing the inequality |X_barn/n — p| ≤ ϵ inside the probability term as follows:
-ϵ ≤ X_barn/n — p ≤ +ϵ, then multiplying the entire inequality by n, and finally adding np to the three terms.
The equation in the third blue box comes from making simple variable substitutions in the equation in the second blue box.
Did you notice how similar the equation in the third blue box looks to the modern day definition of a confidence interval?
Well, don’t let yourself be deceived by the similarity. Notice closely the center term in this equation. It’s the sample mean X_barn. It is not the population mean μ.
The equation in the third blue box happens to be the (1 — α) confidence interval of the known sample mean (or sum) X_barn, while the modern day definition of the confidence interval is defined for the unknown population mean (or sum) μ.
Despite the teasing similarity of the equation in blue box #3 to the confidence interval of the mean as we know it today, in the late 1600s, Jacob Bernoulli was phenomenally far away from giving us the formula for confidence interval of the unknown population mean (or sum) μ.
Bernoulli’s big discovery stated in the Bernoulli’s Theorem in probability eventually came to be known as the Weak Law of Large Numbers.
Bernoulli was well aware what he was stating was in a colloquial sense already woven into the common sense thinking of his times. He expressed this in his characteristically vivid style in Ars Conjectandi:
“…even the most stupid person, all by himself and without any preliminary instruction, being guided by some natural instinct (which is extremely miraculous) feels sure that the more such observations are taken into account, the less is the danger of straying from the goal.”
The “goal” that Bernoulli refers to, is that of being “morally certain” that the observed ratio of black tickets to total tickets in a random sample approaches the unknown true ratio. In Ars Conjectendi, Bernoulli defines “moral certainty” as “that whose probability is almost equal to complete certainty so that the difference is insensible”.
It’s possible to be somewhat precise about is meant by moral certainty if you state it as follows:
There always exists some really large sample size ‘n0‘ such that as long as your sample’s size ‘n’ exceeds ‘n0‘, then for any error threshold ϵ > 0, no matter how tiny it might be:
P(|(X_barn/n) — p| ≤ ϵ) = 1.0 for all practical purposes.
Jacob Bernoulli’s singular breakthrough on the Law of Large Numbers was to take the common sense intuition about how nature works and mold it into the exactness of a mathematical statement. In that respect Bernoulli’s thoughts on probability were deeply philosophical for his era. He wasn’t simply seeking a solution for a practical problem. Bernoulli was, to borrow a phrase from Richard Feynman, probing the very “character of physical law”.
Over the next two and a half centuries, a long parade of mathematicians chiseled away at Bernoulli’s 1689 theorem to shape it into the modern form we recognize so well. Many improvements were made to it. The theorem was freed from the somewhat suffocating straight jacket of Bernoulli’s binomial thought experiment consisting of urns and tickets. The constraints of identical distribution, and even independence of the random variables that make up the random sample, were eventually relaxed. The proof of the weak law of large numbers (WLLN as it’s often called) was greatly simplified using Markov and Chebyshev’s inequalities.
Today, the WLLN says simply the following:
If X1, X2, X3,…,Xn are i.i.d. random variables forming a sample of size n with mean X_barn. Assume that the sample is drawn randomly with replacement from a population with an unknown mean μ. The probability of the error |X_barn — μ| being less than any non-negative number ε approaches absolute certainty as you progressively dial up the sample size. And this holds true no matter how tiny is your choice of the threshold ε.

The WLLN uses the concept of convergence in probability.
In terms of convergence in probability, the above equation states that the sample mean (or sum), X_barn, converges in probability to the real population mean (or sum), μ. This fact can be succinctly stated as follows:

The Weak Law of Large Number’s connection to the Central Limit Theorem
Let’s recall what Central Limit Theorem says:
The standardized sum or mean of a sample of i.i.d. random variables converges in distribution to N(0,1).
Let’s drive into this definition a little bit to see its connection with the Weak Law of Large Numbers.
Assume X1, X2, X3, …,Xn represents a random sample of size n drawn from a population with mean μ and a finite, positive variance σ². Let X_barn be the sample mean or sample sum.
Let’s define a random variable Zn as the standardized version of X_barn. Zn can be calculated as follows:

According to the Central Limit Theorem, as the sample size n tends to infinity, the Cumulative Distribution Function (CDF) of Zn converges in distribution to the standard normal random variable N(0,1).

Now, as per the WLLN, as the sample size n tends to infinity, the standardized sample mean Zn should converge in probability to the mean of the standard normal random variable N(0, 1), which is 0. In other words, as per the WLLN, Zn should converge in probability to zero:

The Weak Law of Large Numbers guarantees a probabilistic convergence of the standardized sample mean to 0.
The Central Limit Theorem guarantees a distributional convergence of the CDF of the standardized sample mean to the CDF of the standard normal random variable, N(0, 1).
Two big weaknesses of the Weak Law of Large Numbers
The path from the discovery of the Weak Law of Large Numbers to the discovery of the Central Limit Theorem turned out to be full of tough, thorny, difficult brambles that took Jacob Bernoulli’s successors several decades to hack through. Look once again at the equation at the heart of Bernoulli’s theorem:

Let’s focus our attention on the probability P(np — δ ≤ X_barn ≤ np + δ).
Recall that X_barn is the count of black tickets in a random sample. The probability distribution of X_barn conditioned on the sample size n and the true proportion of black tickets p is denoted as P(X_barn= i | n, p). If you know P(X_barn= i | n, p), you should be able to calculate the probability P(np — δ ≤ X_barn ≤ np + δ), which is the probability of X_barn lying between the extents i = np — δ and i = np +δ. We’ll see how to do that in a minute.
Meanwhile the question is, what is the probability distribution of P(X_barn=i | n, p)?
Bernoulli chose to frame his investigations within a Binomial setting. Bernoulli’s ticket-filled urn is the sample space for what is a binomial experiment. Therefore, X_barn is Binomial(n, p) distributed.
The binomial probability, P(X_barn=i | p, n), or in short P(X_barn= i), is nCipi(1-p)n-i.
In the above formula, nCi is pronounced n choose i. It’s the number of way in which i objects can be chosen from n objects.
Now that you know the formula for P(X_barn= i), you can calculate P(np — δ ≤ X_barn ≤ np + δ) as follows:

There is one little problem with the above approach. You can calculate P(X_barn=i) only if you know the proportion p of black tickets in the urn.
But how will you ever know this proportion without knowing the number of black and white tickets? Jacob Bernoulli with his Calvinist leanings, and Abraham De Moivre whom we’ll meet in a subsequent chapter, seemed to believe that a divine being might know the true ratio p.
In their writings on probability, both scientists made unmistakable references to Fatalism and ORIGINAL DESIGN. Bernoulli brought up Fatalism in the final para of Ars Conjectandi. De Moivre mentioned ORIGINAL DESIGN (in capitals!) in his book on probability, The Doctrine of Chances. Neither man made secret his suspicion that population values of probabilities such as p actually exist, and they come about via a Creator’s intention or ORIGINAL DESIGN. Furthermore, physical laws such as the Law of Large Numbers are a way for mortals to estimate these true values.
At any rate, none of this theology helps you or me in a practical setting. Almost never will you know the true value of pretty much any property of any non-trivial system in any part of the universe.
If, by an unusually freaky stroke of good fortune, you were to stumble upon the true value of some parameter, such as the true proportion of black tickets in Bernoulli’s urn, then case closed, right?
When you have a God’s Eye view of the data, why waste your time drawing random samples to estimate what you already know? To paraphrase another famous scientist, God has no use for statistical inference.
Down here on Earth, all you have is a random sample, and its mean or sum X_barn, and its variance S. Using them, you’ll want to draw inferences about the population. For example, you’ll want to build a (1 — α)100% confidence interval around the unknown population mean μ.
In short, in a practical setting you’d be more interested in estimating the unknown population mean and the population variance. But not so much, the sample mean or sum. In other words, you won’t have as much use for the probability of the sample mean or sum X_barn lying between two limits, namely the following probability:
P(np — δ ≤ X_barn ≤ np + δ)
as you will for the unknown population mean ‘np’ lying between two limits, given a single observed value of the sample mean or sum, namely the following probability:
P(X_barn— δ ≤ np ≤ X_barn+δ).
Notice how subtle but crucial is the difference between the two probabilities.
The probability P(X_barn— δ ≤ np ≤ X_barn+δ) can be expressed as a difference of two cumulative probabilities:

To estimate the two cumulative probabilities, you’ll need a way to estimate the probability P(p|X_barn,n).
The probability P(p|X_barn,n) is the exact inverse of the binomial probability P(X_barn|n,p) = nCipi(1-p)n-i
In all his dealings with the law of large numbers, Jacob Bernoulli worked with this binomial probability.
With P(X_barn|n,p), you are asking the question:
Given that p is the proportion of black tickets in the urn, what is the probability of observing X_barn black tickets in a random sample of size n drawn from the urn?
With the inverse binomial probability P(p|X_barn,n), you are asking a fundamentally different question:
Given that you have observed X_barn number of black tickets in a random sample of size n, what is the probability density function of the unknown true proportion p of black tickets in the urn?
The path to the Central Theorem’s discovery runs straight through a mechanism to compute this inverse binomial probability — a mechanism that an English Presbyterian minister named Thomas Bayes (of the Bayes Theorem fame), and the Isaac Newton of France Pierre-Simon Laplace were to independently discover in the late 1700s to early 1800s using two strikingly different approaches.
The way to understand inverse probability is to look at the true fraction of black tickets p as the cause that is ‘causing’ the effect of observing X_barn number of black tickets in a randomly selected sample of n tickets.
Notice how different is this method of perceiving reality from saying that there is true unknown ‘p’ that comes from – in de Moivre’s words – the Creator’s ORIGINAL DESIGN, and X_barn / n is simply an estimate of this true p.
In the world of inverse probability, there are an infinite number of possible values for p for each observed value of X_barn. Attached to each of these values of p is a probability density. Thus, the inverse binomial probability P(p|X_barn,n) is actually the Probability Density Function (PDF) of p conditioned upon the sample size n, and the observed sample mean or count X_barn.
Earlier, we saw how, knowing the ‘forward’ binomial probability distribution function P(X_barn|n,p) = nCipi(1-p)n-i, you can calculate the probability of X_barn lying between the bounds X_barn— δ, and X_barn+δ.
Similarly, knowing the inverse probability density function P(p|X_barn,n), you can calculate the probability that the unknown p will lie within some bounds plow and phigh, namely P(plow ≤ p ≤ phigh). Since p is unknown, this probability has great practical use.
Unfortunately, Jacob Bernoulli’s theorem isn’t expressed in terms of the inverse probability density function P(p|X_barn,n). Instead, it’s expressed in terms of its exact complement – the forward binomial probability distribution P(X_barn|n,p) which requires you to know the true ratio p.
Having come as far as stating and proving the Weak Law of Large Numbers in terms of the ‘forward’ binomial probability P(X_barn|n,p), you’d think Bernoulli would take the natural next step, which is to invert the statement of his theorem and show future generations of mathematicians how to calculate the inverse binomial probability density function P(p|n,X_barn).
But Bernoulli did no such thing, choosing instead to mysteriously bring the whole of Ars Conjectandi to a sudden, unexpected end with a rueful sounding paragraph on, you guessed it, Fatalism:
“…if eventually the observations of all should be continued through all eternity (from probability turning to perfect certainty), everything in the world would be determined to happen by certain reasons and by the law of changes. And so even in the most casual and fortuitous things we are obliged to acknowledge a certain necessity, and if I may say so, fate,…”

There is another practical problem about Bernoulli’s treatment of binomial probabilities, which Bernoulli himself ran into.
Look at the summations on the R.H.S. of the following equation:

They contain big, bulky factorials that are all but impossible to crank out for large n. In fact, everything about Bernoulli’s theorem is about large n. And the calculations must have become especially tedious in the year 1689 under the unsteady, dancing glow of grease lamps and using nothing more than paper and pen.
In Part 4, Bernoulli did a few of these calculations particularly to calculate the minimum sample sizes required to achieve different degrees of accuracy. But he left the matter there.

Neither did Bernoulli show how to approximate the factorial (a technique that was to be discovered four decades later by Abraham De Moivre and James Stirling (in that order), nor did he make the crucial, conceptual leap of showing how to attack the problem of inverse probability.
Jacob Bernoulli’s program of inquiry into Probability’s role in different aspects of social, moral and economic affairs was, to put it lightly, ambitious for even the current era. To illustrate, at one point in Part 4 of Ars Conjectandi Bernoulli ventures so far as to confidently define human happiness in terms of probabilities of events.
Death, and a path to the Central Limit Theorem
During the final two years of his life, a sick and enfeebled Jacob Bernoulli corresponded with Gottfried Leibniz, the co-inventor — or the primary inventor — based on which side of the Newton-Leibniz controversy your sympathies lie, of differential and integral Calculus. In his correspondences, Bernoulli complained about his struggles in completing his book and lamented how his laziness and failing health were coming in the way.
Sixteen years after starting work on Part 4, in the Summer of 1705 a relatively young and possibly dispirited Jacob Bernoulli succumbed to Tuberculosis leaving both Part 4 and Ars Conjectandi essentially unfinished.
Since Jacob’s children weren’t mathematically inclined, the task of publishing his unfinished work ought to have fallen into the capable hands of his younger brother Johann, also a prominent mathematician. Unfortunately, for a good fraction of their professional lives, the two Bernoulli brothers had bickered and quarreled, often bitterly and publicly, and often in ways that only first-rate scholars might be expected to do so — through the pages of eminent mathematics journals. At any rate, by Jacob’s death in 1705 they were barely on speaking terms.
The publication of Ars Conjectandi eventually landed upon the reluctant shoulders of Nicolaus Bernoulli (1687–1759) who was both Jacob and Johann’s nephew and also an accomplished mathematician. At one point, Nicolaus asked Abraham De Moivre in England if he was interested in completing Jacob’s program on probability. De Moivre politely refused and curiously chose to go on record with his refusal.
Finally in 1713, eight years after his uncle Jacob’s death, and more than two decades after his uncle’s pen rested for the final time on the pages of Ars Conjectandi, Nicolaus published Jacob’s work in pretty much the same state that Jacob left it.

Just in case you are wondering, Jacob Bernoulli’s family tree was packed to bursting with math and physics geniuses. One would be hard pressed to find a family tree as densely adorned with scientific talent as the Bernoullis’. Perhaps the closet contenders are the Curies of Marie and Pierre Curie fame.
Now get this:
Pierre Curie was a great-great-great-great-great grandson of Jacob Bernoulli’s younger brother Johann.
Ars Conjectandi had fallen short of addressing the basic needs of statistical inference even for the limited case of binomial processes. But Jacob Bernoulli had sown the right kinds of seeds in the minds of his fellow academics.
Jacob’s contemporaries, particularly his nephew Nicolaus Bernoulli (1687–1759), and two French mathematicians Pierre Remond de Montmort (1678–1719), and our friend Abraham de Moivre (1667–1754), knew the general direction in which to take Bernoulli’s work to make it useful. In the decades following Bernoulli’s death, all three mathematicians made progress. And in 1733, de Moivre finally broke through with one of the finest discoveries in mathematics – the discovery of the Normal Curve.
Join me in the next chapter when I’ll cover De Moivre’s Theorem and the birth of the Normal curve and how it was to inspire the solution for Inverse Probability and lead to the discovery of the Central Limit Theorem. Stay tuned.
References and Copyrights
Books and Papers
Bernoulli, Jakob: On the Law of Large Numbers, Part Four of Ars Conjectandi (English translation), Translated by Oscar Sheynin, Berlin: NG Verlag, ISBN 978–3–938417–14–0 (2005) [1713] PDF download
Seneta, E.: A Tricentenary history of the Law of Large Numbers, Bernoulli, Vol. 19, No. 4, pp 1088–1121, (2013), https://doi.org/10.3150/12-BEJSP12, PDF Download
Fischer, H.: A History of the Central Limit Theorem. From Classical to Modern Probability Theory, Springer, Science & Business Media (2010)
Shafer, G.: The significance of Jacob Bernoulli’s Ars Conjectandi for the philosophy of probability today, Journal of Econometrics, Vol. 75, No. 1, pp 15–32, ISSN 0304–4076 (1996), https://doi.org/10.1016/0304-4076(95)01766-6.
Polasek, W.: The Bernoullis and the origin of probability theory: Looking back after 300 years, Resonance, Vol. 5, pp 26–42 (2000), https://doi.org/10.1007/BF02837935. PDF download
Stigler, S. M.: The History of Statistics: The Measurement of Uncertainty Before 1900, Harvard University Press (1986)
Todhunter, I.: A history of the mathematical theory of probability : from the time of Pascal to that of Laplace, Macmillan and Co., (1865)
Hald, H.: A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713–1935, Springer, (2007)
Images and Videos
All images and videos in this chapter are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image or video.
PREVIOUS: Understanding the Central Limit Theorem
NEXT: The Science Of Statistical Expectation
