It would be several years into his new life in England when a middle-aged de Moivre would show a real and abiding interest in Jacob Bernoulli's work on the law of large numbers. To see what his interest led to, let's visit Bernoulli's theorem and the thought experiment that led Bernoulli to his discovery of it.
In The art of projecting.Bernoulli had imagined a large urn containing r black bills and s white bills. Both r and s are unknown to you and so is the true fraction p = r/(r+s) of black bills in the urn. Now suppose you draw n tickets from the urn at random with replacement and your random sample contains x_bar_n black bills. Here, x_bar_n is the sum of n iid random variables. Thus, x_bar_n/n is the proportion of black bills you observe. In essencex_bar_n/n is your estimate of the true value of p.
The number of black bills. x_bar_n found in a random sample of black and white bills has the well-known binomial distribution. That is:
x_bar_n ~ Binomial(n, p)
Where n is the sample size and p=r/(r+s) is the true probability that a single bill is a black bill. Of course, you don't know p, since in Bernoulli's experiment, you don't know the number of black (r) and white tickets.
From x_bar_n is binomially distributed, its expected value E(x_bar_n) = np and its Var(x_bar_n) = np(1 — p). Again, since p is unknown, both the mean and variance of x_bar_n are also unknown.
You also don't know the absolute difference between your estimate of p and the actual value of p. This estimate is the error |x_bar_n/n — p|.
Bernoulli's great discovery was to demonstrate that as the sample size n becomes very large, the error probabilities |x_bar_n/n — p| being smaller than any arbitrarily small positive number ϵ of your choice becomes incredibly large. As an equation, his discovery can be expressed as follows:
The previous equation is Weak law of large numbers. In the previous equation:
P(|X_bar_n/n — p| <= ϵ) is the probability that the estimation error is at most ϵ.
P(|x_bar_n/n — p| > ϵ) is the probability that the estimation error is greater than ϵ.
'c' is a very large positive number.
The WLLN can be expressed in three other ways highlighted in the blue boxes below. These alternative forms result from doing some simple algebraic gymnastics as follows:
Now look at the probability in the third blue box:
P(μ — δ ≤ x_bar_n ≤ m + d) = (1 — a)
Or reconnecting μ =np:
P(np — δ ≤ x_bar_n ≤ np + d) = (1 — a)
From x_bar_n ~ Binomial(n,p), it is easy to express this probability as a difference of two binomial probabilities as follows:
But it is at this point when things stop being simple. For large n, the factorials within the two sums become huge and almost impossible to compute. Imagine having to calculate 20! Let's leave out 100! or 1000! What is needed is a good approximation technique for factorial(n). In The art of guessing Jacob Bernoulli made some weak attempts to approximate these probabilities, but the quality of his approximations left much to be desired.
Abraham De Moivre's great idea
In the early 18th century, when de Moivre began examining Bernoulli's work, he immediately perceived the need for a fast, high-quality approximation technique for the factorial terms of two sums. Without an approach technique, Bernoulli's great achievement was like a big, beautiful kite without a string. A law of great beauty but of little practical use.
De Moivre reformulated the problem as an approximation to the sum of successive terms in the expansion of (a + b) raised to the nth power. This expansion, known as binomial formulasays the following:
De Moivre's reasons for reformulating the WLLN probabilities in terms of the binomial formula were surprisingly simple. It was known that if the sample adds x_bar_n has a binomial distribution, the probability that X_bar_n is less than or equal to some value n can be expressed as a sum of (n+1) probabilities as follows:
If you compare the coefficients of the terms on the right-hand side of the above equation with the coefficients of the terms in the expansion of (a+b) to the power n, you will find that they are remarkably similar. And so de Moivre theorized, if you find a way to appropriate the factor terms on the right side of (a+b) to the power an, you will have paved the way to approximating P(x_bar_n ≤ n), and therefore also the probability that is at the center of the Weak Law of Large Numbers, namely:
P(np — δ ≤ x_bar_n ≤ np + d) = (1 — a)
For more than 10 years, de Moivre worked hard on the approximation problem, creating increasingly precise approximations of the factor terms. In 1733, he had largely concluded his work when he published what came to be called De Moivre's theorem (or less precisely, the De Moivre-Laplace theorem).
At this point, I could just state De Moivre's theorem, but that would ruin half the fun. Instead, let's follow de Moivre's line of thinking. We will work on the calculations that lead to the formulation of his great theorem.
Our requirement is a fast and high-precision approximation technique for the probability that lies at the heart of Bernoulli's theorem, namely:
P(|x_bar_n/n — p| ≤ ϵ)
Or equivalently its transformed version:
P(np — δ ≤ x_bar_n ≤ np + d)
Or in the most general form, the following probability:
P(x_1 ≤ x ≤ x_2)
In this final form, we have assumed that x is a discrete random variable that has a binomial distribution. Specifically, x ~ Binomial(n,p).
The probability P(x_1 ≤ x ≤ x_2) can be expressed as follows:
Now let p, q be two real numbers such that:
0 ≤ p ≤ 1, and 0 ≤ q ≤ 1, and q = (1 — p).
From x ~ Binomial(n,p), E(x) = μ = np, and Var(x) = σ² = npq.
Let's create a new random variable. z as follows:
z is clearly the standardized version of x. Specifically, z it's a standard normal random variable. Thus,
Yeah x ~ Binomial(n,p), then z ~ n(0, 1)
Keep this in mind because we will look at this fact in just a minute.
With the above framework in place, De Moivre showed that for very large values of nprobability:
P(x1 ≤ x ≤x2)
can be approximated by evaluating the following specific type of integral:
The ≃ sign means that the LHS is asymptotically equal to the RHS. In other words, as the sample size grows to ∞, LHS = RHS
Did you notice anything familiar about the integral in RHS? Is the formula for the area under the probability density curve of a standard normal variable from z_1 to z_2.
And the formula within the integral is Probability density function of the random normal standard Z:
Let's divide the integral on the right side as a difference of two integrals as follows:
The two new integrals in the RHS are respectively the cumulative densities P(z ≤ z_2), and P(z ≤z_1).
He Cumulative density function P(z ≤ z) of a standard normal random variable is represented by standard notation:
𝛟(z)
Therefore, the integral in the LHS of the previous equation is equal to:
𝛟(z_2) — 𝛟(z_1).
Putting it all together, we can see that the probability:
P(x1 ≤ x ≤x2)
converges asymptotically to 𝛟(z_2) — 𝛟(z_1):
Now remember how we defined z as standardized x :
And so we also have the following:
In formulating his theorem, De Moivre defined the limits x_1 and x_2 as follows:
Substituting these values of x_1 and x_2 into the set of equations above, we obtain:
And therefore, De Moivre showed that for very large north:
Remember, what de Moivre really wanted was to approximate the probability in the LHS of Bernoulli's theorem:
Which he achieved by doing the following simple substitutions:
Which produces the following asymptotic equality:
With a single elegant stroke, de Moivre showed how to approximate the probability of Bernoulli's theorem for large sample sizes. And Bernoulli's theorem is about large sample sizes. However, there is some subtext to de Moivre's achievement. The integral in the RHS does not have closed form and De Moivre approximated it using an infinite series.
An illustration of De Moivre's theorem
Suppose there are exactly three times as many black bills as white bills in the ballot box. So the true fraction of black bills, p, is 3/4. Suppose also that you draw a random sample with replacement of 1,000 tickets. Given that p = 0.75, the expected value of the black bills is np = 750. Suppose the number of black bills you observe in the sample is 789. What is the probability of drawing such a random sample?
Let's state the facts:
We want to know:
P(750 — 39 ≤ X_bar_n <= 750 + 39)
We will use De Moivre's theorem to find this probability. As we know, the theorem can be stated as follows:
We know that n=1000, p=0.75, x_bar_n=789 and δ=39. We can find k as follows:
Connecting all the values:
In approximately 99.56% of the random samples of 1000 notes each, the number of black notes will be between 711 and 789.