Statistical estimates can be fascinating, can’t they? By just sampling a few instances from a population, you can infer properties of that population such as the mean value or the variance. Likewise, under the right circumstances, it is possible to estimate the total size of the population, as I want to show you in this article.
I will use the example of drawing lottery tickets to estimate how many tickets there are in total, and hence calculate the likelihood of winning. More formally, this means to estimate the population size given a discrete uniform distribution. We will see different estimates and discuss their differences and weaknesses. In addition, I will point you to some other use cases this approach can be used in.
Playing the lottery
Let’s imagine I go to a state fair and buy some tickets in the lottery. As a data scientist, I want to know the probability of winning the main prize, of course. Let’s assume there is just a single ticket that wins the main prize. So, to estimate the likelihood of winning, I need to know the total number of lottery tickets N, as my chance of winning is 1/N then (or k/N, if I buy k tickets). But how can I estimate that N by just buying a few tickets (which are, as I saw, all losers)?
I will make use of the fact, that the lottery tickets have numbers on them, and I assume, that these are consecutive running numbers (which means, that I assume a discrete uniform distribution). Say I have bought some tickets and their numbers in order are (242,412,823,1429,1702). What do I know about the total number of tickets now? Well, obviously there are at least 1702 tickets (as that is the highest number I have seen so far). That gives me a first lower bound of the number of tickets, but how accurate is it for the actual number of tickets? Just because the highest number I have drawn is 1702, that doesn’t mean that there are any numbers higher than that. It is very unlikely, that I caught the lottery ticket with the highest number in my sample.
However, we can make more out of the data. Let us think as follows: If we knew the middle number of all the tickets, we could easily derive the total number from that: If the middle number is m, then there are m-1 tickets below that middle number, and there are m-1 tickets above that. That is, the total number of tickets would be (m-1) + (m-1) + 1, (with the +1 being the ticket of number m itself), which is equal to 2m-1. We don’t know that middle number m, but we can estimate it by the mean or the median of our sample. My sample above has the (rounded) average of 922, which yields 2*922-1 = 1843. That is, from that calculation the estimated number of tickets is 1843.
That was quite interesting so far, as just from a few lottery ticket numbers, I was able to give an estimate of the total number of tickets. However, you may wonder if that is the best estimate we can get. Let me spoil you right away: It is not.
The method we used has some drawbacks. Let me demonstrate that to you with another example: Say we have the numbers (12,30,88), which leads us to 2*43–1 = 85. That means, the formula suggests there are 85 tickets in total. However, we have ticket number 88 in our sample, so this can not be true at all! There is a general problem with this method: The estimated N can be lower than the highest number in the sample. In that case, the estimate has no meaning at all, as we already know, that the highest number in the sample is a natural lower bound of the overall N.
A better approach: Using even spacing
Okay, so what can we do? Let us think in a different direction. The lottery tickets I bought have been sampled randomly from the distribution that goes from 1 to unknown N. My ticket with the highest number is number 1702, and I wonder, how far away is this from being the highest ticket at all. In other words, what is the gap between 1702 and N? If I knew that gap, I could easily calculate N from that. What do I know about that gap, though? Well, I have reason to assume that this gap is expected to be as big as all the other gaps between two consecutive tickets in my sample. The gap between the first and the second ticket should, on average, be as big as the gap between the second and the third ticket, and so on. There is no reason why any of those gaps should be bigger or smaller than the others, except for random deviation, of course. I sampled my lottery tickets independently, so they should be evenly spaced on the range of all possible ticket numbers. On average, the numbers in the range of 0 to N would look like birds on a power line, all having the same gap between them.
That means I expect N-1702 to equal the average of all the other gaps. The other gaps are 242–0=242, 412–242=170, 823–412=411, 1429–823=606, 1702–1429=273, which gives the average 340. Hence I estimate N to be 1702+340=2042. In short, this can be denoted by the following formula:
Here x is the biggest number observed (1702, in our case), and k is the number of samples (5, in our case). This is just a short form of calculating the average as we just did.
Let’s do a simulation
We just saw two estimates of the total number of lottery tickets. First, we calculated 2*m-1, which gave us 1843, and then we used the more sophisticated approach of x + (x-k)/k and obtained 2042. I wonder which estimation is more correct now? Are my chances of winning the lottery 1/1843 or 1/2042?
To show some properties of the estimates we just used, I did a simulation. I drew samples of different sizes k from a distribution, where the highest number is 2000, and that I did a few hundred times each. Hence we would expect that our estimates also return 2000, at least on average. This is the outcome of the simulation:
What do we see here? On the x-axis, we see the k, i.e. the number of samples we take. For each k, we see the distribution of the estimates based on a few hundred simulations for the two formulas we just got to know. The dark point indicates the mean value of the simulations each, which is always 2000, independent of the k. That is a very interesting point: Both estimates converge to the correct value if they are repeated an infinite number of times.
However, besides the common average, the distributions differ quite a lot. We see, that the formula 2*m-1 has higher variance, i.e. its estimates are far away from the real value more often than for the other formula. The variance has a tendency to decrease with higher k though. This decrease does not always hold perfectly, as this is just as simulation and is still subject to random influences. However, it is quite understandable and expected: The more samples I take, the more precise is my estimation. That is a very common property of statistical estimates.
We also see that the deviations are symmetrical, i.e. underestimating the real value is as likely as overestimating it. For the second approach, this symmetry does not hold: While most of the density is above the real mean, there are more and larger outliers below. How does that come? Let’s retrace how we computed that estimate. We took the biggest number in our sample and added the average gap size to that. Naturally, the biggest number in our sample can only be as big as the biggest number in total (the N that we want to estimate). In that case, we add the average gap size to N, but we can’t get any higher than that with our estimate. In the other direction, the biggest number can be very low. If we are unlucky, we could draw the sample (1,2,3,4,5), in which case the biggest number in our sample (5) is very far away from the actual N. That is why larger deviations are possible in underestimating the real value than in overestimating it.
Which is better?
From what we just saw, which estimate is better now? Well, both give the correct value on average. However, the formula x + (x-k)/k has lower variance, and that is a big advantage. It means, that you are closer to the real value more often. Let me demonstrate that to you. In the following, you see the probability density plots of the two estimates for a sample size of k=5.
I highlighted the point N=2000 (the real value for N) with a dotted line. First of all, we still see the symmetry that we have seen before already. In the left plot, the density is distributed symmetrically around N=2000, but in the right plot, it is shifted to the right and has a longer tail to the left. Now let’s take a look at the grey area under the curves each. In both cases, it goes from N=1750 to N=2250. However, in the left plot, this area accounts for 42% of the total area under the curve, while for the right plot, it accounts for 73%. In other words, in the left plot, you have a chance of 42% that your estimate is not deviating more than 250 points in either direction. In the right plot, that chance is 73%. That means, you are much more likely to be that close to the real value. However, you are more likely to slightly overestimate than underestimate.
I can tell you, that x+ (x-k)/k is the so-called uniformly minimum variance unbiased estimator, i.e. it is the estimator with the smallest variance. You won’t find any estimate with lower variance, so this is the best you can use, in general.
Use cases
We just saw how to estimate the total number of elements in a pool, if these elements are indicated by consecutive numbers. Formally, this is a discrete uniform distribution. This problem is commonly known as the German tank problem. In the Second World War, the Allies used this approach to estimate how many tanks the German forces had, just by using the serial numbers of the tanks they had destroyed or captured so far.
We can now think of more examples where we can use this approach. Some are:
- You can estimate how many instances of a product have been produced if they are labeled with a running serial number.
- You can estimate the number of users or customers if you are able to sample some of their IDs.
- You can estimate how many students are (or have been) at your university if you sample students’ matriculation numbers (given that the university has not yet used the first numbers again after reaching the maximum number already).
However, be aware that some requirements need to be fulfilled to use that approach. The most important one is, that you indeed draw your samples randomly and independently of each other. If you ask your friends, who have all enrolled in the same year, for their matriculation numbers, they won’t be evenly spaced on the whole range of matriculation numbers but will be quite clustered. Likewise, if you buy articles with running numbers from a store, you need to make sure, that this store got these articles in a random fashion. If it was delivered with the products of numbers 1000 to 1050, you don’t draw randomly from the whole pool.
Conclusion
We just saw different ways of estimating the total number of instances in a pool under discrete uniform distribution. Although both estimates give the same expected value in the long run, they differ in terms of their variance, with one being superior to the other. This is interesting because neither of the approaches is wrong or right. Both are backed by reasonable theoretical considerations and estimate the real population size correctly (in frequentist statistical terms).
I now know that my chance of winning the state fair lottery is estimated to be 1/2042 = 0.041% (or 0.24% with the 5 tickets I bought). Maybe I should rather invest my money in cotton candy; that would be a save win.
References & Literature
Mathematical background on the estimates discussed in this article can be found here:
- Johnson, R. W. (1994). Estimating the size of a population. Teaching Statistics, 16(2), 50–52.
Also feel free to check out the Wikipedia articles on the German tank problem and related topics, which are quite explanatory:
This is the script to do the simulation and create the plots shown in the article:
import numpy as np
import random
from scipy.stats import gaussian_kde
import matplotlib.pyplot as pltif __name__ == "__main__":
N = 2000
n_simulations = 500
estimate_1 = lambda sample: 2 * round(np.mean(sample)) - 1
estimate_2 = lambda sample: round(max(sample) + ((max(sample) - k) / k))
estimate_1_per_k, estimate_2_per_k = (),()
k_range = range(2,10)
for k in k_range:
diffs_1, diffs_2 = (),()
# sample without duplicates:
samples = (random.sample(range(N), k) for _ in range(n_simulations))
estimate_1_per_k.append((estimate_1(sample) for sample in samples))
estimate_2_per_k.append((estimate_2(sample) for sample in samples))
fig,axs = plt.subplots(1,2, sharey=True, sharex=True)
axs(0).violinplot(estimate_1_per_k, positions=k_range, showextrema=True)
axs(0).scatter(k_range, (np.mean(d) for d in estimate_1_per_k), color="purple")
axs(1).violinplot(estimate_2_per_k, positions=k_range, showextrema=True)
axs(1).scatter(k_range, (np.mean(d) for d in estimate_2_per_k), color="purple")
axs(0).set_xlabel("k")
axs(1).set_xlabel("k")
axs(0).set_ylabel("Estimated N")
axs(0).set_title(r"$2\times m-1$")
axs(1).set_title(r"$x+\frac{x-k}{k}$")
plt.show()
plt.gcf().clf()
k = 5
xs = np.linspace(500,3500, 500)
fig, axs = plt.subplots(1,2, sharey=True)
density_1 = gaussian_kde(estimate_1_per_k(k))
axs(0).plot(xs, density_1(xs))
density_2 = gaussian_kde(estimate_2_per_k(k))
axs(1).plot(xs, density_2(xs))
axs(0).vlines(2000, ymin=0, ymax=0.003, color="grey", linestyles="dotted")
axs(1).vlines(2000, ymin=0, ymax=0.003, color="grey", linestyles="dotted")
axs(0).set_ylim(0,0.0025)
a,b = 1750, 2250
ix = np.linspace(a,b)
verts = ((a, 0), *zip(ix, density_1(ix)), (b, 0))
poly = plt.Polygon(verts, facecolor='0.9', edgecolor='0.5')
axs(0).add_patch(poly)
print("Integral for estimate 1: ", density_1.integrate_box(a,b))
verts = ((a, 0), *zip(ix, density_2(ix)), (b, 0))
poly = plt.Polygon(verts, facecolor='0.9', edgecolor='0.5')
axs(1).add_patch(poly)
print("Integral for estimate 2: ", density_2.integrate_box(a,b))
axs(0).set_ylabel("Probability Density")
axs(0).set_xlabel("N")
axs(1).set_xlabel("N")
axs(0).set_title(r"$2\times m-1$")
axs(1).set_title(r"$x+\frac{x-k}{k}$")
plt.show()
Like this article? Follow me to be notified of my future posts.