# Coupon Collecting and Football Stickers

Although Euro 2016 is still months away from officially starting, another footballing tradition has just started: collecting football stickers. For children and adults alike the quest to complete the entire sticker album is something that takes time and money but the excitement and joy makes it worth it.

As the number of stickers in each album have dramatically increased from the first few albums produced around 40 years ago, the task of collecting all of the stickers has become much harder. Mathematically this problem happens to be a classic problem in combinatorial optimisation: the coupon-collector problem. This problem is when there is a finite number of different coupons any of which can be given to a person one at a time. In a statistical sense this is the problem of sampling with replacement. The question is then how many times do you need to sample to obtain a copy of every single sticker?

You probably already have some intuition to the answer. The first sticker you get will be guaranteed to be one that you don’t have, whereas the chance of the same sticker being the one to complete your collection is tiny. Clearly the probability of finding a duplicate increases with the more stickers you own, it takes a short time to find the first few stickers but a very long time to find the last few.

### Forming the Problem

The classic way to model the problem is by using the geometric distribution. Suppose that there are a total of $n$ stickers in the album. Then we introduce a geometric random variable $h_i$ that is the number of stickers needed to be purchased to move from having $i-1$ different stickers to $i$ different stickers in our collection. This means that the total number of stickers we need to purchase to complete the collection $H$ is the sum of all these $h_i$.

$H = h_1 + h_2 + h_3 + \ldots + h_n$

We now need to consider what the probability of finding a new sticker actually is. Let’s start by assuming that the probability of getting a sticker is exactly the same as getting any other. If there are a total of N stickers to choose from, then the chance of finding a new sticker the first time is N/N. This is because there are N stickers we need out of a possible N types of stickers. To get a second sticker we have a N-1 stickers left that we need again out of a possible N types of stickers. Similarly for the third sticker we have the N-2 stickers left that we want out of a possible N stickers and so on and so forth.

$p_1 = \cfrac{N}{N}, p_2 = \cfrac{N-1}{N}, p_3 = \cfrac{N-2}{N}, \ldots, p_n = \cfrac{N-n+1}{N}$

This means that the probability $p_i$ of finding the $i$th new sticker is $(N-i+1)/N$. As a result $h_i$ is a geometric random variable with expectation $1/p_i$.

By taking into account the independence of the $h$ variables and the linearity of their expectations we can calculate a formula for the expected number of stickers needed to be bought to complete the album, $E(H)$.

$E(H) = E(h_1) + E(h_2) + E(h_3) + \ldots + E(h_n) = \cfrac{1}{p_1} + \cfrac{1}{p_2} + \cfrac{1}{p_3} + \ldots + \cfrac{1}{p_n}$

If we then substitute the probabilities into the above equation and factorise the equation we produce the result given below.

$E(H) = \cfrac{n}{n} + \cfrac{n}{n-1} + \ldots + \cfrac{n}{2} + \cfrac{n}{1} = n(\cfrac{1}{n} + \cfrac{1}{n-1} + \ldots ++ \cfrac{1}{2} +\cfrac{1}{1}) = n \sum\limits_{i=1}^n \cfrac{1}{i}$

This result produces an interesting result, it involves the sum of the reciprocals of the first $n$ natural numbers. This is known as the nth harmonic number. It should be noted that this differs from the harmonic series as we are not summing to infinity. This sequence crops up in many interesting scenarios, but we are interested in finding its value. It turns out that there is no general formula to find harmonic numbers, we have to use asymptotic expansions or approximations to find it.

### Asymptotic Expansion

The simplest way to illustrate this is to look at its relation with the natural logarithm and the area under the graph of y=$\cfrac{1}{x}$.

If we look at $H_n - \cfrac{1}{n} = 1 + \cfrac{1}{2} + \ldots + \cfrac{1}{n-1}$ we can see that this is an overestimation of the area beneath the graph by using rectangles.

Overestimate; source

If we also use$H_n -1 = \cfrac{1}{2} + \cfrac{1}{3} + \ldots + \cfrac{1}{n}$ we can see that this turns out to be an underestimation of the area.

Underestimate; source

As a result the true value of H_n is bounded by these two values,

$H_n - 1 < \int\limits_1^n \frac{1}{x} dx < H_n - \cfrac{1}{n}$

$\rightarrow \ln{n} + \cfrac{1}{n} < H_n < \ln{n} + 1$

It is clear that the natural logarithm has something to do with this number, which was proved later by Euler who found that the difference between a harmonic number and the natural logarithm was bounded by a constant.

$\text{lim}_{n \rightarrow \infty} (H_n - \ln{n}) = \gamma$

This constant $\gamma$ is known as the Euler-Mascheroni constant and has the value of $\gamma \approx 0.5772156649$.

Rearranging this sequence and multiplying it by $n$ yields the answer to earlier equation.

$E(H) = nH_n \approx n(\ln{n} + \gamma)$

As the current Euro 2016 sticker album contains 680 stickers, we see that using this value for $n$ that the expected number of stickers needed is about 4827. Since stickers are bought in packets of 5 and each packet costs 50p, if you wanted to complete the album by only purchasing packets it would most likely take 966 packets and cost you £483.

The real life problem is slightly more tricky since every sticker in a pack is supposed to be different. This shouldn’t change the answer significantly, 5 stickers out of 660 is pretty small. The correct analytical answer is given below.

$\text{Expected Number of Packets} = {{680}\choose{5}} \sum\limits_{j=1}^{680}(-1)^{j+1} \cfrac{{{680}\choose{5}}}{{{680}\choose{5}} + {{680-j}\choose{5}}}$

### Swapping Stickers

Traditionally people who want to complete their album don’t just rely on buying packets, but also swap stickers with other people. By swapping in person or over the internet it is possible to reduce the number of packets needed by turning duplicates into missing stickers.

Swapping stickers with my cool sister Hannah

Sardy and Velenik did numerical computation to determine the cheapest way of completing a collection. First you should buy a box of 100 packets to get about 500 different stickers. Then you should buy packets and swap stickers until you have about 50 stickers to go. These final 50 can be ordered directly to complete the album. As a result the cost of this album could be as low as £85 depending on the price of the box.

Swapping reduces the number of sticker needed; source