# Multi-Armed Bandits

One of the classic problems in Statistics that a lot of people working at STOR-i in particular are involved with is something called ‘Multi-Armed Bandits’. In fact there was a recent conference on the topic and its applications at Lancaster earlier this year. In this post I will try to explain what the problem is and touch on how it is solved.

Originally this problem was first thought about in the casinos of Las Vegas. It concerns how to pick which slot machine out of a group you should play to try in order to try and maximise your total winnings. Whilst this problem is no longer just applicable to casinos, the name ‘one-armed bandit‘ has stuck and the casino analogy is what I will continue to use.

This problem was originally considered to be exceptionally hard to solve and puzzled many researchers for a long time. At first glance you might think that the best thing to do (or policy to take) is to simply play the bandit that gives the highest return (i.e. the best chance of winning). The trouble with this approach is that you don’t actually learn anything about the other machines you could play. In fact the key to solving this problem is that by playing other machines we do learn something: information.

Mathematical Formulation (Feel free to ignore!)

• You can choose one of K independent one-armed bandits to play at each time $t = 0, 1,2,...$
• Bandit $i$ has an unknown success probability $p_i$ at every pull
• Bayesian formulation: $p_i \sim \pi_i$ (independent prior distributions)
• If the play at time $t$ is successful, you get a (discounted) rewards of $a^t$, where $0
• At each time $t$, your choice of which bandit to play is informed by each bandit’s record of success/failure to date (and hence by the current posterior distributions $\pi(t)$)
• How should you pull arms to maximise your total expected winnings from all pulls?

Calibrating a One-Armed Bandit

We would like to develop a method that looks like this.

1. Look at each machine in turn and give it a score.
2. Then we pull the lever of the machine with the highest score.

The question is what is this mysterious score?

To illustrate the mathematical theory behind it, let’s just consider a one-armed bandit with an unknown success probability, $\pi$ and an ‘imaginary’ one-armed bandit with a known success probability $\upsilon$.

How do we optimally pick which one of these two machines to play? Let’s think about what we would do for some simple possibilities

• If $\upsilon = 0$, i.e. there no chance of winning on the red bandit, so we will never play on it and always play on the black bandit.
• If $\upsilon = 1$, i.e. we are guaranteed to win on the red bandit, we will always play on it and never play on the black bandit.
• If $\upsilon = E_\pi(p)$, the chance of winning is the same for both bandits. Mathematically this is the expected value of the prior distribution for the success probability of the black bandit.

But whilst the rewards will be the same for playing on either bandit, we can learn more information by playing on the black bandit, as it is has an unknown success probability distribution whereas there is nothing to learn from playing on the red bandit.

We can illustrate this more clearly on the number line below.

As you can see, it is logical to assume that there exists some point on the number line where both the black and red bandit are optimal to play. This point is called the ‘Gittins Index‘ and is the mysterious score we mentioned earlier.

The other thing you can see in the diagram is the ‘Uncertainty Gap’. The less we know about $\pi$, the more information we can gain by playing it, so the larger the uncertainty gap. However if we only play for a short time or have a large discount rate, i.e. current decisions matter more, the uncertainty gap will be smaller.

Finding Gittins Index

Let’s finish off by evaluating out a simple example. We’ll say that there are just two possible success probabilities for $\pi=\{0,1\}$ and the prior $\pi=Prob\{p=1\}$. Then we can work out the payoffs as seen below.

• Choosing to play the red bandit at the start means we get a payoff of $latex \upsilon$ that is discounted over time, since we play it forever.

Payoff $= \upsilon(1 + a + a^2 + a^3 + ... )= \frac{\upsilon}{1-a}$

• Choosing to play the black bandit at the start gives us two outcomes,
1. There is a probability $\pi$ of a success on the black bandit, so we play forever on the black bandit.
2. There is a probability $1-\pi$ of a failure on the black bandit, so we switch to play on the red bandit forever.

Payoff $= \pi(1 + a + a^2 + a^3 + ... ) + (1-\pi)(0 + a\frac{\upsilon}{1-a})$

Then the Gittens Index is simply the value of $\upsilon$ where both of these payoffs are equal to each other.

$\frac{\upsilon}{1-a} = \frac{\pi}{1-a} + \frac{(1-\pi)\pi a\upsilon}{1-a}$

$\upsilon = \frac{\pi}{1-(1-\pi)a} = \pi + \frac{\pi(1-\pi)a}{1-(1-\pi)a}$

So how do we solve the multi-armed bandit problem? Well all we do is calculate this Gittens index for each of the bandits in turn at each time step and we play the bandit with the highest Gittens index every time!