# Baseball Elimination Problem

Recently we had a masterclass from Pitu Mirchandani on the subject of network flows. These network flow problems have been developed since the 1950’s an example of which is my previous post on Dijkstra’s algorithm. The main thing I took away from the masterclass was that the idea of shortest paths, maximum flows and min cost max flows can be used to describe and solve a surprisingly wide variety of problems.

One interesting problem which can be solved using the ideas of network flows is that of the baseball-elimination problem. Baseball in the USA consists of two different leagues which are each split up into three divisions of about 5 teams. During regular season each team plays 162 games in total, most of which are against teams in their division (76 vs same division, 66 vs same league, 20 vs different league). The aim of this is try and qualify for the playoffs, which can be done by finishing first in the division.

All this adds up to a desire to be able to work out whether it is possible to see if a particular team can still potentially win the division and make it to the playoffs given the current league standings. Solving this problem means that we have to look at tables that might look something like the one below.

Who can make the playoffs?

# Updated Pythagorean Tables

With the Premier League season drawing to a close I thought I’d go back and see what the ‘Pythagorean Expectation’ for the current league table currently looks like. Using the same values as in my previous post of $\gamma_1 =1.18 \text{ and } \gamma_2 = 1.23$ means we can see if the same teams seem to under or over performing.

This is the league table from earlier in the year.

Old League Table

Below is the current league table I’m working from with about 5 or 6 games to go.

Current League Table

Observations

• Leicester seem to be unstoppable in real life and despite my prediction of them hitting a blip nothing of the sort has happened to them yet. This is reflected in them having out performed the average teams points total for their goals scored and conceded by a whopping 9.29 pts. This has almost doubled in the last 10 games (they have only lost 1 league game in the last 10 and that was by a single goal)!
• The other prediction I made was that Tottenham seemed to be under-achieving earlier in the season and that they could challenge for the title. Despite not substantially improving their residual -5.64 to -4.81, they have maintained their good run of form and look like the most likely title challengers to Leicester.
• Other significant improvements have been made by the likes of Bournemouth and Stoke who have turned their form around to pick up more points than they might have expected to.
• Everton continue to be having a shocking season based on pythagorean expectation which matches the conventional wisdom about them. In a more typical season based on their goals scored and conceded they should have almost 9 more points.
• Finally despite my current worries about Watford the maths seems to suggest that they will still be fine in the league this year.

# A Brief Look at Bayesian Statistics

There are two main strands of statistics, Classical statistics and Bayesian statistics. These aren’t necessarily conflicting ideologies, though many statisticians throughout history would beg to differ, but are simply two different ways to tackle a problem. Hopefully this post will give you some brief insight into the uses and differences of the two approaches.

Classical statistics is the first type of statistics that people come across and is to do with what we expect to happen in a repeatable experiment. This might be the idea that if we flip a coin an infinite number of times the proportion of heads we obtain is a half. Hence we get the well known probability of a head as a 1/2. This is why classical statistics is often known as frequentist and covers ideas such as confidence intervals and p-values.

Bayesian statistics evolved out of Bayes’ Theorem which I talked about in a previous post.

$P(A|B) = \cfrac{P(B|A)P(A)}{P(B)}$

• P(A), P(B) are known as prior probabilities, because we know them before we learn any more information.
• P(A|C), P(B|C) are known as posterior probabilities, because they are found after we have learnt some additional information.

You can think of this Bayesian statistics as an evolution of Bayes’ Theorem. Instead of dealing with point probabilities we now deal with probability distributions. As a result we now have prior and posterior distributions to consider.

$f(\theta|x) = \cfrac{f(x|\theta)f(\theta)}{f(x)}$

As the term $f(x)$ is just a normalising constant we can drop it to get the commonly seen Bayes’ Rule.

$f(\theta|x) \propto f(x|\theta)f(\theta)$

Here $f(\theta|x)$ is the posterior distribution, $f(x|\theta)$ is the likelihood which accounts for the statistical model and $f(\theta)$ is the prior which represents the expert beliefs before seeing the data. The key point is that the Bayesian approach can quantify theories and hypotheses, something that can be desirable.

# Coupon Collecting and Football Stickers

Although Euro 2016 is still months away from officially starting, another footballing tradition has just started: collecting football stickers. For children and adults alike the quest to complete the entire sticker album is something that takes time and money but the excitement and joy makes it worth it.

As the number of stickers in each album have dramatically increased from the first few albums produced around 40 years ago, the task of collecting all of the stickers has become much harder. Mathematically this problem happens to be a classic problem in combinatorial optimisation: the coupon-collector problem. This problem is when there is a finite number of different coupons any of which can be given to a person one at a time. In a statistical sense this is the problem of sampling with replacement. The question is then how many times do you need to sample to obtain a copy of every single sticker?

# Optimisation in Car-Sharing Systems

For many people the costs of owning, running and maintaining a car are simply too big to justify having one. Having to pay hundreds, if not thousands of pounds to cover insurance, road tax and fuel costs for only a few short journeys a week to work or to the shops doesn’t make financial sense. One alternative to this that has proven popular is ‘car-sharing‘ or ‘car-clubs‘. These schemes allow members to access vehicles owned by the company by reserving them in advance so they can use them for the short trips they might make in a day. This could be anything from picking up the kids from school, popping down to the shops or going out for a few hours at the weekend.

There are several other benefits to using a car-sharing schemes in addition to any potential financial savings. By using a single vehicle to do several different people’s short trips means that less cars are on the road in total reducing traffic congestion and vehicle emissions. Even something as simple as being able to guarantee a parking space when you reach your destination can be attractive, particularly in busy inner cities.

# Optimal Search Problems

One active area of research within STOR-i is the problem of how to optimally search for something or someone. This is a problem of practical importance with it being used for many real world applications. Bomb disposal squads need to search for unexploded ordinance, search parties want to look for survivors of disasters and the police want to find people on the run from the law.

Like most things the best way to build intuition to a problem is to start by constructing smaller toy models of the system. We can represent the places where an object might be hidden as arcs in a network which are connected to each other with nodes. This type of structure is deliberately similar to that of a road system, the arcs represent streets and the nodes are the street corners. The object that we want to find will be hidden along one of the streets and the job of those searching is to find it as quickly as possible.

Example Network for Optimal Searches

# Why Lawyers Need Statistics

I previously talked about Bayes’ theorem and its often misunderstood applications. Normally these mistakes aren’t particular costly or harmful in the world of statistics, but if they are used to make decisions that impact on the real world then getting things wrong can be extremely costly.

One place where statistics can be called upon to influence important matters is the court. Throughout the last 50 years there has been an increase in the use of statistics in court matters and it is important that everybody involved understands them and their use. If any or all of the prosecution, defence or jury misinterpret the information given to them then the chances of a miscarriage of justice will greatly increase.

$\text{Prob of matching a description} \neq \text{Prob of matching a description and being guilty}$

The classical mistake made in the past by many prosecuting teams is that of the ‘prosecutors fallacy‘. This is when the prosecution or defence have presented the jury with some statistic such as a probability that has been calculated incorrectly, yet manage to convince the jury to accept its truth.

Scales of justice; source