The use of data and statistics has grown rapidly in most sports in recent years. Nowadays many metrics and formulas exist for trying to measure and predict performances of players and teams. One of my favourite tools because of its simplicity is something called ‘Pythagorean Expectation‘.

The most famous early pioneer of baseball statistics, Bill James, came up with the original formula to predict the number of wins for a team over a baseball season, based on the number of runs they scored and conceded.

To work out the number of wins in a season you simply have to multiply the win percentage by the number of games played. The reason James called it Pythagorean is because of the occurrence of all the squared terms. Whoever said Pythagoras Theorem was boring!

It turns out that many later people have gone on to develop more refined models of this formula for baseball, and have come up with more accurate exponents of around 1.81. This reduces the margin of error from about 4 wins per season down to around 3 wins, pretty good considering each team plays around 160 games a season.

This idea has also branched out into other sports such as Basketball and American Football through the work of people such as Daryl Morey, the general manager of Houston Rockets NBA team. It turns out that for every sport there is a unique Pythagorean exponent; in basketball either 13.9 or 16.5 can be used whilst in the NFL it is 2.37. This begs the question what is it in football?

As you can see above, Pythagorean Expectation is a pretty good predictor of a football teams total points over a season. The trouble is that whilst it is pretty good, compared to other sports such as baseball it is nowhere near as accurate. It seems that the method works very well for other sports in general, but not football.

So that begs the question, why is football so much harder to predict from? As you might guess one of the main differences with games like Baseball is the fact that draws are a common occurrence in football. Unlike most American sports where the idea of not having a winner is unfathomable, past results show us that draws happen in nearly a quarter of all football matches played.

This is bad news as the Pythagorean Expectation doesn’t allow the possibility of draws. However some work has gone into trying to extend the formula to allow for the possibility of draws in football such as Howard Hamilton’s proposal here.

On a more subtle level there is another flaw as well. If one team wins a match then the total points awarded for that match is 3, but if a match is drawn the total points awarded is 2. So clearly not all outcomes are made equal. This means that the Pythagorean Expectation tends to over estimate a teams points haul over an entire season.

Finally the simplicity of the formula means that all goals are created equal, which we know is simply not true. Obviously scoring a goal that turns a 2-2 draw into a 3-2 win is worth far more than a goal scored to go from 3-0 to 4-0.

So having finally gained some mastery of R and the optim package this year, I thought that I’d have a go at trying to find these Pythagorean Exponents myself. To do this I read data from the last 10 years of BPL matches into R, created the Pythagorean league table and used the optim package to pick parameters that minimised the root mean square error of the league.

There are several models that we could try as well as the classic one exponent model. The models I’ve looked at are the following, which when multiplied by games played give predicted Pythagorean Points.

There is very little difference in the root-mean-square error between the 2,3 and 4 parameters cases. So for the case of simplicity let’s continue to use just the 2 parameter case. Interestingly this matches what was found in this 538 article, where they did a similar approach but used results from all the major European leagues over a longer time period.

As we mentioned despite the limitations of using Pythagorean Expectation we can still use it throughout the season to gauge how well teams are doing. As it now the end of January, we can compare how each BPL team is doing compared to what our model suggest. Differences between the actual points and the Pythagorean points that are greater than the RMS error (about 4 points) should be noted in particular.

Some Observations

- Those at Villa Park should be worried as they are under performing their expectation by 4.9pts, yet remain at the bottom of the league.
- The mighty Watford are doing well and are in line with their statistical expectations.
- It’s well documented how well Leicester have been doing and at some point they will surely have a blip. They are currently the best overachievers with a difference of +5.8pts, and seem to strong contenders for a Champions league spot.
- Tottenham are a real surprise as they seem to be underachieving their expected Pythagorean points total. This suggests that if they get a bit ‘luckier’, they could challenge for the title!