Every student of statistics I know has at least once thought about making easy money by predicting the stock market or by predicting sports results. To be honest, I certainly was not an exception. Intimidated by uncapped randomness of the stock market, I always tended more to the second option – the sports betting. Nevertheless, it has not been until recently, that I asked the guys from the team and we decided to actually really try if we can make an easy living from predicting football results. [For impatient readers: it looks promising but it is not so easy]
In the beginning we knew absolutely nothing about how the betting works, where to find the data, neither what some standard prediction methods are. So let me walk you through what we have learned.
Football betting
The most common bet in football is so called 1×2 – decoded as win of home team, tie and win of away team. For a match, bookmakers present odds for each of the possible outcomes. These odds reflect probabilities of each outcome as seen by the bookmaker plus some small margin making the bookie’s living. Because we used my account at Bwin we were using odds in a decimal format with the following relation to implied probability: probability = 1/odds (so odds 2 means 50%, 2.5 means 40% and so on). The game is very easy if I place a bet on an outcome with odds 2.5 and I win, I get 2.5times the bet. And zero otherwise. Tempting, right?
Statistical approach
The simplest way to pick the right bet is by using the expected value. It works as follows: when I place a bet of 100 on a win of the home team with odds being 4.0 and I estimate the probability that the home team actually wins as 0.3 the expected value of my bet is 120 (100 * 4.0 * 0.3). This is good because 120 is bigger than 100 so I am actually winning some money. The term expected value is a bit tricky because in fact I don’t expect to win 120, I rather expect to either win 400 or lose 100. But statistically in a long run, making hundreds of similar bets – I expect to win 120 per match. But that’s not the point here.
So now the task can be broken down into three parts:
 how to estimate the probability of each outcome
 how to decide how much money to place (100 or rather 10?)
 and at what threshold should I place the bet (is 120 as an expected value of a 100 bet enough?).
First we need some historical data.
Where to get the data
One of the first links in our rather unsophisticated search for “football data” was http://www.footballdata.co.uk/, which turned out to be just perfect for our needs. It offers weekly updated fixtures, results, odds and many other match statistics for all major European leagues for up to past 20 seasons.
The files are nicely organised, so it is very convenient to get data for specified leagues and seasons. For example the following code gets data from current season (2016/17) of the Premier League. Please note that all the coding was done in Python 3.5.
1
2
3
4
5
6
7
8
9
10
11
12
13

import posixpath import pandas as pd season = '1617' league = 'E0' # Premiere League data_file_suffix = '.csv' remote_file = posixpath.join(root_url, season, '{0}{1}' . format (league, data_file_suffix)) data = pd.read_csv(remote_file) print (data.shape) 
Predicting match results
Once we had data, we could focus on building a predictive model for match outcomes. One of the first models for this was developed by Maher [1]. Imagine a match played by team i as home and team j as away. Maher modelled a number of goals scored by home team as a variable with Poisson distribution (X_{ij}) and similarly for the number of goals scored by away team (Y_{ij}). Each of the variable is influenced by different parameters (? and ?) and they are also assumed to be independent.
Number of goals scored by home team (X_{ij}) is driven by attacking strength of the home team α_{i}, defensive strength of the away team β_{j,} and some home field advantage ?. On the other hand, the number of goals scored by the away team (Y_{ij}) is driven by attacking strength of the away team (α_{j}) and the defensive strength of the home team (β_{i}). Or in other words by a nice formula:
P(X_{ij} = x, Y_{ij} = y?, ?) = Poisson(x?) · Poisson(y?),
where ? = ? · α_{i}β_{j} and ? = α_{j}β_{i}.
Practically it means that we take some relevant historical results and based on them we estimate attacking strength α and defense strength β for each team and also a general home field advantage ? using maximum likelihood.
Maher’s model was further improved by Dixon and Coles [2]. Firstly they realised that in reality the low score ties (00, 11) happen more often and the results 01 and 10 happen less often than they should according to Maher’s model. To correct for that they introduced a function ?(x, y, ?, ?, ?) defined as follows. Note that we get Maher’s model for ?=0.
1
2
3
4
5
6
7
8
9
10
11
12

def tau_function(x, y, lambdaa, mu, rho): if x = = 0 and y = = 0 : tau = 1  lambdaa * mu * rho elif x = = 0 and y = = 1 : tau = 1 + lambdaa * rho elif x = = 1 and y = = 0 : tau = 1 + mu * rho elif x = = 1 and y = = 1 : tau = 1  rho else : tau = 1 return tau 
Second improvement was that they considered unrealistic that the attacking and defensive strength do not evolve in time so they introduced a timevariant version of the parameters α_{it} and β_{it}. Technically they used exponential forgetting so that the recent results are more important for the model fitting than the historical ones. How quickly the importance of historical results decreases or in other words how quickly they are forgotten depends on a parameter ?. And again we get Maher’s timeinvariant model for ?=0.
1
2
3
4
5
6

import numpy as np def get_time_weights(dates, epsilon): delta_days = [( max (dates)  d).days for d in dates] # future games not relevant return list ( map ( lambda x: 0 if x < 0 else np.exp(  1 * epsilon * x), delta_days)) 
When we put all of that together we are about to calculate maximum likelihood estimates for parameters of the following model.
P(X_{ij} = x, Y_{ij} = y?, ?, ?) = ?(x, y, ?, ?, ?) · Poisson(x?) · Poisson(y?)
where ? = ? · α_{i}β_{j} and ? = α_{j}β_{i} and all α’s and β’s vary in time. Because we don’t want to optimise the product of exponential functions we will use loglikelihood function and because it’s easier to minimise than maximise (using scipy.optimize.minimize
) we will find the estimates by minimising the negative loglikelihood function.
The loglikelihood function looks something like this:
1
2
3
4
5
6
7

def time_ln_likelihood(values): return sum ([(value[ 'time_weights' ] * (np.log(tau_function(value[ 'home_goals' ], value[ 'away_goals' ], value[ 'lambda' ], value[ 'mu' ], value[ 'rho' ])) + (  value[ 'lambda' ]) + value[ 'home_goals' ] * np.log(value[ 'lambda' ]) + (  value[ 'mu' ]) + value[ 'away_goals' ] * np.log(value[ 'mu' ]))) for value in values]) 
Before the optimization we need to make sure the model will be identified by adding the following constraint. The identification condition establishes that the loglikelihood has a unique global maximum.
1
2

def norm_alphas(params, number_of_teams): return sum (params[:number_of_teams]) / number_of_teams  1 
And once that is done we can run the optimization, where we address teams by IDs and not names and x0 is an array with the initial set of parameter estimates.
1
2
3
4
5
6
7
8
9
10

# optimize minimize_result = op.minimize(fun = lambda * args:  time_ln_likelihood( * args), x0 = x0, args = (input_data_frame[ 'HomeId' ].tolist(), # home teams input_data_frame[ 'AwayId' ].tolist(), # away teams input_data_frame[ 'FTHG' ].tolist(), # home goals input_data_frame[ 'FTAG' ].tolist(), # away goals number_of_teams, get_time_weights(input_data_frame[ 'Date' ].tolist())), constraints = ({ 'type' : 'eq' , 'fun' : lambda * args: norm_alphas( * args), 'args' : [number_of_teams]})) 
This will give us all the parameters needed for our model so we can easily calculate the probability of each possible score for any future match. If we have a matrix game_probabilities
of all possible match results (with goals of home team in rows and goals of away team in columns) then the following simple aggregation gives us the probabilities of our 1×2 outcomes.
1
2
3

win_probability = sum ( sum (np.tril(game_probabilities,  1 ))) # trianglelower for home win draw_probability = game_probabilities.trace() # diagonal for draw loss_probability = sum ( sum (np.triu(game_probabilities, 1 ))) # triangleupper for home loss 
Betting strategy
When I mentioned the expected value approach to betting let me stress out that a good betting strategy is not about correctly predicting the match results. It is about finding opportunities where our probability is higher than the bookie’s probability implied by the odds. By definition these opportunities are typically present in outcomes that have low probability. So we do not expect to bet on favourites where the odds are very low but rather on outsiders.
How much money to bet and when to actually place the bet is called a betting strategy. We decided to try three different methods to calculate the size of a bet:
 Fixed: betting a fixed amount of money scale_constant on each match
 Kelly’s [3]: bankroll * ((probability * odds – 1) / (odds – 1)), where bankroll is simply the overall budget, which reflects the profit and losses over time
 Varianceadjusted: min(bankroll, scale_constant / (2 * odds * (1 – probability)))
The threshold for placing a bet shows how risky bets we want to make. The more bets we make the closer we should theoretically get to the expected value. And because the number of accepted bets declines with the growing threshold the final payoff becomes more volatile.
What works best
At this stage we have all the building blocks needed so the next step is to give it a try. We decided to test our solution on the Premier League because it was the first data set on http://www.footballdata.co.uk/. To determine, which prediction method and which betting strategy should be used we run a grid search, which simply tried all specified parameter settings and stored how well they performed. The grid consisted of following 4 800 combinations.
1
2
3
4

MODEL_FITTING_GRID = { "method" :[ 'DixonColes' , 'Maher' ], "epsilon" : append(arange(. 0005 , . 00225 , . 00025 ).tolist(), [ 0.0 ])} BETTING_GRID = { "betting_strategy" : [ 'fixed_bet' , 'Kelly' , 'variance_adjusted' ], "bet_threshold" : arange( 1.05 , 1.55 , . 005 )} 
We tested this grid using a rollingprediction over season 2015/2016 when taking data back to 2013 and considering also The Championship (second highest league in England) to include also matches of teams that had been promoted or relegated in time. The parameter we wanted to maximize is a simple return on investment (ROI) so in our case (total payoffs – total bets) / total bets.
The winning combination turned out to be time variant DixenColes model with epsilon equal to 0.002, Kelly’s betting strategy and bet threshold of 1.1. Truth be told the bet threshold was higher but we wanted to place more bets.
So how is this combination working in the current season? (click the image to open interactive Tableau dashboard)
We can see that we started off really well but then got hit by a series of unsuccessful rounds when we lost almost 50% of our initial budget. Few recent rounds have put us back in the game but still way far from “easily making money”.
When I mentioned this exercise to my college friends, who have been meddling with sports betting, they told me that 1×2 bets on Premier League on Bwin is the hard mode of sports betting. So if you really want to give it a try don’t follow us and try to find some niche league and bet for the number of goals for instance.
It the light of that information we tried our approach on few other leagues, namely German Bundesliga, French Ligue 1, Belgian Jupiler League and Portuguese Primeira Liga. Again we run the grid search over the past and chose the combination of parameters with the highest ROI and with a reasonable number of bets.
The best combination for Bundesliga was timevariant Maher’s model with epsilon equal to 0.00075, Kelly’s betting strategy and bet threshold 1.47. This had a ROI of 0.21. And the performance in the current season so far?
For the Ligue 1 we chose timevariant Maher’s model with epsilon being 0.00175, Kelly’s betting strategy and threshold of 1.41 with a 0.54 ROI in season 2015/2016.
For Belgian Jupiler League we decided for timevariant Maher’s model with epsilon being 0.00175, Kelly’s betting strategy and threshold of 1.44 with 0.19 ROI in season 2015/2016.
And finally the Primeira Liga with timevariant Maher, epsilon of 0.00225, Kelly’s betting strategy and threshold of 1.11, which was losing the least in 2015/2016 – ROI of 0.09.
Final thoughts
As we can see the models can win some money even though with a huge volatility. That is positive news given we only tried the basic statistic approaches. In the future one might play with some machine learning approach, add more predictors into the model – e.g. weather or even data for individual players.
On the other hand we haven’t dealt with automatically placing the bets in higher volumes, which is often a very tricky part in this business if you don’t want to spend hours by making bets.
So allinall despite all the fun we had with this we’ll probably stick to our job at the moment. What about you, won’t you give it a try? ?
Big thanks goes to my friends Martin and Michal from aLook Analytics who turned my simple Python scripts into a fully automated betting tool.
I would also like to mention three great online sources of information and data:
 Historical football data http://www.footballdata.co.uk/
 Short paper by H. Langseth with all important knowledge https://www.idi.ntnu.no/~helgel/papers/LangsethSCAI14.pdf
 Great blog about sports betting and many more http://opisthokonta.net/
References
[1] Mike J. Maher. Modelling association football scores. Statistica Neerlandica, 36(3):109–118, 1982
[2] Mark J. Dixon and Stuart G. Coles. Modeling association football scores and inefficiencies in the football betting market. Applied Statistics, 46:265–280, 1997
[3] John L. Kelly. A new interpretation of information rate. IRE Transactions on Information Theory, 2(3):185–189, 1956
[4] Designed by Freepik