Blog

Predicting football results in 2016-2017 with machine learning - Bayesian hierarchical modelling

Jun 27, 2017 5 min read

And so we come to the end of another season of football, and more importantly, Predictaball! This season has seen several large updates that I was meaning to detail these at the start of the season but life got in the way. The predictive model is now fully Bayesian I’ve added a betting system that identifies value bets I’ve expanded it to include the 3 other main European leagues: La liga Serie A Bundesliga Rather than detailing these new aspects as well as summarising the season’s performance in one massive blog, I’ll split this into two parts.

Simulating win probabilities of the CamelUp boardgame

May 20, 2017 7 min read

Camel Up is a deceptively simple board game in which the aim is to predict the outcome of a camel race. I’ll quickly try to explain the game now, although it’s always hard to explain a boardgame without an actual demonstration. The camel movement is randomly generated from dice rolls as follows. Five dice coloured for each of the five camels, each labelled with the numbers 1-3 twice, are placed into a container (decorated as a pyramid, since the game is set in Egypt), which is then shaken.

Building a 2D No Man's Sky - NASA Space Apps

May 5, 2017 5 min read

I’ve never really been much of a hacker, I much prefer to think my projects through entirely and plan them out on pen and paper before starting to write any code. As such I’ve never really had much interest in a hackathon. With a bit of apprehension then I participated in my first one over the weekend. The particular event was NASA Space Apps, where NASA provide lots of data and offer challenges related to modelling certain natural phenomena, providing data visualisation, or prototype hardware tools that fit a particular niche.

An interactive Multi-State Modelling Shiny web app

Last updated on Oct 6, 2017 3 min read

In the last couple of months I’ve been teaching myself about multi-state survival models for use in an upcoming project. While I found the theoretical concepts relatively straight forward, I started having issues when I began to start implementing the models in software. There are many considerations to be made when building a multi-state model, such as: Convert the data into a suitable long format Deciding whether to use either parametric or semi-parametric models Different subsets of the available covariates can be selected for each of the transition hazards In addition, covariates can be forced to have the same hazard ratio on every transition There’s a choice to be made between clock-forward or clock-reset (semi-Markov models) time-scales The Markov assumption can be further violated by including the state arrival times as part of the transition hazard; this often has theoretical justification The baseline hazards can be kept stratified by transition, or certain ones can be assumed to be proportional Needless to say, actually building a model was very time consuming.

Predicting AFL results with hierarchical Bayesian models using JAGS

Apr 5, 2017 21 min read

I’ve recently expanded my hierarchical Bayesian football (aka soccer) prediction football prediction framework to predict the results of Australian Rules Football (AFL) matches. I have no personal interest in AFL, instead I got involved through an email sent to a statistics mailing list advertising a competition that’s held by Monash University in Melbourne. Sensing an opportunity to quickly adapt my soccer prediction method to AFL results and to compare my technique to others, I decided to get involved.

Guide to publishing R packages on CRAN

Sep 27, 2016 1 min read

I recently give a talk at my university’s R User group on how to publish packages to CRAN (slides here). This isn’t an easy topic to distill into a 60 minute slot, and so I had to abandon my original idea of a hands on workshop with examples in favour of a condensed summary of the main challenges in the submission process. This mostly focused on the issue of Namespaces, since this is a rather complex topic to understand if you’re coming from a non-software engineering background, as it doesn’t come up in day-to-day statistical analysis.

Is La Liga the most predictable European football league?

Jul 23, 2016 11 min read

I’ve always been curious to know if any of the 4 major European leagues (Serie A, Bundesliga, Premiership, La Liga) are more predictable than others. La Liga certainly has a reputation as being dull and predictable, although this is due to the sheer dominance of Barcelona and Real Madrid in recent years. I’ve increased my database of football matches in order to improve my football prediction bot this summer, and so now have sufficient data to investigate.

An R package for estimating disease prevalence by simulation: rprev

Jun 8, 2016 3 min read

At ECSG (Epidemiology and Cancer Statistics Group), we primarily work with myeloid and lymphoid disease registries. Resulting from our successful collaborative research project - HMRN (Haematological Malignancy Research Network) - we have access to a large observational dataset of haematological malignancies across Yorkshire. From this we can estimate various measures of interest, such as the effect of standard demographic factors (mainly age and sex) on incidence rates, any longitudinal incidence trends, in addition to numerous statistics related to survival, for example noting any clinical or demographic factors associated with a high risk level.

Predictaball end of season review for 2015-2016

May 29, 2016 3 min read

This post summarises Predictaball’s performance in the 2015-2016 season. I’ll look at overall performance, accuracy per week, how it fared in terms of making profit, and finally the annual comparison with Lawro. Compared to last year when it achieved 48% overall, Predictaball has fared less well this season with 43%. This isn’t largely surprising since this season has been full of surprises to say the least, with Leicester beating out the traditional top four for the title, and Spurs doing their best to break the monopoly (despite failing in typical Spurs fashion).

Generating Iron Maiden lyrics with Markov chains

Apr 19, 2016 9 min read

I’ve been wanting to play with Markov Chains for a while now, and now that I’m starting to get into Bayesian analysis I’m going to need to use them more often. One fun use of them is to generate text which can (at a stretch) pass as written by a drunk person. For nice examples of them in action have a look at Garkov (generated Garfield strips) or even an entire subreddit generated with them reddit.