Blog

While writing the previous post on the two ‘cultures’ of statistical modeling for prediction and inference, I realised that I was glossing over an extremely important area of predictive modeling, and judging by frequent StackExchange posts, one that is often misunderstood. As you will have summarised from the title, I’m talking about cross-validation. If done correctly, cross-validation (CV) will provide a thorough assessment of a predictive model providing you with: unbiased, publishable results; a means of selecting the final model instance to use for your application; and an accurate estimate of the model’s performance on future data.

CONTINUE READING

“Two Cultures” One aspect of statistical modeling which can be taken for granted by those with a bit of experience, but may not be immediately obvious to newcomers, is the difference between modeling for explanation and modeling for prediction. When you’re a newbie to modeling you may think that this only has an effect on how you interpret your results and what conclusions you’re aiming to make, but it has a far bigger impact than that, from influencing the way you form the models, to the types of learning algorithms you use, and even how you evaluate their performance.

CONTINUE READING

I’ve tinkered around with Predictaball a bit recently in an effort to increase its accuracy, with the overall goal of beating Paul Merson and Lawro so that I can claim ‘human competitiveness’. I’ve mentioned in previous posts that I envisage 2 potential ways to achieve this. Include more player data Incorporate bookies odds Adding more player data (such as a variable for each player indicating whether they are in the squad or not) would allow the model to account for situations when a player who is strongly associated with the team winning is now injured - for an example see City’s abysmal record when Kompany isn’t playing.

CONTINUE READING

It’s been a while since I’ve posted anything as I’ve spent my summer in a thesis related haze, which I’m starting to come out of now so expect more frequent updates - particularly as I work my way through the backlog of ideas I’ve been meaning to write about. I’ll start with assessing Predictaball’s performance last season. Just to summarise, this was a classification task attempting to predict the outcome (W/L/D) of every premier league match from the end of September onwards.

CONTINUE READING

One thing that struck me as odd when I was studying for my teaching award was the way in which teachers must consider the varied learning styles of their students when planning lessons. Clearly this is common sense and good teaching practice, what works for one person does not necessarily work for another. The issue I have with it however is that the students themselves, in my experience, are never taught to consider their own learning method despite it having a huge impact upon an individual’s performance.

CONTINUE READING

It’s gradually getting closer to the three year PhD deadline in which I intend to submit, meaning I’ve got two and a half months to not only finish up my experiments but write up my entire thesis. To help motivate myself to work on this huge document (and definitely not as a form of procrastination) I’ve started recording my writing progress and am publicly displaying the data here. The idea is that I won’t want people (family, supervisors, colleagues) to notice that I’m slacking.

CONTINUE READING

Today I gave a talk introducing Python to early stage researchers in my Department. It’s always hard deciding what material to include in an hour’s talk, particularly when the subject material is so vast. This wasn’t helped by the fact that in my department there is a large range of programming experience, from researchers with backgrounds in Computer Science to Electronic Engineers who are only comfortable with Matlab. I attempted to address both of these groups by introducing Python as a language in terms of its syntax, data structures and control flow, before discussing how you can emulate Matlab by using the SciPy stack.

CONTINUE READING

Now that I’ve finished my teaching qualification (the York Learning and Teaching award) I’ve had some time to get back into research. I’ve been updating various bits of software that I haven’t used much over the last month or so, one of which was to update PyPy to version 2.5 from 2.3, skipping a version in the process. I expected that I may get a few speed bonuses but there wouldn’t be a significant improvement from 2.

CONTINUE READING

Receiver Operating Characteristics (ROC) are becoming increasingly commonly used in machine learning as they offer a valuable insight into how your model is performing that isn’t captured with just log-loss, facilitating diagnosis of any issues. I won’t go into much detail of what ROC actually is here, as this post is more intended to help navigate people looking for a MAUC Python implementation. If however you are looking for an overview of ROC then I’d recommend Fawcett’s tutorial here.

CONTINUE READING

I haven’t blogged in a while, mainly because I’ve been so busy with teaching work. It’s fantastic experience and very rewarding, but at the same time I find myself sometimes wishing I had more time to do my research, especially now that I’m in my third year. The other day I came across a very well done hierarchical Bayesian modelling approach for football games. I’ve been thinking a lot recently about what area I want to go into for my first post-doc research, and learning standard statistical techniques (including Bayesian methods) is something I’ve been considering.

CONTINUE READING