HOF prediction model – 2015

Like many baseball fans, I spent the weeks leading up to the hall-of-fame voting obsessing over the publicly released ballots, and especially the data gathered by Ryan Thibodaux (@NotMrTibbs) in his hall-of-fame tracker. This post describes how I used the data provided by Ryan to make a hall-of-fame voting prediction model. The model grew in part out of discussions on Tom Tango’s blog here, and I’ll comment on some of that below. All of my code is available on github here
https://github.com/bdilday/hofTracker

The basic idea behind the model is to take a linear combination of the public ballots to predict the public + non-public ballot overall results. I downloaded Ryan’s HOF tracker data going back to 2011 and used this for training the model. There’s a number of changes to consider between 2011 and the present that impact the choice of which data to use.

The number of voters that make their votes public has grown year by year, which means the further back in time you go to draw your training set from, the less number of voters you’ll have available. The ballot has also changed a lot, as a flood of worthy players, most with real-or-imagined PED issues, have come on. In addition, the Maddux-Thomas-Glavine group doesn’t really have a precedent pre-2014, but does have an analog in the Johnson-Pedro-Smoltz group for 2015. This means training on 2012 data to predict 2014 doesn’t necessarily give a good representation of how the model will perform on the 2015 data. Additionally, I found that using 2013+2014 data to predict 2015 gave results for Johnson-Pedro-Smoltz that were so low that they just didn’t pass the sniff test, so in the end, I ended up using only 2014 data to predict 2015.

The model I used was linear regression, in other words the prediction for each players fraction of the vote is,

$p_j = \Sigma a_i v_{ij}$

where $v_{ij}$ is the result for voter i, voting on player j.

There were two main things that came out of the discussion on Tango’s blog. One was that a couple of people suggested that I should not use a linear model for a probability, and that I would be better off transforming to the log-odds and fitting a linear model to that. I tried this and it gave me worse results, so I abandoned it. A related thing that I did do, however, is to weight each observation according to $1/(x_j (1-x_j))$; that is, for a binomial distribution the variance is $\sigma^2 = x (1-x)/N$, and I weighted each observation by the inverse of its variance (N is the same for every data point). In practice I did this by making M copies of the data, where M is an integer that is proportional to the inverse of the variance, while also adding a bit of noise to each copy of the target variable (which is the actual percentage obtained by the player). In a weighted least squares that wouldn’t be necessary, the only reason for kludging it that way is the scikit-learn routine I was using doesn’t support weighted linear regression; more on that below.

The second thing was that the public ballots are already in the bank, and my first iteration didn’t account for this. What I did at first is fit the actual results (from 2014) as a linear combination of public ballots (from 2013); but the better thing is to subtract the public ballots from the total ballots, which gives me the private ballots (for 2014), and fit the public ballots (from 2013) to those private ballots, and then finally add back in the ballots in the bank. This is a very convenient way to account for sample size, and as a natural consequence makes it more and more difficult for the model to deviate from the average of public ballots as the fraction of public ballots approaches 100%.

For this particular data analysis, the number of voters in the training sample is comparable to or even more than the number of players (target variables), so regression doesn’t make sense without regularization. And if you’re going to use regularization, then it makes sense to use cross-validation to determine the amount of regularization. I used the RidgeCV fitter from the scikit-learn Python package. This worked well for this problem since it gives an easy interface to regularized (Ridge) regression, automatically including a fancy cross-validation procedure. The downside is that it doesn’t support weighted linear regression (i.e. heteroscedasticity), which is why I ended up kludging the data by making copies of the more influential data points (as mentioned above). Part of the lesson I got from messing around with using different samples (e.g. 2012+2013 vs only 2013 to predict 2014, or considering only players whose public-ballot was above a certain threshold in the fit…) was that the results can vary quite a lot, depending in the subset of voters+players you choose. Therefore, I also used bootstrapping to estimate uncertainties on my predictions. As opposed to the cross validation, I did code this part myself, which basically meant doing a loop, choosing a random sample of voters (“sampling with replacement”), and repeating the fit procedure. For my final results, I used samples the same size as the full pool of voters, and something like 1000 iterations (256 vs 1024 vs 2048 really doesn’t make much difference).

So in summary, my HOF model used the public ballots provided by Ryan Thibodaux’s HOF tracker; the voters I considered were those that made their ballots public in both 2014 & 2015; I used ridge-regression with cross validation to determine the regularization parameter; I used bootstrapping to generate uncertainties on my predictions. The results were (click on the image to get a larger version),

The blue x marks show the actual results, the yellow points show my model predictions, with uncertainties determined using bootstrapping.

Here are my predictions compared against some other models I could find on the internet; if you’ve got more please send them my way! Yes, my model has the lowest RMS of the 5, but I swear I would have written this anyway; I’m less interested in showing off than I am in learning something.

I can’t speak in great detail about every other model, but I know Tango split voters into categories based on whether they voted for Bonds/Clemens or not, and whether they had a full ballot or not, and made an inference from that. The “baseballot” model is described in detail here,

http://baseballot.blogspot.com

As far as I understand it, it is simlar to what I did in that it uses a linear combination of voters, based on a fit to previous years results. Poznanski and Lindholm gave predictions that used the public ballot results plus some subject-matter expertise as opposed to a statistical model.