# Principal components analysis for baseball HOF voters

As I talked about in my previous post on sportsVU player-tracking data, principal components analysis (PCA) is a technique that can be used for both dimensionality reduction (describe the data effectively using fewer numbers) and to reveal something about the nature of the data.

To apply PCA to voting patterns for the baseball hall-of-fame, we can consider each player to be a “dimension”, i.e, x, y, z,  etc…, and each voter to occupy a point in that space. Intuitively we would expect that a voter that votes for Bonds (1 in the x direction) would usually vote for Clemens also (a 1 in the y direction). From the dimensionality reduction point of view, this means you don’t really have to tell me both numbers (aye or nay on Bonds and aye or nay on Clemens), just tell me one and I’ll essentially know the other. On the structure-of-the-data side, this tells me something interesting about the way voters make decisions and value – or discount- different aspects of a players career. The Bonds-Clemens association is the most intuitive one, and PCA basically exists to systemize this logic and extend it to less obvious correlations.

To start with I’ll focus on the pool of publicly released ballots from last year (2015), awesomely provided by Ryan Thibodaux (@NotMrTibbs), since there’s a large pool to draw from. As more ballots come in for the current election, it is straight-forward add those or analyze them independently.

This chart show the mean ballot (black) and the 1st (blue) and 2nd (red) principal components.   So, the chart says that, after the overall mean, the most important information you can tell me is what they said for Bonds & Clemens, and to a lesser extent, Piazza, McGwire, Mussina, Edgar, Raines, etc. – all the players with non-zero values for the blue line. The fact that the Bonds and Clemens values go in the same direction is telling us right away that there’s a strong correlation between votes for those two. Likewise, the votes for Bonds and Clemens are positively correlated with votes for McGwire and Piazza, and negatively correlated with votes for Edgar and Lee Smith. In the 2nd component, it becomes important what they said for Schilling, Bagwell, Smith and Raines.

Instead of continuing to plot higher order components, I’ll turn the data on it’s side, and put the player name on the y-axis and the component number on the x-axis. Here is what that looks like,

So we can see that Bonds and Clemens are the most important elements of the first component, but don’t really factor in again until much a much higher component, and because the sign of their values are different, that is basically where a vote for one and not the other comes in.

Another interesting question to consider is how much of the variance in the data can we explain as a function of how many components we use – this is the data reduction aspect of PCA. For this data the correspondence looks like this

So, just by telling me the 1st component, which is more or less the vote for Bonds/Clemens, we explain about 20% of the variance. With 15 components, we explain about 95% of the variance.

Finally, the player-by-player plot of component magnitude reminded me of the iconic Joy Division album cover

so just for fun I whipped up a similar visualization using d3.js. The code for that is available here