## Building a VR data visualization with Statcast batted ball data

Virtual reality (VR) and its proliferation into our lives is a popular topic right now and is being greeted with healthy doses of both excitement and skepticism. I for one am bullish on VR! Having been fortunate to play with an HTC Vive in the office, I find that even mundane things, like mini golf, become mind blowing experiences! An exciting use case for VR is in data visualization (although a more precise description would be in promoting insights and intuition for data – visualization is a powerful way of achieving this, but is not necessarily the only way).  Three-dimensional data visualizations have a long history of being maligned by data visualization academics and practitioners, and I think it’s safe to say that the consensus view is that they are gimmicky and are in fact misleading and harmful to the goal of accurately representing data. Making compelling data visualizations in VR is going to mean more than taking a bar chart or scatter plot and putting it 3d. What’s new and different about VR, in my view, is that one gets a visceral sense of expansiveness, and successful VR data visualization will take advantage of this and give us new ways of interacting and experiencing data.  I think it’s safe to say that many people are thinking about this very interesting challenge and opportunity, but no one has quite cracked it yet.

A few prominent examples of VR data visualizations are,

3d Nasdaq roller coaster from the WSJ

d3 and aframe roller coaster from Ian Johnson

A tour through England, showing (simulated) dislike of Piers Morgan for each town

For my visualization I’m using batted ball data from MLBAM and Statcast, which were obtained from this Statcast-Modeling R package . The layout is similar to one I built with d3.js some months back, which looks like this,

In this visualization the circles show hits (blue) and outs (red), in their landing position according to hit f/x. The grids on the right hand side use mouseover to give the user the option to filter in the any of the launch-angle / launch-speed, launch-angle / hang-time and launch-speed / hang-time planes. The fully interactive version of this visualization is available on my github page.

The main feature of extending this to VR is using the device orientation to control the filtering in the launch-angle / launch-speed plane. The user changes launch angle by tilting their head up and down, and launch speed by tilting their head left to right. There is also a mouseover fallback for users on desktop, i.e moving the mouse up the screen simulates tilting ones head up and down and moving the mouse left to right, tilting head left to right. In addition, I tilted the plane of the field down in more of a perspective view, and I pop the filtered batted balls in the vertical direction to add additional highlighting. The end result looks like the image below; the fully interactive version is available at https://vr-batted-ball-vis.herokuapp.com/

There are a number of technical details about building this visualization that may be interesting, but a description of those is really beyond the scope of this post. In short, the visualization was built using three.js ,with a BufferGeometry and custom shader to render the points as a particle system; I map the output of the DeviceOrientation controls from the three.js examples to the launch angle and launch speed filters. I welcome any additional technical questions in comments or the contact form.

## Batted-ball data visualization using an alternative to a heatmap

The availability of tracking data has been an exciting development in the world of baseball statistics and analytics in the last several years. Most significantly this includes pitch f/x and, more recently, batted ball data from MLBAM Statcast. The data I am working with here are a limited set from the hit f/x system, from April, 2009, but it is straight-forward to extend the ideas outlined here to a larger data set from Statcast. I have smoothed the data with an ad-hoc Gaussian kernel. The batted ball data include measured quantities such as launch angle off the bat, speed off the bat, hang time, and distance. A natural question to ask is how the outcome, specifically the probability for a batted ball to be a hit as opposed to an out, depends on launch angle and speed. This presents the challenge of representing three variables in a two-dimensional graphic. One way of doing this is using a heatmap, i.e. using a 2d plane for the explanatory variables and a color for the outcome. Here’s an example,

Data visualization research suggests that spatial separation and length are the most effective ways of showing quantitative comparisons, and in particular that color is better for categorical variables than quantitative variables. My goal here is to explore an alternative to a heatmap that uses a line graph instead of color to show the quantitative dependance of batting average on launch speed. One complication with that is that the way the batting average changes with launch angle depends on launch speed which gives the data interesting spatial behavior in the launch-angle / launch-speed plane. To  try and keep this information, I came up with the idea of using brushing on the launch angle variable to highlight a given value of launch angle but to also highlight the neighboring few values to try and show the gradient in the launch angle direction. The result looks like this,

The idea is you can use mouseover on the bar on the left-hand side to highlight a particular value of the launch angle, and the blue-to-red color variation show the way the curve changes at adjacent values. The graphic and the source code, which uses d3.js, are available on my github pageI also have a version that uses hang time and distance as the variables, as shown below, and one that uses batted-ball wOBA instead of batted-ball batting average.

In a recent post I talked about applying principal components analysis to SportsVU player tracking data and generating “eigenspectra”.  I sort of ran into a wall with that because the data are too numerous to analyze on my personal machine. I thought about setting up an AWS instance, and Chris Long generously agreed to run the code for me on his sports-analytics cluster. However, I was able to extend the method to more data by using on online estimate of the covariance. The results are available in an interactive form here, and illustrated in the gif below

## Power ranking & multi-year park effects using Markov chain Monte Carlo

I’ve recently had the opportunity to use pymc, a python module for Bayesian statistical modeling, including hierarchical models. It’s so much fun to use, that I find myself looking for models to apply it to. One example is building a power ranking model, including a home field advantage and a park effect. This applies very generally, but I will apply it to baseball, where the concept of a park effect is most meaningful. The idea of building a power-ranking model is on my mind because this is the first year I participated in the  kaggle march madness competition (my model finished 44th), and a non-linear power-ranking model with a home-court advantage is the basic approach I used (although not with pymc). I am using the term power-ranking here to have the very specific meaning that the output of the model tells you the probability that team A will beat team B, which is terminology I picked up from Chris Long.

The basic idea behind Bayesian hierarchical modeling is you define random variables as being generated from some combination of likelihood and prior distributions, and then make inferences about the parameters of your model by sampling from the posterior distribution, e.g. using MCMC. I’ve defined the baseball power-ranking model in the following way,

$\alpha_i \sim N(1, \tau_{\alpha})$

$\beta_i \sim N(1, \tau_{\beta})$

$A \sim N(4.5, \tau_{A})$

$h \sim N(0, \tau_{h})$

$\rho \sim N(0, \tau_{\rho})$

$\pi \sim Beta(\alpha=1, \beta=1)$

$\mu^{s}_{ij} \sim A \times \alpha_i \times \beta_j \times (1 + h_i) \times (1 + \rho_i)$

$\mu^{a}_{ij} \sim A \times \alpha_j \times \beta_i \times (1 - h_i) \times (1 + \rho_i)$

$RS_{ij} \sim Poisson0(\mu^{s}_{ij}, \pi)$

$RA_{ij} \sim Poisson0(\mu^{a}_{ij}, \pi)$,

where $Poisson0$ is a zero-inflated Poisson distribution,

$P(0, \lambda | \pi) = \pi + (1-\pi) e^{-\lambda}$

$P(n, \lambda | \pi) = (1-\pi) \frac{\lambda^n e^{-\lambda}}{n!}$

The parameters are defined as; $\alpha_i$ is the offensive strength of team i, $\beta_i$ the defensive strength of team i, $A$ an overall scale, $\rho_i$ a park effect for the stadium belonging to team i, $h$ the home field advantage, $\mu^{s}$ and $\mu^{a}$, the mean runs scored and allowed, respectively, when team i is at home against team j, and RS and RA, the runs scored and allowed, respectively, when team i is at home against team j. The $\tau$ parameter in the Normal distributions is the inverse variance, and the values have been set to be $\tau_{\alpha} = \tau_{\beta} = 100$, $\tau_{A} = 10$, $\tau_{h} = 100$, $\tau_{\rho} = 100$.

The results of my model for 2012-2015 MLB data, after taking 1500 samples from the posterior probability distribution (not a really big number, but about as much as my laptop can handle), are (mean & standard deviation)

home = 0.010 +- 0.003

$\pi$= 0.053 +- 0.002

A = 4.32 +- 0.035

————

top 5 offenses,

————

TOR 2015 1.22

SLN 2013 1.19

ANA 2012 1.18

BOS 2013 1.17

SLN 2012 1.15

—————-

top 5 defenses

TBA 2012 0.84

SLN 2015 0.84

SEA 2014 0.86

BAL 2014 0.87

KCA 2013 0.87

park effect, highest to lowest,

0.215 DEN02
0.127 SYD01
0.102 BOS07
0.082 BAL12
0.072 ARL02
0.066 MIN04
0.065 MIL06
0.060 TOR02
0.058 CHI12
0.056 DET05
0.054 KAN06
0.038 NYC21
0.038 CLE08
0.024 CHI11
0.016 PHO01
0.013 CIN09
0.001 WAS11
-0.004 HOU03
-0.007 PHI13
-0.008 MIA02
-0.009 STP01
-0.041 ATL02
-0.052 OAK01
-0.052 TOK01
-0.054 STL10
-0.059 ANA01
-0.078 PIT08
-0.100 SFO03
-0.100 NYC20
-0.115 LOS03
-0.126 SEA03
-0.129 SAN02

## True talent estimate using Gaussian process regression

A while back I wrote about estimating true talent using two correlated samples. The resulting equations, for talent $t_1$ in time period one, and talent $t_2$ in time period two, are,

$t_1 = \frac{x_1 (\frac{1}{\sigma_1^2} + \frac{(1-\rho^2) \sigma_t^2}{\sigma_1^2 \sigma_2^2}) + x_2(\frac{\rho}{\sigma_2^2}) + t_0 (\frac{1}{\sigma_t^2} + \frac{(1-\rho)}{\sigma_2^2}) } {\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2} + \frac{1}{\sigma_t^2} + \frac{(1-\rho^2) \sigma_t^2}{\sigma_1^2 \sigma_2^2}}$
$t_2 = \frac{x_2 (\frac{1}{\sigma_2^2} + \frac{(1-\rho^2) \sigma_t^2}{\sigma_1^2 \sigma_2^2}) + x_1(\frac{\rho}{\sigma_1^2}) + t_0 (\frac{1}{\sigma_t^2} + \frac{(1-\rho)}{\sigma_1^2}) } {\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2} + \frac{1}{\sigma_t^2} + \frac{(1-\rho^2) \sigma_t^2}{\sigma_1^2 \sigma_2^2}}$

As an example, let’s look at Mike Trout’s 2012 & 2013 seasons. His wOBA was 0.409 in 639 PA, and 0.423 in 716 PA in 2012 & 2013 respectively. For this I will assume that the mean true talent for wOBA is 0.315 and the standard deviation of the true-talent distribution is 0.030. Applying these to the equations above, my estimates of true talent for year 1 (2012) and year 2 (2013), as a function of $\rho$,  will be,

As we would logically expect, when $\rho = 0$, it’s exactly the same as regressing each season to the mean independently, and when $\rho=1$, it’s the same as pooling the two seasons into one observation of 639+716=1355 PA. Now, there’s no reason we have to stop at two seasons, or two intervals of time; I could estimate true talent in 2013 based on 2012, 2013, 2014, 2015… Moreover, I could break each season into halves and estimate true talent in the latter half of 2013 based on the first half of 2012, the second half of 2012, the first half of 2013, etc… The logical endpoint of this process is to take true talent as something that isn’t fixed over a year or fraction of a year, but that changes from moment to moment, and to estimate what its value is by correlating the value from one instant to the next.  This is where Gaussian processes come in; but before getting to that, I want to motivate their use by looking at Gaussian distributions.

Let’s look at the Mike Trout 2012 & 2013 example again, from a different angle. Before 2012, if I wanted to estimate Mike Trout’s true talent for wOBA, since I don’t have any observations, the best I could do is to assume his true talent will match my prior, which I’m taking to be mean wOBA = 0.315, with standard deviation 0.030. That distribution looks like this, Once I observe Mike Trout to have a 0.409 wOBA in 639 PA, I have a new, posterior, distribution for what I think his true talent is. Since the statistical uncertainty on wOBA is $\approx \sqrt{\frac{0.5}{PA}}$, or 0.028 in this case,  the posterior distribution has mean, $\mu = [0.409/0.028^2 + 0.315/0.03^2 ]/[1/0.028^2 + 1/0.03^2] \approx 0.365$, with standard deviation, $\sigma = \sqrt{(1/0.028^2 + 1/0.03^2)^{-1}} \approx 0.020$. This new distribution, compared to the old, looks  like this,

So now suppose I have 2012 & 2013 wOBA data and I want to estimate true talent in each year. Again, a priori I don’t know what Mike Trout’s true talent is, but I can apply a prior expectation, which is a mean of 0.315 and a standard deviation of 0.03.  The crucial point here is that I don’t take these seasons to be independent, I expect that if his true talent is above average in one season, it is probably above average in the other one as well, and visa versa. As detailed in my post on estimating true talent from correlated samples, this means that the covariance matrix in the true talent prior has non-zero off-diagonal terms. I’m not saying anything at this point about how strong the correlation of true talent from season to season is, but for the sake of argument let’s say it $\rho = 0.5$. The contours of the resulting prior true talent distribution are shown in the left image below, and the prior & posterior distribution in the right image. The crosshair is the 2012 & 2013  data.

The same is shown for $\rho = 0$ (left) and $\rho = 0.95$ (right) below.

So that’s the basic idea; I have a prior distribution for true talent that includes a correlation, I make some (noisy) observations, and then I get a posterior distribution for true talent. If I apply the same logic but treat time as a continuous variable, i.e one that has an infinite number of possible values, then I have a Gaussian distribution in infinite dimensions; a Gaussian process.

The key in using Gaussian process regression is that, because the Normal distribution has a lot of nice mathematical properties (namely that the marginal distributions over any set of variables is also Normal) , you can get closed form solutions for the posterior mean function (i.e. infinite dimensional vector) and covariance function. Formulas are given in e.g. Gaussian Process for Machine Learning by Rasmussen and Williams. To do the computations here, I am using the Gaussian process tools in scikit-learn. As of this writing the stable version is 0.17, and the dev version is 0.18, and I highly recommend using the version 0.18 Gaussian process module, its interface and functionality is very significantly better.

Although I began by looking at Mike Trout’s first few years, I’m going to switch it up now and look at Pedro Martinez’s career wOBA on a start by start basis. This is mainly because this example has been looked at several times, e.g. when I looked at the fat tails of true talent, by Tom Tango a couple different times (e.g., here), and recently by Neil Paine and Jay Boice at 538, and because a game for a pitcher has less statistical noise than a game for a batter. For this analysis, I am using a prior with the mean vale of wOBA = 0.32.

To do a Gaussian process regression you essentially have to specify a covariance function; how strongly correlated is the underlying true talent from moment to moment. The most convenient covariance function is the radial basis function, or squared exponential, which has the same form as a Gaussian distribution function. This has two parameters, the overall scale of the true talent variance, and the length scale of the true talent covariance. In addition, noise on the measurements is handled by adding a term to the diagonal of the covariance matrix, which I am fixing to be $0.01 \approx (0.32 (1.1-0.32))/24$. The graphs below show what the mean of the true talent posterior distribution looks like for a variety of length scales, with the overall value set to a standard deviation of 25 wOBA points. The black dots are Pedro’s game-by-game wOBA in starts, the black line is the mean of the posterior true talent distribution and the blue ribbon is proportional to the diagonal of the posterior covariance matrix, i.e. the marginal uncertainty on the true talent estimate.

Some qualitative features are that, when the length scale is small, i.e. true talent from moment to moment is not strongly correlated, then the estimate fluctuates rapidly, and in regions where there are no observations, returns quickly to the prior, wOBA=0.32. On the other extreme, where the length scale is very large, i.e. true talent is very strongly correlated at different times, the estimate becomes the overall mean for Pedro’s career, wOBA ~ 0.280. The estimate of 25 points wOBA for the standard deviation of true talent came from looking at seasonal values, the graphs below show the true talent estimates assuming the standard deviation from start to start was larger, specifically 50 points of wOBA.

For this particular use case – a pitchers true talent over his career – I was interested in a model that has short-term correlations that fade away quickly, but that also has a lesser long term correlation that persists over the course of a career. In terms of Gaussian process covariance functions, this means defining a kernel with a decaying correlation -like a radial basis function – but with a constant added on. An example is shown below,

Using this covariance function, and again varying the length scale gives the following for the true talent estimates,

Gaussian process regression is an appealing method for the problem of estimating true talent because it’s flexible and doesn’t require specifying a certain functional form for true talent as a function of time; and because it deals directly with the quantity we usually think of in terms of true talent – the correlation from time interval to time interval. One deficiency in what I’ve shown above is that the prior assumes a mean that’s equal to the league mean; where the model would be improved by considering age, park factors, etc…

Ok, so the big question is, when was Pedro’s peak? It depends on how you model the data, but if I use a model where the standard deviation of true talent is 50 points of wOBA and vary the length scale from 100 days to 1200 days, then I get a mean estimate of Aug 19, 2000, with a standard error of about 10 days.

## 2016 BBHOF candidates as Chernoff faces

A few weeks ago a blog post about Chernoff’s faces came across my twitter feed

http://www.futilitycloset.com/2015/11/29/chernoffs-faces/

I thought this was intriguing and thought it’d be fun to apply this to some data set – and the metrics for the 2016 candidates for the baseball hall-of-fame  is perfectly suited. Before I show my result, I want to mention that after working on this, I realized that it was done at the Hardball Times, by Kevin Lai, back in 2011,

http://www.hardballtimes.com/tht-live/chernoff-faces

His version use R, and looks, frankly, much better than my home-cooked python version. In the comments to that article, Max Marchi gives a link to a version he did in 2006.

For my versions, I hacked together some very basic faces using matplotlib. The facial-feature-to-metric correspondence is

• pupil size : world-series championships
• pupil location : mvps + cy youngs
• mouth curvature : all-star appearances
• eyebrow length : “power”
• eyebrow slope : “power” per PA
• face height : sum of WAR for best 5 seasons
• nose width : defensive WAR (as defined by baseball-reference)
• nose height : offensive WAR (as defined by baseball-reference)
• mouth length : number of world series appearances

“Power” is defined as HR for batters and K for pitchers. So with that said, here is the result:

## Principal components analysis for baseball HOF voters

As I talked about in my previous post on sportsVU player-tracking data, principal components analysis (PCA) is a technique that can be used for both dimensionality reduction (describe the data effectively using fewer numbers) and to reveal something about the nature of the data.

To apply PCA to voting patterns for the baseball hall-of-fame, we can consider each player to be a “dimension”, i.e, x, y, z,  etc…, and each voter to occupy a point in that space. Intuitively we would expect that a voter that votes for Bonds (1 in the x direction) would usually vote for Clemens also (a 1 in the y direction). From the dimensionality reduction point of view, this means you don’t really have to tell me both numbers (aye or nay on Bonds and aye or nay on Clemens), just tell me one and I’ll essentially know the other. On the structure-of-the-data side, this tells me something interesting about the way voters make decisions and value – or discount- different aspects of a players career. The Bonds-Clemens association is the most intuitive one, and PCA basically exists to systemize this logic and extend it to less obvious correlations.

To start with I’ll focus on the pool of publicly released ballots from last year (2015), awesomely provided by Ryan Thibodaux (@NotMrTibbs), since there’s a large pool to draw from. As more ballots come in for the current election, it is straight-forward add those or analyze them independently.

This chart show the mean ballot (black) and the 1st (blue) and 2nd (red) principal components.   So, the chart says that, after the overall mean, the most important information you can tell me is what they said for Bonds & Clemens, and to a lesser extent, Piazza, McGwire, Mussina, Edgar, Raines, etc. – all the players with non-zero values for the blue line. The fact that the Bonds and Clemens values go in the same direction is telling us right away that there’s a strong correlation between votes for those two. Likewise, the votes for Bonds and Clemens are positively correlated with votes for McGwire and Piazza, and negatively correlated with votes for Edgar and Lee Smith. In the 2nd component, it becomes important what they said for Schilling, Bagwell, Smith and Raines.

Instead of continuing to plot higher order components, I’ll turn the data on it’s side, and put the player name on the y-axis and the component number on the x-axis. Here is what that looks like,

So we can see that Bonds and Clemens are the most important elements of the first component, but don’t really factor in again until much a much higher component, and because the sign of their values are different, that is basically where a vote for one and not the other comes in.

Another interesting question to consider is how much of the variance in the data can we explain as a function of how many components we use – this is the data reduction aspect of PCA. For this data the correspondence looks like this

So, just by telling me the 1st component, which is more or less the vote for Bonds/Clemens, we explain about 20% of the variance. With 15 components, we explain about 95% of the variance.

Finally, the player-by-player plot of component magnitude reminded me of the iconic Joy Division album cover

so just for fun I whipped up a similar visualization using d3.js. The code for that is available here