A few prominent examples of VR data visualizations are,

3d Nasdaq roller coaster from the WSJ

d3 and aframe roller coaster from Ian Johnson

A tour through England, showing (simulated) dislike of Piers Morgan for each town

For my visualization I’m using batted ball data from MLBAM and Statcast, which were obtained from this Statcast-Modeling R package . The layout is similar to one I built with d3.js some months back, which looks like this,

In this visualization the circles show hits (blue) and outs (red), in their landing position according to hit f/x. The grids on the right hand side use mouseover to give the user the option to filter in the any of the launch-angle / launch-speed, launch-angle / hang-time and launch-speed / hang-time planes. The fully interactive version of this visualization is available on my github page.

The main feature of extending this to VR is using the device orientation to control the filtering in the launch-angle / launch-speed plane. The user changes launch angle by tilting their head up and down, and launch speed by tilting their head left to right. There is also a mouseover fallback for users on desktop, i.e moving the mouse up the screen simulates tilting ones head up and down and moving the mouse left to right, tilting head left to right. In addition, I tilted the plane of the field down in more of a perspective view, and I pop the filtered batted balls in the vertical direction to add additional highlighting. The end result looks like the image below; the fully interactive version is available at https://vr-batted-ball-vis.herokuapp.com/

There are a number of technical details about building this visualization that may be interesting, but a description of those is really beyond the scope of this post. In short, the visualization was built using `three.js`

,with a `BufferGeometry`

and custom shader to render the points as a particle system; I map the output of the `DeviceOrientation`

controls from the `three.js`

examples to the launch angle and launch speed filters. I welcome any additional technical questions in comments or the contact form.

Data visualization research suggests that spatial separation and length are the most effective ways of showing quantitative comparisons, and in particular that color is better for categorical variables than quantitative variables. My goal here is to explore an alternative to a heatmap that uses a line graph instead of color to show the quantitative dependance of batting average on launch speed. One complication with that is that the way the batting average changes with launch angle depends on launch speed which gives the data interesting spatial behavior in the launch-angle / launch-speed plane. To try and keep this information, I came up with the idea of using brushing on the launch angle variable to highlight a given value of launch angle but to also highlight the neighboring few values to try and show the gradient in the launch angle direction. The result looks like this,

The idea is you can use mouseover on the bar on the left-hand side to highlight a particular value of the launch angle, and the blue-to-red color variation show the way the curve changes at adjacent values. The graphic and the source code, which uses d3.js, are available on my github page**. **I also have a version that uses hang time and distance as the variables, as shown below, and one that uses batted-ball wOBA instead of batted-ball batting average.

The basic idea behind Bayesian hierarchical modeling is you define random variables as being generated from some combination of likelihood and prior distributions, and then make inferences about the parameters of your model by sampling from the posterior distribution, e.g. using MCMC. I’ve defined the baseball power-ranking model in the following way,

,

where is a zero-inflated Poisson distribution,

The parameters are defined as; is the offensive strength of team i, the defensive strength of team i, an overall scale, a park effect for the stadium belonging to team i, the home field advantage, and , the mean runs scored and allowed, respectively, when team i is at home against team j, and RS and RA, the runs scored and allowed, respectively, when team i is at home against team j. The parameter in the Normal distributions is the inverse variance, and the values have been set to be , , , .

The results of my model for 2012-2015 MLB data, after taking 1500 samples from the posterior probability distribution (not a really big number, but about as much as my laptop can handle), are (mean & standard deviation)

home = 0.010 +- 0.003

= 0.053 +- 0.002

A = 4.32 +- 0.035

————

top 5 offenses,

————

TOR 2015 1.22

SLN 2013 1.19

ANA 2012 1.18

BOS 2013 1.17

SLN 2012 1.15

—————-

top 5 defenses

—

TBA 2012 0.84

SLN 2015 0.84

SEA 2014 0.86

BAL 2014 0.87

KCA 2013 0.87

—

park effect, highest to lowest,

0.215 DEN02

0.127 SYD01

0.102 BOS07

0.082 BAL12

0.072 ARL02

0.066 MIN04

0.065 MIL06

0.060 TOR02

0.058 CHI12

0.056 DET05

0.054 KAN06

0.038 NYC21

0.038 CLE08

0.024 CHI11

0.016 PHO01

0.013 CIN09

0.001 WAS11

-0.004 HOU03

-0.007 PHI13

-0.008 MIA02

-0.009 STP01

-0.041 ATL02

-0.052 OAK01

-0.052 TOK01

-0.054 STL10

-0.059 ANA01

-0.078 PIT08

-0.100 SFO03

-0.100 NYC20

-0.115 LOS03

-0.126 SEA03

-0.129 SAN02

As an example, let’s look at Mike Trout’s 2012 & 2013 seasons. His wOBA was 0.409 in 639 PA, and 0.423 in 716 PA in 2012 & 2013 respectively. For this I will assume that the mean true talent for wOBA is 0.315 and the standard deviation of the true-talent distribution is 0.030. Applying these to the equations above, my estimates of true talent for year 1 (2012) and year 2 (2013), as a function of , will be,

As we would logically expect, when , it’s exactly the same as regressing each season to the mean independently, and when , it’s the same as pooling the two seasons into one observation of 639+716=1355 PA. Now, there’s no reason we have to stop at two seasons, or two intervals of time; I could estimate true talent in 2013 based on 2012, 2013, 2014, 2015… Moreover, I could break each season into halves and estimate true talent in the latter half of 2013 based on the first half of 2012, the second half of 2012, the first half of 2013, etc… The logical endpoint of this process is to take true talent as something that isn’t fixed over a year or fraction of a year, but that changes from moment to moment, and to estimate what its value is by correlating the value from one instant to the next. This is where Gaussian processes come in; but before getting to that, I want to motivate their use by looking at Gaussian distributions.

Let’s look at the Mike Trout 2012 & 2013 example again, from a different angle. Before 2012, if I wanted to estimate Mike Trout’s true talent for wOBA, since I don’t have any observations, the best I could do is to assume his true talent will match my prior, which I’m taking to be mean wOBA = 0.315, with standard deviation 0.030. That distribution looks like this, Once I observe Mike Trout to have a 0.409 wOBA in 639 PA, I have a new, posterior, distribution for what I think his true talent is. Since the statistical uncertainty on wOBA is , or 0.028 in this case, the posterior distribution has mean, , with standard deviation, . This new distribution, compared to the old, looks like this,

So now suppose I have 2012 & 2013 wOBA data and I want to estimate true talent in each year. Again, a priori I don’t know what Mike Trout’s true talent is, but I can apply a prior expectation, which is a mean of 0.315 and a standard deviation of 0.03. The crucial point here is that I don’t take these seasons to be independent, I expect that if his true talent is above average in one season, it is probably above average in the other one as well, and visa versa. As detailed in my post on estimating true talent from correlated samples, this means that the covariance matrix in the true talent prior has non-zero off-diagonal terms. I’m not saying anything at this point about how strong the correlation of true talent from season to season is, but for the sake of argument let’s say it . The contours of the resulting prior true talent distribution are shown in the left image below, and the prior & posterior distribution in the right image. The crosshair is the 2012 & 2013 data.

The same is shown for (left) and (right) below.

So that’s the basic idea; I have a prior distribution for true talent that includes a correlation, I make some (noisy) observations, and then I get a posterior distribution for true talent. If I apply the same logic but treat time as a continuous variable, i.e one that has an infinite number of possible values, then I have a Gaussian distribution in infinite dimensions; a *Gaussian process*.

The key in using Gaussian process regression is that, because the Normal distribution has a lot of nice mathematical properties (namely that the marginal distributions over any set of variables is also Normal) , you can get closed form solutions for the posterior mean function (i.e. infinite dimensional vector) and covariance function. Formulas are given in e.g. Gaussian Process for Machine Learning by Rasmussen and Williams. To do the computations here, I am using the Gaussian process tools in scikit-learn. As of this writing the stable version is 0.17, and the dev version is 0.18, and I highly recommend using the version 0.18 Gaussian process module, its interface and functionality is very significantly better.

Although I began by looking at Mike Trout’s first few years, I’m going to switch it up now and look at Pedro Martinez’s career wOBA on a start by start basis. This is mainly because this example has been looked at several times, e.g. when I looked at the fat tails of true talent, by Tom Tango a couple different times (e.g., here), and recently by Neil Paine and Jay Boice at 538, and because a game for a pitcher has less statistical noise than a game for a batter. For this analysis, I am using a prior with the mean vale of wOBA = 0.32.

To do a Gaussian process regression you essentially have to specify a covariance function; how strongly correlated is the underlying true talent from moment to moment. The most convenient covariance function is the radial basis function, or squared exponential, which has the same form as a Gaussian distribution function. This has two parameters, the overall scale of the true talent variance, and the length scale of the true talent covariance. In addition, noise on the measurements is handled by adding a term to the diagonal of the covariance matrix, which I am fixing to be . The graphs below show what the mean of the true talent posterior distribution looks like for a variety of length scales, with the overall value set to a standard deviation of 25 wOBA points. The black dots are Pedro’s game-by-game wOBA in starts, the black line is the mean of the posterior true talent distribution and the blue ribbon is proportional to the diagonal of the posterior covariance matrix, i.e. the marginal uncertainty on the true talent estimate.

Some qualitative features are that, when the length scale is small, i.e. true talent from moment to moment is not strongly correlated, then the estimate fluctuates rapidly, and in regions where there are no observations, returns quickly to the prior, wOBA=0.32. On the other extreme, where the length scale is very large, i.e. true talent is very strongly correlated at different times, the estimate becomes the overall mean for Pedro’s career, wOBA ~ 0.280. The estimate of 25 points wOBA for the standard deviation of true talent came from looking at seasonal values, the graphs below show the true talent estimates assuming the standard deviation from start to start was larger, specifically 50 points of wOBA.

For this particular use case – a pitchers true talent over his career – I was interested in a model that has short-term correlations that fade away quickly, but that also has a lesser long term correlation that persists over the course of a career. In terms of Gaussian process covariance functions, this means defining a kernel with a decaying correlation -like a radial basis function – but with a constant added on. An example is shown below,

Using this covariance function, and again varying the length scale gives the following for the true talent estimates,

Gaussian process regression is an appealing method for the problem of estimating true talent because it’s flexible and doesn’t require specifying a certain functional form for true talent as a function of time; and because it deals directly with the quantity we usually think of in terms of true talent – the correlation from time interval to time interval. One deficiency in what I’ve shown above is that the prior assumes a mean that’s equal to the league mean; where the model would be improved by considering age, park factors, etc…

Ok, so the big question is, when was Pedro’s peak? It depends on how you model the data, but if I use a model where the standard deviation of true talent is 50 points of wOBA and vary the length scale from 100 days to 1200 days, then I get a mean estimate of Aug 19, 2000, with a standard error of about 10 days.

]]>

http://www.futilitycloset.com/2015/11/29/chernoffs-faces/

I thought this was intriguing and thought it’d be fun to apply this to some data set – and the metrics for the 2016 candidates for the baseball hall-of-fame is perfectly suited. Before I show my result, I want to mention that after working on this, I realized that it was done at the Hardball Times, by Kevin Lai, back in 2011,

http://www.hardballtimes.com/tht-live/chernoff-faces

His version use R, and looks, frankly, much better than my home-cooked python version. In the comments to that article, Max Marchi gives a link to a version he did in 2006.

For my versions, I hacked together some very basic faces using matplotlib. The facial-feature-to-metric correspondence is

- pupil size : world-series championships
- pupil location : mvps + cy youngs
- mouth curvature : all-star appearances
- eyebrow length : “power”
- eyebrow slope : “power” per PA
- face height : sum of WAR for best 5 seasons
- nose width : defensive WAR (as defined by baseball-reference)
- nose height : offensive WAR (as defined by baseball-reference)
- mouth length : number of world series appearances

“Power” is defined as HR for batters and K for pitchers. So with that said, here is the result:

]]>

To apply PCA to voting patterns for the baseball hall-of-fame, we can consider each player to be a “dimension”, i.e, x, y, z, etc…, and each voter to occupy a point in that space. Intuitively we would expect that a voter that votes for Bonds (1 in the x direction) would usually vote for Clemens also (a 1 in the y direction). From the dimensionality reduction point of view, this means you don’t really have to tell me both numbers (aye or nay on Bonds and aye or nay on Clemens), just tell me one and I’ll essentially know the other. On the structure-of-the-data side, this tells me something interesting about the way voters make decisions and value – or discount- different aspects of a players career. The Bonds-Clemens association is the most intuitive one, and PCA basically exists to systemize this logic and extend it to less obvious correlations.

To start with I’ll focus on the pool of publicly released ballots from last year (2015), awesomely provided by Ryan Thibodaux (@NotMrTibbs), since there’s a large pool to draw from. As more ballots come in for the current election, it is straight-forward add those or analyze them independently.

This chart show the mean ballot (black) and the 1st (blue) and 2nd (red) principal components. So, the chart says that, after the overall mean, the most important information you can tell me is what they said for Bonds & Clemens, and to a lesser extent, Piazza, McGwire, Mussina, Edgar, Raines, etc. – all the players with non-zero values for the blue line. The fact that the Bonds and Clemens values go in the same direction is telling us right away that there’s a strong correlation between votes for those two. Likewise, the votes for Bonds and Clemens are positively correlated with votes for McGwire and Piazza, and negatively correlated with votes for Edgar and Lee Smith. In the 2nd component, it becomes important what they said for Schilling, Bagwell, Smith and Raines.

Instead of continuing to plot higher order components, I’ll turn the data on it’s side, and put the player name on the y-axis and the component number on the x-axis. Here is what that looks like,

So we can see that Bonds and Clemens are the most important elements of the first component, but don’t really factor in again until much a much higher component, and because the sign of their values are different, that is basically where a vote for one and not the other comes in.

Another interesting question to consider is how much of the variance in the data can we explain as a function of how many components we use – this is the data reduction aspect of PCA. For this data the correspondence looks like this

So, just by telling me the 1st component, which is more or less the vote for Bonds/Clemens, we explain about 20% of the variance. With 15 components, we explain about 95% of the variance.

Finally, the player-by-player plot of component magnitude reminded me of the iconic Joy Division album cover

so just for fun I whipped up a similar visualization using d3.js. The code for that is available here

]]>What I mean by “eigenspectra” is taking the configuration of the players and/or the ball and applying principal components analysis (PCA) to them, then reconstructing any particular configuration as a linear combination of the top N principal components. This serves two purposes; (i) reducing the dimensionality of the space, so we can describe the configuration of players+ball with fewer coordinates and (ii) to illustrate something about the structure of the game.

The most basic introduction to PCA is usually given as a pair of correlated variables, as shown in this image (straight from wikipedia),

The principal components are new coordinates, shown as the arrows aligned with the data. Technically, they are derived by finding a coordinate system where the covariance matrix is diagonal, i.e. the measurements expressed in these coordinates are uncorrelated. The principal component where the data show the largest variance is the most important one, called the first component, and you work you’re way down in order of variance to the least important one. One way of thinking about their meaning is that, if you’re going to use one number to describe a point, you’re better off telling me where it lies along the long axis of the elliptical cloud of points then either x or y by itself.

PCA analysis becomes more powerful and useful when applied to higher dimensional data. When you’ve got two variables you can plot them in a plane like above and basically see for yourself what’s going on, but when you’ve got higher and higher dimensional data it becomes impossible to visualize them in such a concise way.

An example of this is decomposing galaxy spectra. In astronomy “spectra” (plural of spectrum) refers to the measurement of the intensity of light at particular wavelengths – or energies – of the light. In a broad sense galaxies are made up of lots of stars, gas, and dust. Stars give off light in a thermal spectrum, where the intensity varies smoothly as a function of wavelength; gas in the atmosphere of stars absorbs light at certain wavelengths and makes divots in the spectrum; cold gas backlit by stars absorbs light in other wavelengths; hot gas gives off light in particular wavelengths; and all of these combine to produce a complicated integrated intensity-vs-wavelength spectrum. In order to apply PCA to galaxy spectra, you treat each wavelength bin as a coordinate, forming a vector that might range from, say, 3500 Angstroms to 8000 Angstroms, in steps of 1 Angstrom; or 4500 dimensions.

The really interesting thing about applying this procedure to galaxy spectra – and the thing I thought would apply to basketball player-tracking data – is that each component has a physical meaning; the mean spectrum (PCA is always applied after subtracting out the mean) tells you the average temperature of the stars; the 1st component tells you about hot gas; the 2nd component tells you about light absorbed from cold gas; and etc. The other really interesting thing is it only takes like 10 components to accurately reconstruct the galaxy spectrum. So instead of telling me 4500 numbers – the intensity at each discrete wavelength – you can tell me 10 numbers – the temperature, plus a few things about the gas and dust.

To apply this to basketball I took the court and chopped it up into 2d bins. Each bin gets an index which is effectively the coordinate (like wavelength in the galaxy example), and a configuration is specified by putting a 1 in a bin if a player is there and 0 otherwise. I experimented with binning and found that you need to use bins at least as small as 2×2 feet and 1×1 is really better. Since a basketball court is approximately 100 ft by 50 ft, this means somewhere between ~1250 (100/2 x 50/2) to 5000 (100 x 50) dimensions. This is sort of an artificial way of representing the data in order to make it work with the eigenspectra decomposition. It’d be easier and probably better suited to analysis to just use the x,y coordinates of 10 players plus x,y,z coordinate of the ball.

With all of that said, here is the result of determining the mean + the first 19 principal components, with a 1×1 foot binning,

The top left is a heat map of the mean configuration, and reading across left to right show the first 19 components. What this says about the structure of the game is that players are usually taking free throws or shooting corner threes. The principal components basically show variations on free throw arrangements, players crossing back and forth under the basket, and shifting from foot to foot on the line. Is this interesting? I’m not really sure. Clearly this is not the result I was hoping for. In followup posts I will possibly take a look at variations on this technique, such as measuring position relative to the mean position of all players, focusing on one half of the court etc. It’s also possible that something interesting would pop out of higher components, but that is going to require more compute power than I have on my laptop. I am considering using AWS or something similar to run a much larger sample of the sportsVU data.

]]>

I will describe my entry in more detail below, but here is a screenshot. A fully interactive version is available here. My entry was awarded an honorable mention, which puts it ~~somewhere in the top 10, I think ~~in the top 7.

Just to clarify what my chart is showing, I left my y-axes unlabeled in an attempt to make the visualization less cluttered (and time constraints) but my units are always “best of all time”. So when you hover on Rickey Henderson – Steals, you will see the thin blue line goes all the way to the top (1 in units of best of all time) and his bar chart in the top left, which plots individual year totals as a function of age, also goes to the top for the 1982 season (when he was 23). In the bar-chart inset for Babe Ruth – Home Runs shown in the screenshot you can see he had 3 seasons (1920, 1921, 1927) where he was about 0.8 in units of “best of all-time”. You can also see the big dip he had in 1925 when he had health problems of some kind.

The contest was to visualize the careers of the top players in major league baseball history. The rules were open-ended as far as which players to include. They offered a list, but also said if you had some criteria you wanted to apply and you could justify it, then go ahead. I took the top 20 batters, ordered by baseball-reference rWAR, and including only WAR accumulated from 1901 to the present. In practice this means I computed career rWAR myself, using my copy of the baseball-reference daily-updated war database, and I joined that with season-by-season statistics from the Lahman database. The original announcement said one should visualize the careers of the top 10 batter & top 10 pitchers, but after working on my design, I made the decision to use the top 20 batters. I wrote my visualization from scratch using D3js.

My first idea was to do a “mountains out of molehills” type chart, e.g. http://bl.ocks.org/enjalot/754c7d061c2d0b71be37

My concern with that was that it’d be difficult to make comparisons from player to player, but it would be interesting to try it. A stacked-area or stream chart would also look pretty cool I think, e.g, http://bl.ocks.org/mbostock/4060954

but would be challenging to make it actually informative, because there are so many statistical categories and such a large span of time.

My next design was a complicated series of grouped and stacked bar charts, kind of a mash up of these two,

http://bl.ocks.org/bdilday/ac943080045043d53971

http://bl.ocks.org/bdilday/7c6277cf3f626d552dbc

As far as I got with that idea was about here,

If it’s not obvious, that’s WAR and PA, by age, for my top 10 batters (can you figure out who is who?). My next step was to make grouped bars for hits (singles, doubles, triples, home runs) and run-related stats (runs and rbis), and then add some sorting options, and then repeat for 10 pitchers, but it seemed like the result was going to end up being informative but kind of dry. I happened to be looking at Edward Tufte’s Visual Display of Quantitative Information which features the Paris to Lyon train schedule on the cover,

It’s a really cool looking visualization and I wanted to go for a similar aesthetic; densely packed information laid over a fine grid, with some interactive elements for highlighting. After making my design, I decided making a fine grid of years/percentiles would make it too busy, so I didn’t add that part.

The gold, silver and bronze winners are listed at the graphicacy website, here (note: they have changed the URL, I updated this link on 06/01/2016). One feature I really like that the winners have in common is including a picture of the players.

The gold medal winner had a similar layout to mine, using thin lines to represent the quantities. One difference is that it uses a drop down menu to isolate one stat, instead of showing all of them and using mouse hover (via voronoi tessellation) to highlight one stat. Both designs have merit, I think. It includes a number of different viewing options also, like shifting the x-axis from calendar year to career year, and shifting from cumulative to year-by-year, which are great additions. It also explicitly always comparison between two individuals, which is a really nice touch.

The silver medal winner is quite different and has a lot of bells and whistles. They focus on WAR and use text to highlight years in which a player led the league (or black ink). They also show a little icon of a trophy to highlight years a players team won the championship, which is a really clever addition

The bronze medal winner is a static png file that basically uses stacked area charts, colored by player.

]]>If you’re reading this, it’s probably too late to enter – entries due ~~Nov. 6, 2015~~ (extended to Nov. 8, 2015) – but keep an eye on Graphicacy, it looks like they will organize more of these.

Ever since I saw the announcement I’ve been totally obsessed with this, and I will post my entry here after the judging is complete. I’ve got a few other designs I’m interested in, so if I get time I may put those together also just for fun.

And, if you’ve got a design or an idea, please leave it in the comments, I would love to see what people do with this challenge!

]]>