# Basketball eigenspectra – applying principal component analysis to SportsVU data

In trying to think of things to do with sportsVU player-tracking data if I ever had the chance to work with it, one of the first things that came to mind was to create “eigenspectra”. So I was really excited when I saw that the raw data was available from github and I finally had the chance to give the idea a try.

What I mean by “eigenspectra” is taking the configuration of the players and/or the ball and applying principal components analysis (PCA) to them, then reconstructing any particular configuration as a linear combination of the top N principal components. This serves two purposes; (i) reducing the dimensionality of the space, so we can describe the configuration of players+ball with fewer coordinates and (ii) to illustrate something about the structure of the game.

The most basic introduction to PCA is usually given as a pair of correlated variables, as shown in this image (straight from wikipedia),

The principal components are new coordinates, shown as the arrows aligned with the data. Technically, they are derived by finding a coordinate system where the covariance matrix is diagonal, i.e. the measurements expressed in these coordinates are uncorrelated. The principal component where the data show the largest variance is the most important one, called the first component, and you work you’re way down in order of variance to the least important one. One way of thinking about their meaning is that, if you’re going to use one number to describe a point, you’re better off telling me where it lies along the long axis of the elliptical cloud of points then either x or y by itself.

PCA analysis becomes more powerful and useful when applied to higher dimensional data. When you’ve got two variables you can plot them in a plane like above and basically see for yourself what’s going on, but when you’ve got higher and higher dimensional data it becomes impossible to visualize them in such a concise way.

An example of this is decomposing galaxy spectra. In astronomy “spectra” (plural of spectrum) refers to the measurement of the intensity of light at particular wavelengths – or energies – of the light. In a broad sense galaxies are made up of lots of stars, gas, and dust. Stars give off light in a thermal spectrum, where the intensity varies smoothly as a function of wavelength; gas in the atmosphere of stars absorbs light at certain wavelengths and makes divots in the spectrum; cold gas backlit by stars absorbs light in other wavelengths; hot gas gives off light in particular wavelengths; and all of these combine to produce a complicated integrated intensity-vs-wavelength spectrum. In order to apply PCA to galaxy spectra, you treat each wavelength bin as a coordinate, forming a vector that might range from, say, 3500 Angstroms to 8000 Angstroms, in steps of 1 Angstrom; or 4500 dimensions.

The really interesting thing about applying this procedure to galaxy spectra – and the thing I thought would apply to basketball player-tracking data – is that each component has a physical meaning; the mean spectrum (PCA is always applied after subtracting out the mean) tells you the average temperature of the stars; the 1st component tells you about hot gas; the 2nd component tells you about light absorbed from cold gas; and etc. The other really interesting thing is it only takes like 10 components to accurately reconstruct the galaxy spectrum. So instead of telling me 4500 numbers – the intensity at each discrete wavelength – you can tell me 10 numbers – the temperature, plus a few things about the gas and dust.

To apply this to basketball I took the court and chopped it up into 2d bins. Each bin gets an index which is effectively the coordinate (like wavelength in the galaxy example), and a configuration is specified by putting a 1 in a bin if a player is there and 0 otherwise. I experimented with binning and found that you need to use bins at least as small as 2×2 feet and 1×1 is really better. Since a basketball court is approximately 100 ft by 50 ft, this means somewhere between ~1250 (100/2 x 50/2) to 5000 (100 x 50) dimensions.  This is sort of an artificial way of representing the data in order to make it work with the eigenspectra decomposition.  It’d be easier and probably better suited to analysis to just use the  x,y coordinates of 10 players plus x,y,z coordinate of the ball.

With all of that said, here is the result of determining the mean + the first 19 principal components, with a 1×1 foot binning,

The top left is a heat map of the mean configuration, and reading across left to right show the first 19 components. What this says about the structure of the game is that players are usually taking free throws or shooting corner threes. The principal components basically show variations on free throw arrangements, players crossing back and forth under the basket, and shifting from foot to foot on the line. Is this interesting? I’m not really sure. Clearly this is not the result I was hoping for. In followup posts I will possibly take a look at variations on this technique, such as measuring position relative to the mean position of all players, focusing on one half of the court etc. It’s also possible that something interesting would pop out of higher components, but that is going to require more compute power than I have on my laptop. I am considering using AWS or something similar to run a much larger sample of the sportsVU data.