### True talent estimate using Gaussian process regression

A while back I wrote about estimating true talent using two correlated samples. The resulting equations, for talent $t_1$ in time period one, and talent $t_2$ in time period two, are,

$t_1 = \frac{x_1 (\frac{1}{\sigma_1^2} + \frac{(1-\rho^2) \sigma_t^2}{\sigma_1^2 \sigma_2^2}) + x_2(\frac{\rho}{\sigma_2^2}) + t_0 (\frac{1}{\sigma_t^2} + \frac{(1-\rho)}{\sigma_2^2}) } {\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2} + \frac{1}{\sigma_t^2} + \frac{(1-\rho^2) \sigma_t^2}{\sigma_1^2 \sigma_2^2}}$
$t_2 = \frac{x_2 (\frac{1}{\sigma_2^2} + \frac{(1-\rho^2) \sigma_t^2}{\sigma_1^2 \sigma_2^2}) + x_1(\frac{\rho}{\sigma_1^2}) + t_0 (\frac{1}{\sigma_t^2} + \frac{(1-\rho)}{\sigma_1^2}) } {\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2} + \frac{1}{\sigma_t^2} + \frac{(1-\rho^2) \sigma_t^2}{\sigma_1^2 \sigma_2^2}}$

As an example, let’s look at Mike Trout’s 2012 & 2013 seasons. His wOBA was 0.409 in 639 PA, and 0.423 in 716 PA in 2012 & 2013 respectively. For this I will assume that the mean true talent for wOBA is 0.315 and the standard deviation of the true-talent distribution is 0.030. Applying these to the equations above, my estimates of true talent for year 1 (2012) and year 2 (2013), as a function of $\rho$,  will be,

As we would logically expect, when $\rho = 0$, it’s exactly the same as regressing each season to the mean independently, and when $\rho=1$, it’s the same as pooling the two seasons into one observation of 639+716=1355 PA. Now, there’s no reason we have to stop at two seasons, or two intervals of time; I could estimate true talent in 2013 based on 2012, 2013, 2014, 2015… Moreover, I could break each season into halves and estimate true talent in the latter half of 2013 based on the first half of 2012, the second half of 2012, the first half of 2013, etc… The logical endpoint of this process is to take true talent as something that isn’t fixed over a year or fraction of a year, but that changes from moment to moment, and to estimate what its value is by correlating the value from one instant to the next.  This is where Gaussian processes come in; but before getting to that, I want to motivate their use by looking at Gaussian distributions.

Let’s look at the Mike Trout 2012 & 2013 example again, from a different angle. Before 2012, if I wanted to estimate Mike Trout’s true talent for wOBA, since I don’t have any observations, the best I could do is to assume his true talent will match my prior, which I’m taking to be mean wOBA = 0.315, with standard deviation 0.030. That distribution looks like this, Once I observe Mike Trout to have a 0.409 wOBA in 639 PA, I have a new, posterior, distribution for what I think his true talent is. Since the statistical uncertainty on wOBA is $\approx \sqrt{\frac{0.5}{PA}}$, or 0.028 in this case,  the posterior distribution has mean, $\mu = [0.409/0.028^2 + 0.315/0.03^2 ]/[1/0.028^2 + 1/0.03^2] \approx 0.365$, with standard deviation, $\sigma = \sqrt{(1/0.028^2 + 1/0.03^2)^{-1}} \approx 0.020$. This new distribution, compared to the old, looks  like this,

So now suppose I have 2012 & 2013 wOBA data and I want to estimate true talent in each year. Again, a priori I don’t know what Mike Trout’s true talent is, but I can apply a prior expectation, which is a mean of 0.315 and a standard deviation of 0.03.  The crucial point here is that I don’t take these seasons to be independent, I expect that if his true talent is above average in one season, it is probably above average in the other one as well, and visa versa. As detailed in my post on estimating true talent from correlated samples, this means that the covariance matrix in the true talent prior has non-zero off-diagonal terms. I’m not saying anything at this point about how strong the correlation of true talent from season to season is, but for the sake of argument let’s say it $\rho = 0.5$. The contours of the resulting prior true talent distribution are shown in the left image below, and the prior & posterior distribution in the right image. The crosshair is the 2012 & 2013  data.

The same is shown for $\rho = 0$ (left) and $\rho = 0.95$ (right) below.

So that’s the basic idea; I have a prior distribution for true talent that includes a correlation, I make some (noisy) observations, and then I get a posterior distribution for true talent. If I apply the same logic but treat time as a continuous variable, i.e one that has an infinite number of possible values, then I have a Gaussian distribution in infinite dimensions; a Gaussian process.

The key in using Gaussian process regression is that, because the Normal distribution has a lot of nice mathematical properties (namely that the marginal distributions over any set of variables is also Normal) , you can get closed form solutions for the posterior mean function (i.e. infinite dimensional vector) and covariance function. Formulas are given in e.g. Gaussian Process for Machine Learning by Rasmussen and Williams. To do the computations here, I am using the Gaussian process tools in scikit-learn. As of this writing the stable version is 0.17, and the dev version is 0.18, and I highly recommend using the version 0.18 Gaussian process module, its interface and functionality is very significantly better.

Although I began by looking at Mike Trout’s first few years, I’m going to switch it up now and look at Pedro Martinez’s career wOBA on a start by start basis. This is mainly because this example has been looked at several times, e.g. when I looked at the fat tails of true talent, by Tom Tango a couple different times (e.g., here), and recently by Neil Paine and Jay Boice at 538, and because a game for a pitcher has less statistical noise than a game for a batter. For this analysis, I am using a prior with the mean vale of wOBA = 0.32.

To do a Gaussian process regression you essentially have to specify a covariance function; how strongly correlated is the underlying true talent from moment to moment. The most convenient covariance function is the radial basis function, or squared exponential, which has the same form as a Gaussian distribution function. This has two parameters, the overall scale of the true talent variance, and the length scale of the true talent covariance. In addition, noise on the measurements is handled by adding a term to the diagonal of the covariance matrix, which I am fixing to be $0.01 \approx (0.32 (1.1-0.32))/24$. The graphs below show what the mean of the true talent posterior distribution looks like for a variety of length scales, with the overall value set to a standard deviation of 25 wOBA points. The black dots are Pedro’s game-by-game wOBA in starts, the black line is the mean of the posterior true talent distribution and the blue ribbon is proportional to the diagonal of the posterior covariance matrix, i.e. the marginal uncertainty on the true talent estimate.

Some qualitative features are that, when the length scale is small, i.e. true talent from moment to moment is not strongly correlated, then the estimate fluctuates rapidly, and in regions where there are no observations, returns quickly to the prior, wOBA=0.32. On the other extreme, where the length scale is very large, i.e. true talent is very strongly correlated at different times, the estimate becomes the overall mean for Pedro’s career, wOBA ~ 0.280. The estimate of 25 points wOBA for the standard deviation of true talent came from looking at seasonal values, the graphs below show the true talent estimates assuming the standard deviation from start to start was larger, specifically 50 points of wOBA.

For this particular use case – a pitchers true talent over his career – I was interested in a model that has short-term correlations that fade away quickly, but that also has a lesser long term correlation that persists over the course of a career. In terms of Gaussian process covariance functions, this means defining a kernel with a decaying correlation -like a radial basis function – but with a constant added on. An example is shown below,

Using this covariance function, and again varying the length scale gives the following for the true talent estimates,

Gaussian process regression is an appealing method for the problem of estimating true talent because it’s flexible and doesn’t require specifying a certain functional form for true talent as a function of time; and because it deals directly with the quantity we usually think of in terms of true talent – the correlation from time interval to time interval. One deficiency in what I’ve shown above is that the prior assumes a mean that’s equal to the league mean; where the model would be improved by considering age, park factors, etc…

Ok, so the big question is, when was Pedro’s peak? It depends on how you model the data, but if I use a model where the standard deviation of true talent is 50 points of wOBA and vary the length scale from 100 days to 1200 days, then I get a mean estimate of Aug 19, 2000, with a standard error of about 10 days.