In my previous post I posted some code to make DRA-value and retro-CSAA databases, starting from retrosheet + baseball reference WAR (for fielding) + lahman (for ID matching) databases, and provided a link to some data for 1997-2004. I’ve run the models for some additional years, and done some analysis of the results.
The first thing to note about my analysis is that CSAA doesn’t seem to have that big of an impact, as long as you have the (1|catcher) term in the mixed-effects model. That is, if you run the model with CSAA and not (1|catcher) it makes a big difference (compared to a baseline), and if you run the model with (1|catcher) and not CSAA it makes a big difference, but running it with baseline+CSAA+(1|catcher) isn’t too different from baseline+(1|catcher). For the 1997-2004 data set, I found it makes a difference of ~4 runs per year in the most extreme cases. So it matters, but much less than other things, and to a first approximation we can leave it out. This is good news since computing CSAA is much more computationally intensive than running the DRA-value model.
With that in mind, here is an updated data set, showing the DRA-value results for a variety of models, for the years 1983-2004.
DRA Models – 1983-2004
icpt_dra_0 refers to the baseline model, lwts ~ (1|pitcher), the others refer to the baseline + the additional term. For example, icpt_dra_catcher refers to the model lwts ~ (1|pitcher) + (1|catcher). As previously, icpt refers to the random intercept returned by the mixed effects model, and full refers to the full model specified in the (second iteration) baseball prospectus article. npa refers to number-of-plate-appearances.
Ok, so on to the analysis.
In this plot, negative value are values where the baseline value was better, and adding all the other effects made the DRA estimates of pitcher quality worse. The 99 Rockies are clearly impacted by a huge park effect. The 94 Braves and 03 Dodgers seem to be a mix of park effect and catcher. The 94 Braves and 03 Dodgers are also distinctive for being good team-level staffs so this would seem to suggest that pitchers that are impacted most strongly – in a negative sense – are those who are on good staffs. It’s a little less clear what to make of the John Tudor 1985 data point.
Again, what jumps out is the big impacts are for staffs that have a good performance overall. The small spread in 1990 is due to the fact that the (1|pitcher) + (1|catcher) model did not converge.
So these two plots suggest to me that there may be something odd happening when the staff overall is good.
As a specific example, consider the 94 Braves. Were Maddux and Glavine and Steve Avery and Smoltz and Mark Wolhlers really good at getting people out? Or was Javy Lopez a really good framer? The fixed-effects part of the model cannot distinguish between these two alternatives. The model has complete freedom to make the pitchers better and the catcher worse – or visa versa. So, the constraint comes from the random-effects part of the model, which essentially applies regression-to-the-mean to the pitcher and catcher intercepts (batter and umpire also in the full model).
So part of what is happening is that the DRA-value model likes to minimize the combined variance of both pitchers and catchers, and it can best achieve this by splitting the credit between the two. When a staff – like the 94 Braves or 03 Dodgers – is composed of exclusively good performing pitchers, the model gives a lot of credit to the catcher. This is kind of like minimizing , subject to the constraint ( is pitchers, is catchers, and is the linear weight value of the outcome). The equation of course doesn’t have a unique solution, it is only tractable because of the priors placed on and (represented by and ). On the other hand, when the staff has poor performance overall, the catchers get a lot of the blame, and the pitchers that benefit according to the DRA model are those that have good performances for poor staffs.
This characteristic of the model could explain why the 2004 Jason Schmidt season comes out looking so good – he had a great year for a poor-overall staff. Based on Schmidt’s observed performance, the model would like to give credit to the catchers, but it can’t because that would mean taking credit away from his teammates, who are already contributing a lot to the regularization/regression-to-the-mean term.
In principle this could also explain A.J. Pierzynski going from a great framer with Minnesota circa 2001-2003 to an average framer with S.F. in 2004, i.e. he got a lot of credit with Minnesota, and then a lot of blame with S.F., and that’s due more to the change in the staff – and the way the model apportions credit between pitchers and catchers – than to a change in his framing performance; although a quick look at baseball-reference doesn’t show those Twins teams to be that much better than average in terms of runs allowed. Another big factor here is the stadium term in the fixed-effects part of the model, which certainly factors in, and which I haven’t looked at in great detail.
So based on that logic, and on the 94 Braves/03 Dodgers/04 Schmidt examples, we would expect that the pitchers that get adjusted the most are good pitchers on good staffs (in a negative sense) and bad pitchers on bad staffs (in a positive sense). Those that would come out well according to DRA are good pitchers on poor staffs and those that would come out worst are poor pitchers on good staffs.
This plot shows the DRA icpt values, for the baseline model, and then for the full model, as a function of the team RA9. The full values are offset from the baseline values in order to show the change from one to the next. Red means the full model considers the pitcher worse than the baseline model does, and blue means the full model considers the pitcher better than the baseline model.
It is worth emphasizing that in the above plot, the baseline model is identical to regressing to the mean, so the comparison of the full model isn’t to the raw performance numbers, it’s to regressed true-talent estimates. Also, the units are an offset to the expected linear weights value, so multiplying by 1000 PA gives a rough estimate of the impact in terms of runs for a full-time/workhorse starter. So, for example, the model says that, in comparison to the regressed-to-the-mean estimate, the combined impact of the other factors – catcher + batters faced + umpire + parks + temperature + inning and score differential + … – is about 18 runs for 97 Maddux, about 20 runs for 96 K. Brown, etc… The direction of the adjustments makes logical sense; if a team had a good RA9, it was more likely to be in a pitchers park, and visa versa. So the plot does support my suggestion, but doesn’t demonstrate it conclusively without looking in more detail at the other factors, like stadium/park factors.
Another thing to keep in mind is that precisely what Bayesian statistical analysis does is to decrease the variance at the cost of increasing the bias. For above average performances, the Bayesian point estimate of true talent will always always always assume the player had good luck, even though that obviously isn’t true for each and every player; some players had bad luck and still ended up with above average performance, we just have no way of knowing which.
In working on this, I have read a bunch of references on mixed models (also called multilevel models, among other things). Two references that were really helpful in clarifying what they are and what they do were these
The technical reference for the R lme4 package is pretty dense, but very in-depth and informative,