In my previous post I posted some code to make DRA-value and retro-CSAA databases, starting from retrosheet + baseball reference WAR (for fielding) + lahman (for ID matching) databases, and provided a link to some data for 1997-2004. I’ve run the models for some additional years, and done some analysis of the results.

The first thing to note about my analysis is that CSAA doesn’t seem to have that big of an impact, as long as you have the (1|catcher) term in the mixed-effects model. That is, if you run the model with CSAA and not (1|catcher) it makes a big difference (compared to a baseline), and if you run the model with (1|catcher) and not CSAA it makes a big difference, but running it with baseline+CSAA+(1|catcher) isn’t too different from baseline+(1|catcher). For the 1997-2004 data set, I found it makes a difference of ~4 runs per year in the most extreme cases. So it matters, but much less than other things, and to a first approximation we can leave it out. This is good news since computing CSAA is much more computationally intensive than running the DRA-value model.

With that in mind, here is an updated data set, showing the DRA-value results for a variety of models, for the years 1983-2004.

DRA Models – 1983-2004

icpt_dra_0 refers to the baseline model, lwts ~ (1|pitcher), the others refer to the baseline + the additional term. For example, icpt_dra_catcher refers to the model lwts ~ (1|pitcher) + (1|catcher). As previously, icpt refers to the random intercept returned by the mixed effects model, and full refers to the full model specified in the (second iteration) baseball prospectus article. npa refers to number-of-plate-appearances.

Ok, so on to the analysis.

Here is a plot of (baseline minus full) vs. year.

In this plot, negative value are values where the baseline value was better, and adding all the other effects made the DRA estimates of pitcher quality worse. The 99 Rockies are clearly impacted by a huge park effect. The 94 Braves and 03 Dodgers seem to be a mix of park effect and catcher. The 94 Braves and 03 Dodgers are also distinctive for being good team-level staffs so this would seem to suggest that pitchers that are impacted most strongly – in a negative sense – are those who are on good staffs. It’s a little less clear what to make of the John Tudor 1985 data point.

To focus in on the effect of catcher specifically, here is a plot of (baseline minus baseline+catcher) vs. year.

Again, what jumps out is the big impacts are for staffs that have a good performance overall. The small spread in 1990 is due to the fact that the (1|pitcher) + (1|catcher) model did not converge.

So these two plots suggest to me that there may be something odd happening when the staff overall is good.

As a specific example, consider the 94 Braves. Were Maddux and Glavine and Steve Avery and Smoltz and Mark Wolhlers really good at getting people out? Or was Javy Lopez a really good framer? The fixed-effects part of the model cannot distinguish between these two alternatives. The model has complete freedom to make the pitchers better and the catcher worse – or visa versa. So, the constraint comes from the random-effects part of the model, which essentially applies regression-to-the-mean to the pitcher and catcher intercepts (batter and umpire also in the full model).

So part of what is happening is that the DRA-value model likes to minimize the combined variance of both pitchers and catchers, and it can best achieve this by splitting the credit between the two. When a staff – like the 94 Braves or 03 Dodgers – is composed of exclusively good performing pitchers, the model gives a lot of credit to the catcher. This is kind of like minimizing , subject to the constraint ( is pitchers, is catchers, and is the linear weight value of the outcome). The equation of course doesn’t have a unique solution, it is only tractable because of the priors placed on and (represented by and ). On the other hand, when the staff has poor performance overall, the catchers get a lot of the blame, and the pitchers that benefit according to the DRA model are those that have good performances for poor staffs.

This characteristic of the model could explain why the 2004 Jason Schmidt season comes out looking so good – he had a great year for a poor-overall staff. Based on Schmidt’s observed performance, the model would like to give credit to the catchers, but it can’t because that would mean taking credit away from his teammates, who are already contributing a lot to the regularization/regression-to-the-mean term.

In principle this could also explain A.J. Pierzynski going from a great framer with Minnesota circa 2001-2003 to an average framer with S.F. in 2004, i.e. he got a lot of credit with Minnesota, and then a lot of blame with S.F., and that’s due more to the change in the staff – and the way the model apportions credit between pitchers and catchers – than to a change in his framing performance; although a quick look at baseball-reference doesn’t show those Twins teams to be that much better than average in terms of runs allowed. Another big factor here is the stadium term in the fixed-effects part of the model, which certainly factors in, and which I haven’t looked at in great detail.

So based on that logic, and on the 94 Braves/03 Dodgers/04 Schmidt examples, we would expect that the pitchers that get adjusted the most are good pitchers on good staffs (in a negative sense) and bad pitchers on bad staffs (in a positive sense). Those that would come out well according to DRA are good pitchers on poor staffs and those that would come out worst are poor pitchers on good staffs.

This plot shows the DRA icpt values, for the baseline model, and then for the full model, as a function of the *team* RA9. The full values are offset from the baseline values in order to show the change from one to the next. Red means the full model considers the pitcher worse than the baseline model does, and blue means the full model considers the pitcher better than the baseline model.

It is worth emphasizing that in the above plot, the baseline model is identical to regressing to the mean, so the comparison of the full model isn’t to the raw performance numbers, it’s to regressed true-talent estimates. Also, the units are an offset to the expected linear weights value, so multiplying by 1000 PA gives a rough estimate of the impact in terms of runs for a full-time/workhorse starter. So, for example, the model says that, in comparison to the regressed-to-the-mean estimate, the combined impact of the other factors – catcher + batters faced + umpire + parks + temperature + inning and score differential + … – is about 18 runs for 97 Maddux, about 20 runs for 96 K. Brown, etc… The direction of the adjustments makes logical sense; if a team had a good RA9, it was more likely to be in a pitchers park, and visa versa. So the plot does support my suggestion, but doesn’t demonstrate it conclusively without looking in more detail at the other factors, like stadium/park factors.

Another thing to keep in mind is that precisely what Bayesian statistical analysis does is to decrease the variance at the cost of increasing the bias. For above average performances, the Bayesian point estimate of true talent will always always always assume the player had good luck, even though that obviously isn’t true for each and every player; some players had bad luck and still ended up with above average performance, we just have no way of knowing which.

In working on this, I have read a bunch of references on mixed models (also called multilevel models, among other things). Two references that were really helpful in clarifying what they are and what they do were these

http://multithreaded.stitchfix.com/blog/2015/07/14/glmms/

http://www.stat.columbia.edu/~gelman/research/published/multi2.pdf

The technical reference for the R lme4 package is pretty dense, but very in-depth and informative,

https://cran.r-project.org/web/packages/lme4/vignettes/Theory.pdf

Very interesting analysis. But I was confused by this statement: “In principle this could also explain A.J. Pierzynski going from a great framer with Minnesota circa 2001-2003 to an average framer with S.F. in 2004.” Doesn’t DRA say AJ was a very poor framer in 2004? Schmidt lost 10 runs to bad framing, and AJ was the catcher for 68% of his PA.

The follow-up DRA article says AJ was a below average framer in 2004. According to the values I computed, he was -0.5 standard deviations from the mean in 2004. For 2001-2003, with Minnesota, he was, +0.59, +2.02 and +2.55.

Not disputing your result, but it’s hard to square -0.5 SD with the notion that SFG catchers were -10 framing runs for Schmidt. That would make them about -70 runs for the season, which must be one of the worst team framing seasons in history. And since AJ did 2/3 of the catching, he would have to be at least -40 or so. Not sure how we make that all add up…….

I don’t have the answer. When I fit the model I get different results. My implementation of the model agrees that for 2004 Torrealba comes out as a terrible framer and Pierzynski as a below average one. Specifically this means -1.8% and -0.4% probability for a called strike per called pitch (not per PA). The coefficient of CSAA in the fit is -0.4, so I have the total impact on Schmidt of the framing is +3.2 runs. The (1|catcher) term is -0.75 runs, the total park effect is +18.3 runs and the batters faced +17.8 runs. So yeah, I dunno.

Park effect is +18.3 runs for Schmidt?! That’s about +0.7 R9, or a park factor of something like 116. Seems less than plausible. Same for batters faced. Something is off here, I think…..

That’s the overall impact of the stadium term in the fixed effects part of the model. It’s not necessarily relative to zero, however. Stadium is a factor in the regression so I think by definition the coefficients are given relative to a reference factor (the first alphabetical park, ANA, in this case) plus there’s an interaction with handedness and an overall intercept. I will go back and do it more carefully. It is definitely true though that in my implementation of the DRA model, park effect is the primary adjustment to Schmidt’s estimated value, not catcher framing.

this is my accounting for Schmidt 2004:

average pitcher difference | total | quant.

————————————————————

+0.0110 +0.0099 -0.00108066 | -0.98 | bat_home_id

-0.0076 -0.0066 +0.00095840 | +0.87 | bats

-0.0014 -0.0043 -0.00286858 | -2.60 | bats:stadium

+0.0217 +0.0196 -0.00208301 | -1.89 | batter

-0.0000 -0.0008 -0.00079787 | -0.72 | catcher

-0.0002 +0.0035 +0.00369584 | +3.35 | csaa

+0.0007 +0.0040 +0.00323553 | +2.93 | fraa

-0.0002 -0.0010 -0.00084763 | -0.77 | fraa:bat_home_id

-1.7304 -1.7304 +0.00000000 | +0.00 | icpt

+0.0185 +0.0172 -0.00123189 | -1.12 | inning

-0.0070 -0.0061 +0.00095282 | +0.86 | inning:bat_home_id

-0.0002 -0.0002 -0.00004242 | -0.04 | inning:scorediff

-0.0056 -0.0055 +0.00007651 | +0.07 | outs_ct

-0.0040 -0.0552 -0.05125184 | -46.49 | pitcher

+0.0119 +0.0182 +0.00629637 | +5.71 | role

-0.0000 -0.0006 -0.00055115 | -0.50 | score_diff

+0.0075 +0.0202 +0.01270269 | +11.52 | stadium

-0.0046 -0.0032 +0.00145442 | +1.32 | start_bases_cd

+0.0065 +0.0053 -0.00115305 | -1.05 | start_bases_cd:outs_ct

+1.6904 +1.6878 -0.00264211 | -2.40 | temp_log

Hey there:

As I’ve said before, this is a neat effort.

A few things. First, I see that you don’t include the second step of the DRA analysis, which is to run the value results through the earth model to scale the run values, cleanse the noise created by turning value into an average, and account for other factors. I’m not sure that would ultimately change what you’ve found (the value model drives much of it), but it’s worth noting.

In terms of the hypothesis, I’m not sure it holds up. The reason the 1994 Braves have a terrific DRA is because they allowed fewer people on base than anybody else, and it was not close. DRA (and pitcher value), at their heart, are about linear weights allowed, and if you prevent those better than anybody, you’re going to have one of the top DRAs. I don’t think catcher framing necessarily has much to do with that. In fact, Lopez had a CSAA of 0.00840467 in 1994: which was just middle of the road for his career.

As for Schmidt, you seem to concur with us that he had a terrible strike zone to work with in 2004. CSAA, it should be made clear, probably goes somewhat beyond “Framing” in 2004 because we don’t have PITCHf/x. So, it basically is just about getting (or not getting) extra strikes in some way. That is probably going to bleed in game-calling, but it also has to do with the fact that umpires probably had more individuality back then. But it’s difficult to dispute that his catchers struggled to pull down strikes that year overall, and that it all combined to make 2004 much more tough for him than it should have been.

Similarly, the reason Schmidt has a terrific DRA is because, once again, he kept guys off base better than almost anyone else. SF for some reason was also a launching pad for run-scoring in 2004, which explains why the staff got a bonus from DRA. The staff rated as average by cFIP in true talent, so I don’t think we can dismiss their entire staff as bad on average, and thereby allowing Schmidt to magically stand out.

Again, this is good and fun work. I am sorry it takes us a long time to inquiries sometimes, but probably like you, we do this in our spare time and there is only so much of it.

*****************

In terms of the hypothesis, I’m not sure it holds up. The reason the 1994 Braves have a terrific DRA is because they allowed fewer people on base than anybody else, and it was not close. DRA (and pitcher value), at their heart, are about linear weights allowed, and if you prevent those better than anybody, you’re going to have one of the top DRAs. I don’t think catcher framing necessarily has much to do with that. In fact, Lopez had a CSAA of 0.00840467 in 1994: which was just middle of the road for his career.

******************

That’s not quite what my argument is. My argument is,

the 1994 Braves were a staff full of really great pitchers; and DRA (or value_pa more accurately) is taking credit away from Maddux and Smoltz and Glavine and Avery and handing it to Javy Lopez (also applies to other great staffs, e.g. 2003 Dodgers).

The technical reason for this is that each team is kind of its own island. You never observe Maddux in isolation, its always Maddux + Javy Lopez (or whatever other Braves catcher). Maybe more importantly, you observe Lopez + (a whole bunch of really good pitchers) and you don’t observe Lopez + (a collection of pitchers who’s talent matches the overall talent distribution). So if I observe that Maddux + Lopez (after controlling for fixed effects & batter & umpire) have linear weights outcomes that are -0.08 (making this number up but approximately correct) below average, how do I split up credit? Is it Maddux=-0.08, Lopez = 0? or Maddux = -0.10 and Lopez=+0.02? or …? The model cannot distinguish these alternatives, so it relies on the Gaussian prior, which will always split the difference and equally (relative to the variance) assign credit.

I should also clarify that it’s not the CSAA term I’m looking at, it’s the (1|catcher) term. Presumably framing is the most important thing that (1|catcher) is capturing, so I referred to it as “framing”. sorry for the ambiguity.

******************

Similarly, the reason Schmidt has a terrific DRA is because, once again, he kept guys off base better than almost anyone else. SF for some reason was also a launching pad for run-scoring in 2004, which explains why the staff got a bonus from DRA. The staff rated as average by cFIP in true talent, so I don’t think we can dismiss their entire staff as bad on average, and thereby allowing Schmidt to magically stand out.

******************

Yes, when I ran your model I got Schmidt’s park effect for 2004 to be really high. SF came out high – but, I also just noticed Schmidt pitched 80 PA in Coors (with something like a 0.250 wOBA), which is a lot for a visiting pitcher. Maybe that has something to do with his counter-intuitive adjustments?

I think that we are agreeing on this point; when I wrote the post it seemed plausible that good-pitcher-on-bad-staff could be the main factor in Schmidt’s surprisingly high ranking by DRA, but when I dug into the numbers more it seemed to be more of a park effect thing.

thanks!

I did a query to see which pitchers fit the good-pitcher/bad-staff mold. for 1994-2004 here’s what I came up with,

Pitcher year team pitcher-wOBA team-wOBA delta-wOBA

————————–

| Randy Johnson | 2004 | ARI | 0.2536 | 0.3571 | -0.10352 |

| Randy Johnson | 1995 | SEA | 0.2671 | 0.3591 | -0.09200 |

| Randy Johnson | 1994 | SEA | 0.3021 | 0.3655 | -0.06343 |

| Roy Halladay | 2002 | TOR | 0.2955 | 0.3516 | -0.05611 |

| Brad Radke | 1997 | MIN | 0.3149 | 0.3699 | -0.05501 |

| Jeff D’Amico | 2000 | MIL | 0.3056 | 0.3604 | -0.05483 |

| Tom Gordon | 1997 | BOS | 0.2991 | 0.3536 | -0.05447 |

| Rodrigo Lopez | 2002 | BAL | 0.3016 | 0.3539 | -0.05225 |

| Curt Schilling | 1999 | PHI | 0.3055 | 0.3569 | -0.05137 |

| Shawn Chacon | 2003 | COL | 0.3219 | 0.3707 | -0.04881 |

So my hypothesis says that DRA is going to claim that these guys suffered from bad framing and they will come out better by DRA then by RA9 (although park effects come into play also).