building CSAA and DRA-value databases using retrosheet

Several weeks ago, the crew at baseball prospectus put out a new pitching metric called DRA, or Deserved Run Average. It is distinctive in building in controls for lots of different factors, for including pitcher catcher framing, and for applying the technique of mixed-effect statistical modeling. The authors have put forth a tremendous amount of effort in explaining their metric, in both technical and non-technical terms, even going so far as to post the R commands they used! I can’t emphasize enough how much I appreciate their general openness, and especially posting actual code. That’s why I called my work on this broadDRA instead of openDRA, because DRA is already essentially open.

In any case, I was curious about the ins and outs of the ambitious new metric and so I’ve been working on reproducing the results starting from retrosheet data. This post just provides a few links to some jupyter notebooks, written in Python and R that show the code for this. If you want to use this code as written, you will need a retrosheet database, with at least year_id and woba_pts columns added (and playoff_flag, but you could probably just comment the selection based on that out of the sql query). You can create a retrosheet database, if you don’t already have one, using py-retrosheet from Wells Oliver, and you can add the additional columns using the retrosheet_sql_tools script that I contributed.

Following are links to my code, as jupyter notebooks, and a tarball that includes the fitted models for 2003-2004 as an example.

I mocked up my own estimate of FRAA, based off of baseball-reference defensive-runs-above-average. This means that running that piece of the code has the additional requirement of having an rWAR database (e.g. from the excellent daily war-update by baseball-reference, and a Lahman database, or some other way to make the conversion from retrosheet ID to baseball-reference ID. Because of these extra dependencies, I write the “pull data for DRA from retrosheet” code below to have both “use FRAA” and “ignore FRAA” options.

Here are the IPython notebooks,
pull data for CSAA from retrosheet

fit the CSAA model, and save it for later

pull data for DRA from retrosheet

read the DRA data, join it with the CSAA data, fit the DRA model, and save it for later on

read in the fitted DRA value model, export as a csv

the results for 1997-2004, are available as a Google doc here,

icpt stands for “intercept”, the ranef intercept of the fitted model. “0” is the baseline model lwts ~ (1|pitcher). The additional columns icpt_dra_X are the model lwts ~ (1|pitcher) + X. icpt_dra_full is the full model, as quoted in the baseball prospectus article.

example of the CSAA and DRA models for 2003-2004, saved as R workspaces, are available in dropbox here,

Finally, here are a couple visualizations, shamelessly adopted from Jason Davies parallel coordinates example.


One thought on “building CSAA and DRA-value databases using retrosheet”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s