Power ranking & multi-year park effects using Markov chain Monte Carlo

I’ve recently had the opportunity to use pymc, a python module for Bayesian statistical modeling, including hierarchical models. It’s so much fun to use, that I find myself looking for models to apply it to. One example is building a power ranking model, including a home field advantage and a park effect. This applies very generally, but I will apply it to baseball, where the concept of a park effect is most meaningful. The idea of building a power-ranking model is on my mind because this is the first year I participated in the  kaggle march madness competition (my model finished 44th), and a non-linear power-ranking model with a home-court advantage is the basic approach I used (although not with pymc). I am using the term power-ranking here to have the very specific meaning that the output of the model tells you the probability that team A will beat team B, which is terminology I picked up from Chris Long.

The basic idea behind Bayesian hierarchical modeling is you define random variables as being generated from some combination of likelihood and prior distributions, and then make inferences about the parameters of your model by sampling from the posterior distribution, e.g. using MCMC. I’ve defined the baseball power-ranking model in the following way,

$\alpha_i \sim N(1, \tau_{\alpha})$

$\beta_i \sim N(1, \tau_{\beta})$

$A \sim N(4.5, \tau_{A})$

$h \sim N(0, \tau_{h})$

$\rho \sim N(0, \tau_{\rho})$

$\pi \sim Beta(\alpha=1, \beta=1)$

$\mu^{s}_{ij} \sim A \times \alpha_i \times \beta_j \times (1 + h_i) \times (1 + \rho_i)$

$\mu^{a}_{ij} \sim A \times \alpha_j \times \beta_i \times (1 - h_i) \times (1 + \rho_i)$

$RS_{ij} \sim Poisson0(\mu^{s}_{ij}, \pi)$

$RA_{ij} \sim Poisson0(\mu^{a}_{ij}, \pi)$,

where $Poisson0$ is a zero-inflated Poisson distribution,

$P(0, \lambda | \pi) = \pi + (1-\pi) e^{-\lambda}$

$P(n, \lambda | \pi) = (1-\pi) \frac{\lambda^n e^{-\lambda}}{n!}$

The parameters are defined as; $\alpha_i$ is the offensive strength of team i, $\beta_i$ the defensive strength of team i, $A$ an overall scale, $\rho_i$ a park effect for the stadium belonging to team i, $h$ the home field advantage, $\mu^{s}$ and $\mu^{a}$, the mean runs scored and allowed, respectively, when team i is at home against team j, and RS and RA, the runs scored and allowed, respectively, when team i is at home against team j. The $\tau$ parameter in the Normal distributions is the inverse variance, and the values have been set to be $\tau_{\alpha} = \tau_{\beta} = 100$, $\tau_{A} = 10$, $\tau_{h} = 100$, $\tau_{\rho} = 100$.

The results of my model for 2012-2015 MLB data, after taking 1500 samples from the posterior probability distribution (not a really big number, but about as much as my laptop can handle), are (mean & standard deviation)

home = 0.010 +- 0.003

$\pi$= 0.053 +- 0.002

A = 4.32 +- 0.035

————

top 5 offenses,

————

TOR 2015 1.22

SLN 2013 1.19

ANA 2012 1.18

BOS 2013 1.17

SLN 2012 1.15

—————-

top 5 defenses

TBA 2012 0.84

SLN 2015 0.84

SEA 2014 0.86

BAL 2014 0.87

KCA 2013 0.87

park effect, highest to lowest,

0.215 DEN02
0.127 SYD01
0.102 BOS07
0.082 BAL12
0.072 ARL02
0.066 MIN04
0.065 MIL06
0.060 TOR02
0.058 CHI12
0.056 DET05
0.054 KAN06
0.038 NYC21
0.038 CLE08
0.024 CHI11
0.016 PHO01
0.013 CIN09
0.001 WAS11
-0.004 HOU03
-0.007 PHI13
-0.008 MIA02
-0.009 STP01
-0.041 ATL02
-0.052 OAK01
-0.052 TOK01
-0.054 STL10
-0.059 ANA01
-0.078 PIT08
-0.100 SFO03
-0.100 NYC20
-0.115 LOS03
-0.126 SEA03
-0.129 SAN02