# NFL Markov: 4 of n (the yards gained distributions (transition matrix))

In football, a basic state consists of a set of down-distance-yardline values. One could include score differential, or time I suppose, but I’m not considering those here. The transition matrix can be built by using the yards-gained distribution, along with the probabilities to run a play as opposed to punting or attempting a field goal. So the main questions I want to address here are,

• what are the probabilities for passing versus running versus kicking?
• given that a play is run, what is the yards gained distribution?
• I am ignoring some context-specific distinctions such as time on the clock, score-differential and weather. This is very much in the same vein as something like RE24 (from the more developed field of baseball analytics) in that it accounts for game state, but is otherwise context neutral.

With that being said, it is not immediately obvious what the dependence of the yards-gained distribution on down, distance, and yardline should be. So my starting point was to grab some data and start slicing and dicing and making some graphs. To clarify, yardline is encoded as “yards-from-own-goal”, abbreviated yfog, where 1 means you are backed up against your end zone, and 99 means you are 1 yard away from scoring a touchdown.

• In between, say yfog=20 and yfog=75, mean yards gained doesn’t depend strongly on field position (Fig. \ref{fignm1}).
• For plays originating between yfog=20 and yfog=75, mean yards gained doesn’t depend strongly on yards-to-go (Fig. \ref{fignm2}).
• The fraction of plays that are passes depends strongly on down and distance, and less strongly on field position (Figs. \ref{fignm3} \& \ref{fignm5}).
• On fourth down, the probabilities to punt versus try a field goal versus go for it vary rapidly as a function of yards-to-go and yards-from-own-goal. This is particularly true for yfog $\sim 50$ to yfog $\sim 80$ (Figs. \ref{figep1}, \ref{figep2}, and \ref{figep3}).
• So now the question is, how can we model the yards-gained distribution? Since mean yards gained doesn’t depend too strongly on down, distance, and field position, it is instructive to pool a bunch of states, and look in more detail at the distribution to get a feel for what it looks like. In Fig. \ref{fignm6}, I show distributions for passes (left) and rushes (right), for all first-down plays originating between yfog=20 and yfog=75.

After playing around with some functions, the best general agreement I could find came from the following,

$p(y) = A ~\frac{e^{(y-y_0)/\sigma_1}} {1+e^{(y-y_0) (\sigma_1+\sigma_2)/(\sigma_1 \sigma_2)})} + G ~e^{- (y-g_0)^2/ (2 \sigma_g^2)}$.

I refer to this as a “Bazin plus Gauss” function; Bazin because I first encountered the “ratio of exponentials” in a paper by Bazin, et al, that used it to model supernova light curves (http://arxiv.org/abs/1109.0948); and Gauss for obvious reasons. The first (Bazin) term basically stitches together two exponentials at the location $y_0$. For $y \ll y_0$, it looks like a rising exponential with a scale factor $\sigma_1$, and for $y \gg y_0$, like a declining exponential with a scale factor $\sigma_2$. The Gaussian part describes being sacked, and in the football application, $G$ is identically 0 for rushes.

Using this functional form, I use the function minimization package pyminuit to determine the maximum likelihood values for the parameters, $y_0, \sigma_1, \sigma_2$ for rushes, and additionally $G/A, g_0,$ and $\sigma_g$ for passes. For rushes, typical values are $y_0 \sim 1, \sigma_1 \sim 1.5, \sigma_2 \sim 3.5$. For passes, typical values are $y_0 \sim 4.5, \sigma_1 \sim 1.8, \sigma_2 \sim 8.0, G/A \sim 0.12, g_0 \sim -6.5,$ and $\sigma_g \sim 3.0$. Figs. \ref{figye1} \& \ref{figym1} compare the model to the empirical distribution for 1st and 10 plays from the 20.

In the next section I will describe how my model is implemented in code.