# the most informative game – information theory in baseball

in information theory, information content is -log2 (p)

where p is the probability and log2 is the base-2 logarithm.

for each batting event (not including stolen bases) in the retrosheet events table I computed the probability for the event to occur and used this to compute the information content. The probability for the event to occur is

prob = x/(1+x)
x = b/(1-b)*p/(1-p)*(1-a)/a

in other words I used the odds ratio method to figure out the probabilities. a is the league average. b and p are the probabilities for batter and pitcher, which are based on the observed value for that season, and have been regressed to the mean.

as a concrete example, on 04/11/1961 the Angels played at the Orioles,
http://www.baseball-reference.com/boxes/BAL/BAL196104110.shtml

in the first inning, T. Kluszewski hit a homerun off of M. Pappas. This had a probability of 0.0356 to occur, which makes the information content, 4.81. The value 0.0356 comes from,

Kluszewski hit 15 HRs in 291 PA (I computed PA by summing events from the events table so I may be off a few here and there compared to baseball-ref). Pappas gave up 16 HR in 732 PA. The mean HR per PA in 1961 is 0.025. So for Kluszewski I set HR/PA rate as

(15 + 0.025*250)/(291 + 250) = 0.0392

and for Pappas,

(16 + 0.025*250)/(732 + 250) = 0.0227

then,
x = 0.0392/(1-0.0392)*0.0227/(1-0.0227)*(1-0.025)/0.025 = 0.0370

and
prob = x/(1+x) = 0.0356.

I consider outs in the field, strikeouts, BBs, HBPs, IBBs, ROE, 1B, 2B, 3B and HRs (reterosheet event_cd 2,3,14,15,16,18,20,21,22,23). I compute the mean for each season to use in the regression, but I use a fixed 250 PA for each quantity. I know this isn’t exactly correct, but that’s my starting point.

Using this scheme, here are the top-5 games with the greatest and least, cummulative and average, information content.

most informative games+half-inning
game_id information bat_home_id
PHI197305040 218.24 0
DET196206240 212.26 1
SE1196907190 211.93 0
ATL198507040 209.94 0
PHI201308240 206.58 0

least informative games+half-inning
game_id information bat_home_id
SLN198404212 17.84 1
ATL199704270 20.90 0
MIN196708060 20.98 0
SDN198404270 21.43 1
WS2196507030 21.66 1

most (average) informative games+half-inning
game_id information bat_home_id
KCA199404120 3.12 0
TOR200804060 3.04 1
BOS200907110 3.04 1
MIL200807060 3.04 1
HOU197406230 3.02 1

least (average) informative games+half-inning
game_id information bat_home_id
BAL197406040 0.91 1
HOU197607230 0.92 1
CLE196505050 0.98 0
NYA198905260 0.99 1
SEA199007150 1.00 0

another quantity in information theory is entropy, which roughly speaking is the number of bits you need to represent the information. entropy is -p log2 p.

here are the highest and lowest, cumulative and average games (split out by home-away)
———————————————–

highest entropy games+half-inning
game_id entropy bat_home_id
CHA198405080 43.53 1
NYN197409110 42.47 0
NYN197409110 41.47 1
HOU196804150 40.52 0
DET196206240 40.37 0

lowest entropy games+half-inning
game_id entropy bat_home_id
BAL197107300 5.52 1
DET196907040 6.70 1
BOS200610010 6.72 0
PHI198809240 6.80 1
KCA198706290 6.83 1

highest (average) entropy games+half-inning
game_id entropy bat_home_id
HOU201304020 0.52 1
ATL200405180 0.52 1
SDN200107180 0.51 1
TEX200707070 0.51 1
CHN200105250 0.51 0

lowest (average) entropy games+half-inning
game_id entropy bat_home_id
WS2196104100 0.33 0
SLN197907011 0.34 1
MIN197905080 0.34 1
BOS196606030 0.34 0
BOS198205230 0.35 1
———————————————–

here are the information and entropy results for full games,
———————————————–

most informative games (both innings)
game_id information
ATL198507040 399.98
MON197705210 394.79
CHA198405080 393.29
NYN197409110 389.37
PHI201308240 389.37

least informative games (both innings)
game_id information
BAL197107300 50.77
WS2196507030 52.28
ATL196607160 57.88
SLN197805280 59.14
SLN197306180 61.14

most (average) informative games (both innings)
game_id information
BOS196606030 2.79
LAN200609180 2.79
HOU200107180 2.76
COL200309230 2.75
COL200006280 2.75

least (average) informative games (both innings)
game_id information
CAL199209290 1.28
PHI198408290 1.30
CLE197809150 1.32
CAL199104200 1.34
MIL197406190 1.34

highest entropy games (both innings)
game_id entropy
NYN197409110 83.94
CHA198405080 82.59
DET196206240 78.81
HOU196804150 78.36
NYN196405312 75.89

lowest entropy games (both innings)
game_id entropy
BAL197107300 12.94
MIN196708060 14.93
CHN199005090 15.17
PHI196708270 15.29
NYN196406072 15.44

highest (average) entropy games (both innings)
game_id entropy
CHN200105250 0.49
ARI200104100 0.49
CHN200405180 0.48
LAN201308230 0.48
COL201009240 0.48

lowest (average) entropy games (both innings)
game_id entropy
MIN197905080 0.35
CAL198606140 0.36
BOS196606030 0.37
BOS198205230 0.37
MON197707170 0.37

———————————————–

here are the highest information and entropy results for batter-seasons and pitcher-seasons,
———————————————–

most informative seasons (batters)
bat_id information year_id
sosas001 1846.20 2001
sizeg001 1835.18 2006
bagwj001 1833.12 1999
vaugm001 1820.52 1996
delgc001 1807.45 2000

highest entropy seasons (batters)
bat_id information year_id
suzui001 338.40 2004
boggw001 329.41 1985
suzui001 329.02 2007
weekr001 328.86 2010
dyksl001 324.95 1993

most informative seasons (pitchers)
bat_id information year_id
lolim101 3111.52 1971
woodw103 3067.14 1973
niekp001 3045.29 1977
ryann001 3002.07 1974
niekp001 2927.42 1979

highest entropy seasons (pitchers)
bat_id information year_id
lolim101 665.86 1971
woodw103 641.12 1973
ryann001 628.83 1974
niekp001 623.37 1977
ryann001 613.86 1973

averages don’t really add anything interesting here since you just pick up players with 1 PA.

—————————————
and finally, here is the most informative (i.e., the unlikeliest) event,
May 18 1992, Albert Belle hits a triple off Kevin Brown,
http://www.baseball-reference.com/boxes/CLE/CLE199205180.shtml

prob = 0.008563, information content = 10.22.