in information theory, information content is -log2 (p)

where p is the probability and log2 is the base-2 logarithm.

for each batting event (not including stolen bases) in the retrosheet events table I computed the probability for the event to occur and used this to compute the information content. The probability for the event to occur is

prob = x/(1+x)

x = b/(1-b)*p/(1-p)*(1-a)/a

in other words I used the odds ratio method to figure out the probabilities. a is the league average. b and p are the probabilities for batter and pitcher, which are based on the observed value for that season, and have been regressed to the mean.

as a concrete example, on 04/11/1961 the Angels played at the Orioles,

http://www.baseball-reference.com/boxes/BAL/BAL196104110.shtml

in the first inning, T. Kluszewski hit a homerun off of M. Pappas. This had a probability of 0.0356 to occur, which makes the information content, 4.81. The value 0.0356 comes from,

Kluszewski hit 15 HRs in 291 PA (I computed PA by summing events from the events table so I may be off a few here and there compared to baseball-ref). Pappas gave up 16 HR in 732 PA. The mean HR per PA in 1961 is 0.025. So for Kluszewski I set HR/PA rate as

(15 + 0.025*250)/(291 + 250) = 0.0392

and for Pappas,

(16 + 0.025*250)/(732 + 250) = 0.0227

then,

x = 0.0392/(1-0.0392)*0.0227/(1-0.0227)*(1-0.025)/0.025 = 0.0370

and

prob = x/(1+x) = 0.0356.

I consider outs in the field, strikeouts, BBs, HBPs, IBBs, ROE, 1B, 2B, 3B and HRs (reterosheet event_cd 2,3,14,15,16,18,20,21,22,23). I compute the mean for each season to use in the regression, but I use a fixed 250 PA for each quantity. I know this isn’t exactly correct, but that’s my starting point.

Using this scheme, here are the top-5 games with the greatest and least, cummulative and average, information content.

most informative games+half-inning

game_id information bat_home_id

PHI197305040 218.24 0

DET196206240 212.26 1

SE1196907190 211.93 0

ATL198507040 209.94 0

PHI201308240 206.58 0

least informative games+half-inning

game_id information bat_home_id

SLN198404212 17.84 1

ATL199704270 20.90 0

MIN196708060 20.98 0

SDN198404270 21.43 1

WS2196507030 21.66 1

most (average) informative games+half-inning

game_id information bat_home_id

KCA199404120 3.12 0

TOR200804060 3.04 1

BOS200907110 3.04 1

MIL200807060 3.04 1

HOU197406230 3.02 1

least (average) informative games+half-inning

game_id information bat_home_id

BAL197406040 0.91 1

HOU197607230 0.92 1

CLE196505050 0.98 0

NYA198905260 0.99 1

SEA199007150 1.00 0

another quantity in information theory is entropy, which roughly speaking is the number of bits you need to represent the information. entropy is -p log2 p.

here are the highest and lowest, cumulative and average games (split out by home-away)

———————————————–

highest entropy games+half-inning

game_id entropy bat_home_id

CHA198405080 43.53 1

NYN197409110 42.47 0

NYN197409110 41.47 1

HOU196804150 40.52 0

DET196206240 40.37 0

lowest entropy games+half-inning

game_id entropy bat_home_id

BAL197107300 5.52 1

DET196907040 6.70 1

BOS200610010 6.72 0

PHI198809240 6.80 1

KCA198706290 6.83 1

highest (average) entropy games+half-inning

game_id entropy bat_home_id

HOU201304020 0.52 1

ATL200405180 0.52 1

SDN200107180 0.51 1

TEX200707070 0.51 1

CHN200105250 0.51 0

lowest (average) entropy games+half-inning

game_id entropy bat_home_id

WS2196104100 0.33 0

SLN197907011 0.34 1

MIN197905080 0.34 1

BOS196606030 0.34 0

BOS198205230 0.35 1

———————————————–

here are the information and entropy results for full games,

———————————————–

most informative games (both innings)

game_id information

ATL198507040 399.98

MON197705210 394.79

CHA198405080 393.29

NYN197409110 389.37

PHI201308240 389.37

least informative games (both innings)

game_id information

BAL197107300 50.77

WS2196507030 52.28

ATL196607160 57.88

SLN197805280 59.14

SLN197306180 61.14

most (average) informative games (both innings)

game_id information

BOS196606030 2.79

LAN200609180 2.79

HOU200107180 2.76

COL200309230 2.75

COL200006280 2.75

least (average) informative games (both innings)

game_id information

CAL199209290 1.28

PHI198408290 1.30

CLE197809150 1.32

CAL199104200 1.34

MIL197406190 1.34

highest entropy games (both innings)

game_id entropy

NYN197409110 83.94

CHA198405080 82.59

DET196206240 78.81

HOU196804150 78.36

NYN196405312 75.89

lowest entropy games (both innings)

game_id entropy

BAL197107300 12.94

MIN196708060 14.93

CHN199005090 15.17

PHI196708270 15.29

NYN196406072 15.44

highest (average) entropy games (both innings)

game_id entropy

CHN200105250 0.49

ARI200104100 0.49

CHN200405180 0.48

LAN201308230 0.48

COL201009240 0.48

lowest (average) entropy games (both innings)

game_id entropy

MIN197905080 0.35

CAL198606140 0.36

BOS196606030 0.37

BOS198205230 0.37

MON197707170 0.37

———————————————–

here are the highest information and entropy results for batter-seasons and pitcher-seasons,

———————————————–

most informative seasons (batters)

bat_id information year_id

sosas001 1846.20 2001

sizeg001 1835.18 2006

bagwj001 1833.12 1999

vaugm001 1820.52 1996

delgc001 1807.45 2000

highest entropy seasons (batters)

bat_id information year_id

suzui001 338.40 2004

boggw001 329.41 1985

suzui001 329.02 2007

weekr001 328.86 2010

dyksl001 324.95 1993

most informative seasons (pitchers)

bat_id information year_id

lolim101 3111.52 1971

woodw103 3067.14 1973

niekp001 3045.29 1977

ryann001 3002.07 1974

niekp001 2927.42 1979

highest entropy seasons (pitchers)

bat_id information year_id

lolim101 665.86 1971

woodw103 641.12 1973

ryann001 628.83 1974

niekp001 623.37 1977

ryann001 613.86 1973

averages don’t really add anything interesting here since you just pick up players with 1 PA.

—————————————

and finally, here is the most informative (i.e., the unlikeliest) event,

May 18 1992, Albert Belle hits a triple off Kevin Brown,

http://www.baseball-reference.com/boxes/CLE/CLE199205180.shtml

prob = 0.008563, information content = 10.22.