##
What does a long game do to teams?
*April 13, 2015*

*Posted by tomflesher in Baseball.*

Tags: extra innings, file drawer, linear model, linear regression, Red Sox, Yankees

add a comment

Tags: extra innings, file drawer, linear model, linear regression, Red Sox, Yankees

add a comment

Friday, the Red Sox took a 19-inning contest from the Yankees. Both teams have the unfortunate circumstance of finishing a game around 2:15 A.M. and having to be back on the field at 1:05 PM. Everyone, including the announcers, discussed how tired the teams would be; in particular, first baseman **Mark Teixeira** spent a long night on the bag to keep backup first baseman and apparent emergency pitcher **Garrett Jones** fresh, leading **Alex Rodriguez** to make his first career appearance at first base on Saturday.

Teixeira wasn’t the only player to sit out the next day – center fielders **Jacoby Ellsbury** and **Mookie Betts**, catchers **Brian McCann** and **Sandy Leon**, and most of the bullpen all sat out, among others. The Yankees called up pitcher **Matt Tracy** for a cup of coffee and sent **Chasen Shreve** down, then swapped Tracy back down to Scranton for **Kyle Davies**. Boston activated starter **Joe Kelly** from the disabled list, sending winning pitcher **Steven Wright** down to make room. Shreve and Wright each had solid outings, with Wright pitching five innings with 2 runs and Shreve pitching 3 1/3 scoreless.

All those moves provide some explanation for a surprising result. Interested in what the effect of these long games are, I dug up all of the games from 2014 that lasted 14 innings or more. In a quick and dirty data set, I traced the scores for each team in their next games along with the number of outs pitched and the length in minutes of the game.

I fitted two linear models and two log models: two each with the next game’s runs as the dependent variable and two each with the difference in runs (next game’s runs – long game’s runs) as the dependent variable. Each used the length of the game in minutes, the number of outs, the average runs scored by the team during 2014, and an indicator variable for the presence of a designated hitter in each game. For each dependent variable, I modeled all variables in a linear form once and the natural log of outs and the natural log of the length of the game once.

With runs scored as the dependent variable, nothing was significant. That is, no variable correlated strongly with an increase or decrease in the number of runs scored.

With a run difference model, the length of the game in minutes became marginally significant. For the linear model, extending the length of the game by one minute lowers the difference in runs by about .043 runs – that is, normalizing for the number of runs scored the previous day, extending the game by one minute lowered the runs the next day by about .043. In the semilog model, extending the game by about 1% lowered the run difference by about 14; this was offset by an extremely high intercept term. This is a very high semielasticity, and both coefficients had p-values between .01 and .015. Nothing else was even close.

With all of the usual caveats about statistical analysis, this shows that teams are actually pretty good at bouncing back from long games, either due to the fact that most of the time they’re playing the same team (so teams are equally fatigued) or due to smart roster moves. Either way, it’s a surprise.

##
Wins and Revenue
*March 31, 2014*

*Posted by tomflesher in Baseball, Economics.*

Tags: linear model, Marginal Revenue Product of a win, Revenue, Wins

add a comment

Tags: linear model, Marginal Revenue Product of a win, Revenue, Wins

add a comment

Forbes has released its annual list of baseball team valuations. This is interesting because it accounts for all of the revenue that each team makes, ignoring a lot of the broader factors that play into what causes a team’s value to rise or fall. It also includes a bunch of extra data, including which teams’ values are rising, which are falling, and what each team’s operating income is for the year.

Without getting too in-depth, there are a lot of interesting relationships we can observe by crunching some of the numbers. First, the relationship between wins and revenue is often taken for granted, but the correlation is really very small – only about .26. That means that there’s a great deal more in play determining revenue than just whether a team wins or loses. (This, of course, assumes a linear relationship – one win is worth a fixed dollar amount, and that fixed dollar amount is the same for every team. Correcting this for local income – allowing a win to be worth more in New York than in Pittsburgh – would be an easy extension.)

Under the same assumptions, we can also run a quick linear regression to determine what an average team’s revenue would be at 0 wins and then determine what each marginal win’s revenue product is. Those numbers tell us that, roughly, a 0-win team would make about $129.68 million dollars, gaining around $1.31 million for each win. Again, though, there are a lot of problems with this – obviously, a 0-win team doesn’t exist and would probably have significantly lower revenue than we’d estimate. Even the worst team last year came in at 51 wins. Also, the p-values don’t exactly inspire confidence – the $130 million figure is significant at the 10% level, but the Wins factor comes in around 16%. That’s a pretty chancy number.

Extending it out to include a squared value for wins, we come up with numbers that are astonishingly nonpredictive – the intercept drops to -$34.9 million for a 0-win team (much more reasonable!) with the expected positive marginal value for wins ($5.6 million) and a negative coefficient for squared wins (-$.027), indicating that wins have a decreasing marginal effect as would be predicted. (Once you have 97 wins, the 98th doesn’t usually provide much value.) However, those numbers are basically no better than chance, with respective p-values of .936, .619, and .701. Although the signs look nice, the magnitudes are up in the air.

The sanest model that I can come up with is a log-log regression – that is, starting off with the natural log of revenue and regressing it on the natural log of the number of wins. This gives you an elasticity – a value that explains a percentage change in revenue for a 1% change in the number of wins. This isn’t the most realistic value, of course, since baseball teams play a fixed number of games, but the values look much better – the model looks like:

log(Revenue) = 3.6608 + .4058*log(Wins)

The 3.6608 value is highly significant (p = .00299) and the .4058 coefficient on the number of wins is the strongest we’ve seen yet (p = .1253). It still gives us an unfortunate $38 million operating budget for a zero-win team, but says that doubling a team’s wins should give a 40% increase in revenue. That seems a bit more reasonable.

There are a couple of other, nicer functional forms we could use, but for now, that’s the best we can do with purely linear models.