##
From the File Drawer – 6-run games are an indicator
*August 29, 2015*

*Posted by tomflesher in Baseball, Sports.*

Tags: file drawer

add a comment

Tags: file drawer

add a comment

At one point during the Mets’ tough loss to Boston last night, I shouted, “IT GETS THROUGH d’ARNAUD! IT GETS THROUGH d’ARNAUD!” The game started off pretty well, with **Matt Harvey** going six scoreless (including the aforementioned wild pitch to **Travis d’Arnaud**), but **Logan Verrett** had a rough seventh inning. Despite **Tyler Clippard** and **Jeurys Familia** doing what they do so well, **Carlos Torres** continued his slide. (He’ll be the topic of another post soon.)

One thing that surprised me was the number of runs the Mets allowed – six! The Mets have allowed 6 or more runs in 33 games, and six of them have been in August. In those six games, though, the Mets are 4-2. Does that sort of game really come out in the wash? I decided to crunch some numbers using the Baseball Reference Play Index and find out, with the hypothesis that the number of high-scoring games for opponents doesn’t really have an effect on the team’s overall record.

The chart attached to this post uses the number of games in which 6 or more runs are allowed on the x-axis and percentage on the y-axis. The blue datapoints are individual teams’ win-loss percentage in those 6-run games; the trend is pretty clear, although the outlier is the Toronto Blue Jays (who have played 48 such games and have a .417 winning percentage). The orange datapoints are season win-loss percentages, again as a function of 6-plus-run games. This trend is pretty clear, too: if you allow your opponents to score lots of runs, there’s a definite negative effect on your record, even though over a small sample size it might disappear. (Another perfectly good hypothesis busted by data!)

For the record, the correlation between 6-runs-allowed games and win percentage in those games is -.457, meaning that there’s a noticeable negative effect on performance in those games; however, the correlation between the number of those games and season win percentage is even stronger, at -.805. One way to interpret those numbers is to say that a team can recover in an individual high-scoring game, but a team that consistently allows such high scores will eventually see the losses add up.

##
What does a long game do to teams?
*April 13, 2015*

*Posted by tomflesher in Baseball.*

Tags: extra innings, file drawer, linear model, linear regression, Red Sox, Yankees

add a comment

Tags: extra innings, file drawer, linear model, linear regression, Red Sox, Yankees

add a comment

Friday, the Red Sox took a 19-inning contest from the Yankees. Both teams have the unfortunate circumstance of finishing a game around 2:15 A.M. and having to be back on the field at 1:05 PM. Everyone, including the announcers, discussed how tired the teams would be; in particular, first baseman **Mark Teixeira** spent a long night on the bag to keep backup first baseman and apparent emergency pitcher **Garrett Jones** fresh, leading **Alex Rodriguez** to make his first career appearance at first base on Saturday.

Teixeira wasn’t the only player to sit out the next day – center fielders **Jacoby Ellsbury** and **Mookie Betts**, catchers **Brian McCann** and **Sandy Leon**, and most of the bullpen all sat out, among others. The Yankees called up pitcher **Matt Tracy** for a cup of coffee and sent **Chasen Shreve** down, then swapped Tracy back down to Scranton for **Kyle Davies**. Boston activated starter **Joe Kelly** from the disabled list, sending winning pitcher **Steven Wright** down to make room. Shreve and Wright each had solid outings, with Wright pitching five innings with 2 runs and Shreve pitching 3 1/3 scoreless.

All those moves provide some explanation for a surprising result. Interested in what the effect of these long games are, I dug up all of the games from 2014 that lasted 14 innings or more. In a quick and dirty data set, I traced the scores for each team in their next games along with the number of outs pitched and the length in minutes of the game.

I fitted two linear models and two log models: two each with the next game’s runs as the dependent variable and two each with the difference in runs (next game’s runs – long game’s runs) as the dependent variable. Each used the length of the game in minutes, the number of outs, the average runs scored by the team during 2014, and an indicator variable for the presence of a designated hitter in each game. For each dependent variable, I modeled all variables in a linear form once and the natural log of outs and the natural log of the length of the game once.

With runs scored as the dependent variable, nothing was significant. That is, no variable correlated strongly with an increase or decrease in the number of runs scored.

With a run difference model, the length of the game in minutes became marginally significant. For the linear model, extending the length of the game by one minute lowers the difference in runs by about .043 runs – that is, normalizing for the number of runs scored the previous day, extending the game by one minute lowered the runs the next day by about .043. In the semilog model, extending the game by about 1% lowered the run difference by about 14; this was offset by an extremely high intercept term. This is a very high semielasticity, and both coefficients had p-values between .01 and .015. Nothing else was even close.

With all of the usual caveats about statistical analysis, this shows that teams are actually pretty good at bouncing back from long games, either due to the fact that most of the time they’re playing the same team (so teams are equally fatigued) or due to smart roster moves. Either way, it’s a surprise.

##
From the File Drawer: Does Spring Training Predict Wins?
*March 18, 2014*

*Posted by tomflesher in Baseball.*

Tags: Baseball, file drawer, Spring training

1 comment so far

Tags: Baseball, file drawer, Spring training

1 comment so far

Nope.

It’s an idle curiosity, and more information is never a bad thing, but first, you need to establish whether there’s actually any information being generated. It would be useful, potentially, to have a sense of at least how the first few weeks of the season might go, so I decided to crunch some numbers to see whether I could torture the data far enough to get a good predictive measure. I grabbed the spring training and regular season stats from 2012 and 2013 and started at it.

**First round.**

Correlation! Correlations are useful. The correlation of spring winning percentage and regular-season winning percentage? A paltry .069. That’s not even worth looking at. This is going to be harder than I thought.

**Second round.**

Well, maybe if we try a Pythagorean expectation, we might get something useful. Let’s try the 2 exponent…. Hm. That correlation is even worse (.063). Well, maybe the 1.82 “true” exponent will help…. .065. This isn’t going to work very well.

**Third round.**

Okay. This is going to involve some functional-form assumptions if we really want to go all Mythbusters on the data’s ass and figure out something that works. First, let’s validate the Pythagorean expectation by running an optimization to minimize the sum of squared errors, with *runratio = *Runs Allowed/Runs Scored and *perc* = regular season winning percentage:

*> min.RSS <- function(data,B) {with(data,sum((1/(1 + runratio^B) – perc)^2))}*

* > result<-optimize(min.RSS, c(0,10),data=data)*

* > result*

* $minimum*

* [1] 1.799245*

*$objective*

* [1] 0.04660422*

That “$minimum” value means that the optimal value for B (the pythagorean exponent) is around 1.80 (to the nearest hundredth). The “$objective” value is the sum of squared errors. Let’s try the same thing with the Spring data:

*> spring.RSS <- function(data,SprB) {with(data,sum((1/(1 + runratio.spr^SprB) – Sprperc)^2))}*

* > springresult<-optimize(spring.RSS, c(0,10),data=data)*

* > springresult*

* $minimum*

* [1] 2.243336*

*$objective*

* [1] 0.1253673*

Alarmingly, even with the same amount of data, the sum of squared errors is almost triple the same measure for the regular-season data. The exponent is also pretty far off. Now for some cross-over: can we set up a model where the spring run ratio yields a useful measure of regular-season win percentage? Let’s try it out:

*> cross.RSS <- function(data,crossB) {with(data,sum((1/(1 + runratio.spr^crossB) – perc)^2))}*

* > crossresult<-optimize(cross.RSS,c(0,10),data=data)*

* > crossresult*

* $minimum*

* [1] 0.08985465*

*$objective*

* [1] 0.3214856*

*> crossperc <- 1/(1 + runratio.spr^crossresult$minimum)*

* > cor(perc,crossperc)*

* [1] 0.05433157*

.054, everybody! That’s the worst one yet!

Now, if anyone ever asks, go ahead and tell them that at least based on an afternoon of noodling around with R, spring training will not predict regular-season wins.

Just for the record, the correlation between the Pythagorean expectation and wins is enormous:

*> pythperc<-1/(1 + runratio^result$minimum)*

*> cor(perc,pythperc)*

*[1] 0.9250366*