## From the File Drawer: Does Spring Training Predict Wins? March 18, 2014

Posted by tomflesher in Baseball.
Tags: , ,

Nope.

It’s an idle curiosity, and more information is never a bad thing, but first, you need to establish whether there’s actually any information being generated. It would be useful, potentially, to have a sense of at least how the first few weeks of the season might go, so I decided to crunch some numbers to see whether I could torture the data far enough to get a good predictive measure. I grabbed the spring training and regular season stats from 2012 and 2013 and started at it.

First round.

Correlation! Correlations are useful. The correlation of spring winning percentage and regular-season winning percentage? A paltry .069. That’s not even worth looking at. This is going to be harder than I thought.

Second round.

Well, maybe if we try a Pythagorean expectation, we might get something useful. Let’s try the 2 exponent…. Hm. That correlation is even worse (.063). Well, maybe the 1.82 “true” exponent will help…. .065. This isn’t going to work very well.

Third round.

Okay. This is going to involve some functional-form assumptions if we really want to go all Mythbusters on the data’s ass and figure out something that works. First, let’s validate the Pythagorean expectation by running an optimization to minimize the sum of squared errors, with runratio = Runs Allowed/Runs Scored and perc = regular season winning percentage:

> min.RSS <- function(data,B) {with(data,sum((1/(1 + runratio^B) – perc)^2))}
> result
\$minimum
 1.799245

\$objective
 0.04660422

That “\$minimum” value means that the optimal value for B (the pythagorean exponent) is around 1.80 (to the nearest hundredth). The “\$objective” value is the sum of squared errors. Let’s try the same thing with the Spring data:

> spring.RSS <- function(data,SprB) {with(data,sum((1/(1 + runratio.spr^SprB) – Sprperc)^2))}
> springresult
\$minimum
 2.243336

\$objective
 0.1253673

Alarmingly, even with the same amount of data, the sum of squared errors is almost triple the same measure for the regular-season data. The exponent is also pretty far off. Now for some cross-over: can we set up a model where the spring run ratio yields a useful measure of regular-season win percentage? Let’s try it out:

> cross.RSS <- function(data,crossB) {with(data,sum((1/(1 + runratio.spr^crossB) – perc)^2))}
> crossresult
\$minimum
 0.08985465

\$objective
 0.3214856

> crossperc <- 1/(1 + runratio.spr^crossresult\$minimum)
> cor(perc,crossperc)
 0.05433157

.054, everybody! That’s the worst one yet!

Now, if anyone ever asks, go ahead and tell them that at least based on an afternoon of noodling around with R, spring training will not predict regular-season wins.

Just for the record, the correlation between the Pythagorean expectation and wins is enormous:

> pythperc<-1/(1 + runratio^result\$minimum)
> cor(perc,pythperc)
 0.9250366