## Cy Young gives me a headache. January 15, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , , , , , ,

As usual, I’ve started my yearly struggle against a Cy Young predictor. Bill James and Rob Neyer’s predictor (which I’ve preserved for posterity here) did a pretty poor job this year, having predicted the wrong winner in both leagues and even getting the order very wrong compared to the actual results. Inside, I’d like to share some of my pain, since I can’t seem to do much better.

I’m using a dataset I culled from baseball-reference.com’s Play Index to which I added Cy Young points for each year, as well as a number of binary variables for team division wins, team wildcard appearances, and so on. It includes every player who pitched from the 2005 through 2009 seasons, all told about 3000 observations. Using R, I tried a number of linear regression models to test their veracity.

First, I tried a variation of the James/Neyer formula, CYP = ((5*IP/9)-ER) + (SO/12) + (SV*2.5) + Shutouts + ((W*6)-(L*2)) + VB. I included IP, ER, SO, SV, SHO, W, L, and VB and got this result:

Call:
lm(formula = model <- cypoints ~ IP + ER + SO + SV + SHO + W +
L + VB)

Residuals:
Min       1Q   Median       3Q      Max
-31.2641  -1.4715   0.1084   0.9949 144.4079

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1057887  0.2341857  -0.452    0.651
IP           0.0080245  0.0136774   0.587    0.557
ER          -0.0960892  0.0184517  -5.208 2.03e-07 ***
SO           0.0483835  0.0090107   5.370 8.45e-08 ***
SV           0.0001499  0.0218261   0.007    0.995
SHO          5.5749651  0.4340868  12.843  < 2e-16 ***
W            0.5653568  0.0899062   6.288 3.64e-10 ***
L           -0.3987691  0.0901410  -4.424 1.00e-05 ***
VB          -0.0191531  0.3781868  -0.051    0.960

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.977 on 3213 degrees of freedom
Multiple R-squared: 0.1952,     Adjusted R-squared: 0.1932
F-statistic: 97.43 on 8 and 3213 DF,  p-value: < 2.2e-16

This isn’t promising. Over the past five years, these factors aren’t very predictive at all – the model explains only about 19% of the variation in voting; innings pitched, saves, and the victory bonus aren’t statistically significant, and the victory bonus has a negative effect. The caveat, of course, is that James and Neyer aren’t predicting actual Cy Young voting points but rather a statistical construct that shows the relative likelihood that a given pitcher will receive the Cy. I’m predicting actual Cy Young points. Still, the effects should be similar.

In fact, the model grossly overestimates the proclivity of Cy Young voters for choosing relievers. A pitcher with Saves as his primary statistic hasn’t been given the Cy since Eric Gagne in 2003. This is a double-edged sword – on the one hand, saves have apparently been historically significant for the Cy, but on the other hand, the voting appears to be trending away from them. The five-year time set I used is a compromise to get enough data without compromising the trend.

After playing with R for a little while, I ended up creating a few extra measures that seem to capture the voting a little bit better (but not much). First, to approximate the relief effect, I created a “weighted saves” statistic that multiplies SV*GF and then takes the square root. To maximize the stat for a given number of games finished, all of those games would be saves. (Every save is a game finished, by definition.) Thus, it helps show that the pitcher was relied on as a clutch player. I did the same thing for Complete Games and Shutouts – weighted shutouts is the square root of CG*SHO. Again, to maximize this, every complete game should be a shutout. It ends up being far more predictive than CG or SHO alone. Finally, to capture the added value of each marginal win and marginal strikeout and the added penalty for each marginal home run and marginal walk, I included the squares of those terms. I also tried a dummy variable for previous year winner, since Lincecum’s so-so predicted points must have been bumped up by something.

After playing with the stats with parsimony in mind, I came up with a number of models, the best of which is:

Call:
lm(formula = model <- cypoints ~ W + Wsq + HR + HRsq + K + Ksq +
BB + BBsq + weightedsv + weightedsho)

Residuals:
Min       1Q   Median       3Q      Max
-40.7374  -1.0710  -0.1198   1.1044 122.7243

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.995e-03  2.795e-01   0.007   0.9943
W           -1.295e+00  1.315e-01  -9.844  < 2e-16 ***
Wsq          1.260e-01  7.371e-03  17.091  < 2e-16 ***
HR           1.807e-01  7.286e-02   2.480   0.0132 *
HRsq        -1.499e-02  2.143e-03  -6.996 3.19e-12 ***
K           -8.473e-02  1.642e-02  -5.161 2.61e-07 ***
Ksq          5.972e-04  6.734e-05   8.869  < 2e-16 ***
BB           2.292e-01  3.143e-02   7.292 3.82e-13 ***
BBsq        -2.826e-03  3.041e-04  -9.295  < 2e-16 ***
weightedsv   7.411e-02  1.652e-02   4.487 7.49e-06 ***
weightedsho  2.443e+00  3.252e-01   7.513 7.43e-14 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.245 on 3211 degrees of freedom
Multiple R-squared: 0.3367,     Adjusted R-squared: 0.3346
F-statistic:   163 on 10 and 3211 DF,  p-value: < 2.2e-16

It’s not a great predictor, explaining only about 33% of the variation in points. However, all of the regressors are statistically significant at at leas the 99% level. Some of the other models I tried are here, so you can get an idea of how significant or insignificant other stats might have been at predicting the Cy Young winner.

The long and the short of it is, there appears to be very little predictive value for the Cy Young voting with respect to common statistical measures.