Diagnosing the AL

Diagnosing the AL December 22, 2010

Posted by tomflesher in Baseball, Economics.
Tags: 2010, American League, baseball-reference.com, R, regression, statistics, Year of the Pitcher
trackback

In the previous post, I crunched some numbers on a previous forecast I’d made and figured out that it was a pretty crappy forecast. (That’s the fun of forecasting, of course – sometimes you’re right and sometimes you’re wrong.) The funny part of it, though, is that the predicted home runs per game for the American League was so far off – 3.4 standard errors below the predicted value – that it’s highly unlikely that the regression model I used controls for all relevant variables. That’s not surprising, since it was only a time trend with a dummy variable for the designated hitter.

There are a couple of things to check for immediately. The first is the most common explanation thrown around when home runs drop – steroids. It seems to me that if the drop in home runs were due to better control of performance-enhancing drugs, then it should mostly be home runs that are affected. For example, intentional walks should probably be below expectation, since intentional walks are used to protect against a home run hitter. Unintentional walks should probably be about as expected, since walks are a function of plate discipline and pitcher control, not of strength. On-base percentage should probably drop at a lower magnitude than home runs, since some hits that would have been home runs will stay in the park as singles, doubles, or triples rather than all being fly-outs. There will be a drop but it won’t be as big. Finally, slugging average should drop because a loss in power without a corresponding increase in speed will lower total bases.

I’ll analyze these with pretty new R code behind the cut.

Using R, I fitted time-series models of the same functional form as the home runs per game model. I pulled the data from the Baseball-Reference.com AL Batting Encyclopedia and regressed the variable of interest on a time trend, its square, and a dummy for the designated hitter.

First Assumption: Intentional walks should decrease.

Results:

> ibb.lm <- lm(IBB ~ t + tsq + DH)
> summary(ibb.lm)

Call:
lm(formula = IBB ~ t + tsq + DH)

Residuals:
       Min         1Q     Median         3Q        Max
-0.1350376 -0.0261969  0.0005516  0.0294412  0.1534536

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.656e-01  1.408e-02  18.870  < 2e-16 ***
t            8.037e-03  1.199e-03   6.706 1.01e-09 ***
tsq         -1.393e-04  2.024e-05  -6.882 4.30e-10 ***
DH          -1.140e-01  1.055e-02 -10.805  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.04689 on 106 degrees of freedom
Multiple R-squared: 0.5961,     Adjusted R-squared: 0.5847
F-statistic: 52.14 on 3 and 106 DF,  p-value: < 2.2e-16

> ibb.2010.fitted <- (2.656e-01) + (8.037e-03)*56 + (-1.393e-04)*(56**2) + (-1.140e-01)
> ibb.2010.obs <- .2
> residual.ibb <- ibb.2010.obs - ibb.2010.fitted
> se.ibb <- .04689
> residual.ibb/se.ibb
[1] 0.750113

Created by Pretty R at inside-R.org

Intentional walks per game increased, but the increase was by less than one standard error. Statistically, intentional walks did not change.

Second Assumption: Unintentional walks should not change.

Results:

> uBB <- (BB-IBB)
> ubb.lm <- lm(uBB ~ t + tsq + DH)
> summary(ubb.lm)

Call:
lm(formula = uBB ~ t + tsq + DH)

Residuals:
     Min       1Q   Median       3Q      Max
-0.69256 -0.12758 -0.01390  0.13178  0.77866

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.0879505  0.0732669  42.147  < 2e-16 ***
t           -0.0190285  0.0062392  -3.050 0.002892 **
tsq          0.0003623  0.0001054   3.439 0.000837 ***
DH           0.1812598  0.0549094   3.301 0.001313 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2441 on 106 degrees of freedom
Multiple R-squared: 0.1876,     Adjusted R-squared: 0.1647
F-statistic: 8.162 on 3 and 106 DF,  p-value: 6.127e-05

> ubb.2010.fitted <- 3.0879505 + (-.0190285)*56 + (.0003623)*(56**2) + .1812598
> ubb.2010.obs <- 3.25 - .2
> residual.ubb <- ubb.2010.obs - ubb.2010.fitted
> se.ubb <- .2441
> residual.ubb/se.ubb
[1] -1.187166

Created by Pretty R at inside-R.org

Unintentional walks decreased by a bit over one standard error. Again, that isn’t evidence of a big enough fluctuation to say that it’s statistically different from our expectation.

Third Assumption: OBP drops, but by somewhat less than 3.4 standard errors.

Results:

> obp.lm <- lm(OBP ~ t + tsq + DH)
> summary(obp.lm)

Call:
lm(formula = OBP ~ t + tsq + DH)

Residuals:
       Min         1Q     Median         3Q        Max
-0.0217348 -0.0044903  0.0002799  0.0046695  0.0182481

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.238e-01  2.230e-03 145.199  < 2e-16 ***
t           -5.703e-04  1.899e-04  -3.003  0.00334 **
tsq          1.472e-05  3.207e-06   4.591 1.22e-05 ***
DH           8.245e-03  1.671e-03   4.933 3.02e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.00743 on 106 degrees of freedom
Multiple R-squared: 0.487,      Adjusted R-squared: 0.4724
F-statistic: 33.54 on 3 and 106 DF,  p-value: 2.532e-15

> obp.2010.fitted <- (3.238e-01) + (-5.703e-04)*56 + (1.472e-05)*(56**2) + 8.245e-03
> obp.2010.obs <- .327
> residual.obp <- obp.2010.obs - obp.2010.fitted
> se.obp <- .00743
> residual.obp/se.obp
[1] -2.593556

Created by Pretty R at inside-R.org

OBP dropped, but it dropped by quite a bit. Without more information it’s hard to judge whether a change of this magnitude is due to better pitching or power being taken away from hitters.

Fourth Assumption: Slugging average will drop.

Results:

> slg.lm <- lm(SLG ~ t + tsq + DH)
> summary(slg.lm)

Call:
lm(formula = SLG ~ t + tsq + DH)

Residuals:
       Min         1Q     Median         3Q        Max
-0.0357646 -0.0087050 -0.0007988  0.0115133  0.0317497

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.937e-01  4.471e-03  88.050  < 2e-16 ***
t           -2.058e-03  3.807e-04  -5.404 4.04e-07 ***
tsq          5.049e-05  6.429e-06   7.853 3.51e-12 ***
DH           1.693e-02  3.351e-03   5.054 1.82e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01489 on 106 degrees of freedom
Multiple R-squared: 0.6452,     Adjusted R-squared: 0.6352
F-statistic: 64.27 on 3 and 106 DF,  p-value: < 2.2e-16

> slg.2010.fitted <- (3.937e-01) + (-2.058e-03)*56 + (5.049e-05)*(56**2) + (1.693e-02)
> slg.2010.obs <- .407
> residual.slg <- slg.2010.obs - slg.2010.fitted
> se.slg <- .01489
> residual.slg/se.slg
[1] -3.137585

Created by Pretty R at inside-R.org

A drop in slugging average of over three standard errors indicates that we may be working with something that’s ruined hitters’ power or that’s hurt their ability to hit in general. We have results that are consistent with either something harming power hitters specifically or hitters in general.

This isn’t evidence of steroid use. In fact, the same results would be consistent with a shift toward pitching talent. More work needs to be done on this year’s data before conclusions can be drawn. However, it does seem to indicate that, at least in the American League, the Year of the Pitcher narrative has some statistical foundation.

Comments»

No comments yet — be the first.

The World's Worst Sports Blog