The DH Redux: Japan June 7, 2010
Posted by tomflesher in Baseball.Tags: Baseball, baseballguru.com, designated hitter, Japan, NPB, OBP, regression, replication
trackback
In an earlier post, I analyzed team-level data from Major League Baseball to determine the size of the effect that the Designated Hitter rule has on on-base percentage. The conclusion I came to was that, if the model is properly specified, the effect of the designated hitter rule is about .008 in on-base percentage. If the reasoning was correct, then when there are no other confounding variables, the effect should be similar in size for any other professional league.
Of course, the other major professional league is Nippon Professional Baseball, the major leagues of Japan. Since it produces players at a level similar to MLB, and the other factors are similar – the DH rule was adopted in 1975 by one, but not both, of the two major leagues – NPB is an ideal place to try to test the model I specified in this post.
I’m working with a dataset pulled from Jim Albright’s BaseballGuru.com Japanese Baseball Data archive. First, note that OBP, HBP, and SF data weren’t readily available. As a result, I’m approximating OBP by adding BB + H and dividing by AB + H. This neglects hit batsmen and sacrifice flies, so OBP is off by a shade. Since I have no idea how prevalent HBP and SF are in Japan, I can’t say whether OBP is overstated or understated. Second, it’s worth stating that there may be non-economic concerns such as strategy preferences (i.e., tastes) that may explain a similar result. Third, I don’t have enough data for the Japanese leagues to determine if the leagues are in fact statistically similar. However, with all that in mind, the DH rule is the same in NPB as it is in MLB, so the effect should be of a similar sign and magnitude.
Once again, we’re testing the null hypothesis of
using a regression of the form:
Since I have data back to 1937, t begins with t=0 that year instead of in 1955 as with the MLB data.
Using R, I ran the regression with the following results:
Estimate | Std. Error | t value | Pr(>|t|) | Significance | |
(Intercept) | 0.03064 | 0.00471 | 65.07800 | 0.00000 | 1.00000 |
t | -0.00050 | 0.00030 | -1.69800 | 0.09224 | 0.90776 |
tsq | 0.00001 | 0.00000 | 3.11300 | 0.00233 | 0.99767 |
DH | 0.01097 | 0.00354 | 3.17700 | 0.00191 | 0.99809 |
Multiple R-squared: 0.3929, Adjusted R-squared: 0.3772
F-statistic: 25.02 on 3 and 116 DF, p-value: 1.475e-12
The regression had a Breusch-Pagan p-value of .5372, meaning the data are homoskedastic (which is good news for us). The adjusted R-squared shows that this regression explains about 38% of the variation in pseudo-OBP using our variables. Let’s look at how the effects stack up against each other:
MLB | NPB | ∆ | S.E. MLB | S.E. NPB | |
(Intercept) | 0.32310 | 0.03064 | 0.29246 | 130.3879 | 62.1198 |
t | -0.00047 | -0.00050 | 0.00003 | 0.175 | 0.111074 |
tsq | 0.000013 | 0.00001 | 0.00000 | 0.083333 | 0.06105 |
DH | 0.008036 | 0.01097 | -0.00293 | 1.74955 | 0.82835 |
The big surprises are, first of all, the fact that the difference in DH terms is so high in terms of MLB standard errors, and second of all, the difference in the intercept. The easy one: The starting times are different, so the intercept is of very little interest to us. As for the difference, there is more baseline data in the NPB dataset since it extends to 1937 instead of only to 1955. Second, the much larger MLB standard error obviously causes trouble here. However, the values are still within the 95% confidence intervals for each other’s standard errors, which means that we cannot reject the hypothesis that they are equal. The signs are the same and the magnitude is similar, and, again, we’re looking at pseudo-OBP for the NPB instead of professionally-calculated OBP.
If I can find data on HBP and SF, it will be interesting to examine the data more closely.
FIrst of all, comparisons like this between NPB and MLB interest me a great deal. Thank you for doing this little study.
Secondly, I’ve been wanting to learn R, but haven’t yet had a serious project to push me into diving in too deep. Reading “Understanding Sabermetrics” by Costa, Huber and Saccoman, I’ve been thinking that the time may be near to try to figure out how some of the constants were derived at, and R may be the tool. But I’d be applying NPB data instead of MLB data.
My main complaint about the book has been that how most of constants used in formulas were derived all happen “off stage” using some sort of magic. Furthermore, concepts like “regressions of β3=0” are 20 years behind me, like most of my “maths.” I can figure out most any programming language through example, though, so would be most interested to see the source code to how you created the above tables.
Thirdly, while your MLB data may be over a shorter time period, NPB has played a shorter season than MLB, especially considering that the first few years of its existence was mainly made up of a few tournaments. So I would guess that the total number of games for NPB from 1937 and MLB from 1955 PER TEAM may be fairly close. But the number of teams is much, much lower, so you are actually getting many less plate appearance opportunities from the NPB data.
Also, the two-league system started in 1950. This seems to be when a lot of official record books start taking records seriously, like the Modern Era stats in MLB compared to the Dead Ball Era and such. I would suggest starting from there.
Furthermore, the sacrifice bunt has been a big part of the game in Japan since the Yomiuri Giants visited the Dodgers’ training camp in the 1960s. The emphasis on “small ball” techniques may also play a role in differences between OBP especially.
Finally, I have HPB and SF for recent NPB seasons, and with some time, can enter them for seasons going back to 1936 (for HBP) and/or 1939 (when SF started). Seeing your original MLB R code would help in deciding where to concentrate my data input efforts. (Please note, I don’t have so much free time to do this in a timely manner – but would be interested in getting this done slowly.)
Hi, Michael. Good to hear from you!
The coefficients in this post (and most of the estimates I make in other posts) are estimated using linear regression, specifically an ordinary least squares (OLS) formula. Basically, I start off with the idea that, maybe, overall, time has some effect on OBP, since pitching and batting ability don’t increase at the same rate, and then beyond that maybe the DH rule has an effect. Then, I fit a line to the data, and maybe the Betas (amount you multiply the variables by to get the output value) are significant and maybe they aren’t.
OLS is sort of an unholy abomination of calculus and linear algebra – basically, it projects the multidimensional data (here, OBP and a yes-or-no “Is there a DH?” question with yes=1 and no=0) onto a time plot and then uses calculus to find a function that minimizes the sum of squared errors (the total squared distance between the actual data points and the data points that the model would predict).
As you’d imagine, this is a pretty tedious thing to do, which is why R (and other computer programs) can be so useful – once the data is loaded, it’s as simple as calling a function like “name-I-give-the-model <- lm(OBP ~ t + tsq + DH)". R generates a table that gives me the estimates for the Betas as well as the standard errors (which are functions of, among other things, the amount of data that we have) and the t-values (which are handy representations of how statistically significant the estimates should be).
There are some drawbacks to OLS, including the likelihood that omitting a relevant variable can distort the coefficient estimates, and the fact that it can be difficult to find good representations of all relevant variables. For example, it's probably impossible to quantify strategy preferences like the preference for small ball that you cite.