Diagnosing the AL December 22, 2010
Posted by tomflesher in Baseball, Economics.Tags: 2010, American League, baseball-reference.com, R, regression, statistics, Year of the Pitcher
add a comment
In the previous post, I crunched some numbers on a previous forecast I’d made and figured out that it was a pretty crappy forecast. (That’s the fun of forecasting, of course – sometimes you’re right and sometimes you’re wrong.) The funny part of it, though, is that the predicted home runs per game for the American League was so far off – 3.4 standard errors below the predicted value – that it’s highly unlikely that the regression model I used controls for all relevant variables. That’s not surprising, since it was only a time trend with a dummy variable for the designated hitter.
There are a couple of things to check for immediately. The first is the most common explanation thrown around when home runs drop – steroids. It seems to me that if the drop in home runs were due to better control of performance-enhancing drugs, then it should mostly be home runs that are affected. For example, intentional walks should probably be below expectation, since intentional walks are used to protect against a home run hitter. Unintentional walks should probably be about as expected, since walks are a function of plate discipline and pitcher control, not of strength. On-base percentage should probably drop at a lower magnitude than home runs, since some hits that would have been home runs will stay in the park as singles, doubles, or triples rather than all being fly-outs. There will be a drop but it won’t be as big. Finally, slugging average should drop because a loss in power without a corresponding increase in speed will lower total bases.
I’ll analyze these with pretty new R code behind the cut.
What Happened to Home Runs This Year? December 22, 2010
Posted by tomflesher in Baseball, Economics.Tags: baseball-reference.com, forecasting, home runs, R, regression, standard error, statistics, time series, Year of the Pitcher
1 comment so far
I was talking to Jim, the writer behind Apparently, I’m An Angels Fan, who’s gamely trying to learn baseball because he wants to be just like me. Jim wondered aloud how much the vaunted “Year of the Pitcher” has affected home run production. Sure enough, on checking the AL Batting Encyclopedia at Baseball-Reference.com, production dropped by about .15 home runs per game (from 1.13 to .97). Is that normal statistical variation or does it show that this year was really different?
In two previous posts, I looked at the trend of home runs per game to examine Stuff Keith Hernandez Says and then examined Japanese baseball’s data for evidence of structural break. I used the Batting Encyclopedia to run a time-series regression for a quadratic trend and added a dummy variable for the Designated Hitter. I found that the time trend and DH control account for approximately 56% of the variation in home runs per year, and that the functional form is
with t=1 in 1955, t=2 in 1956, and so on. That means t=56 in 2010. Consequently, we’d expect home run production per game in 2010 in the American League to be approximately
That means we expected production to increase this year and it dropped precipitously, for a residual of -.28. The residual standard error on the original regression was .1092, so on 106 degrees of freedom, so the t-value using Texas A&M’s table is 1.984 (approximating using 100 df). That means we can be 95% confident that the actual number of home runs should fall within .1092*1.984, or about .2041, of the expected value. The lower bound would be about 1.05, meaning we’re still significantly below what we’d expect. In fact, the observed number is about 3.4 standard errors below the expected number. In other words, we’d expect that to happen by chance less than .1% (that is, less than one tenth of one percent) of the time.
Clearly, something else is in play.
Quickie: Ryan Howard’s Choke Index October 25, 2010
Posted by tomflesher in Baseball.Tags: baseball-reference.com, binomial distribution, Choke Index, Phillies, Ryan Howard, statistics
1 comment so far
The Choke Index is alive and well.
Previous to 2010, Ryan Howard of the Philadelphia Phillies hit home runs in three consecutive postseasons. He managed 7 in his 140 plate appearances, averaging out to .05 home runs per plate appearance. Not too shabby. It’s a bit below his regular season rate of about .067, but there are a bunch of things that could account for that.
This year, Ryan made 38 plate appearances and hit a grand total of 0 home runs in the postseason. What’s the likelihood of that happening? I use the Choke Index (one minus the probability of hitting 0 home runs in a given number of plate appearances) to measure that. As always, the closer a player gets to 1, the more unlikely his homer-free streak is.
The binomial probability can be calculated using the formula
Or, since we’re looking for the probability of an event NOT occurring,
or
using his career postseason numbers. That means that Ryan Howard’s 2010 postseason Choke Index is .858. Pretty impressive!
Teixeira and Cano: Picking up slack? August 5, 2010
Posted by tomflesher in Baseball, Economics.Tags: A-Rod, Alex Rodriguez, binomial distribution, Mark Teixeira, probability, Robinson Cano, statistics, Yankees
1 comment so far
Michael Kaye, the YES broadcaster for the Yankees, often pointed out between July 22 and August 4 that the Yankees were turning up their offense to make up for Alex Rodriguez‘s lack of home run production. That seems like it might be subject to significant confirmation bias – seeing a few guys hit home runs when you wouldn’t expect them to might lead you to believe that the team in general has increased its production. So, did the Yankees produce more home runs during A-Rod’s drought?
During the first 93 games of the season, the Yankees hit 109 home runs in 3660 plate appearances for rates of 1.17 home runs per game and .0298 home runs per plate appearance. From July 23 to August 3, they hit 17 home runs in 451 plate appearances over 12 games for rates of 1.42 home runs per game and .0377 home runs per plate appearances. Obviously those numbers are quite a bit higher than expected, but can it be due simply to chance?
Assume for the moment that the first 93 games represent the team’s true production capabilities. Then, using the binomial distribution, the likelihood of hitting at least 17 home runs in 451 plate appearances is
The cumulative probability is about .868, meaning the probability of hitting 17 or fewer home runs is .868 and the probability of hitting more than that is about .132. The probability of hitting 16 or fewer is .805, which means out of 100 strings of 451 plate appearances about 81 of them should end with 16 or fewer plate appearances. This is a perfectly reasonable number and not inherently indicative of a special performance by A-Rod’s teammates.
Kaye frequently cited Mark Teixeira and Robinson Cano as upping their games. Teixeira hit 18 home runs over the first 93 games and made 423 plate appearances for rates of .194 home runs per game and .0426 home runs per plate appearance. From July 23 to August 3, he had 5 home runs in 12 games and 54 plate appearances for rates of .417 per game and .0926. That rate of home runs per plate appearance is about 8% likely, meaning that either Teixeira did up his game considerably or he was exceptionally lucky.
Cano played 92 games up to July 21, hitting 18 home runs in 400 plate appearances for rates of .196 home runs per game and .045 per plate appearance. During A-Rod’s drought, he hit 3 home runs in 50 plate appearances over 12 games for rates of .25 and .06. That per-plate-appearance rate is about 39% likely, which means we don’t have enough evidence to reject the idea that Cano’s performance (though better than usual) is just a random fluctuation.
It will be interesting to see if Teixeira slows down as a home-run hitter now that Rodriguez’s drought is over.
Is A-Rod’s Performance Different? August 3, 2010
Posted by tomflesher in Baseball, Economics.Tags: A-Rod, Alex Rodriguez, Choke Index, OBP, p-value, probability, SLG, statistics, t-value, Yankees
1 comment so far
In games between milestone home runs, is Alex Rodriguez’ hitting similar to other times? (This is all a very polite way of asking, “Does A-Rod choke?”) It’s difficult to answer, because there’s so little data about those milestone home runs. A-Rod, though, has some statistically improbable results and it would be interesting to look at it a bit more closely.
Over 2008-2009, Alex played in 262 games and had 1129 plate appearances with 281 hits, 65 home runs, a triple:double ratio of 1:50, an OBP of .397, and a SLG of .553. His OBP has a margin of error of .0146, so we can be 95% confident that over those years his baseline production would be somewhere between .368 and .426 and absent any time or age effect that is the range in which A-Rod should produce for any given period.
Two recent milestone home runs come to mind as examples of Rodriguez’s reputed choking. First, the stretch between home run #499 and #500 was 8 games and 36 plate appearances. (I’m intentionally ignoring extra plate appearances on the days he hit #499 and #500.) During that time, Alex had an OBP of only .306. That’s a difference of .091 over 36 plate appearances and that performance has a standard error of about .078 when compared with his regular performance, implying a t-value of about 1.16. With 35 degrees of freedom, Texas A&M’s t Calculator gives a p-value of about .127, so this difference is marginally within the realm of chance. (The usual cutoff for significance would be .05.)
A-Rod hit his last home run on July 22. Discounting the plate appearances after his last home run, he’s played in 11 games with a paltry .255 OBP and .238 SLG over 47 plate appearances. His .255 OBP has a difference of about .142 and a standard error of about .064. That implies a t-value of about 2.21, with a p-value of about .016. That is, the probability of this difference occurring by chance is less than 2%. That gives us one result as close to significant and one as probably significant.
As a side note, A-Rod’s Choke Index continues to rise. He’s gone 48 plate appearances without a home run, and at a rate of .055 home runs per plate appearance the probability of that occurring by chance is about .066. That leaves his Choke Index at .934.
The Best Game Ever July 30, 2010
Posted by tomflesher in Baseball.Tags: 600 home runs, Alex Rodriguez, Andy Marte, Chan Ho Park, Colin Curtis, designated hitter, Frank Hermann, Gabe Kapler, Indians, Jess Todd, Joe Girardi, Joe Smith, losing DH, Marcus Thames, Mitch Talbot, Nick Swisher, position players pitching, probability, Rafael Perez, statistics, Tony Sipp, Yankees
2 comments
Two of my favorite things about baseball happened during tonight’s game between the Yankees and the Indians.
First of all, in the top of the ninth inning, corner infielder Andy Marte pitched for the Indians. Marte pitched a perfect ninth and coincidentally struck out Nick Swisher, who was brought in to pitch for the Yankees in a similar situation last year and struck out Gabe Kapler of the Tampa Bay Rays. I can’t promise it’s true, but I think that puts Swisher at the top of the list for involvement in position player pitcher strikeouts.
Marte’s presence was necessary because the Indians used seven other pitchers. Starter Mitch Talbot went only two innings, and the Indians got another two out of Rafael Perez. Frank Hermann took the loss for the Indians during his 1 1/3 innings. Tony Sipp pitched another 1 1/3, and Joe Smith managed to give up four earned runs in 1/3 of an inning before being removed for Jess Todd for an inning. In the bottom of the 9th, Marte was all the Indians had left.
Not to be outdone, Joe Girardi gave up his designated hitter by moving his DH – funnily enough, it was Swisher – into right field as part of a triple switch. Swisher moved to right field; Colin Curtis moved from right field to left field; Marcus Thames moved from left field to third base; finally, pitcher Chan Ho Park was put into the batting order in place of Alex Rodriguez, who came out of the game.
Finally, A-Rod is up to 33 plate appearances without a home run. Assuming his standard rate of .064 home runs per plate appearance, the likelihood of this happening by chance is . I stand by my belief that there’s something other than chance (i.e. distraction or other mental factors) causing Rodriguez’s hitting to suffer.
Paul the Octopus: Credible? July 11, 2010
Posted by tomflesher in Economics.Tags: binomial distribution, Paul the Octopus, statistics, World Cup
add a comment
Paul the Octopus (hatched 2008) is an octopus who correctly predicted 12 of 14 World Cup matches, including
Spain’s victory over the Dutch. Is his string of victories statistically significant?
First, I’m going to posit the null hypothesis that Paul is choosing randomly. As such, Paul’s proportion of correct choices should be .5 (). His observed proportion of correct choices is 12/14 or .857.
The standard error for proportions is
The t-value of an observation is
According to Texas A&M’s t Distribution Calculator, the probability (or p-value) of this result by chance alone is less than .01.
Using the binomial distribution with , the probability of 12 or more successes in 14 trials is a vanishingly small .0065.
So, is Paul an oracle? Almost certainly not. However, not being a zoologist, I can’t explain what biases might be in play. I’d imagine it’s something like an attraction to contrast as well as a spurious correlation between octopus-attractive flags and success at soccer.
Manny’s First 27 Games (or, the Marginal Product of Drug Use) June 4, 2010
Posted by tomflesher in Baseball, Economics.Tags: Baseball, baseball-reference.com, Dodgers, economics, Manny Ramirez, performance-enhancing drugs, sabermetrics, sports economics, statistics, suspension
add a comment
Last year, Manny Ramirez was suspended for 50 games on May 6. The suspension came after his 27th game of the season. On May 25th of this year, Manny played his 27th game of 2010. That means we can take a look at the first 27 games of each season, when he was using performance-enhancing drugs (in 2009) and when he wasn’t (presumably, this year). The differential line is behind the cut.