jump to navigation

Why the difference in voting? January 5, 2011

Posted by tomflesher in Baseball, Economics.
Tags: , , ,
add a comment

As much as I love the Angels, I can’t take Jered’s side on this one.

Today, I was browsing the voting results from the various awards being voted on. Each league’s Cy Young award voting included the requisite two closers. No surprises there. There was also a beautiful case study of the AL Cy Young winner, Felix Hernandez, versus Jered Weaver. They had identical records (13-12) in an identical number of starts (34) and similar strikeouts (233 for Weaver versus 232 for Hernandez). What explains Hernandez’ winning total of 167 points contra Weaver’s fifth-place 24?

A few things come to mind:

  • Hernandez went longer. In the same number of games, wins, and losses, King Felix pitched 249 2/3 innings, whereas Weaver pitched 224 1/3. Those extra 25 1/3 innings show not only that Hernandez was considered more reliable by his manager but that he was, in fact, more reliable (since the extra innings didn’t result in his stats taking a hit). Hernandez also pitched a formidable 6 complete games with one shutout, whereas Weaver had no pips in either category.
  • Hernandez was more effective. Felix gave up fewer runs (80 versus 83) and had a much higher proportion of unearned runs – fully 21.25% of his runs were unearned, whereas Weaver had about 9.6% of runs unearned. That means that more of Hernandez’s runs are attributable to defensive mishaps than Weaver’s. That leads to Felix with a miniscule 2.27 ERA, much lower than Weaver’s respectable 3.01, and 6 wins above replacement compared with Weaver’s 5.4.
  • Hernandez was marginally more effective. He had six Tough Losses and no Cheap Wins, while Weaver had five Tough Losses and one Cheap Win. Felix couldn’t rely on his team to supply him with significant run support, while Weaver got that support in his one cheap win.
  • However, Hernandez’s control wasn’t as good. Felix walked 70 batters for a control ratio (Strikeouts over walks) of .30 and threw 14 wild pitches. Jered, on the other hand, walked only 54 batters, for a control ratio of .23, and only 7 wild pitches. Still, it seems reasonable to assume that control suffers exponentially as innings increase, so part of the apparent lack of control can be explained by Hernandez’s extra innings.

Overall, Felix’s marginal value over Weaver more than explains the difference in voting.

Are This Year’s Home Runs Really That Different? December 22, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , , , , ,
add a comment

This year’s home runs are quite confounding. On the one hand, home runs per game in the AL have dropped precipitously (as noted and examined in the two previous posts). On the other hand, Jose Bautista had an absolutely outstanding year. How much different is this year’s distribution than those of previous years? To answer that question, I took off to Baseball Reference and found the list of all players with at least one plate appearance, sorted by home runs.

There are several parameters that are of interest when discussing the distribution of events. The first is the mean. This year’s mean was 5.43, meaning that of the players with at least one plate appearance, on average each one hit 5.43 homers. That’s down from 6.53 last year and 5.66 in 2008.

Next, consider the variance and standard deviation. (The variance is the standard deviation squared, so the numbers derive similarly.) A low variance means that the numbers are clumped tightly around the mean. This year’s variance was 68.4, down from last year’s 84.64 but up from 2008’s 66.44.

The skewness and kurtosis represent the length and thickness of the tails, respectively. Since a lot of people have very few home runs, the skewness of every year’s distribution is going to be positive. Roughly, that means that there are observations far larger than the mean, but very few that are far smaller. That makes sense, since there’s no such thing as a negative home run total. The kurtosis number represents how pointy the distribution is, or alternatively how much of the distribution is found in the tail.

For example, in 2009, Mark Teixeira and Carlos Pena jointly led the American League in home runs with 39. There was a high mean, but the tail was relatively thin with a high variance. Compared with this year, when Bautista led his nearest competitor (Paul Konerko) by 15 runs and only 8 players were over 30 home runs, 2009 saw 15 players above 30 home runs with a pretty tight race for the lead. Kurtosis in 2010 was 7.72 compared with 2009’s 4.56 and 2008’s 5.55. (In 2008, 11 players were above the 30-mark, and Miguel Cabrera‘s 37 home runs edged Carlos Quentin by just one.)

The numbers say that 2008 and 2009 were much more similar than either of them is to 2010. A quick look at the distributions bears that out – this was a weird year.

Diagnosing the AL December 22, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , ,
add a comment

In the previous post, I crunched some numbers on a previous forecast I’d made and figured out that it was a pretty crappy forecast. (That’s the fun of forecasting, of course – sometimes you’re right and sometimes you’re wrong.) The funny part of it, though, is that the predicted home runs per game for the American League was so far off – 3.4 standard errors below the predicted value – that it’s highly unlikely that the regression model I used controls for all relevant variables. That’s not surprising, since it was only a time trend with a dummy variable for the designated hitter.

There are a couple of things to check for immediately. The first is the most common explanation thrown around when home runs drop – steroids. It seems to me that if the drop in home runs were due to better control of performance-enhancing drugs, then it should mostly be home runs that are affected. For example, intentional walks should probably be below expectation, since intentional walks are used to protect against a home run hitter. Unintentional walks should probably be about as expected, since walks are a function of plate discipline and pitcher control, not of strength. On-base percentage should probably drop at a lower magnitude than home runs, since some hits that would have been home runs will stay in the park as singles, doubles, or triples rather than all being fly-outs. There will be a drop but it won’t be as big. Finally, slugging average should drop because a loss in power without a corresponding increase in speed will lower total bases.

I’ll analyze these with pretty new R code behind the cut.

(more…)

What Happened to Home Runs This Year? December 22, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , ,
1 comment so far

I was talking to Jim, the writer behind Apparently, I’m An Angels Fan, who’s gamely trying to learn baseball because he wants to be just like me. Jim wondered aloud how much the vaunted “Year of the Pitcher” has affected home run production. Sure enough, on checking the AL Batting Encyclopedia at Baseball-Reference.com, production dropped by about .15 home runs per game (from 1.13 to .97). Is that normal statistical variation or does it show that this year was really different?

In two previous posts, I looked at the trend of home runs per game to examine Stuff Keith Hernandez Says and then examined Japanese baseball’s data for evidence of structural break. I used the Batting Encyclopedia to run a time-series regression for a quadratic trend and added a dummy variable for the Designated Hitter. I found that the time trend and DH control account for approximately 56% of the variation in home runs per year, and that the functional form is

\hat{HR} = .957 - .0188 \times t + .0004 \times t^2 + .0911  \times DH

with t=1 in 1955, t=2 in 1956, and so on. That means t=56 in 2010. Consequently, we’d expect home run production per game in 2010 in the American League to be approximately

\hat{HR} = .957 - .0188 \times 56 + .0004 \times 3136 + .0911 \approx 1.25

That means we expected production to increase this year and it dropped precipitously, for a residual of -.28. The residual standard error on the original regression was .1092, so on 106 degrees of freedom, so the t-value using Texas A&M’s table is 1.984 (approximating using 100 df). That means we can be 95% confident that the actual number of home runs should fall within .1092*1.984, or about .2041, of the expected value. The lower bound would be about 1.05, meaning we’re still significantly below what we’d expect. In fact, the observed number is about 3.4 standard errors below the expected number. In other words, we’d expect that to happen by chance less than .1% (that is, less than one tenth of one percent) of the time.

Clearly, something else is in play.

Home Run Derby: Does it ruin swings? December 15, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , , , , ,
add a comment

Earlier this year, there was a lot of discussion about the alleged home run derby curse. This post by Andy on Baseball-Reference.com asked if the Home Run Derby is bad for baseball, and this Hardball Times piece agrees with him that it is not. The standard explanation involves selection bias – sure, players tend to hit fewer home runs in the second half after they hit in the Derby, but that’s because the people who hit in the Derby get invited to do so because they had an abnormally high number of home runs in the first half.

Though this deserves a much more thorough macro-level treatment, let’s just take a look at the density of home runs in either half of the season for each player who participated in the Home Run Derby. Those players include David Ortiz, Hanley Ramirez, Chris Young, Nick Swisher, Corey Hart, Miguel Cabrera, Matt Holliday, and Vernon Wells.

For each player, plus Robinson Cano (who was of interest to Andy in the Baseball-Reference.com post), I took the percentage of games before the Derby and compared it with the percentage of home runs before the Derby. If the Ruined Swing theory holds, then we’d expect

g(HR) \equiv HR_{before}/HR_{Season} > g(Games) \equiv Games_{before}/162

The table below shows that in almost every case, including Cano (who did not participate), the density of home runs in the pre-Derby games was much higher than the post-Derby games.

Player HR Before HR Total g(Games) g(HR) Diff
Ortiz 18 32 0.54321 0.5625 0.01929
Hanley 13 21 0.54321 0.619048 0.075838
Swisher 15 29 0.537037 0.517241 -0.0198
Wells 19 31 0.549383 0.612903 0.063521
Holliday 16 28 0.54321 0.571429 0.028219
Hart 21 31 0.549383 0.677419 0.128037
Cabrera 22 38 0.530864 0.578947 0.048083
Young 15 27 0.549383 0.555556 0.006173
Cano 16 29 0.537037 0.551724 0.014687

Is this evidence that the Derby causes home run percentages to drop off? Certainly not. There are some caveats:

  • This should be normalized based on games the player played, instead of team games.
  • It would probably even be better to look at a home run per plate appearance rate instead.
  • It could stand to be corrected for deviation from the mean to explain selection bias.
  • Cano’s numbers are almost identical to Swisher’s. They play for the same team. If there was an effect to be seen, it would probably show up here, and it doesn’t.

Once finals are up, I’ll dig into this a little more deeply.

Jim Thome’s Marginal Value October 5, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , ,
add a comment

I’ve alluded to the similarity between Manny Ramirez and Jim Thome quite a bit. They both played in Cleveland for a few years before moving on to other teams. They’re each in the DH phase of their careers. Thome is about two years older than Ramirez, but otherwise they’ve had relatively similar production. That’s why it was so odd for the White Sox to let Thome go a few years back only to pick an injured, probably going-downhill Manny for about a quarter of the season when Ramirez is making about $18 million and Thome’s maximum salary was about $15.7 million. There’s an argument that Manny still has more productive years left than Thome, of course. (I happen to think that argument is wrong, but that’s just me.)

Just for fun, let’s take a look at their production since Manny’s trade.

In the last 24 games he played, Ramirez had 88 plate appearances, a respectable .420 OBP, and a Jeteresque .261 batting average. His win probability added was -.273, for those of you who are into that sort of thing. Meanwhile, over the same number of games, the flagging, decrepit Thome had only 79 plate appearances, with a paltry .333 batting average, and only a .494 OBP.

Thome’s salary this year for the Twins was $1.5 million.

I think the winner here is clear.

Teixeira and Cano: Picking up slack? August 5, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , ,
1 comment so far

Michael Kaye, the YES broadcaster for the Yankees, often pointed out between July 22 and August 4 that the Yankees were turning up their offense to make up for Alex Rodriguez‘s lack of home run production. That seems like it might be subject to significant confirmation bias – seeing a few guys hit home runs when you wouldn’t expect them to might lead you to believe that the team in general has increased its production. So, did the Yankees produce more home runs during A-Rod’s drought?

During the first 93 games of the season, the Yankees hit 109 home runs in 3660 plate appearances for rates of 1.17 home runs per game and .0298 home runs per plate appearance. From July 23 to August 3, they hit 17 home runs in 451 plate appearances over 12 games for rates of 1.42 home runs per game and .0377 home runs per plate appearances. Obviously those numbers are quite a bit higher than expected, but can it be due simply to chance?

Assume for the moment that the first 93 games represent the team’s true production capabilities. Then, using the binomial distribution, the likelihood of hitting at least 17 home runs in 451 plate appearances is

p(K = k) = {n\choose k}p^k(1-p)^{n-k} = {451\choose 17}.0298^{17}(.9702)^{434} \approx .0626

The cumulative probability is about .868, meaning the probability of hitting 17 or fewer home runs is .868 and the probability of hitting more than that is about .132. The probability of hitting 16 or fewer is .805, which means out of 100 strings of 451 plate appearances about 81 of them should end with 16 or fewer plate appearances. This is a perfectly reasonable number and not inherently indicative of a special performance by A-Rod’s teammates.

Kaye frequently cited Mark Teixeira and Robinson Cano as upping their games. Teixeira hit 18 home runs over the first 93 games and made 423 plate appearances for rates of .194 home runs per game and .0426 home runs per plate appearance. From July 23 to August 3, he had 5 home runs in 12 games and 54 plate appearances for rates of .417 per game and .0926. That rate of home runs per plate appearance is about 8% likely, meaning that either Teixeira did up his game considerably or he was exceptionally lucky.

Cano played 92 games up to July 21, hitting 18 home runs in 400 plate appearances for rates of .196 home runs per game and .045 per plate appearance. During A-Rod’s drought, he hit 3 home runs in 50 plate appearances over 12 games for rates of .25 and .06. That per-plate-appearance rate is about 39% likely, which means we don’t have enough evidence to reject the idea that Cano’s performance (though better than usual) is just a random fluctuation.

It will be interesting to see if Teixeira slows down as a home-run hitter now that Rodriguez’s drought is over.

Is A-Rod’s Performance Different? August 3, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , , ,
1 comment so far

In games between milestone home runs, is Alex Rodriguez’ hitting similar to other times? (This is all a very polite way of asking, “Does A-Rod choke?”) It’s difficult to answer, because there’s so little data about those milestone home runs. A-Rod, though, has some statistically improbable results and it would be interesting to look at it a bit more closely.

Over 2008-2009, Alex played in 262 games and had 1129 plate appearances with 281 hits, 65 home runs, a triple:double ratio of 1:50, an OBP of .397, and a SLG of .553. His OBP has a margin of error of .0146, so we can be 95% confident that over those years his baseline production would be somewhere between .368 and .426 and absent any time or age effect that is the range in which A-Rod should produce for any given period.

Two recent milestone home runs come to mind as examples of Rodriguez’s reputed choking. First, the stretch between home run #499 and #500 was 8 games and 36 plate appearances. (I’m intentionally ignoring extra plate appearances on the days he hit #499 and #500.) During that time, Alex had an OBP of only .306. That’s a difference of .091 over 36 plate appearances and that performance has a standard error of about .078 when compared with his regular performance, implying a t-value of about 1.16. With 35 degrees of freedom, Texas A&M’s t Calculator gives a p-value of about .127, so this difference is marginally within the realm of chance. (The usual cutoff for significance would be .05.)

A-Rod hit his last home run on July 22. Discounting the plate appearances after his last home run, he’s played in 11 games with a paltry .255 OBP and .238 SLG over 47 plate appearances. His .255 OBP has a difference of about .142 and a standard error of about .064. That implies a t-value of about 2.21, with a p-value of about .016. That is, the probability of this difference occurring by chance is less than 2%. That gives us one result as close to significant and one as probably significant.

As a side note, A-Rod’s Choke Index continues to rise. He’s gone 48 plate appearances without a home run, and at a rate of .055 home runs per plate appearance the probability of that occurring by chance is about .066. That leaves his Choke Index at .934.

The 600 Home Run Almanac July 28, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , , , , , , , ,
2 comments

People are interested in players who hit 600 home runs, at least judging by the Google searches that point people here. With that in mind, let’s take a look at some quick facts about the 600th home run and the people who have hit it.

Age: There are six players to have hit #600. Sammy Sosa was the oldest at 39 years old in 2007. Ken Griffey, Jr. was 38 in 2007, as were Willie Mays in 1969 and Barry Bonds in 2002. Hank Aaron was 37. Babe Ruth was the youngest at 36 in 1931. Alex Rodriguez, who is 35 as of July 27, will almost certainly be the youngest player to reach 600 home runs. If both Manny Ramirez and Jim Thome hang on to hit #600 over the next two to three seasons, Thome (who was born in August of 1970) will probably be 42 in 2012; Ramirez (born in May of 1972) will be 41 in 2013. (In an earlier post that’s when I estimated each player would hit #600.) If Thome holds on, then, he’ll be the oldest player to hit his 600th home run.

Productivity: Since 2000 (which encompasses Rodriguez, Ramirez, and Thome in their primes), the average league rate of home runs per plate appearances has been about .028. That is, a home run was hit in about 2.8% of plate appearances. Over the same time period, Rodriguez’ rate was .064 – more than double the league average. Ramirez hit .059 – again, over double the league rate. Thome, for his part, hit at a rate of .065 home runs per plate appearance. From 2000 to 2009, Thome was more productive than Rodriguez.

Standing Out: Obviously it’s unusual for them to be that far above the curve. There were 1,877,363 plate appearances (trials) from 2000 to 2009. The margin of error for a proportion like the rate of home runs per plate appearance is

\sqrt{\frac{p(1-p)}{n-1}} = \sqrt{\frac{.028(.972)}{1,877,362}} = \sqrt{\frac{.027}{1,877,362}} \approx \sqrt{\frac{14}{1,000,000,000}} = .00012

Ordinarily, we expect a random individual chosen from the population to land within the space of p \pm 1.96 \times MoE 95% of the time. That means our interval is

.027 \pm .00024

That means that all three of the players are well without that confidence interval. (However, it’s likely that home run hitting is highly correlated with other factors that make this test less useful than it is in other situations.)

Alex’s Drought: Finally, just how likely is it that Alex Rodriguez will go this long without a home run? He hit his last home run in his fourth plate appearance on July 22. He had a fifth plate appearance in which he doubled. Since then, he’s played in five games totalling 22 plate appearances, so he’s gone 23 plate appearances without a home run. Assuming his rate of .064 home runs per plate appearance, how likely is that? We’d expect (.064*23) = about 1.5 home runs in that time, but how unlikely is this drought?

The binomial distribution is used to model strings of successes and failures in tests where we can say clearly whether each trial ended in a “yes” or “no.” We don’t need to break out that tool here, though – if the probability of a home run is .064, the probability of anything else is .936. The likelihood of a string of 23 non-home runs is

.936^{23} = .218

It’s only about 22% likely that this drought happened only by chance. The better guess is that, as Rodriguez has said, he’s distracted by the switching to marked baseballs and media pressure to finally hit #600.

600 Home Runs: Who’s Second? July 25, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , ,
1 comment so far

Alex Rodriguez is, as I’m writing this, sitting at 599 home runs. Almost certainly, he’ll be the next player to hit the 600 home-run milestone, since the next two active players are Jim Thome at 575 and Manny Ramirez at 554. Today’s Toyota Text Poll (which runs during Yankee games on YES) asked which of those two players would reach #600 sooner.

There are a few levels of abstraction to answering this question. First of all, without looking at the players’ stats, Thome gets the nod at the first order because he’s significantly closer than Driving in 25 home runs is easier than driving in 46, so Thome will probably get there first.

At the second order, we should take a look at the players’ respective rates. Over the past two seasons, Thome has averaged a rate of .053 home runs per plate appearance, while Ramirez has averaged .041 home runs per plate appearance. With fewer home runs to hit and a higher likelihood of hitting one each time he makes it to the plate, Thome stays more likely to hit #600 before Ramirez does… but how much more likely?

Using the binomial distribution, I tested the likelihood that each player would hit his required number of home runs in different numbers of plate appearances to see where that likelihood reached a maximum. For Thome, the probability increases until 471 plate appearances, then starts decreasing, so roughly, I expect Thome to hit his 25th home run within 471 plate appearances. For Manny, that maximum doesn’t occur until 1121 plate appearances. Again, the nod has to go to Thome. He’ll probably reach the milestone in less than half as many plate appearances.

But wait. How many plate appearances is that, anyway? Until recently, Manny played 80-90% of the games in a season. Last year, he played 64%. So far the Dodgers have played 99 games and Manny appeared in 61 of them, but of course he’s disabled this year. Let’s make the generous assumption that Manny will play in 75% of the games in each season starting with this one. Then, let’s look at his average plate appearances per game. For most of his career, he averaged between 4.1 and 4.3 plate appearances per game, but this year he’s down to 3.6. Let’s make the (again, generous) assumption that he’ll get 4 plate appearances in each game from now on. At that rate, to get 1121 plate appearances, he needs to play in 280.25 games, which averages to 1.723 seasons of 162 games or about 2.62 seasons of 75% playing time.

Thome, on the other hand, has consistently played in 80% or more of his team’s games but suffered last year and this year because he hasn’t been serving as an everyday player. He pinch-hit in the National League last year and has, in Minnesota, played in about 69% of the games averaging only 3 plate appearances in each. Let’s give Jim the benefit of the doubt and assume that from here on out he’ll hit in 70% of the games and get 3.5 appearances (fewer games and fewer appearances than Ramirez). He’d need about 120.3 games, which equates to about 3/4 of a 162-game season or about 1.06 seasons with 70% playing time. Even if we downgrade Thome to 2.5 PA per game and 66% playing time, that still gives us an expectation that he’ll hit #600 within the next 1.6 real-time seasons.

Since Thome and Ramirez are the same age, there’s probably no good reason to expect one to retire before the other, and they’ll probably both be hitting as designated hitters in the AL next year. As a result, it’s very fair to expect Thome to A) reach 600 home runs and B) do it before Manny Ramirez.