## The 600 Home Run AlmanacJuly 28, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , , , , , , , ,

People are interested in players who hit 600 home runs, at least judging by the Google searches that point people here. With that in mind, let’s take a look at some quick facts about the 600th home run and the people who have hit it.

Age: There are six players to have hit #600. Sammy Sosa was the oldest at 39 years old in 2007. Ken Griffey, Jr. was 38 in 2007, as were Willie Mays in 1969 and Barry Bonds in 2002. Hank Aaron was 37. Babe Ruth was the youngest at 36 in 1931. Alex Rodriguez, who is 35 as of July 27, will almost certainly be the youngest player to reach 600 home runs. If both Manny Ramirez and Jim Thome hang on to hit #600 over the next two to three seasons, Thome (who was born in August of 1970) will probably be 42 in 2012; Ramirez (born in May of 1972) will be 41 in 2013. (In an earlier post that’s when I estimated each player would hit #600.) If Thome holds on, then, he’ll be the oldest player to hit his 600th home run.

Productivity: Since 2000 (which encompasses Rodriguez, Ramirez, and Thome in their primes), the average league rate of home runs per plate appearances has been about .028. That is, a home run was hit in about 2.8% of plate appearances. Over the same time period, Rodriguez’ rate was .064 – more than double the league average. Ramirez hit .059 – again, over double the league rate. Thome, for his part, hit at a rate of .065 home runs per plate appearance. From 2000 to 2009, Thome was more productive than Rodriguez.

Standing Out: Obviously it’s unusual for them to be that far above the curve. There were 1,877,363 plate appearances (trials) from 2000 to 2009. The margin of error for a proportion like the rate of home runs per plate appearance is

$\sqrt{\frac{p(1-p)}{n-1}} = \sqrt{\frac{.028(.972)}{1,877,362}} = \sqrt{\frac{.027}{1,877,362}} \approx \sqrt{\frac{14}{1,000,000,000}} = .00012$

Ordinarily, we expect a random individual chosen from the population to land within the space of $p \pm 1.96 \times MoE$ 95% of the time. That means our interval is

$.027 \pm .00024$

That means that all three of the players are well without that confidence interval. (However, it’s likely that home run hitting is highly correlated with other factors that make this test less useful than it is in other situations.)

Alex’s Drought: Finally, just how likely is it that Alex Rodriguez will go this long without a home run? He hit his last home run in his fourth plate appearance on July 22. He had a fifth plate appearance in which he doubled. Since then, he’s played in five games totalling 22 plate appearances, so he’s gone 23 plate appearances without a home run. Assuming his rate of .064 home runs per plate appearance, how likely is that? We’d expect (.064*23) = about 1.5 home runs in that time, but how unlikely is this drought?

The binomial distribution is used to model strings of successes and failures in tests where we can say clearly whether each trial ended in a “yes” or “no.” We don’t need to break out that tool here, though – if the probability of a home run is .064, the probability of anything else is .936. The likelihood of a string of 23 non-home runs is

$.936^{23} = .218$

It’s only about 22% likely that this drought happened only by chance. The better guess is that, as Rodriguez has said, he’s distracted by the switching to marked baseballs and media pressure to finally hit #600.

## Cheap WinsJuly 16, 2010

Posted by tomflesher in Baseball.
Tags: , , , , , , , , , , , , ,

The opposite of the Tough Loss discussed below (which R.A. Dickey unfortunately experienced tonight in a duel with Tim Lincecum) is a Cheap Win. Logically, since a Tough Loss is a loss in a quality start, a Cheap Win (invented by Bill James) is a win in a non-quality start – that is, a start with a game score of below 50 (or, officially, a start with fewer than 6.0 innings pitched or more than 3 runs allowed).

The Chicago White Sox’ starter, John Danks, picked up a Cheap Win in Thursday’s game against the Twins. Although he pitched six innings, he gave up six runs (all earned) in the second inning, leading to an abysmal game score of 33. Danks had two of last year’s 304 Cheap Wins. Ricky Romero led the pack with six, and Joe Saunders and Tim Wakefield were both among the six pitchers with five Cheap Wins. Even Roy Halladay had two.

Through the beginning of the All-Star Break, there have been 136 Cheap Wins in 2010. That includes one by my current favorite player, Yovani Gallardo. John Lackey is already up to 5, and Brian Bannister is knocking on the door with 4.

It’s hard to read too much into the tea leaves of Cheap Wins, since they’re not all created equal. In general, they represent a pitcher sliding a little bit off his game, but his team upping their run production to rescue him. To that end, Cheap Wins might be a better measure of a team’s ability than Tough Losses, since, while Tough Losses show a pitcher maintaining himself under fire, Cheap Wins represent an ability to hit in the clutch (assuming that run production in Cheap Wins is significantly different from run production in other games). That’s hard to validate without doing a bit more work, but it’s a project to consider.

## More on Home Runs Per GameJuly 9, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , , ,

In the previous post, I looked at the trend in home runs per game in the Major Leagues and suggested that the recent deviation from the increasing trend might have been due to the development of strong farm systems like the Tampa Bay Rays’. That means that if the same data analysis process is used on data in an otherwise identical league, we should see similar trends but no dropoff around 1995. As usual, for replication purposes I’m going to use Japan’s Pro Baseball leagues, the Pacific and Central Leagues. They’re ideal because, just like the American Major Leagues, one league uses the designated hitter and one does not. There are some differences – the talent pool is a bit smaller because of the lower population base that the leagues draw from, and there are only 6 teams in each league as opposed to MLB’s 14 and 16.

As a reminder, the MLB regression gave us a regression equation of

$\hat{HR} = .957 - .0188 \times t + .0004 \times t^2 + .0911 \times DH$

where $\hat{HR}$ is the predicted number of home runs per game, t is a time variable starting at t=1 in 1955, and DH is a binary variable that takes value 1 if the league uses the designated hitter in the season in question.

Just examining the data on home runs per game from the Japanese leagues, the trend looks significantly different.  Instead of the rough U-shape that the MLB data showed, the Japanese data looks almost M-shaped with a maximum around 1984. (Why, I’m not sure – I’m not knowledgeable enough about Japanese baseball to know what might have caused that spike.) It reaches a minimum again and then keeps rising.

After running the same regression with t=1 in 1950, I got these results:

 Estimate Std. Error t-value p-value Signif B0 0.2462 0.0992 2.481 0.0148 0.9852 t 0.0478 0.0062 7.64 1.63E-11 1 tsq -0.0006 0.00009 -7.463 3.82E-11 1 DH 0.0052 0.0359 0.144 0.8855 0.1145

This equation shows two things, one that surprises me and one that doesn’t. The unsurprising factor is the switching of signs for the t variables – we expected that based on the shape of the data. The surprising factor is that the designated hitter rule is insignificant. We can only be about 11% sure it’s significant. In addition, this model explains less of the variation than the MLB version – while that explained about 56% of the variation, the Japanese model has an $R^2$ value of .4045, meaning it explains about 40% of the variation in home runs per game.

There’s a slightly interesting pattern to the residual home runs per game ($Residual = \hat{HR} - HR$. Although it isn’t as pronounced, this data also shows a spike – but the spike is at t=55, so instead of showing up in 1995, the Japan leagues spiked around the early 2000s. Clearly the same effect is not in play, but why might the Japanese leagues see the same effect later than the MLB teams? It can’t be an expansion effect, since the Japanese leagues have stayed constant at 6 teams since their inception.

Incidentally, the Japanese league data is heteroskedastic (Breusch-Pagan test p-value .0796), so it might be better modeled using a generalized least squares formula, but doing so would have skewed the results of the replication.

In order to show that the parameters really are different, the appropriate test is Chow’s test for structural change. To clean it up, I’m using only the data from 1960 on. (It’s quick and dirty, but it’ll do the job.) Chow’s test takes

$\frac{(S_C -(S_1+S_2))/(k)}{(S_1+S_2)/(N_1+N_2-2k)} \sim\ F_{k,N_1+N_2-2k}$

where $S_C = 6.3666$ is the combined sum of squared residuals, $S_1 = 1.2074$ and $S_2 = 2.2983$ are the individual (i.e. MLB and Japan) sum of squared residuals, $k=4$ is the number of parameters, and $N_1 = 100$ and $N_2 = 100$ are the number of observations in each group.

$\frac{(6.3666 -(1.2074 + 2.2983))/(4)}{(100+100)/(100+100-2\times 4)} \sim\ F_{4,100+100-2 \times 4}$

$\frac{(6.3666 -(3.5057))/(4)}{(200)/(192)} \sim\ F_{4,192}$

$\frac{2.8609/4}{1.0417)} \sim\ F_{4,192}$

$\frac{.7152}{1.0417)} \sim\ F_{4,192}$

$.6866 \sim\ F_{4,192}$

The critical value for 90% significance at 4 and 192 degrees of freedom would be 1.974 according to Texas A&M’s F calculator. That means we don’t have enough evidence that the parameters are different to treat them differently. This is probably an artifact of the small amount of data we have.

In the previous post, I looked at the trend in home runs per game in the Major Leagues and suggested that the recent deviation from the increasing trend might have been due to the development of strong farm systems like the Tampa Bay Rays’. That means that if the same data analysis process is used on data in an otherwise identical league, we should see similar trends but no dropoff around 1995. As usual, for replication purposes I’m going to use Japan’s Pro Baseball leagues, the Pacific and Central Leagues. They’re ideal because, just like the American Major Leagues, one league uses the designated hitter and one does not. There are some differences – the talent pool is a bit smaller because of the lower population base that the leagues draw from, and there are only 6 teams in each league as opposed to MLB’s 14 and 16.

As a reminder, the MLB regression gave us a regression equation of

$\hat{HR} = .957 - .0188 \times t + .0004 \times t^2 + .0911 \times DH$

where $\hat{HR}$ is the predicted number of home runs per game, t is a time variable starting at t=1 in 1954, and DH is a binary variable that takes value 1 if the league uses the designated hitter in the season in question.

Just examining the data on home runs per game from the Japanese leagues, the trend looks significantly different.  Instead of the rough U-shape that the MLB data showed, the Japanese data looks almost M-shaped with a maximum around 1984. (Why, I’m not sure – I’m not knowledgeable enough about Japanese baseball to know what might have caused that spike.) It reaches a minimum again and then keeps rising.

After running the same regression with t=1 in 1950, I got these results:

 Estimate Std. Error t-value p-value Signif B0 0.2462 0.0992 2.481 0.0148 0.9852 t 0.0478 0.0062 7.64 1.63E-11 1 tsq -0.0006 0.00009 -7.463 3.82E-11 1 DH 0.0052 0.0359 0.144 0.8855 0.1145

This equation shows two things, one that surprises me and one that doesn’t. The unsurprising factor is the switching of signs for the t variables – we expected that based on the shape of the data. The surprising factor is that the designated hitter rule is insignificant. We can only be about 11% sure it’s significant. In addition, this model explains less of the variation than the MLB version – while that explained about 56% of the variation, the Japanese model has an $R^2$ value of .4045, meaning it explains about 40% of the variation in home runs per game.

There’s a slightly interesting pattern to the residual home runs per game ($Residual = \hat{HR} - HR$. Although it isn’t as pronounced, this data also shows a spike – but the spike is at t=55, so instead of showing up in 1995, the Japan leagues spiked around the early 2000s. Clearly the same effect is not in play, but why might the Japanese leagues see the same effect later than the MLB teams? It can’t be an expansion effect, since the Japanese leagues have stayed constant at 6 teams since their inception.

Incidentally, the Japanese league data is heteroskedastic (Breusch-Pagan test p-value .0796), so it might be better modeled using a generalized least squares formula, but doing so would have skewed the results of the replication.

In order to show that the parameters really are different, the appropriate test is Chow’s test for structural change. To clean it up, I’m using only the data from 1960 on. (It’s quick and dirty, but it’ll do the job.) Chow’s test takes

$\frac{(S_C -(S_1+S_2))/(k)}{(S_1+S_2)/(N_1+N_2-2k)} ~ F$

## Back when it was hard to hit 55…July 8, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , , ,

Last night was one of those classic Keith Hernandez moments where he started talking and then stopped abruptly, which I always like to assume is because the guys in the truck are telling him to shut the hell up. He was talking about Willie Mays for some reason, and said that Mays hit 55 home runs “back when it was hard to hit 55.” Keith coyly said that, while it was easy for a while, it was “getting hard again,” at which point he abruptly stopped talking.

Keith’s unusual candor about drug use and Mays’ career best of 52 home runs aside, this pinged my “Stuff Keith Hernandez Says” meter. After accounting for any time trend and other factors that might explain home run hitting, is there an upward trend? If so, is there a pattern to the remaining home runs?

The first step is to examine the data to see if there appears to be any trend. Just looking at it, there appears to be a messy U shape with a minimum around t=20, which indicates a quadratic trend. That means I want to include a term for time and a term for time squared.

Using the per-game averages for home runs from 1955 to 2009, I detrended the data using t=1 in 1955. I also had to correct for the effect of the designated hitter. That gives us an equation of the form

$\hat{HR} = \hat{\beta_{0}} + \hat{\beta_{1}}t + \hat{\beta_{2}} t^{2} + \hat{\beta_{3}} DH$

The results:

 Estimate Std. Error t-value p-value Signif B0 0.957 0.0328 29.189 0.0001 0.9999 t -0.0188 0.0028 -6.738 0.0001 0.9999 tsq 0.0004 0.00005 8.599 0.0001 0.9999 DH 0.0911 0.0246 3.706 0.0003 0.9997

We can see that there’s an upward quadratic trend in predicted home runs that together with the DH rule account for about 56% of the variation in the number of home runs per game in a season ($R^2 = .5618$). The Breusch-Pagan test has a p-value of .1610, indicating a possibility of mild homoskedasticity but nothing we should get concerned about.

Then, I needed to look at the difference between the predicted number of home runs per game and the actual number of home runs per game, which is accessible by subtracting

$Residual = HR - \hat{HR}$

This represents the “abnormal” number of home runs per year. The question then becomes, “Is there a pattern to the number of abnormal home runs?”  There are two ways to answer this. The first way is to look at the abnormal home runs. Up until about t=40 (the mid-1990s), the abnormal home runs are pretty much scattershot above and below 0. However, at t=40, the residual jumps up for both leagues and then begins a downward trend. It’s not clear what the cause of this is, but the knee-jerk reaction is that there might be a drug use effect. On the other hand, there are a couple of other explanations.

The most obvious is a boring old expansion effect. In 1993, the National League added two teams (the Marlins and the Rockies), and in 1998 each league added a team (the AL’s Rays and the NL’s Diamondbacks). Talent pool dilution has shown up in our discussion of hit batsmen, and I believe that it can be a real effect. It would be mitigated over time, however, by the establishment and development of farm systems, in particular strong systems like the one that’s producing good, cheap talent for the Rays.

## Tough LossesJuly 8, 2010

Posted by tomflesher in Baseball.
Tags: , , , , , , , , ,

Last night, Jonathon Niese pitched 7.2 innings of respectable work (6 hits, 3 runs, all earned, 1 walk, 8 strikeouts, 2 home runs, for a game score of 62) but still took the loss due to his unfortunate lack of run support – the Mets’ only run came in from an Angel Pagan solo homer. This is a prime example of what Bill James called a “Tough Loss”: a game in which the starting pitcher made a quality start but took a loss anyway.

There are two accepted measures of what a quality start is. Officially, a quality start is one with 6 or more innings pitched and 3 or fewer runs. Bill James’ definition used his game score statistic and used 50 as the cutoff point for a quality start. Since a pitcher gets 50 points for walking out on the mound and then adds to or subtracts from that value based on his performance, game score has the nice property of showing whether a pitcher added value to the team or not.

Using the game score definition, there were 393 losses in quality starts last year, including 109 by July 7th. Ubaldo Jimenez and Dan Haren led the league with 7, Roy Halladay had 6, and Yovani Gallardo (who’s quickly becoming my favorite player because he seems to show up in every category) was also up there with 6.

So far this year, though, it seems to be the Year of the Tough Loss. There have already been 230, and Roy Oswalt is already at the 6-tough-loss mark. Halladay is already up at 4. This is consistent with the talk of the Year of the Pitcher, with better pitching (and potentially less use of performance-enhancing drugs) leading to lower run support. That will require a bit more work to confirm, though.

## How often should Youk take his base?June 30, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , ,

Kevin Youkilis is sometimes called “The Greek God of Walks.” I prefer to think of him as “The Greek God of Take Your Base,” since he seems to get hit by pitches at an alarming rate. In fact, this year, he’s been hit 7 times in 313 plate appearances. (Rickie Weeks, however, is leading the pack with 13 in 362 plate appearances. We’ll look at him, too.) There are three explanations for this:

1. There’s something about Youk’s batting or his hitting stance that causes him to be hit. This is my preferred explanation. Youkilis has an unusual batting grip that thrusts his lead elbow over the plate, and as he swings, he lunges forward, which exposes him to being plunked more often.
2. Youkilis is such a hitting machine that the gets hit often in order to keep him from swinging for the fences. This doesn’t hold water, to me. A pitcher could just as easily put him on base safely with an intentional walk, so unless there’s some other incentive to hit him, there’s no reason to risk ejection by throwing at Youkilis. This leads directly to…
3. Youk is a jerk. This is pretty self-explanatory, and is probably a factor.

First of all, we need to figure out whether it’s likely that Kevin is being hit by chance. To figure that out, we need to make some assumptions about hit batsmen and evaluate them using the binomial distribution. I’m also excited to point out that Youk has been overtaken as the Greek God of Take Your Base by someone new: Brett Carroll. (more…)

## Edwin Jackson, Fourth No-Hitter of 2010June 25, 2010

Posted by tomflesher in Baseball, Economics.
Tags: , , , , , , , , ,

Tonight, Edwin Jackson of the Arizona Diamondbacks pitched a no-hitter against the Tampa Bay Rays. That’s the fourth no-hitter of this year, following Ubaldo Jimenez and the perfect games by Dallas Braden and Roy Halladay.

Two questions come to mind immediately:

1. How likely is a season with 4 no-hitters?
2. Does this mean we’re on pace for a lot more?

The second question is pretty easy to dispense with. Taking a look at the list of all no-hitters (which interestingly enough includes several losses), it’s hard to predict a pattern. No-hitters aren’t uniformly distributed over time, so saying that we’ve had 4 no-hitters in x games doesn’t tell us anything meaningful about a pace.

The first is a bit more interesting. I’m interested in the frequency of no-hitters, so I’m going to take a look at the list of frequencies here and take a page from Martin over at BayesBall in using the Poisson distribution to figure out whether this is something we can expect.

The Poisson distribution takes the form

$f(n; \lambda)=\frac{\lambda^n e^{-\lambda}}{n!}$

where $\lambda$ is the expected number of occurrences and we want to know how likely it would be to have $n$ occurrences based on that.

Using Martin’s numbers – 201506 opportunities for no-hitters and an average of 4112 games per season from 1961 to 2009 – I looked at the number of no-hitters since 1961 (120) and determined that an average season should return about 2.44876 no-hitters. That means

$\lambda = 2.44876$

and

$f(n; \lambda = 2.44876)=\frac{2.44876^n (.0864)}{n!}$

Above is the distribution. p is the probability of exactly n no-hitters being thrown in a single season of 4112 games; cdf is the cumulative probability, or the probability of n or fewer no-hitters; p49 is the predicted number of seasons out of 49 (1961-2009) that we would expect to have n no-hitters; obs is the observed number of seasons with n no-hitters; cp49 is the predicted number of seasons with n or fewer no-hitters; and cobs is the observed number of seasons with n or fewer no-hitters.

It’s clear that 4 or even 5 no-hitters is a perfectly reasonable number to expect.

 2.44876

## AJ Burnett: Statistical AnomalyJune 21, 2010

Posted by tomflesher in Baseball.
Tags: , , , , , , , ,

Tonight, A.J. Burnett had a weird first inning in a game that’s still going on as I write this. He got the first two outs fairly easily, and then surrendered home runs to Justin Upton, Adam LaRoche, and Mark Reynolds. Before he knew it, he was down 5-0 in the bottom of the first. That can’t happen very often.

I queried Baseball-Reference.com’s event finder for home runs, then narrowed it down to first inning home-runs with two outs this year. Prior to tonight, there had been 82. None of them came in three-homer games – that answers that.

Just for fun, I checked 2009 as well. In total, there were 209 2-out, first-inning home runs in 2009. Only one of those home runs happened in a three-homer game, so it didn’t happen then, either.

Poor AJ.

## Carlos Zambrano, Ace Pinch Hitter?June 21, 2010

Posted by tomflesher in Baseball.
Tags: , , , , , , , , , , , ,
1 comment so far

Earlier this year, Chicago Cubs manager Lou Piniella experimented with moving starting pitcher and relatively big hitter Carlos Zambrano to the bullpen, briefly making him the Major Leagues’ best-paid setup man. Zambrano is back in the rotation as of the beginning of June. I’m curious what the effect of moving him to the bullpen was.

The thing is that not only is Zambrano an excellent pitcher (though he was slumping at the time), he’s also a regarded as a very good hitter for a pitcher. He’s a career .237 hitter, with a slump last year at “only” .217 in 72 plate appearances (17th most in the National League), which was 6th in the National League among pitchers with at least 50 plate appearances. He didn’t walk enough (his OBP was 13th on the same list), but he was 9th of the 51 pitchers on the list in terms of Base-Out Runs Added (RE24) with about 5.117 runs below a replacement-level batter. Ubaldo Jimenez was also up there with a respectable .220 BA, .292 OBP, but -8.950 RE24.

It should be pointed out that pitcher RE24 is almost always negative for starters – the best RE24 on that list is Micah Owings with -2.069. Zambrano’s run contribution was negative, sure, but it was a lot less negative than most starters. Zambrano also lost a bit of flexibility as an emergency pinch hitter (something that Owings is going through right now due to his recent move to the bullpen) – he’s more valuable as a reliever, so they won’t use him to pinch hit. As a result, he loses at-bats, and that not only keeps him from amassing hits. It also allows him to get rusty.

It’s hard to precisely value the loss of Zambrano’s contribution, although he’s already on pace for -6.1 batting RE24. It’s likely, in my opinion, that his RE24 will rise as he continues hitting over the course of the year. His pitching value is also negative, however, which is unusual. He’s always been very respectable among Cubs starters. It’s possible that although he was pitching very well in relief, the fact that he has the ability to go long means that it’s inefficient to use him as a reliever. This is the opposite of, say, Joba Chamberlain, who is overpowering in relief but struggles as a starter.

As a starter, Zambrano has never been a net loss of runs. He needs to stay out of the bullpen, and Joba needs to stay there.

## Leadoff Home RunsJune 19, 2010

Posted by tomflesher in Baseball.
Tags: , , , , , , ,