Baseball | The World's Worst Sports Blog

Hit Batsman Roundup, 2010 December 26, 2010

Posted by tomflesher in Baseball.
Tags: Brett Carroll, hit batsman, hit by pitch, Hunter Pence, Kevin Youkilis, Omar Infante, Raul Ibanez, regression, Rickie Weeks, Scott Podsednik, spurious correlation, Victor Martinez
add a comment

There’s very little more subtle and involved than the quiet elegance of a batter getting beaned. In fact, that particular strategy was invoked 1549 times in 2010, with 419 batters getting plunked at least one.

The absolute leader this season was not Kevin Youkilis or Brett Carroll but Rickie Weeks, who led with 25 HBP in 754 plate appearances. Put another way, Weeks got hit in 3.32% of his plate appearances. That’s almost once every 30 plate appearances, or nearly four times the MLB-wide rate of 0.83% of the time. (Incidentally, that’s total HBP divided by total plate appearances. The more skewed mean percentage is 0.58%.) What leads to such a high number of plunkings?

I would assume that a few things would go into the decision to hit a batter intentionally:

Pitchers are less likely to be hit by other pitchers.
If a hitter is likely to get on base anyway, he’s more likely to be hit – you don’t lose anything by putting him on base, and you control the damage by limiting him to one base.
If a batter is likely to hit for extra bases, he’s more likely to be hit.
If a batter is likely to steal a base, he’s less likely to be hit, but there is an offsetting effect for caught stealing.
American League batters are more likely to be hit because of the moral hazard effect of pitchers not having to bat.

With that in mind, I set up a regression in R using every player who had at least one plate appearance in 2010. I added binary variables for Pitcher (1 if the player’s primary position is pitcher, 0 otherwise) and Lg (1 if the player played the entire season in the American League, 0 otherwise), then regressed HBP/PA on Pitcher, Lg, BB, HR, OBP, SLG, SB, and CS. The results were somewhat surprising:

Call:
lm(formula = hbppa ~ Pitcher + Lg + BB + HR + OBP + SLG + SB + 
    CS)
 
Residuals:
       Min         1Q     Median         3Q        Max 
-0.0154027 -0.0059081 -0.0018096  0.0001845  0.1397065 
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.847e-03  9.815e-04   6.975 5.77e-12 ***
Pitcher     -5.399e-03  9.136e-04  -5.909 4.81e-09 ***
Lg          -1.614e-03  7.054e-04  -2.289   0.0223 *  
BB          -1.412e-05  3.257e-05  -0.434   0.6647    
HR           1.122e-04  7.956e-05   1.411   0.1587    
OBP          8.570e-03  3.477e-03   2.465   0.0139 *  
SLG         -3.451e-03  2.468e-03  -1.398   0.1624    
SB          -6.749e-05  8.693e-05  -0.776   0.4377    
CS           1.770e-04  2.646e-04   0.669   0.5036    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 0.01042 on 935 degrees of freedom
Multiple R-squared: 0.08839,    Adjusted R-squared: 0.08059 
F-statistic: 11.33 on 8 and 935 DF,  p-value: 2.07e-15

Created by Pretty R at inside-R.org

That’s right – only Pitcher, Lg, HR, and SLG are even marginally significant (80% level). BB, SB, and CS aren’t even close. Why not?

Well, for one, the number of stolen bases and times caught stealing are relatively small no matter what. There probably isn’t enough data. For another, there simply probably isn’t as much intent to hit batters as we’d like to pretend.

Second, American Leaguers are less likely to be hit. This baffles me a little bit.

Also, keep in mind that this model shouldn’t be expected to, and cannot, explain all or even most of the variation in hit batsman. The R-squared is about .09, meaning that it explains about 9% of the variation. It ignores probably the most important factor, physics, entirely. (That is, the model doesn’t have any way to account for accidental plunkings.) As a side note, other regressions show there might be an effect for plate appearances, meaning you’re more likely to get hit by chance alone if you take enough pitches.

Finally, there are some guys who manage to do the opposite of Weeks’ feat. Houston outfielder Hunter Pence went 156 games and 658 plate appearances without getting plunked at all. Honorable mentions go to Raul Ibanez, Scott Podsednik, Victor Martinez, and Omar Infante, all of whom went over 500 plate appearances without a beaning. Now THAT’S plate discipline.

Weird Pitching Decisions Almanac in 2010 December 24, 2010

Posted by tomflesher in Baseball.
Tags: baseball-reference.com, Carl Pavano, Cheap Wins, Clayton Kershaw, Colby Lewis, Cubs, Felix Hernandez, Francisco Rodriguez, Hiroki Kuroda, Jeremy Affeldt, John Lackey, Justin Verlander, Mariners, Phil Hughes, Red Sox, Rodrigo Lopez, Roy Oswalt, Royals, Tommy Hanson, Tough Losses, Tyler Clippard, vulture wins
1 comment so far

I’m a big fan of weird pitching decisions. A pitcher with a lot of tough losses pitches effectively but stands behind a team with crappy run support. A pitcher with a high proportion of cheap wins gets lucky more often than not. A reliever with a lot of vulture wins might as well be taking the loss.

In an earlier post, I defined a tough loss two ways. The official definition is a loss in which the starting pitcher made a quality start – that is, six or more innings with three or fewer runs. The Bill James definition is the same, except that James defines a quality start as having a game score of 50 or higher. In either case, tough losses result from solid pitching combined with anemic run support.

This year’s Tough Loss leaderboard had 457 games spread around 183 pitchers across both leagues. The Dodgers’ Hiroki Kuroda led the league with a whopping eight starts with game scores of 50 or more. He was followed by eight players with six tough losses, including Justin Verlander, Carl Pavano, Roy Oswalt, Rodrigo Lopez, Colby Lewis, Clayton Kershaw, Felix Hernandez, and Tommy Hanson. Kuroda’s Dodgers led the league with 23 tough losses, followed by the Mariners and the Cubs with 22 each.

There were fewer cheap wins, in which a pitcher does not make a quality start but does earn the win. The Cheap Win leaderboard had 248 games and 136 pitchers, led by John Lackey with six and Phil Hughes with 5. Hughes pitched to 18 wins, but Lackey’s six cheap wins were almost half of his 14-win total this year. That really shows what kind of run support he had. The Royals and the Red Sox were tied for first place with 15 team cheap wins each.

Finally, a vulture win is one for the relievers. I define a vulture win as a blown save and a win in the same game, so I searched Baseball Reference for players with blown saves and then looked for the largest number of wins. Tyler Clippard was the clear winner here. In six blown saves, he got 5 vulture wins. Francisco Rodriguez and Jeremy Affeldt each deserve credit, though – each had three blown saves and converted all three for vulture wins. (When I say “converted,” I mean “waited it out for their team to score more runs.”)

Pitchers Hit This Year (or, Two Guys Named Buchholz) December 23, 2010

Posted by tomflesher in Baseball.
Tags: baseball-reference.com, Bruce Chen, Clay Buccholz, Evan Meek, George Sherrill, Gustavo Chacin, hit by pitch, Jack Taschner, Joe Blanton, Kenley Jansen, Manny Aybar, Matt Reynolds, Pitchers batting, Taylor Buccholz, weird lines, Yovani Gallardo
add a comment

Okay, I admit it. This post was originally conceived as a way to talk about the supremely weird line put up by Gustavo Chacin, who in his only plate appearance for Houston hit a home run to leave him with the maximum season OPS of 5.0. Unfortunately, Raphy at Baseball Reference beat me to it. Instead, I noticed while I was browsing the NL’s home run log to prepare to run some diagnostics on it that Kenley Jansen had two plate appearances comprising one hit and one walk. (Seriously, is there anything this kid can’t do?)

In Kenley’s case, that’s not entirely surprising, since he was a catcher until this season. His numbers weren’t great, but he was competent. What surprised me was that 75 pitchers since 2000 have finished the season with a perfect batting average. 9 were from this year, including Clay Buchholz and his distant cousing Taylor Buchholz. Evan Meek and Bruce Chen matched Jansen’s two plate appearances without an out. None of the perfect batting average crowd had an extra-base hit except for Chacin.

Since 2000, the most plate appearances by a pitcher to keep the perfect batting average was 4 by Manny Aybar in 2000.

At the other end of the spectrum, this year only three pitchers managed a perfect 1.000 on-base percentage without getting any hits at all. George Sherrill and Matt Reynolds both walked in their only plate appearances; Jack Taschner went them one better by recording a sacrifice hit in a second plate appearance.

Finally, to round things out, this year saw Joe Blanton and Heureusement, ici, c’est le Blog‘s favorite pitcher, Yovani Gallardo, each get hit by two pitches. Gallardo had clearly angered other pitchers by being so much more awesome than they were.

Are This Year’s Home Runs Really That Different? December 22, 2010

Posted by tomflesher in Baseball, Economics.
Tags: Carlos Pena, Carlos Quentin, home run distributions, home runs, Jose Bautista, kurtosis, Mark Teixeira, Miguel Cabrera, Paul Konerko, R, skewness, statistics
add a comment

This year’s home runs are quite confounding. On the one hand, home runs per game in the AL have dropped precipitously (as noted and examined in the two previous posts). On the other hand, Jose Bautista had an absolutely outstanding year. How much different is this year’s distribution than those of previous years? To answer that question, I took off to Baseball Reference and found the list of all players with at least one plate appearance, sorted by home runs.

There are several parameters that are of interest when discussing the distribution of events. The first is the mean. This year’s mean was 5.43, meaning that of the players with at least one plate appearance, on average each one hit 5.43 homers. That’s down from 6.53 last year and 5.66 in 2008.

Next, consider the variance and standard deviation. (The variance is the standard deviation squared, so the numbers derive similarly.) A low variance means that the numbers are clumped tightly around the mean. This year’s variance was 68.4, down from last year’s 84.64 but up from 2008’s 66.44.

The skewness and kurtosis represent the length and thickness of the tails, respectively. Since a lot of people have very few home runs, the skewness of every year’s distribution is going to be positive. Roughly, that means that there are observations far larger than the mean, but very few that are far smaller. That makes sense, since there’s no such thing as a negative home run total. The kurtosis number represents how pointy the distribution is, or alternatively how much of the distribution is found in the tail.

For example, in 2009, Mark Teixeira and Carlos Pena jointly led the American League in home runs with 39. There was a high mean, but the tail was relatively thin with a high variance. Compared with this year, when Bautista led his nearest competitor (Paul Konerko) by 15 runs and only 8 players were over 30 home runs, 2009 saw 15 players above 30 home runs with a pretty tight race for the lead. Kurtosis in 2010 was 7.72 compared with 2009’s 4.56 and 2008’s 5.55. (In 2008, 11 players were above the 30-mark, and Miguel Cabrera‘s 37 home runs edged Carlos Quentin by just one.)

The numbers say that 2008 and 2009 were much more similar than either of them is to 2010. A quick look at the distributions bears that out – this was a weird year.

Diagnosing the AL December 22, 2010

Posted by tomflesher in Baseball, Economics.
Tags: 2010, American League, baseball-reference.com, R, regression, statistics, Year of the Pitcher
add a comment

In the previous post, I crunched some numbers on a previous forecast I’d made and figured out that it was a pretty crappy forecast. (That’s the fun of forecasting, of course – sometimes you’re right and sometimes you’re wrong.) The funny part of it, though, is that the predicted home runs per game for the American League was so far off – 3.4 standard errors below the predicted value – that it’s highly unlikely that the regression model I used controls for all relevant variables. That’s not surprising, since it was only a time trend with a dummy variable for the designated hitter.

There are a couple of things to check for immediately. The first is the most common explanation thrown around when home runs drop – steroids. It seems to me that if the drop in home runs were due to better control of performance-enhancing drugs, then it should mostly be home runs that are affected. For example, intentional walks should probably be below expectation, since intentional walks are used to protect against a home run hitter. Unintentional walks should probably be about as expected, since walks are a function of plate discipline and pitcher control, not of strength. On-base percentage should probably drop at a lower magnitude than home runs, since some hits that would have been home runs will stay in the park as singles, doubles, or triples rather than all being fly-outs. There will be a drop but it won’t be as big. Finally, slugging average should drop because a loss in power without a corresponding increase in speed will lower total bases.

I’ll analyze these with pretty new R code behind the cut.

(more…)

What Happened to Home Runs This Year? December 22, 2010

Posted by tomflesher in Baseball, Economics.
Tags: baseball-reference.com, forecasting, home runs, R, regression, standard error, statistics, time series, Year of the Pitcher
1 comment so far

I was talking to Jim, the writer behind Apparently, I’m An Angels Fan, who’s gamely trying to learn baseball because he wants to be just like me. Jim wondered aloud how much the vaunted “Year of the Pitcher” has affected home run production. Sure enough, on checking the AL Batting Encyclopedia at Baseball-Reference.com, production dropped by about .15 home runs per game (from 1.13 to .97). Is that normal statistical variation or does it show that this year was really different?

In two previous posts, I looked at the trend of home runs per game to examine Stuff Keith Hernandez Says and then examined Japanese baseball’s data for evidence of structural break. I used the Batting Encyclopedia to run a time-series regression for a quadratic trend and added a dummy variable for the Designated Hitter. I found that the time trend and DH control account for approximately 56% of the variation in home runs per year, and that the functional form is

$\hat{HR} = .957 - .0188 \times t + .0004 \times t^2 + .0911 \times DH$

with t=1 in 1955, t=2 in 1956, and so on. That means t=56 in 2010. Consequently, we’d expect home run production per game in 2010 in the American League to be approximately

$\hat{HR} = .957 - .0188 \times 56 + .0004 \times 3136 + .0911 \approx 1.25$

That means we expected production to increase this year and it dropped precipitously, for a residual of -.28. The residual standard error on the original regression was .1092, so on 106 degrees of freedom, so the t-value using Texas A&M’s table is 1.984 (approximating using 100 df). That means we can be 95% confident that the actual number of home runs should fall within .1092*1.984, or about .2041, of the expected value. The lower bound would be about 1.05, meaning we’re still significantly below what we’d expect. In fact, the observed number is about 3.4 standard errors below the expected number. In other words, we’d expect that to happen by chance less than .1% (that is, less than one tenth of one percent) of the time.

Clearly, something else is in play.

Home Run Derby: Does it ruin swings? December 15, 2010

Posted by tomflesher in Baseball, Economics.
Tags: Baseball, baseball-reference.com, Chris Young, Corey Hart, David Ortiz, Hanley Ramirez, home run derby, home runs, Matt Holliday, Miguel Cabrera, Nick Swisher, Vernon Wells
add a comment

Earlier this year, there was a lot of discussion about the alleged home run derby curse. This post by Andy on Baseball-Reference.com asked if the Home Run Derby is bad for baseball, and this Hardball Times piece agrees with him that it is not. The standard explanation involves selection bias – sure, players tend to hit fewer home runs in the second half after they hit in the Derby, but that’s because the people who hit in the Derby get invited to do so because they had an abnormally high number of home runs in the first half.

Though this deserves a much more thorough macro-level treatment, let’s just take a look at the density of home runs in either half of the season for each player who participated in the Home Run Derby. Those players include David Ortiz, Hanley Ramirez, Chris Young, Nick Swisher, Corey Hart, Miguel Cabrera, Matt Holliday, and Vernon Wells.

For each player, plus Robinson Cano (who was of interest to Andy in the Baseball-Reference.com post), I took the percentage of games before the Derby and compared it with the percentage of home runs before the Derby. If the Ruined Swing theory holds, then we’d expect

$g(HR) \equiv HR_{before}/HR_{Season} > g(Games) \equiv Games_{before}/162$

The table below shows that in almost every case, including Cano (who did not participate), the density of home runs in the pre-Derby games was much higher than the post-Derby games.

Player	HR Before	HR Total	g(Games)	g(HR)	Diff
Ortiz	18	32	0.54321	0.5625	0.01929
Hanley	13	21	0.54321	0.619048	0.075838
Swisher	15	29	0.537037	0.517241	-0.0198
Wells	19	31	0.549383	0.612903	0.063521
Holliday	16	28	0.54321	0.571429	0.028219
Hart	21	31	0.549383	0.677419	0.128037
Cabrera	22	38	0.530864	0.578947	0.048083
Young	15	27	0.549383	0.555556	0.006173
Cano	16	29	0.537037	0.551724	0.014687

Is this evidence that the Derby causes home run percentages to drop off? Certainly not. There are some caveats:

This should be normalized based on games the player played, instead of team games.
It would probably even be better to look at a home run per plate appearance rate instead.
It could stand to be corrected for deviation from the mean to explain selection bias.
Cano’s numbers are almost identical to Swisher’s. They play for the same team. If there was an effect to be seen, it would probably show up here, and it doesn’t.

Once finals are up, I’ll dig into this a little more deeply.

In Memoriam November 11, 2010

Posted by tomflesher in Baseball.
add a comment

In Flanders Fields the poppies blow
Between the crosses row on row,
That mark our place; and in the sky
The larks, still bravely singing, fly
Scarce heard amid the guns below.

We are the Dead. Short days ago
We lived, felt dawn, saw sunset glow,
Loved and were loved, and now we lie
In Flanders fields.

Take up our quarrel with the foe:
To you from failing hands we throw
The torch; be yours to hold it high.
If ye break faith with us who die
We shall not sleep, though poppies grow
In Flanders fields.

– John McCrae

Fire Up The Hot Stove November 2, 2010

Posted by tomflesher in Baseball.
Tags: Aubrey Huff, Buster Posey, Cliff Lee, Giants, Rangers, Tim Lincecum, Yankees
add a comment

Although I’m usually fairly heavy on the statistical content, I can’t help but mention a few impressions from Game 5 of the World Series last night.

If I didn’t have Baseball-Reference.com to tell me different, I’d have assumed Aubrey Huff wasn’t an everyday first baseman from the way he played last night. He was competent and made some nice picks, but he didn’t seem to have the ankle-preservation instinct that most everyday 1Bs do. He seemed to have his heels back quite far on the bag most of the time.

The rumors about the Yankees pursuing Cliff Lee strike me as cartoonish supervillainy. “If I cannot defeat you, I will simply BUY you!”

Game 3 was the Lee vs. Tim Lincecum gem that we all assumed Game 1 would be.

Somewhere, Bengie Molina is secretly pouring champagne all over himself.

If the postseason came before voting, Buster Posey would be a lock for Rookie of the

Quickie: Ryan Howard’s Choke Index October 25, 2010

Posted by tomflesher in Baseball.
Tags: baseball-reference.com, binomial distribution, Choke Index, Phillies, Ryan Howard, statistics
1 comment so far

The Choke Index is alive and well.

Previous to 2010, Ryan Howard of the Philadelphia Phillies hit home runs in three consecutive postseasons. He managed 7 in his 140 plate appearances, averaging out to .05 home runs per plate appearance. Not too shabby. It’s a bit below his regular season rate of about .067, but there are a bunch of things that could account for that.

This year, Ryan made 38 plate appearances and hit a grand total of 0 home runs in the postseason. What’s the likelihood of that happening? I use the Choke Index (one minus the probability of hitting 0 home runs in a given number of plate appearances) to measure that. As always, the closer a player gets to 1, the more unlikely his homer-free streak is.

The binomial probability can be calculated using the formula

$f(k;n,p) = \Pr(K = k) = {n\choose k}p^k(1-p)^{n-k}$

Or, since we’re looking for the probability of an event NOT occurring,

$(1-p)^k$

or $.95^{38}= .142$

using his career postseason numbers. That means that Ryan Howard’s 2010 postseason Choke Index is .858. Pretty impressive!

« older posts newer posts »

The World's Worst Sports Blog