Padre Differential July 11, 2011
Posted by tomflesher in Baseball, Economics.Tags: Baseball, baseball-reference.com, linear regression, National League, Padre Differential, Padres, Phillies, runs allowed, runs scored, statistics
1 comment so far
I was all set to fire up the Choke Index again this year. Unfortunately, Derek Jeter foiled my plan by making his 3000th hit right on time, so I can’t get any mileage out of that. Perhaps Jim Thome will start choking around #600 – but, frankly, I hope not. Since Jeter had such a callous disregard for the World’s Worst Sports Blog’s material, I’m forced to make up a new statistic.
This actually plays into an earlier post I made, which was about home field advantage for the Giants. It started off as a very simple regression for National League teams to see if the Giants’ pattern – a negative effect on runs scored at home, no real effect from the DH – held across the league. Those results are interesting and hold with the pattern that we’ll see below – I’ll probably slice them into a later entry.
The first thing I wanted to do, though, was find team effects on runs scored. Basically, I want to know how many runs an average team of Greys will score, how many more runs they’ll score at home, how many more runs they’ll score on the road if they have a DH, and then how many more runs the Phillies, the Mets, or any other team will score above their total. I’m doing this by converting Baseball Reference’s schedules and results for each team through their last game on July 10 to a data file, adding dummy variables for each team, and then running a linear regression of runs scored by each team against dummy variables for playing at home, playing with a DH, and the team dummies. In equation form,
For technical reasons, I needed to leave a team out, and so I chose the team that had the most negative coefficient: the Padres. Basically, then, the terms represent how many runs the team scores above what the Padres would score. I call this “RAP,” for Runs Above Padres. I then ran the same equation, but rather than runs scored by the team, I estimated runs allowed by the team’s defense. That, logically enough, was called “ARAP,” for Allowed Runs Above Padres. A positive RAP means that a team scores more runs than the Padres, while a negative ARAP means the team doesn’t allow as many runs as the Padres. Finally, to pull it all together, one handy number shows how many more runs better off a team is than the Padres:
That is, the Padre Differential shows whether a team’s per-game run differential is higher or lower than the Padres’.
The table below shows each team in the National League, sorted by Padre Differential. By definition, San Diego’s Padre Differential is zero. ‘Sig95’ represents whether or not the value is statistically significant at the 95% level.
Unsurprisingly, the Phillies – the best team in baseball – have the highest Padre Differential in the league, with over 1.3 runs on average better than the Padres. Houston, in the cellar of the NL Central, is the worst team in the league and is .8 runs worse than the Padres per game. Florida and Chicago are both worse than the Padres and are both close to (Florida, 43) or below (Chicago, 37) the Padres’ 40-win total.
Home Field Advantage July 9, 2011
Posted by tomflesher in Baseball, Economics.Tags: Giants, home field advantage, linear regression
1 comment so far
The Mets unfortunately played a 10 PM game in San Francisco last night, so I’m short on sleep today. I do remember, though, that Gary Cohen mentioned, repeatedly, the Giants’ significant home field advantage. Even after last night’s loss at the hands of Carlos Beltran (coming from a rare blown save by Brian Wilson), the Giants have a .619 winning percentage at home (26-16) versus a .500 winning percentage on the road (24-24). Interestingly, their run differential is much worse at home – they’ve scored 205 and allowed 184 on the road for a total differential of +21, but their run differential at home is actually negative. They’ve scored 120 but allowed 135 for a differential of -15.
Some of that is due to the way walk-offs are scored – they end an inning immediately, so a scoring inning at home is cut short when the same inning on the road would continue and might lead to further scoring – but it’s still quite shocking to see that large a split. So far, the Giants have only scored 11 walk-off RBIs, compared with only 7 RBIs in the 9th inning on the road that came with the Giants ahead. So, even adding in an extra few runs wouldn’t account for the difference.
Last year, there wasn’t much of a home field effect at all. Running a very simple linear regression of runs scored against dummy variables for playing at home and playing with a DH, I estimated that
and only the intercept term, which represents (essentially) the unconditional average number of runs the Giants score, was significant.
For this year, the numbers are quite different.
with both the intercept and Home terms significant at the 95% level. It’s clear that the Giants are winning more at home, but it’s not because they’re scoring more at home.
Take Your Base July 7, 2011
Posted by tomflesher in Baseball, Economics.Tags: hit batsman, hit batsmen, hit by pitch, Kevin Youkilis, statistics
add a comment
As usual, Kevin Youkilis is getting hit at an alarming rate this year. A quick check of his stats from Baseball Reference shows that from 2004 to 2010, he got hit at about a 2% clip and was intentionally walked about .5% of the time. This year, he’s been hit nine times in 340 plate appearances, for about 2.6% of plate appearances ending in the phrase “Take your base.” He’s only been intentionally walked once, which isn’t out of line from his three IBBs last year. In contrast, he was “only” hit ten times last year, so he’s one away from eclipsing that mark and six away from tying his record 15 times hit (in 2007). Interestingly, Kevin has never been hit in the postseason.
It would be oversimplistic to say that guys who get hit a lot get hit because they’re jerks. There’s a plausible argument that Youkilis’ unorthodox batting stance is responsible for his high rate, and some guys just get hit more often. Crashburn Alley makes the point that getting hit is a legitimate skill, and Plunk Everyone has a truly dizzying array of information about players getting hit. My question, though, is whether it could be the case that Youkilis is hit less often in the postseason because pitchers are more careful.
In 2007, 2008, and 2009, Youkilis made a total of 123 postseason plate appearances. During that time, he was never hit, nor was he intentionally walked. His OBP was .376, compared with a .397 regular-season OBP over those years. It’s possible that he was simply slumping and not seen as a threat.
It’s also possible that Youk’s failure to get hit at a respectable 2% rate (we’d have expected about 2 1/2 plunks) was simply chance. As a quick check, assume that his regular season stats during 2007, 2008, and 2009 represent “true” information, and that the 123 plate appearances he made in the postseasons were all random draws from the same distribution. Since he was hit 43 times in 1834 plate appearances across 2007-09, his true rate would be 2.3% (closer to 2.34, but I rounded down – note that this cuts Youk a little extra slack). Then, 95% of 123-appearance distributions should have hit-by-pitch rates that fall within the window
where se is the standard error, calculated as
Thus, 95 out of 100 123-appearance runs should fall within the window
Obviously, since there can’t be a negative number of hit batsmen, zero is included in that interval. Youkilis isn’t necessarily being pitched around more effectively in the postseason – he’s just unlucky enough not to get plunked.
RBIs with Two Outs July 4, 2011
Posted by tomflesher in Baseball, Economics.Tags: Boone Logan, Daniel Murphy, Hector Noesi, Jason Bay, Mets, Ramiro Pena, RBIs, Scott Hairston, statistics, Subway Series, two-out RBIs, Yankees
add a comment
Sunday’s Subway Series game between the Mets and Yankees ended with a bang – Jason Bay hit a single off Hector Noesi that brought home Scott Hairston. The tenth inning should have been over, but Ramiro Pena committed an error at shortstop that put Daniel Murphy on base for Boone Logan. Hairston’s run was unearned, but no matter – Noesi took the loss and the Mets won the game.
The final score was 3-2, and the interesting thing about the game was that all three of the Mets’ runs came with two outs. (My fiancée, Katie, suggested that this was unusual, and motivated most of the rest of this post.) In fact, so far, the Mets have had 347 RBIs (of 375 runs scored), and 147 of them have come with two outs. That’s about 42.4% of their RBIs. By contrast, only 1070 of 3274 plate appearances – 32.7% – come with two outs. (Less than a third of plate appearances come with two outs because of the double play, among other reasons.) The majority come with no men out (about 34.8%) with the remainder coming with one out. It seems like the high concentration of 2-out RBIs should be explained by the use of the sacrifice bunt, but the Mets have only had 31 sacrifice bunts this season – not nearly enough to account for the difference between 32.7% of plate appearances and 42.4% of RBIs.
Is that pattern common across baseball? So far, there have been 10,037 RBIs in Major League Baseball in the 2011 season. 3686 of them – about 36.7% – came with two outs. Excluding the Mets’ numbers, that’s 3539 out of 9690, or 36.5%. For the National League only, there were 1928 two-out RBIS of 5212 total, or 37%, with 1781 of 4865 (36.6%) of National League RBIs coming with two outs if you exclude the Mets. (Note that I’m defining ‘in the National League’ as ‘in National League parks,’ since what I’m interested in is whether the Mets’ concentration of RBIs can be partially explained by the rules requiring pitchers to bat.)
Assume that the Mets’ RBIs are drawn from the same distribution as all others’. Then, 95% of the time, I’d expect the proportion of RBIs that come with two outs to be within two standard errors of the National League’s proportion, excluding the Mets. (The ‘two standard errors’ comes from the fact that a t-distribution’s critical value for a large number of trials for 95% significance is 1.96. For less than an infinite number, two standard errors is a handy approximation.) For the Mets’ 347 RBIs, the standard error would be
Thus, 95% of the time, the Mets should be within the interval of (.366 – .052, .366+.052), or (.314, .418). Since, again, the Mets’ proportion is .424, the Mets are slightly outside the 95% confidence interval. That’s pretty close, and certainly could happen by chance, but it’s surprising nonetheless. The question then is whether this is due to some sort of strategy employed by the Mets’ management or to some sort of clutch playing ability by the Mets. Again, there’s more data to collect and crunch (as always).
June Wins Above Expectation July 1, 2011
Posted by tomflesher in Baseball, Economics.Tags: Baseball, baseball-reference.com, statistics, wins above expectation
add a comment
Even though I’ve conjectured that team-level wins above expectation are more or less random, I’ve seen a few searches coming in over the past few days looking for them. With that in mind, I constructed a table (with ample help from Baseball-Reference.com) of team wins, losses, Pythagorean expectations, wins above expectation, and Alpha.
Quick definitions:
- The Pythagorean Expectation (pyth%) is a tool that estimates what percentage of games a team should have won based on that team’s runs scored and runs allowed. Since it generates a percentage, Pythagorean Wins (pythW) are estimated by multiplying the Pythagorean expectation by the number of games a team has played.
- Wins Above Expectation (WAE) are wins in excess of the Pythagorean expected wins. It’s hypothesized by some (including, occasionally, me) that WAE represents an efficiency factor – that is, they represent wins in games that the team “shouldn’t” have won, earned through shrewd management or clutch play. It’s hypothesized by others (including, occasionally, me) that WAE represent luck.
- Alpha is a nearly useless statistic representing the percentage of wins that are wins above expectation. Basically, W-L% = pyth% + Alpha. It’s an accounting artifact that will be useful in a long time series to test persistence of wins above expectation.
The results are not at all interesting. The top teams in baseball – the Yankees, Red Sox, Phillies, and Braves – have either negative WAE (that is, wins below expectation) or positive WAE so small that they may as well be zero (about 2 wins in the Phillies’ case and half a win in the Braves’). The Phillies’ extra two wins are probably a mathematical distortion due to Roy Halladay‘s two tough losses and two no-decisions in quality starts compared with only two cheap wins (and both of those were in the high 40s for game score). In fact, Phildaelphia’s 66-run differential, compared with the Yankees’ 115, shows the difference between the two teams’ scoring habits. The Phillies have the luxury of relying on low run production (they’ve produced about 78% of the Yankees’ production) due to their fantastic pitching. On the other hand, the Yankees are struggling with a 3.53 starters’ ERA including Ivan Nova and AJ Burnett, both over 4.00, as full-time starters. The Phillies have three pitchers with 17 starts and an ERA under 3.00 (Halladay, Cliff Lee, and Cole Hamels) and Joe Blanton, who has an ERA of 5.50, has only started 6 games. Even with Blanton bloating it, the Phillies’ starer ERA is only 2.88.
That doesn’t, though, make the Yankees a badly-managed team. In fact, there’s an argument that the Yankees are MORE efficient because they’re leading their league, just as the Phillies are, with a much worse starting rotation, through constructing a team that can balance itself out.
That’s the problem with wins above expectation – they lend themselves to multiple interpretations that all seem equally valid.
Tables are behind the cut. (more…)
Justin Turner Takes One For The Team June 23, 2011
Posted by tomflesher in Baseball, Economics.Tags: Athletics, Brad Ziegler, Charlie Morton, Dane Sardinha, hit by pitch, Jeff Francoeur, Justin Turner, Mariano Rivera, Mets, Oakland As
add a comment
The Mets’ Justin Turner quite literally took one for the team last night when he wasn’t trying to get hit, but, oops, managed to get plunked in the bottom of the 13th inning with the bases loaded. Brad Ziegler was the losing pitcher for Oakland. It was the first game-ending hit by pitch since last year, when Mariano Rivera nailed Jeff Francoeur for the loss in a September game.
In 185 plate appearances this year, Turner has been hit three times. The other two were both by Pittsburgh Pirates pitcher Charlie Morton, eleven days apart; Morton is not especially known for hitting batters, since he, too, has only been involved in three hit batsmen this year. (The third plunking was Dane Sardinha.) It was the Mets’ only go-ahead HBP this year, and the only one of this year’s six go-ahead hit batsmen to occur in extra innings.
Turner has a way about him. He’s hit ten go-ahead RBIs this year (and yes, a hit by pitch that forces in a run is an RBI), which accounts for a little over ten percent of the Mets’ 95 go-ahead RBIs. Only Carlos Beltran, with 13, has more. It’s also the Mets’ only game-ending RBI this year. I guess Turner will take what he can get.
Did Run Production Change in 2010? June 2, 2011
Posted by tomflesher in Baseball, Economics.Tags: Chow test, run production, Year of the Pitcher
add a comment
Part of the narrative of last year’s season was the compelling “Year of the Pitcher” storyline prompted by an unusual number of no-hitters and perfect games. Though it’s too early in the season to say the same thing is happening this year, a few bloggers have suggested that run production is down in 2011 and we might see the same sort of story starting again.
As a quick and dirty check of this, I’d like to compare production in the 2000-2009 sample I used in a previous post to production in 2010. This will introduce a few problems, notably that using one year’s worth of data for run production will lead to possibly spurious results for the 2010 data and that the success of the pitchers may be a result of the strategy used to generate runs. That is, if pitchers get better, and strategy doesn’t change, then we see pitchers taking advantage of inefficiencies in strategy. If batting strategy stays the same and pitchers take advantage of bad batting, then we should see a change in the structure of run production since the areas worked over by hitters – for example, walks and strikeouts – will see shifts in their relative importance in scoring runs.
Hypothesis: A regression model of runs against hits, doubles, triples, home runs, stolen bases, times caught stealing, walks, times hit by pitch, sacrifice bunts, and sacrifice flies using two datasets, one with team-level season-long data for each year from 2000 to 2009 and the other from 2010 only, will yield statistically similar beta coefficients.
Method: Chow test.
Result: There is a difference, significant at the 90% but not 95% level. That might be a result of a change in strategy or of pitchers exploiting strategic inefficiencies.
R code behind the cut.
Is scoring different in the AL and the NL? May 31, 2011
Posted by tomflesher in Baseball, Economics.Tags: American League, Baseball, baseball-reference.com, bunts, Chow test, linear regression, National League, R, structural break
1 comment so far
The American League and the National League have one important difference. Specifically, the AL allows the use of a player known as the Designated Hitter, who does not play a position in the field, hits every time the pitcher would bat, and cannot be moved to a defensive position without forfeiting the right to use the DH. As a result, there are a couple of notable differences between the AL and the NL – in theory, there should be slightly more home runs and slightly fewer sacrifice bunts in the AL, since pitchers have to bat in the NL and they tend to be pretty poor hitters. How much can we quantify that difference? To answer that question, I decided to sample a ten-year period (2000 until 2009) from each league and run a linear regression of the form
Where runs are presumed to be a function of hits, doubles, triples, home runs, stolen bases, times caught stealing, walks, strikeouts, hit batsmen, bunts, and sacrifice flies. My expectations are:
- The sacrifice bunt coefficient should be smaller in the NL than in the AL – in the American League, bunting is used strategically, whereas NL teams are more likely to bunt whenever a pitcher appears, so in any randomly-chosen string of plate appearances, the chance that a bunt is the optimal strategy given an average hitter is much lower. (That is, pitchers bunt a lot, even when a normal hitter would swing away.) A smaller coefficient means each bunt produces fewer runs, on average.
- The strategy from league to league should be different, as measured by different coefficients for different factors from league to league. That is, the designated hitter rule causes different strategies to be used. I’ll use a technique called the Chow test to test that. That means I’ll run the linear model on all of MLB, then separately on the AL and the NL, and look at the size of the errors generated.
The results:
- In the AL, a sac bunt produces about .43 runs, on average, and that number is significant at the 95% level. In the NL, a bunt produces about .02 runs, and the number is not significantly different from saying that a bunt has no effect on run production.
- The Chow Test tells us at about a 90% confidence level that the process of producing runs in the AL is different than the process of producing runs in the NL. That is, in Major League Baseball, the designated hitter has a statistically significant effect on strategy. There’s structural break.
R code is behind the cut.
Is ‘luck’ persistent? May 25, 2011
Posted by tomflesher in Baseball, Economics.Tags: American League, Baseball, Pythagorean expectation, wins above expectation
2 comments
I’ve been listening to Scott Patterson’s The Quants in my spare time recently. One of the recurring jokes is Wall Street traders’ use of the word ‘Alpha’ (which usually represents abnormal returns in finance) to refer to a general quality of being skillful or having talent. That led me to think about an old concept I haven’t played with in a while – wins above expectation.
As a quick review, wins above expectation relate a team’s actual wins to its Pythagorean expectation. If the team wins more than expected, it has a positive WAE number, and if it loses more than expected, it has wins below expectation, or, equivalently, a negative WAE. It’s tempting to think of WAE as representing a sort of ‘alpha’ in the traders’ sense – since the Pythagorean Expectation involves groups of runs scored and runs allowed, it generates a probability that a team with a history represented by its runs scored/runs allowed stats will win a given game. If a team has a lot more wins than expected, it seems like that represents efficiency – scoring runs at crucial times, not wasting them on blowing out opponents – or especially skillful management. Alternatively, it could just be luck. Is there any way to test which it is?
It’s difficult. However, let’s break down what the efficiency factor would imply. In general, it would represent some combination of individual player skill (such as the alleged clutch hitting ability) and team chemistry, whether that boils down to on- or off-field factors. Assuming rosters don’t change much over the course of the year, then, efficiency also shouldn’t change much over the course of the year. Similarly, if a manager’s skill was the primary determinant of wins above expectation, then for teams that don’t change managers midyear, we wouldn’t expect much of a change throughout the course of the season. Most managers work up through the minors, so there probably isn’t a major on-the-job training effect to consider.
On the other hand, if wins above expectation are just luck, then we wouldn’t need to place any restrictions on them. Maybe they’d change. Maybe they wouldn’t. Who knows?
In order to test that idea, I pulled some data for the American League off Baseball Reference from last season. I split the season into pre- and post-All-Star Break sets and calculated the Pythagorean expectation (using the 1.81 exponent referred to in Wikipedia) for each team. I found WAE for each team in each session, then found each team’s ‘Alpha’ for that session by dividing WAE by the number of games played. Basically, I assumed that WAE represented extra win probability in some fashion and assumed it existed in every game at about the same level. The results:
As is evident from the table, a whopping 10 out of the 14 teams see a change in the sign of Alpha from before the All-Star Game to after the All-Star Game. The correlation coefficient of Alpha from pre- to post-All-Star is -.549, which is a pretty noisy correlation. (Note also that this very closely describes regression to the mean.) It’s not 0, but it’s also negative, implying one of two things: Either teams become less efficient and/or more badly managed, on average, after the break, or Alpha represents very little more than a realization of a random process, which might just as well be described as luck.
Teixeira’s Ability to Pick Up Slack: Re-Evaluating April 12, 2011
Posted by tomflesher in Baseball, Economics.Tags: Alex Rodriguez, binomial distribution, home runs, Mark Teixeira, Michael Kaye, Robinson Cano, Yankees
add a comment
In an earlier post, I discussed Yankees broadcaster Michael Kaye’s belief that Mark Teixeira and Robinson Cano were picking up slack during the time in which Alex Rodriguez was struggling to hit his 600th home run. I noticed that Teixeira had hit 18 home runs in 423 plate appearances during the first 93 games of the season for rates of .194 home runs per game and .0426 home runs per plate appearance. During the time between A-Rod’s #599 and #600, Teixeira’s performance was different in a statistically significant way: his production per game was up to .417 home runs per game and .0926 home runs per plate appearance.
Now, let’s take a look at the home stretch of the season. Teixeira played in 52 games, starting 51 of them, and hit 10 home runs in 230 plate appearances. That works out to .1923 home runs per game or .0435 per plate appearance. Those numbers are exceptionally similar to Teixeira’s production in the first stretch of the season, so it seems reasonable to say that those rates represent his standard rate of production.
This is prima facie evidence that Teixeira was working to hit more home runs, consciously or subconsciously, during the time that Rodriguez was struggling. The question then becomes, is there a reason to expect production to increase during the stretch between late July and early August? What if Mark was just operating better following the All-Star Break?
I chose a twelve-game stretch immediately following the All-Star Break to evaluate. This period overlaps with the drought between A-Rod’s 599th and 600th home runs, stretching from July 16 to July 28, so six games overlap and six do not. During that time, Teixeira hit 3 home runs in 56 plate appearances. His rate was therefore .0535 home runs per plate appearance.
If we assume that Teixeira’s true rate of production is about .043 home runs per plate appearance (his average over the season, excluding the drought), then the probability of his hitting exactly 3 home runs in a random 56-plate-appearance stretch is
He has a 43% chance of hitting 3 or more, compared with the complementary probability 57% probability of hitting fewer than 3. It’s well within the normal expected range. So, the All-Star Break effect is unlikely to explain Teixeira’s abnormal production last July.