binomial distribution | The World's Worst Sports Blog

How has hitting changed this year? Evidence from the first half of 2017 August 22, 2017

Posted by tomflesher in Baseball.
Tags: Baseball, binomial distribution, home runs, MLB, probability, Stuff Gary Cohen Says
3 comments

It’s no secret that MLB hitters are hitting more home runs this year. In June, USA Today’s Ted Berg called the uptick “so outrageous and so unprecedented” as to require additional examination, and he offered a “juiced” ball as a possibility (along with “juiced” players and statistical changes to players’ approaches). DJ Gallo noted a “strange ambivalence” toward the huge increase in home runs, and June set a record for the most home runs in a month. Neil Greenberg makes a convincing case that the number of homers is due to better understanding of the physics of hitting.

How big a shift are we talking about here? Well, take a look at the numbers from 2016’s first half. (That’s defined as games before the All-Star Game.) That comprises 32670 games and 101450 plate appearances. In that time period, hitters got on base at a .323 clip. About 65% of hits were singles, with 19.6% doubles, 2.09% triples, and 13.2% home runs. Home runs came in about 3.04% of plate appearances (3082 home runs in 101450 plate appearances).

Using 2016’s rate, 2017’s home run count is basically impossible.

Taking that rate as our prior, how different are this year’s numbers? For one, batters are getting on base only a little more – the league’s OBP is .324 – but hitting more extra-base hits every time. Only 63.7% of hits in the first year were singles, with 19.97% of hits landing as doubles, 1.78% triples, and 14.5% home runs. There were incidentally, more homers (3343) in fewer plate appeances (101269). Let’s assume for the moment that those numbers are significantly different from last year – that the statistical fluctuation isn’t due to weather, “dumb luck,” or anything else, but has to be due to some internal factor. There weren’t that many extra hits – again, OBP only increased by .001 – but the distribution of hits changed noticeably. Almost all of the “extra” hits went to the home run column, rather than more hits landing as singles or doubles.

In fact, there were more fly balls this year – the leaguewide grounder-to-flyer ratio fell from .83 in 2016 to .80 this year. That still doesn’t explain everything, though, since the percentage of fly balls that went out of the park rose from 9.2% to 10%. (Note that those are yearlong numbers, not first-half specific.) Not only are there more fly balls, but more of them are leaving the stadium as home runs. The number of fly balls on the infield has stayed steady at 12%, and although there are slightly more walks (8.6% this year versus 8.2% last year), the strikeout rate rose by about the same number (21.5% this year, 21.1% last year).

Using last year’s rate of 3082 homers per 101450 plate appearances, I simulated 100,000 seasons each consisting of 101269 plate appearances – the number of appearances made in the first half of 2017. To keep the code simple, I recorded only the number of home runs in each season. If the rates were the same, the numbers would be clustered around 3077. In fact, in those 100,000 seasons, the median and mean were both 3076, and the distribution shown above has a clear peak in that region. Note in the bottom right corner, the distribution’s tail basically disappears above 3300; in those 100,000 seasons, the most home runs recorded was 3340 – 3 fewer than this year’s numbers. In fact, the probability of having LESS than 3343 home runs is 0.9999992. If everything is the same as last year, the probability of this year’s home runs occurring simply by chance is .0000008, or roughly 8 in 10 million.

Teixeira’s Ability to Pick Up Slack: Re-Evaluating April 12, 2011

Posted by tomflesher in Baseball, Economics.
Tags: Alex Rodriguez, binomial distribution, home runs, Mark Teixeira, Michael Kaye, Robinson Cano, Yankees
add a comment

In an earlier post, I discussed Yankees broadcaster Michael Kaye’s belief that Mark Teixeira and Robinson Cano were picking up slack during the time in which Alex Rodriguez was struggling to hit his 600th home run. I noticed that Teixeira had hit 18 home runs in 423 plate appearances during the first 93 games of the season for rates of .194 home runs per game and .0426 home runs per plate appearance. During the time between A-Rod’s #599 and #600, Teixeira’s performance was different in a statistically significant way: his production per game was up to .417 home runs per game and .0926 home runs per plate appearance.

Now, let’s take a look at the home stretch of the season. Teixeira played in 52 games, starting 51 of them, and hit 10 home runs in 230 plate appearances. That works out to .1923 home runs per game or .0435 per plate appearance. Those numbers are exceptionally similar to Teixeira’s production in the first stretch of the season, so it seems reasonable to say that those rates represent his standard rate of production.

This is prima facie evidence that Teixeira was working to hit more home runs, consciously or subconsciously, during the time that Rodriguez was struggling. The question then becomes, is there a reason to expect production to increase during the stretch between late July and early August? What if Mark was just operating better following the All-Star Break?

I chose a twelve-game stretch immediately following the All-Star Break to evaluate. This period overlaps with the drought between A-Rod’s 599th and 600th home runs, stretching from July 16 to July 28, so six games overlap and six do not. During that time, Teixeira hit 3 home runs in 56 plate appearances. His rate was therefore .0535 home runs per plate appearance.

If we assume that Teixeira’s true rate of production is about .043 home runs per plate appearance (his average over the season, excluding the drought), then the probability of his hitting exactly 3 home runs in a random 56-plate-appearance stretch is

$p(K = k) = {n \choose k}p^k(1-p)^{n-k} = {56 \choose 3}.043^{3}(.957)^{53} \approx .2146$

He has a 43% chance of hitting 3 or more, compared with the complementary probability 57% probability of hitting fewer than 3. It’s well within the normal expected range. So, the All-Star Break effect is unlikely to explain Teixeira’s abnormal production last July.

Quickie: Ryan Howard’s Choke Index October 25, 2010

Posted by tomflesher in Baseball.
Tags: baseball-reference.com, binomial distribution, Choke Index, Phillies, Ryan Howard, statistics
1 comment so far

The Choke Index is alive and well.

Previous to 2010, Ryan Howard of the Philadelphia Phillies hit home runs in three consecutive postseasons. He managed 7 in his 140 plate appearances, averaging out to .05 home runs per plate appearance. Not too shabby. It’s a bit below his regular season rate of about .067, but there are a bunch of things that could account for that.

This year, Ryan made 38 plate appearances and hit a grand total of 0 home runs in the postseason. What’s the likelihood of that happening? I use the Choke Index (one minus the probability of hitting 0 home runs in a given number of plate appearances) to measure that. As always, the closer a player gets to 1, the more unlikely his homer-free streak is.

The binomial probability can be calculated using the formula

$f(k;n,p) = \Pr(K = k) = {n\choose k}p^k(1-p)^{n-k}$

Or, since we’re looking for the probability of an event NOT occurring,

$(1-p)^k$

or $.95^{38}= .142$

using his career postseason numbers. That means that Ryan Howard’s 2010 postseason Choke Index is .858. Pretty impressive!

Teixeira and Cano: Picking up slack? August 5, 2010

Posted by tomflesher in Baseball, Economics.
Tags: A-Rod, Alex Rodriguez, binomial distribution, Mark Teixeira, probability, Robinson Cano, statistics, Yankees
1 comment so far

Michael Kaye, the YES broadcaster for the Yankees, often pointed out between July 22 and August 4 that the Yankees were turning up their offense to make up for Alex Rodriguez‘s lack of home run production. That seems like it might be subject to significant confirmation bias – seeing a few guys hit home runs when you wouldn’t expect them to might lead you to believe that the team in general has increased its production. So, did the Yankees produce more home runs during A-Rod’s drought?

During the first 93 games of the season, the Yankees hit 109 home runs in 3660 plate appearances for rates of 1.17 home runs per game and .0298 home runs per plate appearance. From July 23 to August 3, they hit 17 home runs in 451 plate appearances over 12 games for rates of 1.42 home runs per game and .0377 home runs per plate appearances. Obviously those numbers are quite a bit higher than expected, but can it be due simply to chance?

Assume for the moment that the first 93 games represent the team’s true production capabilities. Then, using the binomial distribution, the likelihood of hitting at least 17 home runs in 451 plate appearances is

$p(K = k) = {n\choose k}p^k(1-p)^{n-k} = {451\choose 17}.0298^{17}(.9702)^{434} \approx .0626$

The cumulative probability is about .868, meaning the probability of hitting 17 or fewer home runs is .868 and the probability of hitting more than that is about .132. The probability of hitting 16 or fewer is .805, which means out of 100 strings of 451 plate appearances about 81 of them should end with 16 or fewer plate appearances. This is a perfectly reasonable number and not inherently indicative of a special performance by A-Rod’s teammates.

Kaye frequently cited Mark Teixeira and Robinson Cano as upping their games. Teixeira hit 18 home runs over the first 93 games and made 423 plate appearances for rates of .194 home runs per game and .0426 home runs per plate appearance. From July 23 to August 3, he had 5 home runs in 12 games and 54 plate appearances for rates of .417 per game and .0926. That rate of home runs per plate appearance is about 8% likely, meaning that either Teixeira did up his game considerably or he was exceptionally lucky.

Cano played 92 games up to July 21, hitting 18 home runs in 400 plate appearances for rates of .196 home runs per game and .045 per plate appearance. During A-Rod’s drought, he hit 3 home runs in 50 plate appearances over 12 games for rates of .25 and .06. That per-plate-appearance rate is about 39% likely, which means we don’t have enough evidence to reject the idea that Cano’s performance (though better than usual) is just a random fluctuation.

It will be interesting to see if Teixeira slows down as a home-run hitter now that Rodriguez’s drought is over.

600 Home Runs: Who’s Second? July 25, 2010

Posted by tomflesher in Baseball, Economics.
Tags: 600 home runs, Alex Rodriguez, binomial distribution, Dodgers, home runs, Jim Thome, Manny Ramirez, quick and dirty stats, Twins
1 comment so far

Alex Rodriguez is, as I’m writing this, sitting at 599 home runs. Almost certainly, he’ll be the next player to hit the 600 home-run milestone, since the next two active players are Jim Thome at 575 and Manny Ramirez at 554. Today’s Toyota Text Poll (which runs during Yankee games on YES) asked which of those two players would reach #600 sooner.

There are a few levels of abstraction to answering this question. First of all, without looking at the players’ stats, Thome gets the nod at the first order because he’s significantly closer than Driving in 25 home runs is easier than driving in 46, so Thome will probably get there first.

At the second order, we should take a look at the players’ respective rates. Over the past two seasons, Thome has averaged a rate of .053 home runs per plate appearance, while Ramirez has averaged .041 home runs per plate appearance. With fewer home runs to hit and a higher likelihood of hitting one each time he makes it to the plate, Thome stays more likely to hit #600 before Ramirez does… but how much more likely?

Using the binomial distribution, I tested the likelihood that each player would hit his required number of home runs in different numbers of plate appearances to see where that likelihood reached a maximum. For Thome, the probability increases until 471 plate appearances, then starts decreasing, so roughly, I expect Thome to hit his 25th home run within 471 plate appearances. For Manny, that maximum doesn’t occur until 1121 plate appearances. Again, the nod has to go to Thome. He’ll probably reach the milestone in less than half as many plate appearances.

But wait. How many plate appearances is that, anyway? Until recently, Manny played 80-90% of the games in a season. Last year, he played 64%. So far the Dodgers have played 99 games and Manny appeared in 61 of them, but of course he’s disabled this year. Let’s make the generous assumption that Manny will play in 75% of the games in each season starting with this one. Then, let’s look at his average plate appearances per game. For most of his career, he averaged between 4.1 and 4.3 plate appearances per game, but this year he’s down to 3.6. Let’s make the (again, generous) assumption that he’ll get 4 plate appearances in each game from now on. At that rate, to get 1121 plate appearances, he needs to play in 280.25 games, which averages to 1.723 seasons of 162 games or about 2.62 seasons of 75% playing time.

Thome, on the other hand, has consistently played in 80% or more of his team’s games but suffered last year and this year because he hasn’t been serving as an everyday player. He pinch-hit in the National League last year and has, in Minnesota, played in about 69% of the games averaging only 3 plate appearances in each. Let’s give Jim the benefit of the doubt and assume that from here on out he’ll hit in 70% of the games and get 3.5 appearances (fewer games and fewer appearances than Ramirez). He’d need about 120.3 games, which equates to about 3/4 of a 162-game season or about 1.06 seasons with 70% playing time. Even if we downgrade Thome to 2.5 PA per game and 66% playing time, that still gives us an expectation that he’ll hit #600 within the next 1.6 real-time seasons.

Since Thome and Ramirez are the same age, there’s probably no good reason to expect one to retire before the other, and they’ll probably both be hitting as designated hitters in the AL next year. As a result, it’s very fair to expect Thome to A) reach 600 home runs and B) do it before Manny Ramirez.

The Kate Smith Effect July 18, 2010

Posted by tomflesher in Baseball.
Tags: binomial distribution, Flyers, hockey-reference.com, Kate Smith, Kate Smith Effect
add a comment

From the Mountains…
To the Prairies…
To the Oceans…
White with foam….

It’s “well-known” that when Kate Smith sings “God Bless America” – whether live starting in 1969 or on videotape now – the Philadelphia Flyers play better, or at least they’re more likely to win. As Wikipedia indicates, she’s considered a good luck charm for the Flyers. How much does she help?

Since 1969, the Flyers have played in 3268 games and won 1631 of them for an observed win percentage of .4991. That’s very close to the long-term win percentage of .50 that we’d expect for any team. Of those games, Kate Smith sang or was played at 114 of them with a total record of 87-23-4, and the record when Kate Smith did not sing was 1544 wins in 3154 games for a “non-Kate” win proportion of .4895. I’ll make the null hypothesis that the Flyers play exactly the same way in games where “God Bless America” is sung – “Kate games” – as they do when it isn’t. That means that

$H_{0}: p(Win \mid Kate) = p(Win \mid Non-Kate) = .4895$

The simplest way to attack this is to note that the Flyers’ win percentage in Kate games is .7632. Qualitatively, that’s quite a jump – surely, it must be significant. Of course, we can’t leave it at that.

First, note that with an observed proportion of .4895, the binomial probability of winning 87 games in 114 trials is approximately .00000000145 – that’s about 145 in one hundred billion. That’s highly unlikely. However, other methods can help us quantify the Kate Smith Effect.

The standard error for proportions is

$\sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{.7632(.2368)}{114}} = \sqrt{.0012} = .0346$

With 113 degrees of freedom and a 95% confidence interval, I used Texas A&M’s t Calculator to find that the appropriate critical value is 1.98. That means that we can be 95% confident that the win percentage in Kate games after controlling for other factors is somewhere in the range

$.7632 \pm 1.98 \times .0346$ or approximately $.6947 \le p(Win \mid Kate) \le .8317$

Since the true proportion in non-Kate games is .4895, that means the Kate Smith Effect is somewhere in the range

$.2051 \le \hat{\delta} \le .3421$

Though I can’t explain why, it’s apparent that there’s a Kate Smith Effect of at least 20% in terms of winning percentage. This isn’t to say that playing Kate Smith’s “God Bless America” causes good luck. Since the Kate video is considered a good luck charm, it’s probably more likely that the players play harder in games that are deemed important enough to play it.

Paul the Octopus: Credible? July 11, 2010

Posted by tomflesher in Economics.
Tags: binomial distribution, Paul the Octopus, statistics, World Cup
add a comment

Paul the Octopus (hatched 2008) is an octopus who correctly predicted 12 of 14 World Cup matches, including

Spain’s victory over the Dutch. Is his string of victories statistically significant?

First, I’m going to posit the null hypothesis that Paul is choosing randomly. As such, Paul’s proportion of correct choices should be .5 ( $H_o : \bar{p} = .5$ ). His observed proportion of correct choices is 12/14 or .857.

The standard error for proportions is

$\sqrt{\frac{p(1-p)}{n-1}} = \sqrt{\frac{.857(.143)}{13}} = \sqrt{\frac{.123}{13}} = \sqrt{.009} = .097$

The t-value of an observation is

$\frac{p}{se} \sim\ t_{df} = \frac{.857}{.097} \sim\ t_{13} = 8.84 \sim\ t_{13}$

According to Texas A&M’s t Distribution Calculator, the probability (or p-value) of this result by chance alone is less than .01.

Using the binomial distribution with $\lambda = .5$ , the probability of 12 or more successes in 14 trials is a vanishingly small .0065.

So, is Paul an oracle? Almost certainly not. However, not being a zoologist, I can’t explain what biases might be in play. I’d imagine it’s something like an attraction to contrast as well as a spurious correlation between octopus-attractive flags and success at soccer.

Pinch Hitters from the Bullpen July 6, 2010

Posted by tomflesher in Baseball, Economics.
Tags: binomial distribution, bullpen, Carlos Zambrano, Livan Hernandez, margin of error, Micah Owings, pinch hitter, sabermetrics
add a comment

Occasionally, a solid two-way player shows up in the majors. Carlos Zambrano is known as a solid hitter with a great arm (despite the occasional meltdown), and Micah Owings is the rare pitcher used as a pinch hitter. Even Livan Hernandez has 15 pinch-hit plate appearances (with 2 sacrifice bunts, 6 strikeouts, and a .077 average and .077 OBP, compared with a lifetime .227 average and .237 OBP).

Like Hernandez, Zambrano has a very different batting line as a pinch hitter than as a pitcher. In 24 plate appearances as a pinch hitter, Big Z is hitting only .087 with a .087 OBP, compared to his .243/.249 line when hitting as a pitcher. Since we see the same effect for both of these pitchers, it seems like there’s some sort of difference in hitting as a pinch hitter that causes the pitchers to be less mentally prepared. Of course, these numbers come from a very small sample.

On the other hand, Micah Owings hits .307/.331 as a pitcher, and a quite similar .250/.298 as a pinch hitter. What’s the difference? Owings has almost double Zambrano’s plate appearances as a pinch hitter with 47. That seems to show that maybe Owings’ larger sample size is what causes the similarity. How can this be tested rigorously?

As we did with Kevin Youkilis and his title of Greek God of Take Your Base, we can use the binomial distribution to see if it’s reasonable for Owings, Hernandez and Zambrano to hit so differently as pinch hitters. To figure out whether it’s reasonable or not, let’s limit our inquiry to OBP just because it’s a more inclusive measure and then assume that the batting average as a pitcher (i.e. the one with a larger sample size) is the pitcher’s “true” batting average and use that to represent the probability of getting on base. Each plate appearance is a Bernoulli trial with a binary outcome – we’ll call it a success if the player gets on base and a failure otherwise.

Under the binomial distribution, the probability of a player with OBP p getting on base k times in n plate appearances is:

$\Pr(K = k) = {n\choose k}p^k(1-p)^{n-k}$

with

${n\choose k}=\frac{n!}{k!(n-k)!}$

We’ll also need the margin of error for proportions. If p = OBP as pitcher, and we assume a t-distribution with over 100 plate appearances (i.e. degrees of freedom), then the margin of error is:

$\sqrt{\frac{p(1-p)}{n-1}}$

so that 95% of the time we’d expect the pinch hitting OBP to lie within

$OBP \pm 2\times\sqrt{\frac{p(1-p)}{n-1}}$

$\Pr(K = k) = {n\choose k}p^k(1-p)^{n-k}$

with

${n\choose k}=\frac{n!}{k!(n-k)!}$

We’ll also need the margin of error for proportions. If p = OBP as pitcher, and we assume a t-distribution with over 100 plate appearances (i.e. degrees of freedom), then the margin of error is:

$\sqrt{\frac{p(1-p)}{n-1}}$

so that 95% of the time we’d expect the pinch hitting OBP to lie within

$OBP \pm 2\times\sqrt{\frac{p(1-p)}{n-1}}$

Let’s start with Owings. He has an OBP of .331 as a pitcher in 151 plate appearances, so the probability of having at most 14 times on base in 47 plate appearances is .3778. In other words, about 38% of the time, we’d expect a random string of 47 plate appearances to have 14 or fewer times on base. His 95% confidence interval is .254 to .408, so his .298 OBP as a pinch hitter is certainly statistically credible.

Owings is special, though. Hernandez, for example, has 994 plate appearances as a pitcher and a .237 OBP, with only one time on base in 15 plate appearances. It’s a very small sample, but the binomial distribution predicts he would have at most one time on base only about 9.8% of the time. His confidence interval is .210 to .264, which means that it’s very unlikely that he’d end up with an OBP of .077 unless there is some relevant difference between hitting as a pitcher and hitting as a pinch hitter.

Zambrano’s interval breaks down, too. He has 601 plate appearances as a pitcher with a .249 OBP, but an anemic .087 OBP (two hits) in 24 plate appearances as a pinch hitter. We’d expect 2 or fewer hits only 4% of the time, and 95% of the time we’d expect Big Z to hit between .214 and .284.

As a result, we can make two determinations.

Zambrano and Hernandez are hitting considerably below expectations as pinch hitters. It’s likely, though not proven, that this is a pattern among most pitchers.
Micah Owings is a statistical outlier from the pattern. It’s not clear why.

How often should Youk take his base? June 30, 2010

Posted by tomflesher in Baseball, Economics.
Tags: Baseball, baseball-reference.com, binomial distribution, Brett Carroll, Greek God of Take Your Base, hit batsmen, hit by pitch, Kevin Youkilis, R
add a comment

Kevin Youkilis is sometimes called “The Greek God of Walks.” I prefer to think of him as “The Greek God of Take Your Base,” since he seems to get hit by pitches at an alarming rate. In fact, this year, he’s been hit 7 times in 313 plate appearances. (Rickie Weeks, however, is leading the pack with 13 in 362 plate appearances. We’ll look at him, too.) There are three explanations for this:

There’s something about Youk’s batting or his hitting stance that causes him to be hit. This is my preferred explanation. Youkilis has an unusual batting grip that thrusts his lead elbow over the plate, and as he swings, he lunges forward, which exposes him to being plunked more often.
Youkilis is such a hitting machine that the gets hit often in order to keep him from swinging for the fences. This doesn’t hold water, to me. A pitcher could just as easily put him on base safely with an intentional walk, so unless there’s some other incentive to hit him, there’s no reason to risk ejection by throwing at Youkilis. This leads directly to…
Youk is a jerk. This is pretty self-explanatory, and is probably a factor.

First of all, we need to figure out whether it’s likely that Kevin is being hit by chance. To figure that out, we need to make some assumptions about hit batsmen and evaluate them using the binomial distribution. I’m also excited to point out that Youk has been overtaken as the Greek God of Take Your Base by someone new: Brett Carroll. (more…)

The World's Worst Sports Blog