[by JSC5]

This paragraph from an essay by Dennis Overbye on discoveries in astrophysics really blew me away:

“Call it the two-sigma blues. **Two-sigma** is mathematical jargon for a measurement or discovery of some kind that **sticks up high enough above the random noise to be interesting but not high enough to really mean anything conclusive**. For the record, **the criterion for a genuine discovery is known as five-sigma**, suggesting there is less than one chance in roughly 3 million that it is wrong. Two sigma, leaving a 2.5 percent chance of being wrong, is just high enough to jangle the nerves.”

If you think back to your first statistics class, you’ll remember a bunch of ways for testing an observation for significance, and I’ll bet you $10 you used the 95% confidence level for just about everything you did in that class. It’s become the default level for significance in most social sciences. Run the regression, and if p<.05, bam, you’re done. Call it significant and move on.

Then along comes a hard science like astrophysics that *puts everyone else to shame*. These guys run right past .05 without looking back, at .025 their nerves start “jangling”, but it’s not until 3.33 x 10^-7 that they’re ready to go ahead and say, “Excuse me, sir, but I think I have a genuine discovery on my hands.” Meanwhile the economics/poly sci grad student next door has run 1000 new regressions and ‘discovered’ a bunch of things that turn out not to be quite right but still get published with frightening frequency.

And that’s why people trust astrophysicists.

### Like this:

Like Loading...

*Related*

The idea that “the criterion for a genuine discovery is known as five-sigma,” is a terrible mistake.

The high number of false alarms from statistical tests by the “economics/poly sci grad student next door” is because they follow cookbook statistical recipes without proper appreciation of other factors which affect the actual probability associated with an observation or experiment, and the topics of their study are riddled with extra variables, often unknown, usually uncontrollable, which affect the outcome. It’s unfortunate that some disciplines publish so many mistaken “results” in spite of the fact that if they were analyzed in detail, with proper statistical treatment, they wouldn’t pass muster.

But it’s also unfortunate that some of what you call “hard” sciences similarly fail in their statistical treatment of data. But instead of allowing so many false results to get published, they go to the other extreme and insist on a ridiculously high statistical significance level. The 5-sigma “significance” requirement isn’t because these hard sciences are so reliable. Quite the opposite, it’s unreliability that motivates them to overcompensate in the other direction.

I’m sorry to burst your “astrophysics is so trustworthy” bubble, but the fact is that if you have to resort to setting the statistical bar at 5-sigma, then

you’re doin’ it wrong. At least, that’s my opinion. And since I’m a professional statistician who’s done quite a bit of research analyzing astrophysical data, I might know a thing or two about the topic that Dennis Overbye doesn’t.Interesting. What confidence level do you like to use in your own work? Does it depend on the particular subject or observatory instrument?

Wow … what a can of worms.

Among statisticians the practice of hypothesis testing using a fixed confidence level is declining in popularity. What’s more popular is to compute the “p-value,” i.e., the probability of getting a given result if the null hypothesis is true. If the p-value is < 0.05 it would pass the "95% confidence" test, but as I say that's a practice that's less favored these days — for many (myself included) it simply makes more sense to say the p-value is such-and-such and let others decide for themselves how "significant" that is.

An important consideration (and one which leads to a lot of those false alarms in the "soft" sciences) is that one should really take into consideration the effect of so many researchers doing so many tests on so many data sets. This can be *very* hard to estimate realistically, but the fact is that if you run 1000 regressions you *expect* to get 50 or so results at the p < 0.05 level, and that has to be taken into account or you'll be swamped with false alarms. It's one of the real problems in medical research.

Another important consideration is that often the relevant "null hypothesis" isn't known very well, while the null hypothesis which we use to compute the p-value is rather clearly irrelevant. This is one of the big reasons that researchers' p-value computations can be so unrealistic, and one of the reasons that some disciplines tolerate an unruly number of false alarms while others have just jacked up the significance level to cut down on those false alarms.

Getting a real handle on the probability of a given outcome is

hard— and most scientists (whether hard or soft) aren’t statisticians and can’t really afford the time to become one. So — I don’t blame them for the choices they make. It’s OK to tolerate a lot of false alarms if it keeps the statistics easy enough to be within reach of the average researcher, and keeps research moving forward. Just be aware (!) that the real significance is less than your estimate. It’s also OK to jack up the significance level in order to have confidence in your outcomes. Just be aware (!) that the real significance is (again) less than your estimate — and that you’re likely going to miss some interesting results because it’s so hard to “make the cut.”Then there’s the

verythorny issue of, what does a p-value really mean? Yes, it means there’s only probability p of getting that result if the null hypothesis is true (assuming we even have a clue what the right null is), and that may be very small (or very large), but —compared to what? The observed result may be very unlikely under the null, but might be vastlymoreunlikely under all other hypotheses. Then we should apply Sherlock Holmes’s maxim that when you’ve eliminated all other possibility, whatever’s left —no matter how unlikely— must be true. This whole issue leads naturally to the use ofBayesian statistics, which is really a newparadigmfor statistical analysis. For many statisticians it’s the “darling” of modern statistics and anyone not doing Bayesian analysis is a useless fossil, while for others it’s a flash-in-the-pan and those not firmly rooted in the old ways (“frequentist” statistics) are being dazzled by a glamour queen.So I guess my answer is: there’s no such thing as a single way to approach analysis, no such thing as a single significance level, each case should be analyzed with care and thoroughness and a healthy awareness that we might not have too many clues, in fact we might be clueless (in more than one sense of the word). I know … cop-out. But it’s for real.

And really, I don’t have a problem with using standard, “naive” methods since most researchers aren’t statisticians and it’s what’s available to them. And it’s OK for physicists who do so to insist on 5-sigma because that gives reliable results

most of the time. But let’s not fool ourselves into believing it’s a sign that their science is “better.” The real reason they do it is that it’s a naive 5-sigma, so the extra caution is required.