## Significance

Posted by Daniel Hall on July 11, 2008

Evan’s post last week about statistical versus economic (I might call it *actual*) significance reminded me a recent set of articles in Wired magazine that announced “the end of theory“. The article offered an interesting (if florid) take on “the Petabyte Age”, where huge volumes of raw data make the scientific method obsolete:

At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn’t pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right. …

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. …

There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

As you can probably imagine, this prompted some discussion both among natural and social scientists. (Both those links are worth reading, as are the posts they link to.) My only specific comment on the Wired articles is that while they do a nice job exploring some fields where data is proving very powerful, the implicit premise seems to be that causation is dead — long live correlation! — and this is oversold.

The link to Evan’s post that I wanted to highlight was that it is exactly this abundance of data (combined with the powerful statistical tools we have to analyze it) that have made for such an intense focus on statistical significance in recent years. This is completely appropriate when you are just mucking around in data looking for correlations. After all, if I look at a set of 20 random variables (that have no relationship in reality) then on average I should should find one relationship between them that is significant at the p=0.05 level. Remember, finding statistically significant correlation at p=0.05 means that there is a 1 in 20 chance that the correlation is just a random artifact. Well, when you start piling through hundreds (or thousands) of variables and trying out scores (or hundreds) of specifications, random artifacts start showing up! (Everyone who actually understands statistics is now allowed to bang their head against their desk and correct my horribly sloppy and imprecise language in the comments.)

And this gets back to that difference that Evan was talking about. When you are doing reduced form econometrics and not pretending you have any idea what the relationship between any of your variables should be, you had better show some pretty impressive t stats — and start coming up with a reasonably good post hoc argument — before I pay much attention to your paper.

On the other hand, if theory tells us that there is good reason to suspect that a set of variables share a causal relationship — if we have a *model* that we think represents reality — then it is quite a different thing to find that there is a 1 in 20 (or even 1 in 10!) chance that our correlation is just random. When data analysis backs up a theoretical prediction — even with only 90 (or even 80!) percent confidence — then our belief in the theory should get stronger.

The upshot is that I’m not ready to declare that correlation is enough. Data is great, but it’s a lot better when you combine it with some theory.

## Scott S. said

Thanks for making the argument that I wanted to after reading the article, but was mostly a list of complaints in my head. I love ‘teh Wired’, but they tend to overshoot with their conclusions.