November 04, 2020

Why we shouldn't take p-values literally

PSA: if you read a paper claiming p<0.01, you shouldn’t automatically take that p-value literally.

By definition, a p-value is the probability of getting a result at least as extreme as the one in your sample, assuming the null hypothesis H0 is true. In other words, assuming H0 is true, if you collected many more samples, and ran the same testing procedure, you’d expect to find that p% of the results are more extreme than the original result. Notice: this requires running the exact same test on every sample.

This is equivalent to preregistering your test. If, instead, you first looked at the data before designing your test (to check for correlations, try different dependent variables, or even engage in outright p-hacking), you would violate the definition. In this data-dependent testing procedure, for every new sample you collect, you look at the data and then choose a test. That is, for every new sample you would potentially run a different test, which contradicts the requirement to run the same test.

Hence, the burden of proof lies with the researcher to convince us that, given a new sample, they would run the same test. Preregistration is the easiest way to do this: it guarantees that the researcher chose their test before looking at the data. But if the hypothesis is not preregistered, and looks like the garden of forking paths, then we should feel free to reject their p-value as invalid.

For example, suppose a paper reports a null average effect and poorly-motivated p<0.01 interaction effect. This looks suspiciously like they started out expecting an average effect, didn’t find one, so scrambled to find something, anything with three stars, then made up a rationalization for why that interaction was what they were planning all along. But obviously this is data-dependent testing. We have little reason to believe that the researcher would run the same test on other samples. Hence, we also have little reason to take the p-value seriously.

Lesson: consumers of non-preregistered research need to do more than just look at the p-value. We also have to make a judgment call about whether that p-value is well-defined.