P > .05: The incorrect interpretation of "not significant" results is a significant problem.

Richard J Smith

doi:10.1002/ajpa.24092

Abstract

Statistically nonsignificant (p > .05) results from a null hypothesis significance test (NHST) are often mistakenly interpreted as evidence that the null hypothesis is true-that there is "no effect" or "no difference." However, many of these results occur because the study had low statistical power to detect an effect. Power below 50% is common, in which case a result of no statistical significance is more likely to be incorrect than correct. The inference of "no effect" is not valid even if power is high. NHST assumes that the null hypothesis is true; p is the probability of the data under the assumption that there is no effect. A statistical test cannot confirm what it assumes. These incorrect statistical inferences could be eliminated if decisions based on p values were replaced by a biological evaluation of effect sizes and their confidence intervals. For a single study, the observed effect size is the best estimate of the population effect size, regardless of the p value. Unlike p values, confidence intervals provide information about the precision of the observed effect. In the biomedical and pharmacology literature, methods have been developed to evaluate whether effects are "equivalent," rather than zero, as tested with NHST. These methods could be used by biological anthropologists to evaluate the presence or absence of meaningful biological effects. Most of what appears to be known about no difference or no effect between sexes, between populations, between treatments, and other circumstances in the biological anthropology literature is based on invalid statistical inference.

Full Text