Abstract


 
 
 What is best criterion for determining statistical significance? In psychology, the criterion has been p < .05. This criterion has been criticized since its inception, and the criticisms have been rejuvenated with recent failures to replicate studies published in top psychology journals. Several replacement criteria have been suggested including reducing the alpha level to .005 or switching to other types of criteria such as Bayes factors or effect sizes. Here, various decision criteria for statistical significance were evaluated using signal detection analysis on the outcomes of simulated data. The signal detection measure of area under the curve (AUC) is a measure of discriminability with a value of 1 indicating perfect discriminability and 0.5 indicating chance performance. Applied to criteria for statistical significance, it provides an estimate of the decision criterion’s performance in discriminating real effects from null effects. AUCs were high (M = .96, median = .97) for p values, suggesting merit in using p values to discriminate significant effects. AUCs can be used to assess methodological questions such as how much improvement will be gained with increased sample size, how much discriminability will be lost with questionable research practices, and whether it is better to run a single high-powered study or a study plus a replication at lower powers. AUCs were also used to compare performance across p values, Bayes factors, and effect size (Cohen’s d). AUCs were equivalent for p values and Bayes factors and were slightly higher for effect size. Signal detection analysis provides separate measures of discriminability and bias. With respect to bias, the specific thresholds that produced maximally-optimal utility depended on sample size, although this dependency was particularly notable for p values and less so for Bayes factors. The application of signal detection theory to the issue of statistical significance highlights the need to focus on both false alarms and misses, rather than false alarms alone.
 
 

Highlights

  • The author would like to thank Anne Cleary, John Wixted, Mark Prince, Susan Wagner Cook, Mike Dodd, Art Glenberg, Jim Nairne, Jeremy Wolfe, and Ben Prytherch for useful discussions and feedback on an earlier draft

  • The signal detection theory measure of area under the curve (AUC) is offered as a tool to quantify the effectiveness of various measures of statistical effects

  • Type I errors and Type II errors are sometimes considered separately, with Type I errors being a function of the alpha level and Type II errors being a function of power

Read more

Summary

Colorado State University

What is best criterion for determining statistical significance? In psychology, the criterion has been p < .05. Various decision criteria for statistical significance were evaluated using signal detection analysis on the outcomes of simulated data. Changes in the recommendations for statistical significance include using a stricter criterion for significance (e.g., p < .005; Benjamin et al, 2017) and minimizing flexibility in decisions around data collection and analysis (e.g., Simmons, Nelson, & Simonsohn, 2011). These recommendations were designed to increase replicability by decreasing the false alarm rates, which is the rate at which null effects are incorrectly labeled as significant.

Correct Rejection
Questionable Research Practices
Bayes Factor Versus p values
No evidence
False Alarm
Including Priors
Bayes Factor and Bias
Discriminability with Effect Size
Conclusion
Findings
Open Science Practices
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call