Abstract

ABSTRACT P values linked to null hypothesis significance testing (NHST) is the most widely (mis)used method of statistical inference. Empirical data suggest that across the biomedical literature (1990–2015), when abstracts use P values 96% of them have P values of 0.05 or less. The same percentage (96%) applies for full-text articles. Among 100 articles in PubMed, 55 report P values, while only 4 present confidence intervals for all the reported effect sizes, none use Bayesian methods and none use false-discovery rate. Over 25 years (1990–2015), use of P values in abstracts has doubled for all PubMed, and tripled for meta-analyses, while for some types of designs such as randomized trials the majority of abstracts report P values. There is major selective reporting for P values. s tend to highlight most favorable P values and inferences use even further spin to reach exaggerated, unreliable conclusions. The availability of large-scale data on P values from many papers has allowed the development and applications of methods that try to detect and model selection biases, for example, p-hacking, that cause patterns of excess significance. Inferences need to be cautious as they depend on the assumptions made by these models and can be affected by the presence of other biases (e.g., confounding in observational studies). While much of the unreliability of past and present research is driven by small, underpowered studies, NHST with P values may be also particularly problematic in the era of overpowered big data. NHST and P values are optimal only in a minority of current research. Using a more stringent threshold, as in the recently proposed shift from P < 0.05 to P < 0.005, is a temporizing measure to contain the flood and death-by-significance. NHST and P values may be replaced in many fields by other, more fit-for-purpose, inferential methods. However, curtailing selection biases requires additional measures, beyond changes in inferential methods, and in particular reproducible research practices.

Highlights

  • Null hypothesis significance testing (NHST) and P value thresholds such as 0.05 have long been a mainstay of empirical work in the sciences

  • null hypothesis significance testing (NHST) coupled with the use of P value thresholds dominates most fields in the biomedical and life sciences, social sciences, and physical sciences

  • We have considered only broad-brush changes: greater use of effect sizes and confidence intervals, methods based on false discovery rates, Bayesian methods, and more stringent thresholds for declaring a result significant

Read more

Summary

Introduction

Null hypothesis significance testing (NHST) and P value thresholds such as 0.05 have long been a mainstay of empirical work in the sciences. This paper, based on an invited plenary address to a recent ASAsponsored workshop on statistical inference, summarizes recent empirical work on the use and misuse of P values and places in context what we have learnt towards solving this conundrum. These include: (4.1) alternative approaches to inference (effect sizes and confidence intervals, Bayesian methods, changing the P value threshold); (4.2) attempts to model the selection process (the P value curve and meta-analysis of publication selection); (4.3) examples of alternatives based on context and goals; and, (4.4) how reproducible research practices might offer the best solution.

Empirical Results
Selection Effects
Layers of Selection for P Values
Selection Within Sections of a Paper
Cherry Picking in the More Competitive Basic Science Journals
Alternatives Approaches to Inference and Complements to P Values
Specific Examples Based on Context and Goals
Attempts to Model the Selection Process
Reproducible Research is Key to Addressing Selection Effects Head-On
Concluding Thoughts
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call