The prevalence of statistical reporting errors in psychology (1985-2013).

Michèle B Nuijten,Marcel A L M Van Assen,Sacha Epskamp,Chris H J Hartgerink,Jelte M Wicherts

doi:10.3758/s13428-015-0664-2

Michèle B Nuijten, Marcel A L M Van Assen + Show 3 more

Open Access

https://doi.org/10.3758/s13428-015-0664-2

Copy DOI

Abstract

This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.

Highlights

This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package Bstatcheck.^ statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period
If the reported p-value is inconsistent with the recalculated p-value and the inconsistency changes the statistical conclusion, the result is marked as a gross inconsistency
In this paper we investigated the prevalence of reporting errors in eight major journals in psychology using the automated R package statcheck (Epskamp & Nuijten, 2015)

Summary

Introduction

This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package Bstatcheck.^ statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. There is evidence that many reported p-values do not match their accompanying test statistic and degrees of freedom (Bakker & Wicherts, 2011; Bakker & Wicherts, 2014; Berle & Starcevic, 2007; Caperos & Pardo, 2013; Garcia-Berthou & Alcaraz, 2004; Veldkamp, Nuijten, Dominguez-Alvarez, Van Assen, & Wicherts, 2014; Wicherts, Bakker, & Molenaar, 2011) These studies highlighted that roughly half of all published empirical psychology articles using NHST contained at least one inconsistent p-value and that around one in seven articles contained a gross inconsistency, in which the reported p-value was significant and the computed p-value was not, or vice versa. Contrary to many other QRPs in John et al.’s list, misreported p-values that bear on significance can be readily detected on the basis of the articles’ text

Objectives

Methods

Results

Conclusion