Exact replication: Foundation of science or game of chance?

Sophie K Piper,Andre Rex,Ulrich Dirnagl,Nico Riedel,Ulrike Grittner,Felix Fischer,Bob Siegerink,Robert Nadon

doi:10.1371/journal.pbio.3000188

Abstract

The need for replication of initial results has been rediscovered only recently in many fields of research. In preclinical biomedical research, it is common practice to conduct exact replications with the same sample sizes as those used in the initial experiments. Such replication attempts, however, have lower probability of replication than is generally appreciated. Indeed, in the common scenario of an effect just reaching statistical significance, the statistical power of the replication experiment assuming the same effect size is approximately 50%—in essence, a coin toss. Accordingly, we use the provocative analogy of “replicating” a neuroprotective drug animal study with a coin flip to highlight the need for larger sample sizes in replication experiments. Additionally, we provide detailed background for the probability of obtaining a significant p value in a replication experiment and discuss the variability of p values as well as pitfalls of simple binary significance testing in both initial preclinical experiments and replication studies with small sample sizes. We conclude that power analysis for determining the sample size for a replication study is obligatory within the currently dominant hypothesis testing framework. Moreover, publications should include effect size point estimates and corresponding measures of precision, e.g., confidence intervals, to allow readers to assess the magnitude and direction of reported effects and to potentially combine the results of initial and replication study later through Bayesian or meta-analytic approaches.

Highlights

“Non-reproducible single occurrences are of no significance to science.” [1]
Replication of results has been considered an integral part of the scientific process, at least, since Karl Popper’s famous declaration [2], and has again taken center stage in discussions about current research and publication practices
The valproic acid (VPA) treated group displayed significantly lower infarct volumes (−37%) compared with the vehicle treated group (mean: 39.4 mm3, standard deviation [SD]: 27.6 mm3 versus 63.6 mm3, SD: 22.7; n = 10 per group; mean difference: 24.2 mm3 with 95% confidence interval [CI; 0.3–48.0 mm3]; standardized effect size of 0.96; t = 2.136; p = 0.047; see S1 Fig)

Summary

Introduction

“Non-reproducible single occurrences are of no significance to science.” [1]. In modern times, replication of results has been considered an integral part of the scientific process, at least, since Karl Popper’s famous declaration [2], and has again taken center stage in discussions about current research and publication practices. Psychology was the first field to attempt large scale replications of key research findings [3,4,5], with discouraging results. We use an empirical example from our own research to highlight the generally low statistical power of same sample-size exact replications, with emphasis on the common scenario of a barely significant initial finding. To this end, we conduct a coin flip experiment in an attempt to “replicate” an animal experiment that found a small neuroprotective effect of valproic acid (VPA).

Ethics statement

Findings

Conclusions