Principal investigators over-optimistically forecast scientific and operational outcomes for clinical trials.

Daniel M Benjamin,Sean X Zhang,Yasmina Hachem,Spencer P Hey,David R Mandel,Jonathan Kimmelman,Amanda Macpherson,Samantha Dolter,Sandy Wong,Kara S Smith

doi:10.1371/journal.pone.0262862

Abstract

To assess the accuracy of principal investigators' (PIs) predictions about three events for their own clinical trials: positivity on trial primary outcomes, successful recruitment and timely trial completion. A short, electronic survey was used to elicit subjective probabilities within seven months of trial registration. When trial results became available, prediction skill was calculated using Brier scores (BS) and compared against uninformative prediction (i.e. predicting 50% all of the time). 740 PIs returned surveys (16.7% response rate). Predictions on all three events tended to exceed observed event frequency. Averaged PI skill did not surpass uninformative predictions (e.g., BS = 0.25) for primary outcomes (BS = 0.25, 95% CI 0.20, 0.30) and were significantly worse for recruitment and timeline predictions (BS 0.38, 95% CI 0.33, 0.42; BS = 0.52, 95% CI 0.50, 0.55, respectively). PIs showed poor calibration for primary outcome, recruitment, and timelines (calibration index = 0.064, 0.150 and 0.406, respectively), modest discrimination in primary outcome predictions (AUC = 0.76, 95% CI 0.65, 0.85) but minimal discrimination in the other two outcomes (AUC = 0.64, 95% CI 0.57, 0.70; and 0.55, 95% CI 0.47, 0.62, respectively). PIs showed overconfidence in favorable outcomes and exhibited limited skill in predicting scientific or operational outcomes for their own trials. They nevertheless showed modest ability to discriminate between positive and non-positive trial outcomes. Low survey response rates may limit generalizability.

Highlights

Clinical trials aim at generating evidence for clinical decision-making
Averaged principal investigators’ (PIs) skill did not surpass uninformative predictions (e.g., Brier scores (BS) = 0.25) for primary outcomes (BS = 0.25, 95% CI 0.20, 0.30) and were significantly worse for recruitment and timeline predictions (BS 0.38, 95% CI 0.33, 0.42; BS = 0.52, 95% CI 0.50, 0.55, respectively)
PIs showed poor calibration for primary outcome, recruitment, and timelines, modest discrimination in primary outcome predictions (AUC = 0.76, 95% CI 0.65, 0.85) but minimal discrimination in the other two outcomes (AUC = 0.64, 95% CI 0.57, 0.70; and 0.55, 95% CI 0.47, 0.62, respectively)

Summary

Introduction

Clinical trials aim at generating evidence for clinical decision-making. Many trials fail to generate clinically relevant information because they posit a poorly justified hypothesis, they deploy suboptimal design and reporting, or they founder on operational issues like recruitment [1]. Such failures have many causes, one likely factor is that investigators misjudge the viability of their clinical hypotheses or operational aspects of the trial itself. The recurrent use of unrealistically large effect sizes in power calculations is sometimes invoked to suggest investigators harbor excess optimism about clinical hypotheses [8, 9]. The resulting underpowered design might instead reflect investigators’ realism about their inability to recruit enough patients for larger studies aimed at detecting smaller effects

Methods

Results

Conclusion