Error rates of human reviewers during abstract screening in systematic reviews.

Zhen Wang,Mohammad Hassan Murad,Peter O’Blenis,Tarek Nayfeh,Jennifer Tetzlaff,Sompop Bencharit

doi:10.1371/journal.pone.0227742

Abstract

Automated approaches to improve the efficiency of systematic reviews are greatly needed. When testing any of these approaches, the criterion standard of comparison (gold standard) is usually human reviewers. Yet, human reviewers make errors in inclusion and exclusion of references. To determine citation false inclusion and false exclusion rates during abstract screening by pairs of independent reviewers. These rates can help in designing, testing and implementing automated approaches. We identified all systematic reviews conducted between 2010 and 2017 by an evidence-based practice center in the United States. Eligible reviews had to follow standard systematic review procedures with dual independent screening of abstracts and full texts, in which citation inclusion by one reviewer prompted automatic inclusion through the next level of screening. Disagreements between reviewers during full text screening were reconciled via consensus or arbitration by a third reviewer. A false inclusion or exclusion was defined as a decision made by a single reviewer that was inconsistent with the final included list of studies. We analyzed a total of 139,467 citations that underwent 329,332 inclusion and exclusion decisions from 86 unique reviewers. The final systematic reviews included 5.48% of the potential references identified through bibliographic database search (95% confidence interval (CI): 2.38% to 8.58%). After abstract screening, the total error rate (false inclusion and false exclusion) was 10.76% (95% CI: 7.43% to 14.09%). This study suggests important false inclusion and exclusion rates by human reviewers. When deciding the validity of a future automated study selection algorithm, it is important to keep in mind that the gold standard is not perfect and that achieving error rates similar to humans may be adequate and can save resources and time.

Highlights

Systematic review is a process to identify, select, synthesize and appraise all empirical evidence that fits pre-specified criteria to answer a specific research question
The total error rate was 10.76%
This study suggests important false inclusion and exclusion rates by human reviewers

Summary

Introduction

Systematic review is a process to identify, select, synthesize and appraise all empirical evidence that fits pre-specified criteria to answer a specific research question. It is estimated that the annual publications of systematic reviews increased 2,728% from 1,024 in 1991 to 28,959 in 2014.[2]. Despite of the surging number of published systematic reviews in recent years, many systematic reviews employ suboptimal methodological approaches.[2,3,4] Rigorous systematic reviews require strict procedures with at least eight time-consuming steps.[5, 6] Significant time and resources are needed, with estimated 0.9 minutes, 7 minutes and 53 minutes spent per reference per reviewer on abstract screening, full text screening, and data extraction; respectively.[7, 8] One thousand potential studies retrieved from literature search required 952 hours to complete.[9]methods to improve efficiency of systematic reviews without jeopardizing the validity are greatly needed. Automated approaches to improve the efficiency of systematic reviews are greatly needed. When testing any of these approaches, the criterion standard of comparison (gold standard) is usually human reviewers. Human reviewers make errors in inclusion and exclusion of references

Objectives

Methods

Results

Conclusion