Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews.

Kentaro Matsui,Tomohiro Utsumi,Yumi Aoki,Taku Maruki,Masahiro Takeshima,Yoshikazu Takaesu

doi:10.2196/52758

Kentaro Matsui, Tomohiro Utsumi + Show 4 more

Open Access

https://doi.org/10.2196/52758

Copy DOI

Export

Save

Cite

Journal: Journal of medical Internet research	Publication Date: Aug 16, 2024
Citations: 2	License type: cc-by

Abstract
Full-Text
Similar Papers

Abstract

Listen

The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers. We evaluated the performance of a 3-layer screening method using GPT-3.5 and GPT-4 to streamline the title and abstract-screening process for systematic reviews. Our goal is to develop a screening method that maximizes sensitivity for identifying relevant records. We conducted screenings on 2 of our previous systematic reviews related to the treatment of bipolar disorder, with 1381 records from the first review and 3146 from the second. Screenings were conducted using GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, and (3) interventions and controls. The 3-layer screening was conducted using prompts tailored to each study. During this process, information extraction according to each study's inclusion criteria and optimization for screening were carried out using a GPT-4-based flow without manual adjustments. Records were evaluated at each layer, and those meeting the inclusion criteria at all layers were subsequently judged as included. On each layer, both GPT-3.5 and GPT-4 were able to process about 110 records per minute, and the total time required for screening the first and second studies was approximately 1 hour and 2 hours, respectively. In the first study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.900/0.709 and 0.806/0.996, respectively. Both screenings by GPT-3.5 and GPT-4 judged all 6 records used for the meta-analysis as included. In the second study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.958/0.116 and 0.875/0.855, respectively. The sensitivities for the relevant records align with those of human evaluators: 0.867-1.000 for the first study and 0.776-0.979 for the second study. Both screenings by GPT-3.5 and GPT-4 judged all 9 records used for the meta-analysis as included. After accounting for justifiably excluded records by GPT-4, the sensitivities/specificities of the GPT-4 screening were 0.962/0.996 in the first study and 0.943/0.855 in the second study. Further investigation indicated that the cases incorrectly excluded by GPT-3.5 were due to a lack of domain knowledge, while the cases incorrectly excluded by GPT-4 were due to misinterpretations of the inclusion criteria. Our 3-layer screening method with GPT-4 demonstrated acceptable level of sensitivity and specificity that supports its practical application in systematic review screenings. Future research should aim to generalize this approach and explore its effectiveness in diverse settings, both medical and nonmedical, to fully establish its use and operational feasibility.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews.

Abstract

Published Version

Talk to us

Similar Papers

More From: Journal of medical Internet research

Lead the way for us

Similar Papers

Health professionals' experiences of grief associated with the death of pediatric patients: a systematic review.
Shannon Barnes ... Zoe Jordan
JBI Evidence Synthesis | VOL. 18
Shannon Barnes, et. al.Shannon Barnes ... Zoe Jordan
01 Mar 2020
JBI Evidence Synthesis | VOL. 18

PROTOCOL: Dropout Prevention and Intervention Programs: Effects on School Completion and Dropout Among School‐aged Children and Youth
Sandra Jo Wilson ... Chiungjung Huang
Campbell Systematic Reviews | VOL. 6
Sandra Jo Wilson, et. al.Sandra Jo Wilson ... Chiungjung Huang
01 Jan 2009
Campbell Systematic Reviews | VOL. 6

Health Communication and the Arts in the United States: A Scoping Review.
Jill Sonke ... Nancy Schaefer
American Journal of Health Promotion | VOL. 35
Jill Sonke, et. al.Jill Sonke ... Nancy Schaefer
18 Jun 2020
American Journal of Health Promotion | VOL. 35

Improving the clinical significance of medical research
J André Knottnerus ... Peter Tugwell
Journal of Clinical Epidemiology | VOL. 67
J André Knottnerus, et. al.J André Knottnerus ... Peter Tugwell
22 Apr 2014
Journal of Clinical Epidemiology | VOL. 67

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews.

Abstract

Published Version

Talk to us

Similar Papers

More From: Journal of medical Internet research