Abstract Background Systematic reviews are at the top of the evidence pyramid, but the whole process can be labor-intensive, especially in the screening by title and abstract of retrieved records. Large language models (LLMs), such as ChatGPT, are proficient in natural language processing tasks and offer potential for accelerating this phase. The aim of this study is to assess ChatGPT’s performance in screening records for an already published systematic review (doi: 10.1080/21645515.2023.2300848). Methods 1601 records with title and abstract were evaluated with ChatGPT 4, using two different prompts to instruct the model: one asking the model to classify each record as included or excluded (Prompt A), and one asking the model to rate each record from 1 to 5 based on inclusion confidence (Prompt B), to compute the model’s performance metrics. Results The review included 64 records after title and abstract screening, of which 18 after full text screening. Using records included by title and abstract as a reference, prompt B with a rating cut-off of 3 provided 82% sensitivity, 88% specificity and 99% negative predictive value (NPV). With articles included by full text as a reference, 100% sensitivity, 86% specificity, and 100% NPV. An 85% workload saving was reached. Prompts A showed higher workload savings (∼93%), high NPVs (∼100%) and specificity (∼96%), but lower sensitivity (62-72%). Conclusions Prompts with a rating cut-off of 3 achieved a better performance, and still relevant workload savings compared to binary classification prompts. These findings can inform prompt-engineering strategies to instruct LLMs and reach both high sensitivity and workload savings. Key messages • Systematic reviews provide evidence synthesis to guide policy and decision-making in healthcare and public health. Such process can be time-consuming, and need for timely evidence is often stringent. • LLMs can help reduce workload and speed up the title and abstract screening phase of a systematic review, which is considered a major bottleneck in the review process.
Read full abstract