Abstract WP88: Can Machine Learning Assist With Large Scale Medical Literature Review?

Roger Cheng,Pengfei Zhang

doi:10.1161/str.55.suppl_1.wp88

Abstract

Introduction: The pace of emerging scientific literature far outstrips that of guideline updates, yet clinical practice may require early implementation after careful review. Recently developed large language model (LLM) based machine learning systems are potential assistive tools to identify novel clinical practice or research avenues as they emerge. Here we use GPT3.5, a LLM developed by OpenAI, to evaluate 3 interventions for aneurysmal subarachnoid hemorrhage (aSAH), with comparison to conclusions from a systematic review, the 2023 AHA Guideline on Management of aSAH. Methods: A list of articles and the associated abstracts were identified for each of the terms “nimodipine” (N), “magnesium” (M), and “induced hypertension” (IH) via PubMed search, with results topic filtered using the MeSH term “Subarachnoid Hemorrhage.” Review articles and meta-analyses were excluded. Using custom code via the GPT3.5 programming interface, the LLM was asked to read each abstract, decide if the intervention was discussed, and if so, rate it as effective, neutral, or ineffective, corresponding to scores of 1, 0, or -1. Results: Mean efficacy scores for N, M, and IH were 0.67, 0.32, and 0.15 after automated review of 136, 47, and 33 abstracts, respectively. Corresponding guidelines report strong benefit/class 1 (N), no benefit/class 3 (M), weak/Class 2b for therapeutic use of IH, and harm (class 3) for prophylactic use. While N had consistently positive automated scores, manual review of these abstracts showed that it was often used as a standard of care comparator (possible confirmation bias). Though M had consistently poor clinical trial scores, the mostly positive case and retrospective reports raised the score (possible publication bias). IH combined all aspects, with case reports of harm, mixed trial scores, yet also widespread use as standard of care. Conclusion: Though GPT3.5 results roughly parallel human curated ones, this method is not meant to replace the rigorous process of systematic review, but potentially assist as a “first pass” evaluator of new literature in a real-time, continuous manner to improve efficiency. Optimization of LLM prompts may help with specificity, and weighting articles by type or perceived quality may improve accuracy.

Full Text