Performance Of Language Model Research Articles

Large language models (LLMs) are promising as tools for citation screening in systematic reviews. However, their applicability has not yet been determined. To evaluate the accuracy and efficiency of an LLM in title and abstract literature screening. This prospective diagnostic study used the data from the title and abstract screening process for 5 clinical questions (CQs) in the development of the Japanese Clinical Practice Guidelines for Management of Sepsis and Septic Shock. The LLM decided to include or exclude citations based on the inclusion and exclusion criteria in terms of patient, population, problem; intervention; comparison; and study design of the selected CQ and was compared with the conventional method for title and abstract screening. This study was conducted from January 7 to 15, 2024. LLM (GPT-4 Turbo)-assisted citation screening or the conventional method. The sensitivity and specificity of the LLM-assisted screening process was calculated, and the full-text screening result using the conventional method was set as the reference standard in the primary analysis. Pooled sensitivity and specificity were also estimated, and screening times of the 2 methods were compared. In the conventional citation screening process, 8 of 5634 publications in CQ 1, 4 of 3418 in CQ 2, 4 of 1038 in CQ 3, 17 of 4326 in CQ 4, and 8 of 2253 in CQ 5 were selected. In the primary analysis of 5 CQs, LLM-assisted citation screening demonstrated an integrated sensitivity of 0.75 (95% CI, 0.43 to 0.92) and specificity of 0.99 (95% CI, 0.99 to 0.99). Post hoc modifications to the command prompt improved the integrated sensitivity to 0.91 (95% CI, 0.77 to 0.97) without substantially compromising specificity (0.98 [95% CI, 0.96 to 0.99]). Additionally, LLM-assisted screening was associated with reduced time for processing 100 studies (1.3 minutes vs 17.2 minutes for conventional screening methods; mean difference, -15.25 minutes [95% CI, -17.70 to -12.79 minutes]). In this prospective diagnostic study investigating the performance of LLM-assisted citation screening, the model demonstrated acceptable sensitivity and reasonably high specificity with reduced processing time. This novel method could potentially enhance efficiency and reduce workload in systematic reviews.

Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information. To evaluate the accuracy and safety of LLM answers on medical oncology examination questions. This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs. The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm. Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm. In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.

Performance Of Language Model Research Articles

Related Topics

Articles published on Performance Of Language Model

Comparison of Performance of Large Language Models on Lung-RADS Related Questions.

Performance and Biases of Large Language Models in Public Opinion Simulation

Clinical application potential of large language model: a study based on thyroid nodules.

CardioCanon: A Customised Chatbot for Cardiology Inquiry With Retrieval Augmented Generation to Reduce Hallucinations and Improve Performance of Large Language Models

Theory of mind performance of large language models: A comparative analysis of Turkish and English

Integrating chemistry knowledge in large language models via prompt engineering

Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems.

Performance of a Large Language Model in Screening Citations

Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning

KnowledgeNavigator: leveraging large language models for enhanced reasoning over knowledge graph

Fine-Tuned Large Language Model for Extracting Patients on Pretreatment for Lung Cancer from a Picture Archiving and Communication System Based on Radiological Reports.

Contrastive learning based on linguistic knowledge and adaptive augmentation for text classification

Project-specific code summarization with in-context learning

Language models, like humans, show content effects on reasoning tasks.

Comparative Analysis of Performance of Large Language Models in Urogynecology.

The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries.

Assessing the performance of large language models in literature screening for pharmacovigilance: a comparative study

Performance of Large Language Models on Medical Oncology Examination Questions

Fuzzing JavaScript engines with a syntax-aware neural program model

BioInstruct: instruction tuning of large language models for biomedical natural language processing.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Performance Of Language Model Research Articles

Related Topics

Articles published on Performance Of Language Model

Comparison of Performance of Large Language Models on Lung-RADS Related Questions.

Performance and Biases of Large Language Models in Public Opinion Simulation

Clinical application potential of large language model: a study based on thyroid nodules.

CardioCanon: A Customised Chatbot for Cardiology Inquiry With Retrieval Augmented Generation to Reduce Hallucinations and Improve Performance of Large Language Models

Theory of mind performance of large language models: A comparative analysis of Turkish and English

Integrating chemistry knowledge in large language models via prompt engineering

Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems.

Performance of a Large Language Model in Screening Citations

Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning

KnowledgeNavigator: leveraging large language models for enhanced reasoning over knowledge graph

Fine-Tuned Large Language Model for Extracting Patients on Pretreatment for Lung Cancer from a Picture Archiving and Communication System Based on Radiological Reports.

Contrastive learning based on linguistic knowledge and adaptive augmentation for text classification

Project-specific code summarization with in-context learning

Language models, like humans, show content effects on reasoning tasks.

Comparative Analysis of Performance of Large Language Models in Urogynecology.

The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries.

Assessing the performance of large language models in literature screening for pharmacovigilance: a comparative study

Performance of Large Language Models on Medical Oncology Examination Questions

Fuzzing JavaScript engines with a syntax-aware neural program model

BioInstruct: instruction tuning of large language models for biomedical natural language processing.