Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.

Seyma Handan Akyon,Fatih Cagatay Akyon,Ahmet Sefa Camyar,Fatih Hızlı,Talha Sari,Şamil Hızlı

doi:10.2196/59258

Seyma Handan Akyon, Fatih Cagatay Akyon + Show 4 more

Open Access

https://doi.org/10.2196/59258

Copy DOI

Journal: JMIR medical informatics	Publication Date: Sep 4, 2024
Citations: 1	License type: cc-by

Abstract

Reading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. A tool that can help doctors efficiently process and understand medical papers is needed. This study aims to critically assess and compare the comprehension capabilities of large language models (LLMs) in accurately and efficiently understanding medical research papers using the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist, which provides a standardized framework for evaluating key elements of observational study. The study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative artificial intelligence tools in medical papers. A novel benchmark pipeline processed 50 medical research papers from PubMed, comparing the answers of 6 LLMs (GPT-3.5-Turbo, GPT-4-0613, GPT-4-1106, PaLM 2, Claude v1, and Gemini Pro) to the benchmark established by expert medical professors. Fifteen questions, derived from the STROBE checklist, assessed LLMs' understanding of different sections of a research paper. LLMs exhibited varying performance, with GPT-3.5-Turbo achieving the highest percentage of correct answers (n=3916, 66.9%), followed by GPT-4-1106 (n=3837, 65.6%), PaLM 2 (n=3632, 62.1%), Claude v1 (n=2887, 58.3%), Gemini Pro (n=2878, 49.2%), and GPT-4-0613 (n=2580, 44.1%). Statistical analysis revealed statistically significant differences between LLMs (P<.001), with older models showing inconsistent performance compared to newer versions. LLMs showcased distinct performances for each question across different parts of a scholarly paper-with certain models like PaLM 2 and GPT-3.5 showing remarkable versatility and depth in understanding. This study is the first to evaluate the performance of different LLMs in understanding medical papers using the retrieval augmented generation method. The findings highlight the potential of LLMs to enhance medical research by improving efficiency and facilitating evidence-based decision-making. Further research is needed to address limitations such as the influence of question formats, potential biases, and the rapid evolution of LLM models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.

Abstract

Talk to us

Similar Papers

More From: JMIR medical informatics

Lead the way for us

Similar Papers

The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies
Erik Von Elm ... Jan P Vandenbroucke
The Lancet | VOL. 370
Erik Von Elm, et. al.Erik Von Elm ... Jan P Vandenbroucke
01 Oct 2007
The Lancet | VOL. 370

Reporting and Methodology of Multivariable Analyses in Prognostic Observational Studies Published in 4 Anesthesiology Journals
Jean Guglielminotti ... Philippe Montravers
Anesthesia & Analgesia | VOL. 121
Jean Guglielminotti, et. al.Jean Guglielminotti ... Philippe Montravers
01 Oct 2015
Anesthesia & Analgesia | VOL. 121

The use of reporting guidelines as an educational intervention for teaching research methods and writing

-

01 Jan 2020
01 Jan 2020

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... W Nick Street
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... W Nick Street
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.

Abstract

Talk to us

Similar Papers

More From: JMIR medical informatics