Recall-Oriented Understudy For Gisting Evaluation Research Articles

The integration of deep learning into radiology has the potential to enhance diagnostic processes, yet its acceptance in clinical practice remains limited due to various challenges. This study aimed to develop and evaluate a fine-tuned large language model (LLM), based on Llama 3-8B, to automate the generation of accurate and concise conclusions in magnetic resonance imaging (MRI) and computed tomography (CT) radiology reports, thereby assisting radiologists and improving reporting efficiency. A dataset comprising 15,000 radiology reports was collected from the University of Medicine and Pharmacy of Craiova's Imaging Center, covering a diverse range of MRI and CT examinations made by four experienced radiologists. The Llama 3-8B model was fine-tuned using transfer-learning techniques, incorporating parameter quantization to 4-bit precision and low-rank adaptation (LoRA) with a rank of 16 to optimize computational efficiency on consumer-grade GPUs. The model was trained over five epochs using an NVIDIA RTX 3090 GPU, with intermediary checkpoints saved for monitoring. Performance was evaluated quantitatively using Bidirectional Encoder Representations from Transformers Score (BERTScore), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Bilingual Evaluation Understudy (BLEU), and Metric for Evaluation of Translation with Explicit Ordering (METEOR) metrics on a held-out test set. Additionally, a qualitative assessment was conducted, involving 13 independent radiologists who participated in a Turing-like test and provided ratings for the AI-generated conclusions. The fine-tuned model demonstrated strong quantitative performance, achieving a BERTScore F1 of 0.8054, a ROUGE-1 F1 of 0.4998, a ROUGE-L F1 of 0.4628, and a METEOR score of 0.4282. In the human evaluation, the artificial intelligence (AI)-generated conclusions were preferred over human-written ones in approximately 21.8% of cases, indicating that the model's outputs were competitive with those of experienced radiologists. The average rating of the AI-generated conclusions was 3.65 out of 5, reflecting a generally favorable assessment. Notably, the model maintained its consistency across various types of reports and demonstrated the ability to generalize to unseen data. The fine-tuned Llama 3-8B model effectively generates accurate and coherent conclusions for MRI and CT radiology reports. By automating the conclusion-writing process, this approach can assist radiologists in reducing their workload and enhancing report consistency, potentially addressing some barriers to the adoption of deep learning in clinical practice. The positive evaluations from independent radiologists underscore the model's potential utility. While the model demonstrated strong performance, limitations such as dataset bias, limited sample diversity, a lack of clinical judgment, and the need for large computational resources require further refinement and real-world validation. Future work should explore the integration of such models into clinical workflows, address ethical and legal considerations, and extend this approach to generate complete radiology reports.

Read full abstract

BackgroundThe impression section integrates key findings of a radiology report but can be subjective and variable. We sought to fine-tune and evaluate an open-source Large Language Model (LLM) in automatically generating impressions from the remainder of a radiology report across different imaging modalities and hospitals.MethodsIn this institutional review board-approved retrospective study, we collated a dataset of CT, US, and MRI radiology reports from the University of California San Francisco Medical Center (UCSFMC) (n = 372,716) and the Zuckerberg San Francisco General (ZSFG) Hospital and Trauma Center (n = 60,049), both under a single institution. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, an automatic natural language evaluation metric that measures word overlap, was used for automatic natural language evaluation. A reader study with five cardiothoracic radiologists was performed to more strictly evaluate the model’s performance on a specific modality (CT chest exams) with a radiologist subspecialist baseline. We stratified the results of the reader performance study based on the diagnosis category and the original impression length to gauge case complexity.ResultsThe LLM achieved ROUGE-L scores of 46.51, 44.2, and 50.96 on UCSFMC and upon external validation, ROUGE-L scores of 40.74, 37.89, and 24.61 on ZSFG across the CT, US, and MRI modalities respectively, implying a substantial degree of overlap between the model-generated impressions and impressions written by the subspecialist attending radiologists, but with a degree of degradation upon external validation. In our reader study, the model-generated impressions achieved overall mean scores of 3.56/4, 3.92/4, 3.37/4, 18.29 s,12.32 words, and 84 while the original impression written by a subspecialist radiologist achieved overall mean scores of 3.75/4, 3.87/4, 3.54/4, 12.2 s, 5.74 words, and 89 for clinical accuracy, grammatical accuracy, stylistic quality, edit time, edit distance, and ROUGE-L score respectively. The LLM achieved the highest clinical accuracy ratings for acute/emergent findings and on shorter impressions.ConclusionsAn open-source fine-tuned LLM can generate impressions to a satisfactory level of clinical accuracy, grammatical accuracy, and stylistic quality. Our reader performance study demonstrates the potential of large language models in drafting radiology report impressions that can aid in streamlining radiologists’ workflows.

Read full abstract

Recall-Oriented Understudy For Gisting Evaluation Research Articles

Related Topics

Articles published on Recall-Oriented Understudy For Gisting Evaluation

Targeting COVID-19 and Human Resources for Health News Information Extraction: Algorithm Development and Validation.

GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3.

HybridEval: An Improved Novel Hybrid Metric for Evaluation of Text Summarization

A Transformer-Based Yoruba to English Machine Translation (TYEMT) System with Rouge Score

An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study

An indicator-based multi-objective variable neighborhood search approach for query-focused summarization

Beyond ROUGE: A Comprehensive Evaluation Metric for Abstractive Summarization Leveraging Similarity, Entailment, and Acceptability

Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.

Hybrid model for extractive single document summarization: utilizing BERTopic and BERT model

Bidirectional recommendation in HR analytics through text summarization

Video Transcripts Summarization using OpenAI Whisper and GPT Model

Exploring the potential of data augmentation in poetry generation with small-scale corpora

An Abstractive Text Summarization using Decoder Attention with Pointer Network

Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study.

Deep sequential pattern mining for readability enhancement of Indonesian summarization

Ar-CM-ViMETA: Arabic Image Captioning based on Concept Model and Vision-based Multi-Encoder Transformer Architecture

Fully automatic summarization of radiology reports using natural language processing with large language models

CANBLWO: A Novel Hybrid Approach for Semantic Text Generation

Automatic Update Summarization by a Multiobjective Number-One-Selection Genetic Approach.

KurdSum: A new benchmark dataset for the Kurdish text summarization

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Recall-Oriented Understudy For Gisting Evaluation Research Articles

Related Topics

Articles published on Recall-Oriented Understudy For Gisting Evaluation

Targeting COVID-19 and Human Resources for Health News Information Extraction: Algorithm Development and Validation.

GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3.

HybridEval: An Improved Novel Hybrid Metric for Evaluation of Text Summarization

A Transformer-Based Yoruba to English Machine Translation (TYEMT) System with Rouge Score

An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study

An indicator-based multi-objective variable neighborhood search approach for query-focused summarization

Beyond ROUGE: A Comprehensive Evaluation Metric for Abstractive Summarization Leveraging Similarity, Entailment, and Acceptability

Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.

Hybrid model for extractive single document summarization: utilizing BERTopic and BERT model

Bidirectional recommendation in HR analytics through text summarization

Video Transcripts Summarization using OpenAI Whisper and GPT Model

Exploring the potential of data augmentation in poetry generation with small-scale corpora

An Abstractive Text Summarization using Decoder Attention with Pointer Network

Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study.

Deep sequential pattern mining for readability enhancement of Indonesian summarization

Ar-CM-ViMETA: Arabic Image Captioning based on Concept Model and Vision-based Multi-Encoder Transformer Architecture

Fully automatic summarization of radiology reports using natural language processing with large language models

CANBLWO: A Novel Hybrid Approach for Semantic Text Generation

Automatic Update Summarization by a Multiobjective Number-One-Selection Genetic Approach.

KurdSum: A new benchmark dataset for the Kurdish text summarization