Differentially Private Low-Rank Adaptation of Large Language Model Using Federated Learning
The surge in interest and application of large language models (LLMs) has sparked a drive to fine-tune these models to suit specific applications, such as finance and medical science. However, concerns regarding data privacy have emerged, especially when multiple stakeholders aim to collaboratively enhance LLMs using sensitive data. In this scenario, federated learning becomes a natural choice, allowing decentralized fine-tuning without exposing raw data to central servers. Motivated by this, we investigate how data privacy can be ensured in LLM fine-tuning through practical federated learning approaches, enabling secure contributions from multiple parties to enhance LLMs. Yet, challenges arise: (1) despite avoiding raw data exposure, there is a risk of inferring sensitive information from model outputs, and (2) federated learning for LLMs incurs notable communication overhead. To address these challenges, this article introduces DP-LoRA, a novel federated learning algorithm tailored for LLMs. DP-LoRA preserves data privacy by employing a Gaussian mechanism that adds noise in weight updates, maintaining individual data privacy while facilitating collaborative model training. Moreover, DP-LoRA optimizes communication efficiency via low-rank adaptation, minimizing the transmission of updated weights during distributed training. The experimental results across medical, financial, and general datasets using various LLMs demonstrate that DP-LoRA effectively ensures strict privacy constraints while minimizing communication overhead.
- Research Article
8
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
- 10.55041/ijsrem36608
- Aug 10, 2024
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
This research paper delves into the inherent vulnerabilities and potential threats posed by large language models (LLMs), focusing on their implications across diverse applications such as natural language processing and data privacy. The study aims to identify and analyze these risks comprehensively, emphasizing the importance of mitigating strategies to prevent exploitation and misuse in LLM deployments. In recent years, LLMs have revolutionized fields like automated content generation, sentiment analysis, and conversational agents, yet their immense capabilities also raise significant security concerns. Vulnerabilities such as bias amplification, adversarial attacks, and unintended data leakage can undermine trust and compromise user privacy. Through a systematic examination of these challenges, this paper proposes safeguarding measures crucial for responsibly harnessing the potential of LLMs while minimizing associated risks. It underscores the necessity of rigorous security protocols, including robust encryption methods, enhanced authentication mechanisms, and continuous monitoring frameworks. Furthermore, the research discusses regulatory implications and ethical considerations surrounding LLM usage, advocating for transparency, accountability, and stakeholder engagement in policy- making and deployment practices. By synthesizing insights from current literature and real-world case studies, this study provides a comprehensive framework for stakeholders—developers, policymakers, and users—to navigate the complex landscape of LLM security effectively. Ultimately, this research aims to inform future advancements in LLM technology, ensuring its safe and beneficial integration into various domains while mitigating potential risks to individuals and society as a whole. Keywords— Adversarial attacks on LLMs, Bias in LLMs, Data privacy in LLMs, Ethical considerations LLMs, Exploitation of LLMs, Large Language Models (LLMs), Misuse of LLMs, Mitigation strategies for LLMs, Natural Language Processing (NLP), Regulatory frameworks LLMs, Responsible deployment of LLMs, Risks of LLMs, Security implications of LLMs, Threats to LLMs, Vulnerabilities in LLMs.
- Research Article
13
- 10.1108/jebde-08-2023-0015
- Dec 19, 2023
- Journal of Electronic Business & Digital Economics
PurposeThe rapid rise of large language models (LLMs) has propelled them to the forefront of applications in natural language processing (NLP). This paper aims to present a comprehensive examination of the research landscape in LLMs, providing an overview of the prevailing themes and topics within this dynamic domain.Design/methodology/approachDrawing from an extensive corpus of 198 records published between 1996 to 2023 from the relevant academic database encompassing journal articles, books, book chapters, conference papers and selected working papers, this study delves deep into the multifaceted world of LLM research. In this study, the authors employed the BERTopic algorithm, a recent advancement in topic modeling, to conduct a comprehensive analysis of the data after it had been meticulously cleaned and preprocessed. BERTopic leverages the power of transformer-based language models like bidirectional encoder representations from transformers (BERT) to generate more meaningful and coherent topics. This approach facilitates the identification of hidden patterns within the data, enabling authors to uncover valuable insights that might otherwise have remained obscure. The analysis revealed four distinct clusters of topics in LLM research: “language and NLP”, “education and teaching”, “clinical and medical applications” and “speech and recognition techniques”. Each cluster embodies a unique aspect of LLM application and showcases the breadth of possibilities that LLM technology has to offer. In addition to presenting the research findings, this paper identifies key challenges and opportunities in the realm of LLMs. It underscores the necessity for further investigation in specific areas, including the paramount importance of addressing potential biases, transparency and explainability, data privacy and security, and responsible deployment of LLM technology.FindingsThe analysis revealed four distinct clusters of topics in LLM research: “language and NLP”, “education and teaching”, “clinical and medical applications” and “speech and recognition techniques”. Each cluster embodies a unique aspect of LLM application and showcases the breadth of possibilities that LLM technology has to offer. In addition to presenting the research findings, this paper identifies key challenges and opportunities in the realm of LLMs. It underscores the necessity for further investigation in specific areas, including the paramount importance of addressing potential biases, transparency and explainability, data privacy and security, and responsible deployment of LLM technology.Practical implicationsThis classification offers practical guidance for researchers, developers, educators, and policymakers to focus efforts and resources. The study underscores the importance of addressing challenges in LLMs, including potential biases, transparency, data privacy, and responsible deployment. Policymakers can utilize this information to shape regulations, while developers can tailor technology development based on the diverse applications identified. The findings also emphasize the need for interdisciplinary collaboration and highlight ethical considerations, providing a roadmap for navigating the complex landscape of LLM research and applications.Originality/valueThis study stands out as the first to examine the evolution of LLMs across such a long time frame and across such diversified disciplines. It provides a unique perspective on the key areas of LLM research, highlighting the breadth and depth of LLM’s evolution.
- Research Article
3
- 10.18653/v1/2024.emnlp-main.1244
- Jan 1, 2024
- Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing
Despite their improved capabilities in generation and reasoning, adapting large language models (LLMs) to the biomedical domain remains challenging due to their immense size and privacy concerns. In this study, we propose MedAdapter, a unified post-hoc adapter for test-time adaptation of LLMs towards biomedical applications. Instead of fine-tuning the entire LLM, MedAdapter effectively adapts the original model by fine-tuning only a small BERT-sized adapter to rank candidate solutions generated by LLMs. Experiments on four biomedical tasks across eight datasets demonstrate that MedAdapter effectively adapts both white-box and black-box LLMs in biomedical reasoning, achieving average performance improvements of 18.24% and 10.96%, respectively, without requiring extensive computational resources or sharing data with third parties. MedAdapter also yields enhanced performance when combined with train-time adaptation, highlighting a flexible and complementary solution to existing adaptation methods. Faced with the challenges of balancing model performance, computational resources, and data privacy, MedAdapter provides an efficient, privacy-preserving, cost-effective, and transparent solution for adapting LLMs to the biomedical domain.
- Research Article
356
- 10.1016/j.hcc.2024.100211
- Mar 1, 2024
- High-Confidence Computing
A survey on large language model (LLM) security and privacy: The Good, The Bad, and The Ugly
- Conference Article
- 10.54941/ahfe1006669
- Jan 1, 2025
Thematic Analysis (TA) is a powerful tool for human factors, HCI, and UX researchers to gather system usability insights from qualitative data like open-ended survey questions. However, TA is both time consuming and difficult, requiring researchers to review and compare hundreds, thousands, or even millions of pieces of text. Recently, this has driven many to explore using Large Language Models (LLMs) to support such an analysis. However, LLMs have their own processing limitations and usability challenges when implementing them reliably as part of a research process – especially when working with a large corpus of data that exceeds LLM context windows. These challenges are compounded when using locally hosted LLMs, which may be necessary to analyze sensitive and/or proprietary data. However, little human factors research has rigorously examined how various prompt engineering techniques can augment an LLM to overcome these limitations and improve usability. Accordingly, in the present paper, we investigate the impact of several prompt engineering techniques on the quality of LLM-mediated TA. Using a local LLM (Llama 3.1 8b) to ensure data privacy, we developed four LLM variants with progressively complex prompt engineering techniques and used them to extract themes from user feedback regarding the usability of a novel knowledge management system prototype. The LLM variants were as follows:1.A “baseline” variant with no prompt engineering or scalability2.A “naïve batch processing” variant that sequentially analyzed small batches of the user feedback to generate a single list of themes3.An “advanced batch processing” variant that built upon the naïve variant by adding prompt engineering techniques (e.g., chain-of-thought prompting)4.A “cognition-inspired” variant that incorporated advanced prompt engineering techniques and kept a working memory-like log of themes and their frequencyContrary to conventional approaches to studying LLMs, which largely rely upon descriptive statistics (e.g., % improvement), we systematically applied a set of evaluation methods from behavioral science and human factors. We performed three stages of evaluation of the outputs of each LLM variant: we compared the LLM outputs to our team’s original TA, we had human factors professionals (N = 4) rate the quality and usefulness of the outputs, and we compared the Inter-Rater Reliability (IRR) of other human factors professionals (N = 2) attempting to code the original data with the outputs generated by each variant. Results demonstrate that even small, locally deployed LLMs can produce high-quality TA when guided by appropriate prompts. While the “baseline” variant performed surprisingly well for small datasets, we found that the other, scalable methods were dependent upon advanced prompt engineering techniques to be successful. Only our novel "cognition-inspired" approach performed as well as the “baseline” variant in qualitative and quantitative comparisons of ratings and coding IRR. This research provides practical guidance for human factors researchers looking to integrate LLMs into their qualitative analysis workflows, disentangling and uncovering the importance of context window limitations, batch processing strategies, and advanced prompt engineering techniques. The findings suggest that local LLMs can serve as valuable and scalable tools in thematic analysis.
- Preprint Article
- 10.2196/preprints.71916
- Jan 29, 2025
BACKGROUND Large language models (LLMs) can generate outputs understandable by humans, such as answers to medical questions and radiology reports. With the rapid development of LLMs, clinicians face a growing challenge in determining the most suitable algorithms to support their work. OBJECTIVE We aimed to provide clinicians and other health care practitioners with systematic guidance in selecting an LLM that is relevant and appropriate to their needs and facilitate the integration process of LLMs in health care. METHODS We conducted a literature search of full-text publications in English on clinical applications of LLMs published between January 1, 2022, and March 31, 2025, on PubMed, ScienceDirect, Scopus, and IEEE Xplore. We excluded papers from journals below a set citation threshold, as well as papers that did not focus on LLMs, were not research based, or did not involve clinical applications. We also conducted a literature search on arXiv within the same investigated period and included papers on the clinical applications of innovative multimodal LLMs. This led to a total of 270 studies. RESULTS We collected 330 LLMs and recorded their application frequency in clinical tasks and frequency of best performance in their context. On the basis of a 5-stage clinical workflow, we found that stages 2, 3, and 4 are key stages in the clinical workflow, involving numerous clinical subtasks and LLMs. However, the diversity of LLMs that may perform optimally in each context remains limited. GPT-3.5 and GPT-4 were the most versatile models in the 5-stage clinical workflow, applied to 52% (29/56) and 71% (40/56) of the clinical subtasks, respectively, and they performed best in 29% (16/56) and 54% (30/56) of the clinical subtasks, respectively. General-purpose LLMs may not perform well in specialized areas as they often require lightweight prompt engineering methods or fine-tuning techniques based on specific datasets to improve model performance. Most LLMs with multimodal abilities are closed-source models and, therefore, lack of transparency, model customization, and fine-tuning for specific clinical tasks and may also pose challenges regarding data protection and privacy, which are common requirements in clinical settings. CONCLUSIONS In this review, we found that LLMs may help clinicians in a variety of clinical tasks. However, we did not find evidence of generalist clinical LLMs successfully applicable to a wide range of clinical tasks. Therefore, their clinical deployment remains challenging. On the basis of this review, we propose an interactive online guideline for clinicians to select suitable LLMs by clinical task. With a clinical perspective and free of unnecessary technical jargon, this guideline may be used as a reference to successfully apply LLMs in clinical settings.
- Research Article
6
- 10.3390/cancers16162830
- Aug 12, 2024
- Cancers
Large Language Models (LLMs), such as the GPT model family from OpenAI, have demonstrated transformative potential across various fields, especially in medicine. These models can understand and generate contextual text, adapting to new tasks without specific training. This versatility can revolutionize clinical practices by enhancing documentation, patient interaction, and decision-making processes. In oncology, LLMs offer the potential to significantly improve patient care through the continuous monitoring of chemotherapy-induced toxicities, which is a task that is often unmanageable for human resources alone. However, existing research has not sufficiently explored the accuracy of LLMs in identifying and assessing subjective toxicities based on patient descriptions. This study aims to fill this gap by evaluating the ability of LLMs to accurately classify these toxicities, facilitating personalized and continuous patient care. This comparative pilot study assessed the ability of an LLM to classify subjective toxicities from chemotherapy. Thirteen oncologists evaluated 30 fictitious cases created using expert knowledge and OpenAI's GPT-4. These evaluations, based on the CTCAE v.5 criteria, were compared to those of a contextualized LLM model. Metrics such as mode and mean of responses were used to gauge consensus. The accuracy of the LLM was analyzed in both general and specific toxicity categories, considering types of errors and false alarms. The study's results are intended to justify further research involving real patients. The study revealed significant variability in oncologists' evaluations due to the lack of interaction with fictitious patients. The LLM model achieved an accuracy of 85.7% in general categories and 64.6% in specific categories using mean evaluations with mild errors at 96.4% and severe errors at 3.6%. False alarms occurred in 3% of cases. When comparing the LLM's performance to that of expert oncologists, individual accuracy ranged from 66.7% to 89.2% for general categories and 57.0% to 76.0% for specific categories. The 95% confidence intervals for the median accuracy of oncologists were 81.9% to 86.9% for general categories and 67.6% to 75.6% for specific categories. These benchmarks highlight the LLM's potential to achieve expert-level performance in classifying chemotherapy-induced toxicities. The findings demonstrate that LLMs can classify subjective toxicities from chemotherapy with accuracy comparable to expert oncologists. The LLM achieved 85.7% accuracy in general categories and 64.6% in specific categories. While the model's general category performance falls within expert ranges, specific category accuracy requires improvement. The study's limitations include the use of fictitious cases, lack of patient interaction, and reliance on audio transcriptions. Nevertheless, LLMs show significant potential for enhancing patient monitoring and reducing oncologists' workload. Future research should focus on the specific training of LLMs for medical tasks, conducting studies with real patients, implementing interactive evaluations, expanding sample sizes, and ensuring robustness and generalization in diverse clinical settings. This study concludes that LLMs can classify subjective toxicities from chemotherapy with accuracy comparable to expert oncologists. The LLM's performance in general toxicity categories is within the expert range, but there is room for improvement in specific categories. LLMs have the potential to enhance patient monitoring, enable early interventions, and reduce severe complications, improving care quality and efficiency. Future research should involve specific training of LLMs, validation with real patients, and the incorporation of interactive capabilities for real-time patient interactions. Ethical considerations, including data accuracy, transparency, and privacy, are crucial for the safe integration of LLMs into clinical practice.
- Research Article
- 10.1093/ofid/ofae631.609
- Jan 29, 2025
- Open Forum Infectious Diseases
Background Central line-associated bloodstream infections (CLABSI) surveillance can be subjective and time-consuming. Large language models (LLMs) are advanced artificial intelligence systems with potential to assist healthcare professionals in classification tasks. Stanford Health Care recently implemented one of the first secure LLMs, powered by OpenAI’s GPT 4.0, cleared for sensitive health data. We assessed its performance in classifying CLABSI cases.Figure 1:Confusion Matrix of LLM Performance in CLABSI Classification. Methods We selected 40 patients flagged by our surveillance system for CLABSI review from November 2023–March 2024: 20 CLABSIs, consecutively identified, and 20 not-CLABSIs (randomly sampled). We prompted the LLM to determine if patients met the NHSN definition for CLABSI and provided the blood culture results that triggered the alert and the last 2 progress notes from the primary care team at the infection window end (within 3 days after the first positive test). We compared the secure LLM's determinations with those of infection preventionists.Table 1.Cases in which the LLM did not agree with IP assessment for CLABSI.*Community-onset: Blood cultures obtained within 2 days of admission.+NHSN guidelines list Fusobacterium nucleatum as an MBI organism (https://www.cdc.gov/nhsn/pdfs/pscmanual/17pscnosinfdef_current.pdf)Abbreviations: BSI, bloodstream infection; CLABSI, central line-associated infection; CoNS, coagulase-negative Staphylococci; ESBL, extended-spectrum beta lactamase; HIDA scan, IP, infection preventionist, LLM, large language model; MBI, mucosal barrier injury; MSSA, methicillin-susceptible Staphylococcus aureus; NHSN, National Healthcare Safety Network. Results Across 20 CLABSI-positive and 20 CLABSI-negative cases reviewed, the LLM accurately identified 16 of 20 CLABSIs and 7 of 20 not CLABSIs. The sensitivity was 80% (95% CI 57.6%–92.9%), specificity was 35% (95% CI 33.3%–86.5%), and the agreement rate was 57.5% (95% CI 41.2%–73.3%). Among 17 discordant cases, 11 involved clinical data available in the chart but unavailable to the LLM—admission information (4 false-positives), matching organisms (4 false-positives), and central line or symptom status (2 false-negatives, 1 false-positive). If this information was available to the LLM, we expect an adjusted sensitivity of 90% (18/20) and adjusted specificity of 80% (16/20). The remaining discordant cases involved misclassifications of organisms and incorrect identification of infection sources by the LLM. The mean review time by infection preventionists was 75 minutes (SD 48.7 minutes) compared to 5 minutes using the LLM. Conclusion An LLM not specifically trained for CLABSI classification showed high sensitivity using limited patient data. LLM case review required 5 minutes, versus 1 hour for traditional review. These results suggest LLMs could serve as a "first-pass" screening tool for CLABSI detection, helping infection preventionists narrow records needing human review. Disclosures All Authors: No reported disclosures
- Research Article
- 10.1108/mlag-01-2025-0001
- Nov 28, 2025
- Machine Learning and Data Science in Geotechnics
Purpose The purpose of this study is to explore the application of automated grading systems in geotechnics using large language models (LLMs) and cosine similarity for enhanced assessment and educational content generation. By training and testing LLMs on synthetic and real student data, the study seeks to develop robust systems for grading technical reports and open-ended questions, aligned with industry standards. Additionally, it aims to enhance student learning through auto-grading, immediate feedback and content generation, while addressing ethical considerations such as data privacy and fairness. Ultimately, the study strives to demonstrate the potential of LLMs to improve consistency, efficiency and educational outcomes. Design/methodology/approach The study employs a mixed-methods approach to develop and validate automated grading systems in geotechnics. Initially, correct answers were generated manually and synthetically using a generative pre-trained transformer model, with synthetic answers compared to correct ones via cosine similarity. Real student answers underwent similar evaluation. A Web-based tool was created to assess responses in real-time, providing dynamic feedback. Additionally, LLMs were fine-tuned on geotechnics textbooks and validated using synthetic and real student data. Anonymized student project reports were graded automatically, showcasing the potential and limitations of LLMs in consistent grading and educational content generation. Ethical considerations were addressed throughout. Findings The study demonstrated the potential of LLMs in geotechnics education by developing ML-driven systems for grading and content generation. The grading system, using cosine similarity and LLMs, provided consistent and objective assessments comparable to human graders. Immediate feedback on open-ended questions enhanced learning outcomes, enabling students to address knowledge gaps effectively. Fine-tuning LLMs with geotechnics textbooks and industry standards facilitated the generation of accurate, relevant questions and answers, further improved by retrieval-augmented generation (RAG). Data augmentation techniques enhanced model robustness, while ethical considerations, including data privacy, fairness, and transparency, ensured responsible deployment and fostered trust among stakeholders. Originality/value This study offers originality and value by pioneering the application of LLMs and cosine similarity for automated grading in geotechnics education, a domain with limited exploration in educational technology. By integrating RAG and fine-tuning LLMs with domain-specific textbooks, it bridges the gap between advanced machine learning techniques and practical applications in engineering education. The development of real-time feedback tools and robust grading systems enhances both student learning and instructional efficiency. Furthermore, addressing ethical considerations such as fairness and data privacy sets a precedent for responsible artificial intelligence (AI) deployment, contributing to the broader adoption of AI in academia.
- Research Article
- 10.1016/j.jdent.2025.106187
- Oct 1, 2025
- Journal of dentistry
The ethics and governance of large language models in dentistry: A framework for research and clinical implementation.
- Research Article
- 10.1093/bjrai/ubae019
- Dec 20, 2024
- BJR artificial intelligence
This review examines the use of large language models (LLMs) in cancer, analysing articles sourced from PubMed, Embase, and Ovid Medline, published between 2017 and 2024. Our search strategy included terms related to LLMs, cancer research, risks, safeguards, and ethical issues, focusing on studies that utilized text-based data. 59 articles were included in the review, categorized into 3 segments: quantitative studies on LLMs, chatbot-focused studies, and qualitative discussions on LLMs on cancer. Quantitative studies highlight LLMs' advanced capabilities in natural language processing (NLP), while chatbot-focused articles demonstrate their potential in clinical support and data management. Qualitative research underscores the broader implications of LLMs, including the risks and ethical considerations. Our findings suggest that LLMs, notably ChatGPT, have potential in data analysis, patient interaction, and personalized treatment in cancer care. However, the review identifies critical risks, including data biases and ethical challenges. We emphasize the need for regulatory oversight, targeted model development, and continuous evaluation. In conclusion, integrating LLMs in cancer research offers promising prospects but necessitates a balanced approach focusing on accuracy, ethical integrity, and data privacy. This review underscores the need for further study, encouraging responsible exploration and application of artificial intelligence in oncology.
- Research Article
185
- 10.1038/s41368-023-00239-y
- Jul 28, 2023
- International Journal of Oral Science
The ChatGPT, a lite and conversational variant of Generative Pretrained Transformer 4 (GPT-4) developed by OpenAI, is one of the milestone Large Language Models (LLMs) with billions of parameters. LLMs have stirred up much interest among researchers and practitioners in their impressive skills in natural language processing tasks, which profoundly impact various fields. This paper mainly discusses the future applications of LLMs in dentistry. We introduce two primary LLM deployment methods in dentistry, including automated dental diagnosis and cross-modal dental diagnosis, and examine their potential applications. Especially, equipped with a cross-modal encoder, a single LLM can manage multi-source data and conduct advanced natural language reasoning to perform complex clinical operations. We also present cases to demonstrate the potential of a fully automatic Multi-Modal LLM AI system for dentistry clinical application. While LLMs offer significant potential benefits, the challenges, such as data privacy, data quality, and model bias, need further study. Overall, LLMs have the potential to revolutionize dental diagnosis and treatment, which indicates a promising avenue for clinical application and research in dentistry.
- Research Article
- 10.30574/wjarr.2024.21.3.0891
- Mar 30, 2024
- World Journal of Advanced Research and Reviews
In cybersecurity, large language models are a two-edged sword: they stand to offer opportunities in data protection, threats mitigation, and privacy preservation; and threats on the same in data protection, threats mitigation, and privacy preservation. The present paper discusses how large language models are changing roles in cybersecurity and innovative security concepts such as Profit Protection 2.0, AI-based encryption, and automated threat response. It goes on to discuss how large language models have been integrated into the security technologies that already exist and how the emergent technologies-such as blockchain, federated learning, and decentralized AI-considerably empower data security. It highlights the possible risks large language models pose, such as privacy and vulnerability to adversarial attacks. In anticipation of advancements in AI-powered cybersecurity, this research singles out predictive security, adaptive defense systems, and regulation for companies as well as policymakers. The paper picks up from insights in the latest literature to make recommendations for practical measures of safeguarding these systems and also for their ethical implementation. The findings enrich the debate around AI security with proposals for organizations to adopt measures of shielding sensitive data while welcoming LLMs into the innovation of cybersecurity.
- Research Article
- 10.1002/jmri.29807
- May 4, 2025
- Journal of magnetic resonance imaging : JMRI
This narrative review focuses on the integration of large language models (LLMs), such as GPT-4 and Gemini, into breast imaging. LLMs excel in understanding, processing, and generating human-like text, with potential applications ranging widely from decision-making to radiology reporting support. LLMs show promise in addressing current critical challenges, including rising demands for imaging services concurrent with an increasing shortage in the radiologist workforce. Their ability to integrate clinical guidelines and generate standardized, evidence-based reports has the potential to improve diagnostic consistency and reduce inter-reader variability. Emerging multimodal capabilities further extend their utility, enabling the integration of textual and visual data for tasks such as tumor classification and decision-making. Despite these advancements, significant challenges remain. LLMs often suffer from limitations such as hallucinations, biases in training datasets, and domain-specific knowledge gaps. These issues can affect their reliability, particularly in nuanced tasks like Breast Imaging Reporting and Data System categorization and multimodal image assessment. Moreover, ethical concerns about data privacy, biased outputs, and regulatory compliance must be addressed before effective deployment in the clinical setting. Current studies suggest that while LLMs can complement human expertise, their performance still lags behind that of radiologists in key areas, particularly in tasks requiring complex medical reasoning or direct image analysis. Looking ahead, LLMs are poised to play a crucial role in breast imaging by optimizing workflows, supporting multidisciplinary meetings, and improving patient education. However, their successful integration will depend on proper context training, robust validation, and ethical oversight, with human supervision as a crucial safeguard. EVIDENCE LEVEL: 5. TECHNICAL EFFICACY: Stage 2.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.