Large User Research Articles

As the health care industry increasingly embraces large language models (LLMs), understanding the consequence of this integration becomes crucial for maximizing benefits while mitigating potential pitfalls. This paper explores the evolving relationship among clinician trust in LLMs, the transition of data sources from predominantly human-generated to artificial intelligence (AI)-generated content, and the subsequent impact on the performance of LLMs and clinician competence. One of the primary concerns identified in this paper is the LLMs' self-referential learning loops, where AI-generated content feeds into the learning algorithms, threatening the diversity of the data pool, potentially entrenching biases, and reducing the efficacy of LLMs. While theoretical at this stage, this feedback loop poses a significant challenge as the integration of LLMs in health care deepens, emphasizing the need for proactive dialogue and strategic measures to ensure the safe and effective use of LLM technology. Another key takeaway from our investigation is the role of user expertise and the necessity for a discerning approach to trusting and validating LLM outputs. The paper highlights how expert users, particularly clinicians, can leverage LLMs to enhance productivity by off-loading routine tasks while maintaining a critical oversight to identify and correct potential inaccuracies in AI-generated content. This balance of trust and skepticism is vital for ensuring that LLMs augment rather than undermine the quality of patient care. We also discuss the risks associated with the deskilling of health care professionals. Frequent reliance on LLMs for critical tasks could result in a decline in health care providers' diagnostic and thinking skills, particularly affecting the training and development of future professionals. The legal and ethical considerations surrounding the deployment of LLMs in health care are also examined. We discuss the medicolegal challenges, including liability in cases of erroneous diagnoses or treatment advice generated by LLMs. The paper references recent legislative efforts, such as The Algorithmic Accountability Act of 2023, as crucial steps toward establishing a framework for the ethical and responsible use of AI-based technologies in health care. In conclusion, this paper advocates for a strategic approach to integrating LLMs into health care. By emphasizing the importance of maintaining clinician expertise, fostering critical engagement with LLM outputs, and navigating the legal and ethical landscape, we can ensure that LLMs serve as valuable tools in enhancing patient care and supporting health care professionals. This approach addresses the immediate challenges posed by integrating LLMs and sets a foundation for their maintainable and responsible use in the future.

Read full abstract

Although patients have easy access to their electronic health records and laboratory test result data through patient portals, laboratory test results are often confusing and hard to understand. Many patients turn to web-based forums or question-and-answer (Q&A) sites to seek advice from their peers. The quality of answers from social Q&A sites on health-related questions varies significantly, and not all responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to have their questions answered. We aimed to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to laboratory test-related questions asked by patients and identify potential issues that can be mitigated using augmentation approaches. We collected laboratory test result-related Q&A data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from 5 LLMs: GPT-4, GPT-3.5, LLaMA 2, MedAlpaca, and ORCA_mini. We assessed the similarity of their answers using standard Q&A similarity-based evaluation metrics, including Recall-Oriented Understudy for Gisting Evaluation, Bilingual Evaluation Understudy, Metric for Evaluation of Translation With Explicit Ordering, and Bidirectional Encoder Representations from Transformers Score. We used an LLM-based evaluator to judge whether a target model had higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. We performed a manual evaluation with medical experts for all the responses to 7 selected questions on the same 4 aspects. Regarding the similarity of the responses from 4 LLMs; the GPT-4 output was used as the reference answer, the responses from GPT-3.5 were the most similar, followed by those from LLaMA 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored the lowest and, thus, as the least similar to GPT-4-generated answers. The results of the win rate and medical expert evaluation both showed that GPT-4's responses achieved better scores than all the other LLM responses and human responses on all 4 aspects (relevance, correctness, helpfulness, and safety). LLM responses occasionally also suffered from lack of interpretation in one's medical context, incorrect statements, and lack of references. By evaluating LLMs in generating responses to patients' laboratory test result-related questions, we found that, compared to other 4 LLMs and human answers from a Q&A website, GPT-4's responses were more accurate, helpful, relevant, and safer. There were cases in which GPT-4 responses were inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses, including prompt engineering, prompt augmentation, retrieval-augmented generation, and response evaluation.

Read full abstract

Large User Research Articles

Related Topics

Articles published on Large User

The Subscription Management System

Click-and-Trade Structured Products for Wealth Management

Responsible Crowdsourcing for Responsible Generative AI: Engaging Crowds in AI Auditing and Evaluation

A Beam Hopping Scheme Based on Adaptive Beam Radius for LEO Satellites.

Reinventing BrED: A Practical Construction

Reactive Load Pre-measurement for Large Industrial Users Based on Improved GBDT Regression Algorithm

Analysis of Profiles of Supporters of Conspiracy Narratives about Vaccination Against COVID-19 on a Social Network

Researching hate speech online: Exploring the potential and limitations of Facebook as a survey tool in Africa

Pengembangan Website Desa Wisata Sebagai Media Informasi Wisatawan Pada Desa Temesi

A User-Centered Framework for Data Privacy Protection Using Large Language Models and Attention Mechanisms

Impact of Short Video Marketing on Film Promotion: A Case Study of Douyin Platform

Techno-Economic Assessment of Electricity Generation From a Medium-Scale CSP-PV Hybrid Plant Using Long-Duration Storage

Research on the Development Strategy of Chinese Second-Dimensional Mobile Games - A Case Study Based on Genshin Impact

IDify - Distributed Database System for Digital Identification

An effective attention and residual network for malware detection

Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.

Detecting multilingual hate speech targeting immigrants and women on Twitter

Unveiling the Synergistic Relationship between Distributed Systems and Cloud Computing: A Review of Architectural Trends

A Transfer Learning-based Method for the Daily Electricity Consumption Forecasting of Large Industrial Users after Business Expansion

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large User Research Articles

Related Topics

Articles published on Large User

The Subscription Management System

Click-and-Trade Structured Products for Wealth Management

Responsible Crowdsourcing for Responsible Generative AI: Engaging Crowds in AI Auditing and Evaluation

A Beam Hopping Scheme Based on Adaptive Beam Radius for LEO Satellites.

Reinventing BrED: A Practical Construction

Reactive Load Pre-measurement for Large Industrial Users Based on Improved GBDT Regression Algorithm

Analysis of Profiles of Supporters of Conspiracy Narratives about Vaccination Against COVID-19 on a Social Network

Researching hate speech online: Exploring the potential and limitations of Facebook as a survey tool in Africa

Pengembangan Website Desa Wisata Sebagai Media Informasi Wisatawan Pada Desa Temesi

A User-Centered Framework for Data Privacy Protection Using Large Language Models and Attention Mechanisms

Impact of Short Video Marketing on Film Promotion: A Case Study of Douyin Platform

Techno-Economic Assessment of Electricity Generation From a Medium-Scale CSP-PV Hybrid Plant Using Long-Duration Storage

Research on the Development Strategy of Chinese Second-Dimensional Mobile Games - A Case Study Based on Genshin Impact

IDify - Distributed Database System for Digital Identification

An effective attention and residual network for malware detection

Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.

Detecting multilingual hate speech targeting immigrants and women on Twitter

Unveiling the Synergistic Relationship between Distributed Systems and Cloud Computing: A Review of Architectural Trends

A Transfer Learning-based Method for the Daily Electricity Consumption Forecasting of Large Industrial Users after Business Expansion