🧜Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this article, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.

Similar Papers
  • Research Article
  • Cite Count Icon 8
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Research Article
  • 10.1093/sleep/zsaf090.1401
1401 Bringing Medicine Expertise to Your Screen: A New Frontier in Curbside Sleep Consultation Leveraging Large Language Models?
  • May 19, 2025
  • SLEEP
  • Nina Kuei + 4 more

Introduction Advances in large language models (LLMs) have opened new avenues for healthcare applications. Recently, ChatGPT-4 successfully achieved the pass mark &amp;gt;80% in 5 of 10 sleep medicine examination domains, indicating a strong foundational knowledge of sleep medicine. However, the ability to answer USMLE-type multiple-choice questions may not equate to the capacity to offer accurate and comprehensive answers to clinical queries or real-world case scenarios. Current literature exhibits a significant gap in validation research examining the clinical utility of LLMs in evidence-based medicine practice. This investigation aims to evaluate the potential usefulness and reliability of LLMs as adjunctive clinical decision-support tools (namely, “curbside consultants”) in sleep medicine. Methods Six clinical sleep queries and six case scenarios were presented to 6 LLMs including 3 general-purpose LLMs (ChatGPT-4o, Gemini-1.5-Pro, and Llama-3.1-405B) and 3 medical-specialized LLMs (OpenEvidence, Clara AI, and MediGPT). Performance assessment was conducted independently utilizing two independently developed 5-point Likert scales evaluating two primary domains: answer content (accuracy, relevance, comprehensiveness/depth, clarity/coherence, and unique insightfulness) and reference quality (accuracy, relevance, currentness, comprehensiveness/depth, and searchability). Benchmark answers were established through consensus among four sleep medicine specialists. Analysis of variance was used to compare the performance of the LLMs. Results The medical LLMs demonstrated superior overall performance compared to the general LLMs (P&amp;lt; 0.001). The primary distinction was observed in reference quality metrics, where medical LLMs significantly outperformed general LLMs across all parameters: accuracy, relevance, and searchability (p&amp;lt; 0.001), currentness (p=0.001), and comprehensiveness/depth (p=0.010). Notably, Open Evidence achieved the highest reference quality (p&amp;lt; 0.001). In contrast, the answer content analysis revealed no significant overall differences between medical and general LLMs (p=0.659). Most answer content-related metrics, including accuracy, relevance, or unique insightfulness, did not differ significantly (p&amp;gt;0.050). An exception was clarify/coherence, where medical LLMs were superior to general LLMs (p=0.030). Furthermore, MediGPT and ChatGPT-4o displayed better content comprehensiveness/depth relative to other LLMs (p&amp;lt; 0.001). These two LLMs exhibited comparable performance in overall metrics (p=0.615), answer contents (p=0.922), and reference quality (p=0.621). Conclusion Both medical-specialized and general-purpose LLMs show promise as adjunctive decision-support tools in clinical practice. However, substantial improvements in reference quality are critically needed across most LLM platforms. Support (if any)

  • Research Article
  • 10.1108/ir-02-2025-0074
Large language and vision-language models for robot: safety challenges, mitigation strategies and future directions
  • Jul 29, 2025
  • Industrial Robot: the international journal of robotics research and application
  • Xiangyu Hu + 1 more

Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.

  • Research Article
  • 10.1038/s41698-025-00916-7
Evaluating the performance of large language & visual-language models in cervical cytology screening
  • May 23, 2025
  • npj Precision Oncology
  • Qi Hong + 15 more

Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.

  • Research Article
  • 10.1145/3728971
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
  • Jun 22, 2025
  • Proceedings of the ACM on Software Engineering
  • Xiaohan Yuan + 11 more

Generative large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety and ethical ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, as well as automated safety assessment techniques to explore the potential risks efficiently. To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy. S-Eval incorporates two key components, i.e., an expert testing LLM M t and a novel safety critique LLM M c . The expert testing LLM M t is responsible for automatically generating test cases in accordance with the proposed risk management (including 8 risk dimensions and a total of 102 subdivided risks). The safety critique LLM M c can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval differs in significant ways: (i) efficient – we construct a multi-dimensional and open-ended benchmark comprising 220,000 test cases across 102 risks utilizing M t and conduct safety evaluations for 21 influential LLMs via M c on our benchmark. The entire process is fully automated and requires no human involvement. (ii) effective – extensive validations show S-Eval facilitates a more thorough assessment and better perception of potential LLM risks, and M c not only accurately quantifies the risks of LLMs but also provides explainable and in-depth insights into their safety, surpassing comparable models such as LLaMA-Guard-2. (iii) adaptive – S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. We further study the impact of hyper-parameters and language environments on model safety, which may lead to promising directions for future research. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios.

  • Research Article
  • Cite Count Icon 53
  • 10.1001/jamanetworkopen.2023.46721
Performance of Large Language Models on a Neurology Board–Style Examination
  • Dec 7, 2023
  • JAMA network open
  • Marc Cicero Schubert + 2 more

Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. To assess the performance of LLMs on neurology board-style examinations. This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Overall percentage scores of 2 LLMs. LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.

  • Research Article
  • 10.1080/13658816.2025.2577252
Extraction of geoprocessing modeling knowledge from crowdsourced Google Earth Engine scripts by coordinating large and small language models
  • Nov 1, 2025
  • International Journal of Geographical Information Science
  • Anqi Zhao + 7 more

The widespread use of online geoinformation platforms, such as Google Earth Engine (GEE), has produced numerous scripts. Extracting domain knowledge from these crowdsourced scripts supports understanding of geoprocessing workflows. Small Language Models (SLMs) are effective for semantic embedding but struggle with complex code; Large Language Models (LLMs) can summarize scripts, yet lack consistent geoscience terminology to express knowledge. In this paper, we propose Geo-CLASS, a knowledge extraction framework for geospatial analysis scripts that coordinates large and small language models. Specifically, we designed domain-specific schemas and a schema-aware prompt strategy to guide LLMs to generate and associate entity descriptions, and employed SLMs to standardize the outputs by mapping these descriptions to a constructed geoscience knowledge base. Experiments on 237 GEE scripts, selected from 295,943 scripts in total, demonstrated that our framework outperformed LLM baselines, including Llama-3, GPT-3.5 and GPT-4o. In comparison, the proposed framework improved accuracy in recognizing entities and relations by up to 31.9% and 12.0%, respectively. Ablation studies and performance analysis further confirmed the effectiveness of key components and the robustness of the framework. Geo-CLASS has the potential to enable the construction of geoprocessing modeling knowledge graphs, facilitate domain-specific reasoning and advance script generation via Retrieval-Augmented Generation (RAG).

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.joms.2024.11.007
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential
  • Mar 1, 2025
  • Journal of Oral and Maxillofacial Surgery
  • Reema Mahmoud + 5 more

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.2196/59641
Large Language Models Can Enable Inductive Thematic Analysis of a Social Media Corpus in a Single Prompt: Human Validation Study.
  • Aug 29, 2024
  • JMIR infodemiology
  • Michael S Deiner + 5 more

Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.

  • Conference Article
  • Cite Count Icon 100
  • 10.1145/3510003.3510203
Jigsaw
  • May 21, 2022
  • Naman Jain + 6 more

Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.

  • Research Article
  • 10.1186/s40862-025-00334-z
Exploring sentence-level revision capabilities of large language models in English for academic purposes writing assistance
  • May 29, 2025
  • Asian-Pacific Journal of Second and Foreign Language Education
  • Zhendong Du + 1 more

The English for Academic Purposes (EAP) is pivotal for scholarly communication; however, it poses significant challenges for non-native English speakers. Recently, Large Language Models (LLMs) have been extensively utilized in EAP to assist with writing tasks. EAP writing assistance typically encompasses several downstream tasks in natural language processing, such as Grammatical Error Correction (GEC). Nonetheless, some studies have revealed that the performance of LLMs in GEC tasks is inferior to traditional GEC solutions. To explore the capabilities of LLMs more thoroughly in aspects like deep semantic and syntactic structures, this study aims to rigorously assess the performance of LLMs in the Sentence-level Revision (SentRev) task. We designed three sets of meticulous experiments to evaluate the efficacy of different LLMs. The first experiment assessed LLMs using prompts in ten different languages, finding that the SentRev performance of LLMs was heavily influenced by the language of the prompt and the quality of the input text. The second experiment investigated the performance of English LLMs with minimal prompting in the SentRev task, yet the results showed no significant changes, contradicting some prior studies. In the third experiment, we devised an innovative and straightforward method that significantly enhanced the performance of multiple LLMs by integrating academic phrases from the Formulaic Language Academic Phrasebank (https://www.phrasebank.manchester.ac.uk/), thus overcoming the performance limitations imposed by different languages on LLMs. Additionally, our study highlights the deficiencies in existing evaluation benchmarks and suggests that higher-level, discourse-based EAP text evaluation benchmarks merit deeper exploration.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.procs.2023.09.086
A Large and Diverse Arabic Corpus for Language Modeling
  • Jan 1, 2023
  • Procedia Computer Science
  • Abbas Raza Ali + 3 more

A Large and Diverse Arabic Corpus for Language Modeling

  • Research Article
  • Cite Count Icon 3
  • 10.1109/embc53108.2024.10782119
High Throughput Phenotyping of Physician Notes with Large Language and Hybrid NLP Models.
  • Jul 15, 2024
  • Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
  • Syed I Munzir + 2 more

Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past 30 years, progress toward making high-throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping physician notes.Clinical relevance: Large language models will likely emerge as the dominant method for the high throughput phenotyping of signs and symptoms in physician notes.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 50
  • 10.1038/s41746-024-01024-9
CancerGPT for few shot drug pair synergy prediction using large pretrained language models
  • Feb 19, 2024
  • NPJ Digital Medicine
  • Tianhao Li + 6 more

Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.

  • Research Article
  • 10.70891/jair.2025.040011
IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models
  • Aug 2, 2025
  • Journal of Artificial Intelligence Research
  • Qiyao Wang + 7 more

With the rapid development of Large Language Models (LLMs) in vertical domains, attempts have been made to the field of intellectual property (IP). However, there is currently no evaluation benchmark specifically for assessing the understanding, application, and reasoning abilities of LLMs in the IP domain. To address this issue, we introduce IPEval, the first capability evaluation benchmark designed for IP agency and consulting tasks. IPEval consists of 2657 multiple-choice questions, divided into four major capability dimensions: creation, application, protection, and management. These questions cover eight areas: patent rights which including inventions, utility models, and designs, trademarks, copyrights, trade secrets, integrated circuit layout design rights, geographical indications, and related laws. We designed three evaluation methods: zero-shot, 5-few-shot, and Chain of Thought (CoT) for seven kinds of LLMs with varying parameters, primarily using either English or Chinese. The study results indicate that the GPT series and Qwen series models demonstrate stronger performance in English tests, while Chinese-major LLMs, such as the Qwen series, outperform GPT-4 in Chinese tests. Specialized legal domain LLMs, such as the fuzi-mingcha and MoZi, still significantly lag behind general-purpose LLMs of comparable parameter sizes in IP performance. This highlights the necessity and substantial potential for developing more specialized LLMs with stronger IP abilities. We also analyze the models' capabilities in terms of the regional and temporal aspects of IP, emphasizing that IP domain LLMs need to clearly understand the differences in IP laws across different regions and their dynamic changes over time. We hope IPEval can provide an accurate assessment of LLM capabilities in the IP domain and encourage researchers interested in IP to develop LLMs with richer IP knowledge.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon