An Improved Autoregressive Evaluation Paradigm for Large Language Models
The AI community has witnessed the emergence of various chat-style Large Language Models (LLMs) since the advent of ChatGPT. Despite significant progress in this area, evaluating these models remains a substantial challenge. The evaluations provided by humans or GPT-4 oracles are often taken as the gold standard, but they are neither automatic nor scalable. More recently, a series of (open-source) LLM-based judge models have been introduced, yet they often exhibit model-specific biases, e.g., a LLaMA-family judge favors a LLaMA-family model. On the other hand, autoregressive evaluation metrics, which holds the potential to address the aforementioned issues, remains underexplored. Among them, likelihood-based metrics such as perplexity and negative log-likelihood (NLL) are widely adopted and has proven effective in tracking the pretraining progress of LLMs. However, they struggle to evaluate the generation capabilities of fine-tuned models due to exposure bias , a phenomenon where the distribution of the model’s output gradually deviates from the ground-truth during inference. To address this key issue, in this paper, we propose a novel autoregressive metric, Normalized Discounted Cumulative Gain (NDCG), to improve the evaluation of fine-tuned LLMs. Our experimental results demonstrate that NDCG significantly outperforms likelihood-based metrics: it shows over 45% improvement in both Spearman and Kendall’s tau correlation coefficients for commonsense QA tasks, and aligns more closely with GPT-4 Elo rankings for instruction-tuned models.
- Research Article
8
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Conference Article
1
- 10.54941/ahfe1004478
- Jan 1, 2024
Large Language Models (LLMs) have emerged as a driving force in the field of Natural Language Processing (NLP) with applications spanning various domains, including the development of Autonomous Generative Agents. Generative Agents are computational software programs designed to believably simulate human behavior by harnessing the capabilities of large language models. Through repetitive prompts against the large language model, these agents operate based on a system architecture consisting of memory streams, reflection, and planning, allowing them to store experiences, learn from them, and translate insights into high-level action plans to interact with their environment. This paper discusses the current landscape of language models and autonomous agents, their advantages and challenges, and the current state of evaluation, and proposes an innovative evaluation benchmark designed to provide a holistic perspective on their performance. Additionally, we see the impact of fine-tuning such an LLM, evaluate using our benchmark, and then propose a framework for evaluation of both the agents and their underlying LLMs. The existing frameworks for evaluating LLMs and autonomous agents focus on single tasks and are limited in capturing their capabilities. We outline the methodology for evaluating autonomous agents' performance in responding to single and multi-step prompts. The process consists of three key stages: Preparation of the data, Preparation of the Gold Answers, and Evaluations. We use the meticulously crafted 20 unique prompts to challenge the agents, covering simple and complex questions. Using GPT-4, a state-of-the-art model, we generate the initial responses, which undergo rigorous verification to produce gold answers, indicating correctness and revealing the minimum steps required for task completion. Our evaluation framework relies on two critical metrics: the effort metrics, quantifying the steps taken by autonomous agents, and the success rate, measuring their accuracy in achieving task objectives and also keeping track of hallucinations of the model. We conduct experiments with ten different models, representing the current landscape of natural language processing models, presenting each with 20 unique prompts. Their responses are meticulously compared to our gold answers and gold steps (optimal number of steps) to generate the evaluation metrics. Similarly, a fine-tuned model is also evaluated with ten different questions, which test the agent's decision-making process by selecting the correct tool and then the ability of the model to reach the correct conclusion to the question asked by the user in this process.This comprehensive approach ensures a thorough assessment of autonomous agents' capabilities. It demonstrates the utility of these metrics, revealing how they can shed light on the strengths and weaknesses of various autonomous agents. As a step toward standardization, we propose transforming the evaluation process of LLMs into an automated framework that accommodates all types of language models, agents, and LLM-based applications. Such an approach promises to establish a unified and comprehensive evaluation methodology, empowering users to make informed decisions when selecting, fine-tuning, and assessing the accuracy of underlying language models and their applications for different domains.In summary, this paper contributes to the ongoing research on evaluating LLMs and autonomous agents by introducing a novel benchmark and proposing a framework, focusing on evaluating the language models while keeping different knowledge domains in mind. Our framework will enhance our understanding of these technologies and serve as a valuable resource for researchers, engineers, and practitioners working in the ever-evolving landscape of NLP and autonomous systems.
- Research Article
4
- 10.1609/aaai.v37i13.26879
- Jun 26, 2023
- Proceedings of the AAAI Conference on Artificial Intelligence
Large neural network-based language models play an increasingly important role in contemporary AI. Although these models demonstrate sophisticated text generation capabilities, they have also been shown to reproduce harmful social biases contained in their training data. This paper presents a project that guides students through an exploration of social biases in large language models. As a final project for an intermediate college course in Artificial Intelligence, students developed a bias probe task for a previously-unstudied aspect of sociolinguistic or sociocultural bias they were interested in exploring. Through the process of constructing a dataset and evaluation metric to measure bias, students mastered key technical concepts, including how to run contemporary neural networks for natural language processing tasks; construct datasets and evaluation metrics; and analyze experimental results. Students reported their findings in an in-class presentation and a final report, recounting patterns of predictions that surprised, unsettled, and sparked interest in advocating for technology that reflects a more diverse set of backgrounds and experiences. Through this project, students engage with and even contribute to a growing body of scholarly work on social biases in large language models.
- Research Article
- 10.1093/ndt/gfae069.792
- May 23, 2024
- Nephrology Dialysis Transplantation
Background and Aims Large language models (LLMs) have gained significant attention in the field of natural language processing (NLP), marking a shift from traditional techniques like Term Frequency-Inverse Document Frequency (TF-IDF). We developed a traditional NLP model to predict arteriovenous fistula (AVF) failure within next 30 days using clinical notes. The goal of this analysis was to investigate whether LLMs would outperform traditional NLP techniques, specifically in the context of predicting AVF failure within the next 30 days using clinical notes. Method We defined AVF failure as the change in status from active to permanently unusable status or temporarily unusable status. We used data from a large kidney care network from January 2021 to December 2021. Two models were created using LLMs and traditional TF-IDF technique. We used “distilbert-base-uncased”, a distilled version of BERT base model [1], and compared its performance with traditional TF-IDF-based NLP techniques. The dataset was randomly divided into 60% training, 20% validation and 20% test dataset. The test data, comprising of unseen patients’ data was used to evaluate the performance of the model. Both models were evaluated using metrics such as area under the receiver operating curve (AUROC), accuracy, sensitivity, and specificity. Results The incidence of 30 days AVF failure rate was 2.3% in the population. Both LLMs and traditional showed similar overall performance as summarized in Table 1. Notably, LLMs showed marginally better performance in certain evaluation metrics. Both models had same AUROC of 0.64 on test data. The accuracy and balanced accuracy for LLMs were 72.9% and 59.7%, respectively, compared to 70.9% and 59.6% for the traditional TF-IDF approach. In terms of specificity, LLMs scored 73.2%, slightly higher than the 71.2% observed for traditional NLP methods. However, LLMs had a lower sensitivity of 46.1% compared to 48% for traditional NLP. However, it is worth noting that training on LLMs took considerably longer than TF-IDF. Moreover, it also used higher computational resources such as utilization of graphics processing units (GPU) instances in cloud-based services, leading to higher cost. Conclusion In our study, we discovered that advanced LLMs perform comparably to traditional TF-IDF modeling techniques in predicting the failure of AVF. Both models demonstrated identical AUROC. While specificity was higher in LLMs compared to traditional NLP, sensitivity was higher in traditional NLP compared to LLMs. LLM was fine-tuned with a limited dataset, which could have influenced its performance to be similar to that of traditional NLP methods. This finding suggests that while LLMs may excel in certain scenarios, such as performing in-depth sentiment analysis of patient data for complex tasks, their effectiveness is highly dependent on the specific use case. It is crucial to weigh the benefits against the resources required for LLMs, as they can be significantly more resource-intensive and costly compared to traditional TF-IDF methods. This highlights the importance of a use-case-driven approach in selecting the appropriate NLP technique for healthcare applications.
- Research Article
- 10.54254/2755-2721/2025.22701
- May 15, 2025
- Applied and Computational Engineering
Understanding and interpreting code is a crucial task in intelligent software engineering, aiding developers and users in adjusting code for correctness and robustness. The emergence of large language models (LLMs) provides new perspectives for code interpretation tasks. However, current LLM-based code interpretation remains restricted to limited dimensions, lacks a unified evaluation standard, and is missing a comprehensive and systematic assessment methodology. To address this issue, this paper proposes an LLM code understanding evaluation method based on a multi-granularity voting mechanism, aiming to systematically investigate and analyze LLMs' performance in code interpretation tasks. First, we carefully select code snippets from open-source GitHub projects and preprocess them for LLM analysis. Second, we use identical prompts and inputs to test three popular LLMs, recording their output. During this process, we apply prompt engineering techniques to specific target code snippets and conduct repeated experiments to explore the impact of prompt engineering on LLM-generated code explanations. Next, we design evaluation metrics to quantify the LLM outputs and assess their effectiveness based on the obtained scores. Experimental results demonstrate significant differences in code analysis and generation capabilities among the evaluated general-purpose LLMs from different vendors when given identical prompts and inputs. When multiple dimensions are considered in evaluating the generated content, different LLMs exhibit varying strengths in different aspects. Additionally, applying specific prompt engineering techniques can moderate the discrepancies in code analysis and generation capabilities among different LLMs.
- Research Article
6
- 10.1016/j.jbi.2024.104707
- Aug 13, 2024
- Journal of Biomedical Informatics
On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models
- Supplementary Content
30
- 10.2196/52597
- Dec 11, 2024
- Journal of Medical Internet Research
BackgroundEmpathy, a fundamental aspect of human interaction, is characterized as the ability to experience another being’s emotions within oneself. In health care, empathy is a fundamental for health care professionals and patients’ interaction. It is a unique quality to humans that large language models (LLMs) are believed to lack.ObjectiveWe aimed to review the literature on the capacity of LLMs in demonstrating empathy.MethodsWe conducted a literature search on MEDLINE, Google Scholar, PsyArXiv, medRxiv, and arXiv between December 2022 and February 2024. We included English-language full-length publications that evaluated empathy in LLMs’ outputs. We excluded papers evaluating other topics related to emotional intelligence that were not specifically empathy. The included studies’ results, including the LLMs used, performance in empathy tasks, and limitations of the models, along with studies’ metadata were summarized.ResultsA total of 12 studies published in 2023 met the inclusion criteria. ChatGPT-3.5 (OpenAI) was evaluated in all studies, with 6 studies comparing it with other LLMs such GPT-4, LLaMA (Meta), and fine-tuned chatbots. Seven studies focused on empathy within a medical context. The studies reported LLMs to exhibit elements of empathy, including emotions recognition and emotional support in diverse contexts. Evaluation metric included automatic metrics such as Recall-Oriented Understudy for Gisting Evaluation and Bilingual Evaluation Understudy, and human subjective evaluation. Some studies compared performance on empathy with humans, while others compared between different models. In some cases, LLMs were observed to outperform humans in empathy-related tasks. For example, ChatGPT-3.5 was evaluated for its responses to patients’ questions from social media, where ChatGPT’s responses were preferred over those of humans in 78.6% of cases. Other studies used subjective readers’ assigned scores. One study reported a mean empathy score of 1.84-1.9 (scale 0-2) for their fine-tuned LLM, while a different study evaluating ChatGPT-based chatbots reported a mean human rating of 3.43 out of 4 for empathetic responses. Other evaluations were based on the level of the emotional awareness scale, which was reported to be higher for ChatGPT-3.5 than for humans. Another study evaluated ChatGPT and GPT-4 on soft-skills questions in the United States Medical Licensing Examination, where GPT-4 answered 90% of questions correctly. Limitations were noted, including repetitive use of empathic phrases, difficulty following initial instructions, overly lengthy responses, sensitivity to prompts, and overall subjective evaluation metrics influenced by the evaluator’s background.ConclusionsLLMs exhibit elements of cognitive empathy, recognizing emotions and providing emotionally supportive responses in various contexts. Since social skills are an integral part of intelligence, these advancements bring LLMs closer to human-like interactions and expand their potential use in applications requiring emotional intelligence. However, there remains room for improvement in both the performance of these models and the evaluation strategies used for assessing soft skills.
- Research Article
1
- 10.1145/3715109
- Jan 27, 2025
- ACM Transactions on Software Engineering and Methodology
Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. The most recent trend is using LLM-based agents to iterate the code generation process. Based on the observation that top-level software engineers often ask clarifying questions to reduce Ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation tasks. For this purpose, we define the communication skills of LLMs as “being able to ask clarifying questions when the description of the code generation problem has issues”. In this study, we restrict these issues to three matters from the software requirement engineering field: inconsistent requirements, ambiguous requirements, and incomplete requirements. By asking probing questions about the requirements of problem descriptions before generating the final code, the challenges of programming with LLMs, such as unclear intent specification may be alleviated, resulting to a correct code in the initial iterations. In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs for code generation. We created a new benchmark, HumanEvalComm, by modifying problem descriptions according to three issues mentioned above, Inconsistency , Ambiguity , Incompleteness . We then experimented on HumanEvalComm with different Code LLMs, and a new LLM agent approach, C o de C l a rificatio n a nd G eneration A ge n t (Okanagan), to identify and ask questions in ambiguous parts from code and descriptions for further refining the generated code. In the evaluation, we introduced an LLM-based evaluator and created Communication Rate and Good Question Rate as the evaluation metrics to represent the ratio of questions asked and questions with good quality in responses. We found that more than 60% of responses from Code LLMs still generate code rather than ask questions when the problem descriptions are manually modified according to different clarification categories. The Pass@1 and Test Pass Rate of most Code LLMs drop by 35% \(\sim\) 52% and by 17% \(\sim\) 35% respectively, with statistical significance in each category for over 75% numbers. Okanagan, as an LLM agent approach that uses LLM such as ChatGPT 3.5, effectively increases the Communication Rate and Good Question Rate by an absolute 58% and 38%, respectively. Thus, Okanagan boosts Pass@1 and Test Pass Rate by an absolute 8% and 7%, respectively, when the problem descriptions are modified based on given clarification categories. This result indicates the potential for achieving more effective communication capability using LLM agent. Our benchmark and full code are publicly available at https://github.com/jie-jw-wu/human-eval-comm .
- Research Article
- 10.1002/jac5.70115
- Sep 18, 2025
- JACCP: JOURNAL OF THE AMERICAN COLLEGE OF CLINICAL PHARMACY
Introduction The emerging capabilities of large language models (LLMs) have drawn increasing attention across various fields, including pharmacy practice in Thailand. Given the extensive number of available medications and the complex, often perplexing prescribing patterns, medication review remains a critical responsibility for pharmacists. This study explored the potential role of LLMs in supporting the medication review process within the Thai health care context, specifically focusing on their ability to detect drug interactions (DIs) and suggest context‐sensitive management strategies. Methods Ten clinical vignettes were constructed, each depicting a patient with a specific drug regimen seeking assistance from a pharmacist. These cases were tailored to reflect the Thai context and represent commonly encountered DIs in Thailand. Each vignette was submitted to a set of LLMs—ChatGPT‐4, ChatGPT‐4o, ChatGPT‐4o mini, Gemini 1.5, Claude 3.5, Microsoft Copilot, and Alisa 3.0—in both English and Thai. Evaluation metrics were developed and validated using the Index of Item‐Objective Congruence. Two independent evaluators assessed all responses, and inter‐rater reliability was measured using weighted Cohen's κ . LLM performance was scored based on percentage ranges, and cumulative scores were reported across evaluation domains. Results The weighted Cohen's κ values across six domains—(A) ability to identify DIs, (B) completeness, (C) clarity, (D) citation reliability, (E) usefulness, and (F) ability to assess harm—exceeded 0.6, indicating substantial inter‐rater agreement. All LLMs showed clinically acceptable performance in both languages. Citation reliability was limited in Alisa 3.0 and Gemini 1.5, while ChatGPT‐4o demonstrated the most consistent and well‐rounded performance. Conclusion The selected LLMs demonstrated their potential as capable digital assistants in medication reviews, although some models require further improvement and careful consideration when applied in real‐world settings. Nevertheless, human oversight remains essential; when used in parallel, LLMs and health professionals can work synergistically to enhance patient outcomes.
- Research Article
1
- 10.3897/biss.8.136735
- Sep 10, 2024
- Biodiversity Information Science and Standards
Recently, Large Language Models (LLMs) have transformed information retrieval, becoming widely adopted across various domains due to their ability to process extensive textual data and generate diverse insights. Biodiversity literature, with its broad range of topics, is no exception to this trend (Boyko et al. 2023, Castro et al. 2024). LLMs can help in information extraction and synthesis, text annotation and classification, and many other natural language processing tasks. We leverage LLMs to automate the information retrieval task from biodiversity publications, building upon data sourced from our previous work (Ahmed et al. 2024). In our previous work (Ahmed et al. 2023, Ahmed et al. 2024), we assessed the reproducibility of deep learning (DL) methods used in biodiversity research. We developed a manual pipeline to extract key information on DL pipelines—dataset, source code, open-source frameworks, model architecture, hyperparameters, software and hardware specs, randomness, averaging result and evaluation metrics from 61 publications (Ahmed et al. 2024). While this allowed analysis, it required extensive manual effort by domain experts, limiting scalability. To address this, we propose an automatic information extraction pipeline using LLMs with the Retrieval Augmented Generation (RAG) technique. RAG combines the retrieval of relevant documents with the generative capabilities of LLMs to enhance the quality and relevance of the extracted information. We employed an open-source LLM, Hugging Face implementation of Mixtral 8x7B (Jiang et al. 2024), a mixture of expert models in our pipeline (Fig. 1) and adapted the RAG pipeline from earlier work (Kommineni et al. 2024). The pipeline was run on a single NVIDIA A100 40GB graphics processing unit with 4-bit quantization. To evaluate our pipeline, we compared the expert-assisted manual approach with the LLM-assisted automatic approach. We measured their consistency using the inter-annotator agreement (IAA) and quantified it with the Cohen Kappa score (Pedregosa et al. 2011), where a higher score indicates more reliable and aligned outputs (1: maximum agreement, -1: no agreement). The Kappa score among human experts (annotators 1 and 2) was 0.54 (moderate agreement), while the scores comparing human experts with the LLM were 0.16 and 0.12 (slight agreement). The difference is partly due to human annotators having access to more information (including code, dataset, figures, tables and supplementary materials) than the LLM, which was restricted to the text itself. Given these restrictions, the results are promising but also show the potential to improve them by adding further modalities to the LLM inputs. Future work will involve several key improvements to our LLM-assisted information retrieval pipeline: Incorporating multimodal data (e.g., figures, tables, code, etc.) as input to the LLM, alongside text, to enhance the accuracy and comprehensiveness of the information retrieved from publications. Optimizing the retrieval component of the RAG framework with advanced techniques like semantic search, hybrid search or relevance feedback can improve the quality of outputs. Expanding the evaluation to a larger corpus of biodiversity literature could provide a more comprehensive understanding of pipeline capabilities, and this paves the way for pipeline optimization. A human-in-the-loop approach for evaluating the LLM-generated outputs by matching the ground truth values from the respective publications, will increase the quality of the overall pipeline. Employing more metrics for the evaluation beyond the Cohen Kappa score to better understand the LLM-assisted outputs. Incorporating multimodal data (e.g., figures, tables, code, etc.) as input to the LLM, alongside text, to enhance the accuracy and comprehensiveness of the information retrieved from publications. Optimizing the retrieval component of the RAG framework with advanced techniques like semantic search, hybrid search or relevance feedback can improve the quality of outputs. Expanding the evaluation to a larger corpus of biodiversity literature could provide a more comprehensive understanding of pipeline capabilities, and this paves the way for pipeline optimization. A human-in-the-loop approach for evaluating the LLM-generated outputs by matching the ground truth values from the respective publications, will increase the quality of the overall pipeline. Employing more metrics for the evaluation beyond the Cohen Kappa score to better understand the LLM-assisted outputs. Leveraging LLMs to automate information retrieval from biodiversity publications signifies a notable advancement in the scalable and efficient analysis of biodiversity literature. Initial results show promise, yet there is substantial potential for enhancement through the integration of multimodal data, optimized retrieval mechanisms, and comprehensive evaluation. By addressing these areas, we aim to improve the accuracy and utility of our pipeline, ultimately enabling broader and more in-depth analysis of biodiversity literature.
- Conference Article
- 10.54941/ahfe1006042
- Jan 1, 2025
Cognitive styles, which shape designers’ thinking, problem-solving, and decision-making, influence strategies and preferences in design tasks. In team collaboration, diversity cognitive styles enhance problem-solving efficiency, foster creativity, and improve team performance.The ‘Co-evolution of problem–solution’ model serves as a key theoretical framework for understanding differences in designers’ cognitive styles. Based on this model, designers can be categorized into two cognitive styles: problem-driven and solution-driven. Problem-driven designers prioritize structuring the problem before developing solutions, while solution-driven designers generate solutions when design problems still ill-defined, and then work backward to define the problem. Designers with different expertise and disciplinary backgrounds exhibit distinct cognitive style tendencies. Different cognitive styles also adapt differently to design tasks, excelling in some more than others.As a rapidly advancing technology, large language models (LLMs) have shown considerable potential in the field of design. Their powerful generative capabilities position them as potential collaborators in design teams, emulating different cognitive styles. These emulations aim to bridge cognitive differences among team members, enable designers to leverage their individual strengths, and ultimately produce more feasible and high-quality design solutions.However, previous studies have been limited in leveraging LLMs to directly generate design outcomes based on different cognitive styles, neglecting the emulation of the design process itself. In fact, the evolutionary development between problem and solution spaces better reflects the core differences in cognitive styles. Moreover, communication and collaboration within design teams extend beyond simply exchanging solutions, but span multiple stages of the design process—from problem analysis, idea generation, to evaluation. To better integrate LLMs into design teams, it is necessary to consider the emulation of the design cognition process.To this end, our study, based on the cognitive style taxonomy proposed by Dorst and Cross (2001), explores how LLMs can be used to emulate the design processes of problem-driven and solution-driven designers. We develop a zero-shot chain-of-thought (CoT)-based prompting strategy that enables LLMs to emulate the step-by-step cognitive flow of both design styles. The prompt design is inspired by Jiang et al. (2014) and Chen et al. (2023), who analyzed cognitive differences in conceptual design process using the FBS ontology model. Furthermore, to evaluate the effectiveness of LLMs in emulating cognitive styles, this study establishes a three-dimentional evaluation metrics: static distribution (the proportion and preference of cognitive issues), dynamic transformation (behavioral transition patterns), and the creativity of the design outcomes. Using previous studies identified human design behaviours as a benchmark, we compare the cognitive styles emulated by LLMs under different design constraints against human performance to assess their alignment and differences.The results show that LLM-generated design processes align well with human cognitive styles, effectively emulate static cognitive characteristics. Moreover, enhancing novelty and integrity in solutions and demonstrating superior creativity compared to baseline methods. However, LLMs lack the fully complex nonlinear transitions between problem and solution spaces observed in human designers.This process-based emulation has the potential to enhance the application of LLMs in design teams, enabling them to not only serve as tools for generating solutions but also provide support for collaboration during key stages of the design process. Future research should enhance LLMs' reasoning flexibility through fine-tuning or the GoT approach and explore their impact on human-AI collaboration across diverse design tasks to refine their role in design teams.
- Research Article
- 10.1097/js9.0000000000003631
- Oct 15, 2025
- International journal of surgery (London, England)
How does AI compare to the experts in a Delphi setting: simulating medical consensus with large language models.
- Preprint Article
- 10.2196/preprints.68320
- Nov 3, 2024
BACKGROUND Medical question answering (QA) is essential for various medical applications. While small-scale pre-training language models (PLMs) are widely adopted in open-domain QA tasks through fine-tuning with related datasets, applying this approach in the medical domain requires significant and rigorous integration of external knowledge. Knowledge-enhanced small-scale PLMs have been proposed to incorporate knowledge bases (KBs) to improve performance, as KBs contain vast amounts of factual knowledge. Large language models (LLMs) contain a vast amount of knowledge and have attracted significant research interest due to their outstanding natural language processing (NLP) capabilities. KBs and LLMs can provide external knowledge to enhance small-scale models in medical QA. OBJECTIVE KBs consist of structured factual knowledge that must be converted into sentences to align with the input format of PLMs. However, these converted sentences often lack semantic coherence, potentially causing them to deviate from the intrinsic knowledge of KBs. LLMs, on the other hand, can generate natural, semantically rich sentences, but they may also produce irrelevant or inaccurate statements. Retrieval-augmented generation (RAG) paradigm enhances LLMs by retrieving relevant information from an external database before responding. By integrating LLMs and KBs using the RAG paradigm, it is possible to generate statements that combine the factual knowledge of KBs with the semantic richness of LLMs, thereby enhancing the performance of small-scale models. In this paper, we explore a RAG fine-tuning method, RAG-mQA, that combines KBs and LLMs to improve small-scale models in medical QA. METHODS In the RAG fine-tuning scenario, we adopt medical KBs as an external database to augment the text generation of LLMs, producing statements that integrate medical domain knowledge with semantic knowledge. Specifically, KBs are used to extract medical concepts from the input text, while LLMs are tasked with generating statements based on these extracted concepts. In addition, we introduce two strategies for constructing knowledge: KB-based and LLM-based construction. In the KB-based scenario, we extract medical concepts from the input text using KBs and convert them into sentences by connecting the concepts sequentially. In the LLM-based scenario, we provide the input text to an LLM, which generates relevant statements to answer the question. For downstream QA tasks, the knowledge produced by these three strategies is inserted into the input text to fine-tune a small-scale PLM. F1 and exact match (EM) scores are employed as evaluation metrics for performance comparison. Fine-tuned PLMs without knowledge insertion serve as baselines. Experiments are conducted on two medical QA datasets: emrQA (English) and MedicalQA (Chinese). RESULTS RAG-mQA achieved the best results on both datasets. On the MedicalQA dataset, compared to the KB-based and LLM-based enhancement methods, RAG-mQA improved the F1 score by 0.59% and 2.36%, and the EM score by 2.96% and 11.18%, respectively. On the emrQA dataset, the EM score of RAG-mQA exceeded those of the KB-based and LLM-based methods by 4.65% and 7.01%, respectively. CONCLUSIONS Experimental results demonstrate that RAG fine-tuning method can improve the model performance in medical QA. RAG-mQA achieves greater improvements compared to other knowledge-enhanced methods. CLINICALTRIAL This study does not involve trial registration.
- Research Article
- 10.1108/ir-02-2025-0074
- Jul 29, 2025
- Industrial Robot: the international journal of robotics research and application
Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.
- Research Article
- 10.1038/s41698-025-00916-7
- May 23, 2025
- npj Precision Oncology
Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.