The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior

Similar Papers
  • Research Article
  • Cite Count Icon 44
  • 10.1111/epi.17570
Are AI language models such as ChatGPT ready to improve the care of individuals with epilepsy?
  • Mar 13, 2023
  • Epilepsia
  • Christian M Boßelmann + 2 more

Epilepsy is a neurological disorder characterized by recurrent seizures, which can significantly impact the quality of life of affected individuals. Fortunately, advances in artificial intelligence (AI) are providing new opportunities to improve the diagnosis and treatment of epilepsy. Briefly, examples of ongoing epilepsy-related AI research include (1) algorithms that can analyze large amounts of electroencephalography (EEG) time-series data to label interictal epileptiform discharges both independently and with human supervision,1, 2 (2) diagnostic biomedical imaging with automated magnetic resonance imaging (MRI)–based lesion detection, surgical decision-making support, and outcome prediction,3, 4 and (3) Clinical Decision Support Systems (CDSS) that use patient data to provide physicians with recommendations based on up-to-date evidence and guidelines for an, overall, improved diagnostic and therapeutic accuracy.5, 6 Language models are often used in chatbots and other conversational systems to generate context-aware human-like text in response to an input prompt from a user. Such models are trained on large data sets of human conversations using machine learning (ML) techniques to learn the patterns and structure of natural language. Various artificial intelligence (AI) language models have been developed since the 1950s, but significant advances have only been made in recent years due to improved ML models paired with an increased availability of large amounts of data and computational resources. Some of the earliest examples of such models include ELIZA, developed in the 1960s (one of the first programs to simulate a patient-doctor relationship), and SHRDLU from the 1970s (a program able to emulate dialogue around a simplified world with a limited number of objects, the "blocks world").7, 8 However, these early language models were inherently limited in their capabilities and could perform only a narrow range of tasks. In recent years, more complex, large language models have led to significant progress in natural language processing. Several of these AI language models can be used for dialogue, for example, (1) GPT-3 (Generative Pre-trained Transformer 3), a state-of-the-art language model developed by OpenAI that can generate contextual human-like text for a wide range of applications, including dialogues9; (2) DialoGPT, a language model developed by Microsoft that is trained on a large data set of social media comment chains and can generate responses in single-turn conversations10; (3) Meena, a sensible and specific language model developed by Google that is trained on human–human conversations from public-domain social media and can generate responses that are coherent and contextually appropriate11; and (4) XLNet, a language model developed by Google and Carnegie Mellon University that is capable of several language modeling tasks including question answering, natural language inference, sentiment analysis, and document ranking; and many others.12 Mainly such algorithms enable the analysis of free-text electronic medical records and other written materials (e.g., test results and treatment plans) that are otherwise inaccessible without preprocessing and standardization. By analyzing large amounts of free-text medical records, language models can learn to identify and summarize relevant patterns. Possible outcomes are information on identified hierarchical patient subgroups based on seizure patterns, documented treatment options, and outcome parameters.13-15 This structured information could be queried to provide personalized treatment recommendations based on medical history and other relevant factors. For example, by identifying early candidates for epilepsy surgery, language models can help minimize treatment delays and improve patient outcomes.16, 17 Another example of how language models can improve health care are Clinical Decision Support Systems (CDSS) trained to understand and offer natural responses to queries from health care providers. CDSS can provide medical or surgical treatment recommendations, suggest relevant clinical guidelines or protocols, and alert health care providers to potential errors or risks. Similar methods may be used to create virtual assistants for individuals with epilepsy to answer questions and provide easy access to information about their condition, treatment options, and other related topics, including driving, causes of premature death (including sudden unexpected death in epilepsy [SUDEP]), and status epilepticus.18, 19 Overall, AI language models have the future potential to significantly improve the care and management of individuals with epilepsy by providing natural conversational interfaces to both patients and physicians, allowing for easy access to structured information. We tested ChatGPT (ChatGPT Dec 15 Version, available at chat.openai.com, last accessed 01/07/2023 at 9:30 p.m.) for some of the use cases outlined above and provided the prompts used and model responses in Figure 1. First, we assumed the role of an individual with epilepsy taking levetiracetam. The model correctly responded that aggression is a possible side effect and recommended follow-up with the prescribing physician (Figure 1A).20 We then requested an Acute Seizure Action Plan (ASAP), a structured treatment plan used to guide patients and caregivers in the event of an epileptic seizure. The model provided a reasonable first draft in line with expert recommendations (Figure 1B).21 We found this useful to quickly generate general patient-facing informational content, but note that each ASAP should be subject to human review to screen for misinformation, and to personalize the draft to include additional information from the individual's medical history and seizure types. We proceeded to present the model with a short, simplified case study of an individual with treatment-resistant left mesial temporal lobe epilepsy. Of interest, the model correctly integrated the medical history and diagnostic findings, noting that hippocampal sclerosis presents an epileptogenic lesion before proceeding to recommend epilepsy surgery. Although this assessment represents a simplification of phase I presurgical evaluation findings and surgical strategies, the overall recommendation is sound.22 However, limitations became apparent when we informed the model that the previously discussed patient now had additional evidence of right temporal lobe seizure onset. Although the initial response is still appropriate, the following advice is actively harmful (Figure 1D). The model confidently states that the patient's health care team may consider bilateral temporal lobectomy or removal of both temporal lobes and the adjacent frontal and parietal lobes (a procedure incorrectly defined as "hemispherotomy" by the model). Finally, even simple queries for structured information may fail if it concerns particularly specialized or disputed areas of knowledge. In Figure 1E, we queried if there is a relationship between variants in SCN9A and autosomal dominant epilepsy. The positive response was incorrect, likely due to misinformation in the academic literature present in the model's training data. Any relationship between variants in SCN9A and epilepsy has been refuted.23, 24 Previous research, as outlined above, has focused on language models trained on large amounts of public-domain data of general human conversations, commonly involving text messages from social media sites (Twitter, Reddit, Facebook, etc.) and some additional training data from books or academic literature. Indeed, the use cases shown above do not accurately represent the limits of this tool, as it was likely not trained on a sufficiently extensive, high-quality, domain-specific data set. It is important to note that language models cannot easily deal with disputed areas of knowledge and may not provide correct answers when contradictions are present in the input data. In light of these general considerations and the specific use cases outlined above, we argue that oversight from medical professionals will be needed to distill training information, and that all current AI applications need to be utilized in combination with human expertise. This is made immediately relevant by the fact that the broad ethical and legal implications of generative models are subjects of ongoing debate, with developers denying liability that may then fall onto the clinician user. Another important limitation of language models is an issue coined "hallucination," which describes confidently formulated answers with incorrect or nonsensical content.25 This misinformation is a result of biased training data or mismatches between token encoding and concept representation, and it is particularly difficult to identify. Finally, users should be aware that language models show bias against individuals based on gender, race, or disability.26 This issue is particularly sensitive in epilepsy, where stigma is still prevalent.27 Extraction of structured information from electronic medical records and assistance with simple human-supervised tasks are feasible use-case scenarios. However, these systems will need to be thoroughly tested and rigorously validated before they can be used in clinical care, in line with existing regulations on Software as a Medical Device or AI/ML-Enabled Medical Devices.28 Ultimately, AI language models in epilepsy care will depend on developing robust and reliable systems as per the Ethics Guidelines for Trustworthy Artificial Intelligence,29 driven by community-based data sharing and epilepsy-specific AI research. Outside of the clinical care of patients, several successful applications of language models (e.g., smart data processing, content generation, and sentiment analysis) provide a promising perspective of AI-augmented future clinical practice. To achieve similar success stories with AI language models in epilepsy and general clinical practice, we will need to develop protocols for applying decentralized language learning models (i.e., using federated learning) on distributed identifiable patient data from multiple institutions. These coordinated decentralized language models will take advantage of the collective knowledge and insights of multiple sources, including specialty fields like epilepsy, while protecting patient privacy. We confirm that we have read the Journal's position on issues involved in ethical publication and affirm that this report is consistent with those guidelines. Christian M Boßelmann: Conceptualization, Writing – original draft; Costin Leu: Writing – review & editing; Dennis Lal: Writing – review & editing, Supervision. None. The authors report no conflicts of interest.

  • Preprint Article
  • 10.31234/osf.io/p7hvw_v2
Modelling Implicit Bias in Gender–Career Associations: A systematic comparison of language models
  • May 23, 2025
  • Alexander Porshnev + 5 more

Biases in language and their reflection in language models have attracted researchers' attention, particularly with the growth of large language models (LLMs). However, many questions on the links between language models and people’s biased attitudes remain unanswered. In the current study we focus on gender–career bias to examine the extent to which language models can be used to model behavioural responses in the Gender–Career Implicit Association Test (IAT). We provide a systematic evaluation of a range of language models, including n-gram, count vector, predict (word2vec), and Large Language Models (LLMs), to determine how well they capture people’s behaviour in the IAT. We compared response time data from over 800,000 participants against 25 language models, with a total of 675 model variants. We find that many language models, including large language models (LLMs), correlated well with human behavior. While results support previous findings for both predict and count model families, we observed that performance of LLMs was consistently different from that of simpler predict models, particularly in terms of the direction and strength of correlations with reaction time and bias. This divergence may indicate successful attempts to mitigate bias in LLMs while preserving other aspects of linguistic information. Our findings reinforce the idea that societal biases are generally encoded in language, but that large language models can exhibit behaviors different to classical language models.

  • Preprint Article
  • 10.31234/osf.io/p7hvw_v1
Modelling Implicit Bias in Gender–Career Associations: A systematic comparison of language models
  • May 22, 2025
  • Alexander Porshnev + 5 more

Biases in language and their reflection in language models have attracted researchers' attention, particularly with the growth of large language models (LLMs). However, many questions on the links between language models and people’s biased attitudes remain unanswered. In the current study we focus on gender–career bias to examine the extent to which language models can be used to model behavioural responses in the Gender–Career Implicit Association Test (IAT). We provide a systematic evaluation of a range of language models, including n-gram, count vector, predict (word2vec), and Large Language Models (LLMs), to determine how well they capture people’s behaviour in the IAT. We compared response time data from over 800,000 participants against 25 language models, with a total of 675 model variants. We find that many language models, including large language models (LLMs), correlated well with human behavior. While results support previous findings for both predict and count model families, we observed that performance of LLMs was consistently different from that of simpler predict models, particularly in terms of the direction and strength of correlations with reaction time and bias. This divergence may indicate successful attempts to mitigate bias in LLMs while preserving other aspects of linguistic information. Our findings reinforce the idea that societal biases are generally encoded in language, but that large language models can exhibit behaviors different to classical language models.

  • Discussion
  • 10.1111/cogs.13430
Large Language Models: A Historical and Sociocultural Perspective.
  • Mar 1, 2024
  • Cognitive science
  • Eugene Yu Ji

This letter explores the intricate historical and contemporary links between large language models (LLMs) and cognitive science through the lens of information theory, statistical language models, and socioanthropological linguistic theories. The emergence of LLMs highlights the enduring significance of information-based and statistical learning theories in understanding human communication. These theories, initially proposed in the mid-20th century, offered a visionary framework for integrating computational science, social sciences, and humanities, which nonetheless was not fully fulfilled at that time. The subsequent development of sociolinguistics and linguistic anthropology, especially since the 1970s, provided critical perspectives and empirical methods that both challenged and enriched this framework. This letter proposes that two pivotal concepts derived from this development, metapragmatic function and indexicality, offer a fruitful theoretical perspective for integrating the semantic, textual, and pragmatic, contextual dimensions of communication, an amalgamation that contemporary LLMs have yet to fully achieve. The author believes that contemporary cognitive science is at a crucial crossroads, where fostering interdisciplinary dialogues among computational linguistics, social linguistics and linguistic anthropology, and cognitive and social psychology is in particular imperative. Such collaboration is vital to bridge the computational, cognitive, and sociocultural aspects of human communication and human-AI interaction, especially in the era of large language and multimodal models and human-centric Artificial Intelligence (AI).

  • Front Matter
  • 10.1162/artl_e_00409
Editorial: What Have Large-Language Models and Generative Al Got to Do With Artificial Life?
  • May 1, 2023
  • Artificial life
  • Alan Dorin + 1 more

Accessible generative artificial intelligence (AI) tools like large-language models (LLMs) (e.g., Chat-GPT, 1 Minerva 2 ) are raising a flurry of questions about the potential and implications of generative algorithms and the ethical use of AI-generated text in a variety of contexts, including open science (Bugbee & Ramachandran, 2023), student assessment (Heidt, 2023), and medicine (Harrer, 2023) . Similarly, among the graphic and visual arts communities, the use of generative image synthesis algorithms (e.g., DALL-E, 3 Midjourney, 4 Stable Diffusion 5 ) that take text prompts as input and produce works in the style of a particular human artist, or no artist who ever lived, are causing consternation and posing challenging questions (Murphy, 2022; Plunkett, 2022) . The use of generative AI to create deep fakes has also been in the spotlight (Ruiter, 2021), as has its role in answering scientific research questions directly (Castelvecchi, 2023) . To our minds, the questions these technologies are raising do not seem to be of a fundamentally different character to questions asked about AI for many years. They largely concern (a) what is possible, (b) what is right, and (c) the implications of the technology's use. For instance,

  • Research Article
  • Cite Count Icon 12
  • 10.1080/02763869.2023.2194149
ChatGPT, an Opportunity to Understand More About Language Models
  • Apr 3, 2023
  • Medical Reference Services Quarterly
  • Borui Zhang

ChatGPT, a leading large language model, has achieved some success beyond previous language models and caught the world’s attention since its release in late 2022. Businesses and healthcare professional fields have raised strong interests in investing in large language models to assist various kinds of information searching in their domain of expertise. Under the influence of ChatGPT, searched information may be received in a new personalized chat format, in contrast to the traditional search engines with pages of results for users to evaluate and open. Large language models and generative AI present new opportunities for librarians to understand more about language models’ development as well as the future directions of the language models that are developed behind the user interfaces. Being aware of how language models impact the communication of information will enrich librarians’ abilities to examine the quality of AI outputs and awareness of users’ rights and data curation policies, to better assist patrons’ research activities that involve using language models in the foreseeable future.

  • Research Article
  • 10.9781/ijimai.2025.09.005
Blending Language Models and Domain-Specific Languages in Computer Science Education. A Case Study on API RESTFul
  • Oct 3, 2025
  • International Journal of Interactive Multimedia and Artificial Intelligence
  • Francisco Jurado + 3 more

Since Computer Science students are used to applying both General Purpose Programming Languages (GPPLs) and Domain-Specific Languages (DSLs), Generative Artificial Intelligence based on Language Models (LMs) can help them on automatic tasks, allowing them to focus on more creative tasks and higher skills. However, the teaching and evaluation of technical tasks in Computer Science can be inefficient and prone to errors. Thus, the main objective of this article is to explore the performance of LMs compared to that of undergraduate Computer Science students in a specific case study: designing and implementing RESTful APIs DSLs. This research aims to determine if LMs can enhance the efficiency and accuracy of these processes. Our case study involved 39 students and 5 different LMs that must use the two DSLs we also designed for their task assignment. To evaluate performance, we applied uniform criteria to student and LMs-generated solutions, enabling a comparative analysis of accuracy and effectiveness. With a case study comparing performance between students and LMs, this article contributes to checking to what extent LMs are able to carry out software development tasks involving the use of new DSLs specially designed for highly specific settings in a similar way as well-qualified Computer Science students are able to. The results underscore the importance of welldefined DSLs and effective prompting processes for optimal LM performance. Specifically, LMs demonstrated high variability in task execution, with two GPT-based LMs achieving similar grades to those scored by the best of the students for every task, obtaining 0.78 and 0.92 on a normalized scale [0, 1], with 0.23 and 0.14 Standard Deviation for ChatGPT-4 and ChatGPT-4o respectively. After the experience, we can conclude that a well-defined DSL and a proper prompting process, providing the LM with metadata, persistent prompts, and a good knowledge base, are crucial for good LM performance. When LMs receive the right prompts, both large and small LMs can achieve excellent results depending on the task.

  • Conference Article
  • Cite Count Icon 1
  • 10.54941/ahfe1004478
Magenta: Metrics and Evaluation Framework for Generative Agents based on LLMs
  • Jan 1, 2024
  • Sudarshan Kamath Barkur + 3 more

Large Language Models (LLMs) have emerged as a driving force in the field of Natural Language Processing (NLP) with applications spanning various domains, including the development of Autonomous Generative Agents. Generative Agents are computational software programs designed to believably simulate human behavior by harnessing the capabilities of large language models. Through repetitive prompts against the large language model, these agents operate based on a system architecture consisting of memory streams, reflection, and planning, allowing them to store experiences, learn from them, and translate insights into high-level action plans to interact with their environment. This paper discusses the current landscape of language models and autonomous agents, their advantages and challenges, and the current state of evaluation, and proposes an innovative evaluation benchmark designed to provide a holistic perspective on their performance. Additionally, we see the impact of fine-tuning such an LLM, evaluate using our benchmark, and then propose a framework for evaluation of both the agents and their underlying LLMs. The existing frameworks for evaluating LLMs and autonomous agents focus on single tasks and are limited in capturing their capabilities. We outline the methodology for evaluating autonomous agents' performance in responding to single and multi-step prompts. The process consists of three key stages: Preparation of the data, Preparation of the Gold Answers, and Evaluations. We use the meticulously crafted 20 unique prompts to challenge the agents, covering simple and complex questions. Using GPT-4, a state-of-the-art model, we generate the initial responses, which undergo rigorous verification to produce gold answers, indicating correctness and revealing the minimum steps required for task completion. Our evaluation framework relies on two critical metrics: the effort metrics, quantifying the steps taken by autonomous agents, and the success rate, measuring their accuracy in achieving task objectives and also keeping track of hallucinations of the model. We conduct experiments with ten different models, representing the current landscape of natural language processing models, presenting each with 20 unique prompts. Their responses are meticulously compared to our gold answers and gold steps (optimal number of steps) to generate the evaluation metrics. Similarly, a fine-tuned model is also evaluated with ten different questions, which test the agent's decision-making process by selecting the correct tool and then the ability of the model to reach the correct conclusion to the question asked by the user in this process.This comprehensive approach ensures a thorough assessment of autonomous agents' capabilities. It demonstrates the utility of these metrics, revealing how they can shed light on the strengths and weaknesses of various autonomous agents. As a step toward standardization, we propose transforming the evaluation process of LLMs into an automated framework that accommodates all types of language models, agents, and LLM-based applications. Such an approach promises to establish a unified and comprehensive evaluation methodology, empowering users to make informed decisions when selecting, fine-tuning, and assessing the accuracy of underlying language models and their applications for different domains.In summary, this paper contributes to the ongoing research on evaluating LLMs and autonomous agents by introducing a novel benchmark and proposing a framework, focusing on evaluating the language models while keeping different knowledge domains in mind. Our framework will enhance our understanding of these technologies and serve as a valuable resource for researchers, engineers, and practitioners working in the ever-evolving landscape of NLP and autonomous systems.

  • Research Article
  • Cite Count Icon 46
  • 10.1145/3591300
Prompting Is Programming: A Query Language for Large Language Models
  • Jun 6, 2023
  • Proceedings of the ACM on Programming Languages
  • Luca Beurer-Kellner + 2 more

Large language models have demonstrated outstanding performance on a wide range of tasks such as question answering and code generation. On a high level, given an input, a language model can be used to automatically complete the sequence in a statistically-likely way. Based on this, users prompt these models with language instructions or examples, to implement a variety of downstream tasks. Advanced prompting methods can even imply interaction between the language model, a user, and external tools such as calculators. However, to obtain state-of-the-art performance or adapt language models for specific tasks, complex task- and model-specific programs have to be implemented, which may still require ad-hoc interaction.Based on this, we present the novel idea of Language Model Programming (LMP). LMP generalizes language model prompting from pure text prompts to an intuitive combination of text prompting and scripting. Additionally, LMP allows constraints to be specified over the language model output. This enables easy adaption to many tasks while abstracting language model internals and providing high-level semantics.To enable LMP, we implement LMQL (short for Language Model Query Language), which leverages the constraints and control flow from an LMP prompt to generate an efficient inference procedure that minimizes the number of expensive calls to the underlying language model.We show that LMQL can capture a wide range of state-of-the-art prompting methods in an intuitive way, especially facilitating interactive flows that are challenging to implement with existing high-level APIs. Our evaluation shows that we retain or increase the accuracy on several downstream tasks, while also significantly reducing the required amount of computation or cost in the case of pay-to-use APIs (26-85% cost savings).

  • Research Article
  • Cite Count Icon 3
  • 10.1609/aaai.v38i17.29948
Quantifying and Analyzing Entity-Level Memorization in Large Language Models
  • Mar 24, 2024
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Zhenhong Zhou + 3 more

Large language models (LLMs) have been proven capable of memorizing their training data, which can be extracted through specifically designed prompts. As the scale of datasets continues to grow, privacy risks arising from memorization have attracted increasing attention. Quantifying language model memorization helps evaluate potential privacy risks. However, prior works on quantifying memorization require access to the precise original data or incur substantial computational overhead, making it difficult for applications in real-world language models. To this end, we propose a fine-grained, entity-level definition to quantify memorization with conditions and metrics closer to real-world scenarios. In addition, we also present an approach for efficiently extracting sensitive entities from autoregressive language models. We conduct extensive experiments based on the proposed, probing language models' ability to reconstruct sensitive entities under different settings. We find that language models have strong memorization at the entity level and are able to reproduce the training data even with partial leakages. The results demonstrate that LLMs not only memorize their training data but also understand associations between entities. These findings necessitate that trainers of LLMs exercise greater prudence regarding model memorization, adopting memorization mitigation techniques to preclude privacy violations.

  • Research Article
  • 10.59573/emsj.9(5).2025.25
Choosing the Right Model for Enterprise AI Applications: A Comprehensive Analysis of Small Language Models versus Large Language Models in Enterprise Architecture
  • Sep 11, 2025
  • European Modern Studies Journal
  • Venkata Kiran Chand Vemulapalli

The enterprise deployment of artificial intelligence language models presents organizations with critical architectural decisions between Small Language Models and Large Language Models, each offering distinct advantages and operational considerations. Small Language Models, characterized by parameter counts ranging from millions to several billion, provide computational efficiency, rapid deployment capabilities, and cost-effective solutions for real-time applications requiring millisecond response times. Large Language Models, featuring billions to trillions of parameters, deliver sophisticated contextual understanding, complex reasoning abilities, and comprehensive knowledge bases suitable for advanced content generation and analytical tasks. Enterprise environments must evaluate infrastructure requirements, with Small Language Models operating effectively on standard CPU configurations and minimal memory footprints, while Large Language Models demand GPU clusters and substantial computational resources. The architectural choice significantly impacts system performance, operational costs, scalability potential, and long-term strategic positioning. Contemporary enterprise implementations increasingly recognize hybrid deployment strategies that leverage the complementary strengths of both model categories, enabling organizations to optimize resource utilization while addressing diverse application requirements. Future developments in neural architecture search, hardware-software co-design methodologies, and federated learning frameworks promise to reshape enterprise AI deployment strategies, creating opportunities for more efficient and scalable artificial intelligence solutions.

  • Research Article
  • Cite Count Icon 2
  • 10.1186/s12967-025-06871-y
ECBT-I dialogue system: a comparative evaluation of large language models and adaptation strategies for insomnia treatment
  • Aug 5, 2025
  • Journal of Translational Medicine
  • Xueying Bao + 16 more

BackgroundTraditional face-to-face mental health treatments are often limited by time and space. Thanks to the development of advanced large language models (LLMs), digital mental health treatments can provide personalized advice to patients and improve compliance. However, in the field of CBT-I, specialized, real-time interactive dialogue platforms have not been fully developed.MethodsOur research team construct an eCBT-I intelligent dialogue system based on the RAG architecture, aiming to provide an example of the deep integration of CBT-I knowledge graphs and large language models. Furthermore, in order to optimize the performance of the system’s core language generation module on the insomnia dialogue dataset, we systematically include eight mainstream large language models (ChatGLM2-6b, ChatGLM3-6b, Baichuan-7b, Baichuan-13b, Qwen-7b, Qwen2-7b, Llama-2-7b-chat-hf, and Llama-2-13b-chat-hf) and three adaptation strategies (LoRA, QLoRA, and Freeze). We screen the suitability of the three adaptation strategies for the eight major language models in the group, and thus determine the best adaptation method for each language model to maximize performance improvement. The eight best-adapted language models are then evaluated in three dimensions to compare their performance on the small sample sleep dialogue dataset and the C-eval dataset. All subjects that evaluated under experimental conditions are historical medical records and patients who did not exhibit delirium and had normal language expression abilities.ResultsThrough the matching of model characteristics to adaptation strategies and the horizontal evaluation of multiple models, we compare the contribution of different fine-tuning strategies to the performance improvement of different language models on the small insomnia dialogue dataset, and finally determine that Qwen2-7b (Freeze) is the model with the best performance on the insomnia dialogue dataset.ConclusionsThis study effectively integrates the CBT-I knowledge graph with the large language model through the RAG architecture, which improves the professionalism of the eCBT-I intelligent dialogue system. The systematic fine-tuning method selection process and the confirmation of the optimal model not only improve the adaptability of the large language model in the CBT-I task, but also provide a useful paradigm for AI applications in medical subfields with resource constraints and difficulties in data collection, laying a solid foundation for more accurate and efficient digital CBT-I clinical practice in the future.Supplementary InformationThe online version contains supplementary material available at 10.1186/s12967-025-06871-y.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 77
  • 10.3390/app14052074
A Review of Current Trends, Techniques, and Challenges in Large Language Models (LLMs)
  • Mar 1, 2024
  • Applied Sciences
  • Rajvardhan Patil + 1 more

Natural language processing (NLP) has significantly transformed in the last decade, especially in the field of language modeling. Large language models (LLMs) have achieved SOTA performances on natural language understanding (NLU) and natural language generation (NLG) tasks by learning language representation in self-supervised ways. This paper provides a comprehensive survey to capture the progression of advances in language models. In this paper, we examine the different aspects of language models, which started with a few million parameters but have reached the size of a trillion in a very short time. We also look at how these LLMs transitioned from task-specific to task-independent to task-and-language-independent architectures. This paper extensively discusses different pretraining objectives, benchmarks, and transfer learning methods used in LLMs. It also examines different finetuning and in-context learning techniques used in downstream tasks. Moreover, it explores how LLMs can perform well across many domains and datasets if sufficiently trained on a large and diverse dataset. Next, it discusses how, over time, the availability of cheap computational power and large datasets have improved LLM’s capabilities and raised new challenges. As part of our study, we also inspect LLMs from the perspective of scalability to see how their performance is affected by the model’s depth, width, and data size. Lastly, we provide an empirical comparison of existing trends and techniques and a comprehensive analysis of where the field of LLM currently stands.

  • Research Article
  • 10.1145/3709155
Clinical Analogy Resolution Performance for Foundation Language Models
  • Apr 25, 2025
  • ACM Transactions on Computing for Healthcare
  • Fabián Villena + 2 more

Using extensive data sources to create foundation language models has revolutionized the performance of deep learning-based architectures. This remarkable improvement has led to state-of-the-art results for various downstream NLP tasks, including clinical tasks. However, more research is needed to measure model performance intrinsically, especially in the clinical domain. We revisit the use of analogy questions as an effective method to measure the intrinsic performance of language models for the clinical domain in English. We tested multiple Transformer-based language models s over analogy questions constructed from the Unified Medical Language System (UMLS), a massive knowledge graph of clinical concepts. Our results show that large language models are significantly more performant for analogy resolution than small language models. Similarly, domain-specific language models perform better than general domain language models. We also found a correlation between intrinsic and extrinsic performance, validated through PubMedQA extrinsic task. Creating clinical-specific and language-specific language models is essential for advancing biomedical and clinical NLP and will ensure a valid application in clinical practice. Finally, given that our proposed intrinsic test is based on a term graph available in multiple languages, the dataset can be built to measure the performance of models in languages other than English.

  • Conference Article
  • Cite Count Icon 105
  • 10.1145/3510003.3510203
Jigsaw
  • May 21, 2022
  • Naman Jain + 6 more

Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.