Articles published on Language Models
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
37182 Search results
Sort by Recency
- New
- Research Article
- 10.1080/17521882.2026.2640932
- Mar 6, 2026
- Coaching: An International Journal of Theory, Research and Practice
- Abongile Sipondo + 1 more
ABSTRACT This scoping review examines the design and effectiveness of AI coaching chatbots in light of recent advances in generative artificial intelligence. Following the emergence of large language models in 2022–2023, AI coaching has gained traction as a scalable and cost-effective intervention, yet evidence guiding effective design remains limited. Using a five-stage PRISMA-ScR methodology, 17 empirical studies were analysed through thematic synthesis. Three overarching themes emerged: chatbot design considerations, determinants of adoption, and the operationalisation of AI coaching. Findings indicate that generative AI enhances interaction quality, usability, and engagement, particularly for structured tasks such as goal setting, reflection, feedback, and intersessional support. However, limitations in relational depth, cultural sensitivity, and psychological nuance persist, positioning AI coaches as complementary rather than substitutive to human coaching.
- New
- Discussion
- 10.1080/08820538.2026.2638803
- Mar 6, 2026
- Seminars in Ophthalmology
- Sowmya V Kothandan + 2 more
Reply: Large Language Models Use in Dry Eye Disease: Bots and Brains
- New
- Research Article
- 10.1097/cm9.0000000000004000
- Mar 5, 2026
- Chinese medical journal
- Conghui Zhang + 2 more
Heart failure (HF) is a chronic condition characterized by high morbidity and mortality worldwide, imposing a substantial burden on healthcare systems. In recent years, artificial intelligence (AI) technologies, including machine learning, deep learning, and large language models, have demonstrated great potential in HF management. By integrating multimodal data, such as electronic health records and medical imaging, AI models address limitations in risk prediction, phenotyping, diagnosis, treatment, and prognosis, offering novel insights to improve the quality of life for HF patients. However, several challenges remain before AI can be reliably implemented in clinical practice, including model selection, model generalization, interpretability, and limited reliability in real-world settings. In this review, we systematically summarize recent advances in application of AI in HF management across multiple domains, including inspection, monitoring, treatment, and integration. We further discuss key real-world challenges to implementation, and outline future directions for the development of intelligent HF management. In addition, representative application cases are presented to illustrate how AI technologies can be developed and translated into clinical practice, with the aim of providing practical insights and methodological guidance for researchers.
- New
- Research Article
- 10.1038/s41746-026-02503-x
- Mar 4, 2026
- NPJ digital medicine
- Junmo Kim + 10 more
Cutaneous adverse drug reactions (CADRs) are the most common form of adverse drug reactions, ranging from mild rashes to life-threatening diseases, such as Stevens-Johnson syndrome and toxic epidermal necrolysis. However, there is no effective tool to predict antibiotic-associated CADRs. In this study, we propose an antibiotic-associated CADR prediction model using electronic health record (EHR) foundation models (FMs). EHR FMs are based on the pretraining-finetuning paradigms of language models, corresponding medical codes and their sequences to words and sentences. We included 802,131 inpatients across three tertiary hospitals in Korea, combining EHR data with nursing statements and reports to extract skin rash records. Our approach achieved the best predictive performance compared to all the other baseline models across all datasets. To enhance clinical relevance, we classified CADRs into immediate and delayed types and conducted a detailed sub-analysis. Finally, we found that properly configured EHR FMs can effectively predict the risk of developing antibiotics-associated CADRs, particularly for delayed-type reactions where predictive testing options are limited.
- New
- Research Article
- 10.3390/robotics15030055
- Mar 4, 2026
- Robotics
- Matthew Lisondra + 2 more
Rapid advancements in foundation models, including Large Language Models, Vision-Language Models, Multimodal Large Language Models, and Vision-Language-Action models, have opened new avenues for embodied AI in mobile service robotics. By combining foundation models with the principles of embodied AI, where intelligent systems perceive, reason, and act through physical interaction, mobile service robots can achieve more flexible understanding, adaptive behavior, and robust task execution in dynamic real-world environments. Despite this progress, embodied AI for mobile service robots continues to face fundamental challenges related to the translation of natural language instructions into executable robot actions, multimodal perception in human-centered environments, uncertainty estimation for safe decision-making, and computational constraints for real-time onboard deployment. In this paper, we present the first systematic review of foundation models in mobile service robotics, following the preferred reporting items for systematic reviews and meta-analysis (PRISMA) guidelines. Using an OpenAlex literature search, we considered 7506 papers for the years spanning 1968–2025. Our detailed analysis identified four main challenges and how recent advances in foundation models, related to the translation of natural language instructions into executable robot actions, multimodal perception in human-centered environments, uncertainty estimation for safe decision-making, and computational constraints for real-time onboard deployment, have addressed these challenges. We further examine real-world applications in domestic assistance, healthcare, and service automation, highlighting how foundation models enable context-aware, socially responsive, and generalizable robot behaviors. Beyond technical considerations, we discuss ethical, societal, human-interaction, and physical design and ergonomic implications associated with deploying foundation-model-enabled service robots in human environments. Finally, we outline future research directions emphasizing reliability and lifelong adaptation, privacy-aware and resource-constrained deployment, as well as the governance and human-in-the-loop frameworks required for safe, scalable, and trustworthy mobile service robotics.
- New
- Research Article
- 10.3389/frai.2026.1701665
- Mar 4, 2026
- Frontiers in Artificial Intelligence
- Farhad Abtahi + 2 more
Bias in medical artificial intelligence is conventionally viewed as a defect that requires elimination. However, human reasoning inherently incorporates biases shaped by education, culture, and experience, suggesting their presence may be inevitable and potentially valuable. We propose MEDLEY (Medical Ensemble Diagnostic system with Leveraged diversitY), a conceptual framework that orchestrates multiple AI models while preserving their diverse outputs rather than collapsing them into a consensus. Unlike traditional approaches that suppress disagreement, MEDLEY documents model-specific biases as potential strengths and treats hallucinations as provisional hypotheses for clinician verification. A proof-of-concept demonstrator for differential diagnosis was developed using over 30 large language models, preserving both consensus and minority views, rendering diagnostic uncertainty and latent biases transparent to support clinical oversight. While not yet a validated clinical tool, the demonstration illustrates how structured diversity can enhance medical reasoning under the supervision of clinicians. By reframing AI imperfection as a resource, MEDLEY offers a paradigm shift that opens new regulatory, ethical, and innovation pathways for developing trustworthy medical AI systems.
- New
- Research Article
- 10.1186/s12873-026-01511-0
- Mar 4, 2026
- BMC emergency medicine
- Linfang Deng + 9 more
Evaluation of large language models in emergency medicine scenarios: a comparative analysis of ChatGPT-4o, ChatGPT-o3mini, Gemini 2.0-pro, and DeepSeek-R1.
- New
- Research Article
- 10.3390/app16052464
- Mar 4, 2026
- Applied Sciences
- Haytham Younus + 4 more
This article presents a state-of-the-art review of recent advances aimed at transforming traditional Failure Mode and Effects Analysis (FMEA) into a more intelligent, data-driven, and semantically enriched process. As engineered systems grow in complexity, conventional FMEA methods, which are largely manual, document-centric, and expert-dependent, have become increasingly inadequate for addressing the demands of modern systems engineering. We examine how techniques from Artificial Intelligence (AI), including machine learning and natural language processing, can transform FMEA into a more dynamic, data-driven, intelligent, and model-integrated process by automating failure prediction, prioritisation, and knowledge extraction from operational data. In parallel, we explore the role of ontologies in formalising system knowledge, supporting semantic reasoning, improving traceability, and enabling cross-domain interoperability. The review also synthesises emerging hybrid approaches, such as ontology-informed learning and large language model integration, which further enhance explainability and automation. These developments are discussed within the broader context of Model-Based Systems Engineering (MBSE) and function modelling, showing how AI and ontologies can support more adaptive and resilient FMEA workflows. We critically analyse a range of tools, case studies, and integration strategies, while identifying key challenges related to data quality, explainability, standardisation, and interdisciplinary adoption. By leveraging AI, systems engineering, and knowledge representation using ontologies, this review offers a structured roadmap for embedding FMEA within intelligent, knowledge-rich engineering environments.
- New
- Research Article
- 10.3390/app16052440
- Mar 3, 2026
- Applied Sciences
- Lei Zhang + 3 more
Chinese Spelling Correction (CSC) aims to identify and correct character-level errors in Chinese text, where mistakes are predominantly caused by phonetic similarity and complex semantic ambiguity. Existing CSC approaches typically model phonetic and semantic information separately, which limits their ability to resolve errors requiring joint reasoning over pronunciation, tone, and global sentence meaning. In this paper, we propose a Phonetic–Semantic and Long–Short Information Fusion (PSIF) framework that explicitly integrates transliteration knowledge with sentence-level semantic representations. By incorporating tone-aware pinyin embeddings and fusing short-range phonetic features with long-range contextual semantics, PSIF effectively captures both local and global cues necessary for accurate correction. Extensive experiments on multiple CSC benchmarks demonstrate that the proposed method consistently outperforms state-of-the-art approaches, particularly on homophonic and context-sensitive errors. Furthermore, to investigate CSC under noisy input conditions in large language models (LLMs), we introduce UCMMLU, a novel benchmark constructed by injecting erroneous Chinese characters into CMMLU questions. Results show that applying PSIF as a preprocessing module significantly enhances LLM robustness and question-answering performance in zero-shot settings. These findings suggest that phonetic–semantic fusion not only advances CSC accuracy but also provides an effective pathway for improving the reliability of language models when handling misspelled or noisy Chinese text.
- New
- Research Article
- 10.31436/imjm.v25i01.3212
- Mar 3, 2026
- IIUM Medical Journal Malaysia
- Rekha Prabhu + 3 more
INTRODUCTION: ChatGPT, a language model, is well-known for its capacity to generate human-like responses, but its use in medical education, particularly in assessment contexts, is underexplored. The aim of this study was to evaluate the efficiency of ChatGPT as an assessment tool in medical physiology examinations by comparing its performance in answering MCQs and SAQs. The findings of this study may impact the use of AI in medical education in a constantly digitised academic environment. MATERIALS AND METHODS: The study evaluated the performance of ChatGPT in answering 30 multiple-choice questions (MCQs) and 12 short-answer questions (SAQs) from each of the four physiology blocks. The questions were chosen from previous block exams to ascertain consistency. Two independent evaluators assessed the correctness and relevance of responses from ChatGPT using the answer key. The mean marks obtained by first-year medical students for 120 MCQs and 48 SAQs were compared with those of ChatGPT. RESULTS: ChatGPT performed better than first-year medical students in MCQs in all block exams and the difference in marks was statistically significant in blocks 1, 2, and 3. In SAQs, ChatGPT also performed better than the students in most questions. Students scored better in SAQ 11 in block 2, SAQ 12 in block 3 and SAQ 1, 2, 5 in block 3. CONCLUSION: ChatGPT is an effective AI tool for answering medical physiology questions. However, its performance varies across some MCQs and SAQs, indicating potential limitations in reasoning, contextual interpretation, and application-based problem-solving.
- New
- Research Article
- 10.1038/s44259-026-00187-7
- Mar 3, 2026
- npj antimicrobials and resistance
- William J Waldock + 4 more
We aimed to assess prescribing accuracy, error reduction, usability, and clinician confidence of Ask Eolas (a retrieval-augmented generation-enhanced AI-CDSS) compared to existing antimicrobial guidance tools. We conducted a structured simulation single-site study evaluating Ask Eolas across 45 prescribing cases with healthcare professionals to assess prescribing accuracy. Among 45 participants, Ask Eolas achieved zero prescribing errors versus six and eight documented errors in the two comparator groups (Eolas App and PDF Guidelines), respectively (p < 0.001). The number needed to treat was 1.9 for Ask Eolas versus traditional guidelines, indicating one additional error-free prescription for every two clinicians switching to Ask Eolas. Ask Eolas significantly improved prescribing accuracy while enhancing usability, clinician confidence, and system transparency compared to existing tools. These findings align with TRUST-AI framework principles for safe AI-CDSS deployment, supporting further investigation through real-world implementation studies incorporating live data integration, confidence calibration systems, and comprehensive auditability features in antimicrobial stewardship programmes.
- New
- Research Article
- 10.1038/s44387-026-00081-7
- Mar 3, 2026
- npj Artificial Intelligence
- Jio Oh + 3 more
Classroom AI: large language models as grade-specific teachers
- New
- Research Article
- 10.55041/isjem05582
- Mar 3, 2026
- International Scientific Journal of Engineering and Management
- Sreeshylam Rasula
Multimodal interfaces combine two or more input and/or output channels such as text, speech, vision, touch, gaze, gesture, haptics, and physiological sensing to create interaction styles that are more natural, accessible, and context-aware than single-modality systems. In the last few years, progress in multimodal machine learning, foundation models, wearable sensing, and spatial computing has reshaped how humans communicate intent to machines and how machines returnfeedback in real time. This article reviews recent advances in multimodal human–computer interaction (HCI) with an emphasis on: (i) multimodal fusion and alignment methods, (ii) multimodal large language models that connect language with perception, (iii) emerging sensing modalities (e.g., wrist sEMG) and XR interaction patterns (e.g., gaze + pinch), and (iv) evaluation practices that capture accuracy, latency, cognitive load, and user trust. A research methodology is presented for building and assessing multimodal interfaces, including dataset selection, signal preprocessing, fusion design, and user study protocols. Tables summarize modalities, fusion strategies, benchmark datasets, and evaluation metrics. Mathematical formulations cover fusion operators, attention-based alignment, and HCI performance laws. Finally, the paper consolidates findings, practical suggestions, and a forward-looking agenda addressing robustness under distribution shift, privacy, safety, and inclusive design
- New
- Research Article
- 10.31436/imjm.v25i01.3213
- Mar 3, 2026
- IIUM Medical Journal Malaysia
- Rekha Prabhu + 2 more
INTRODUCTION: Large language models (LLMs) are increasingly used by MBBS students as supplementary resources for exam preparation. The objective of this study was to evaluate the performance of ChatGPT and Microsoft Copilot in answering clinical vignette-style physiology MCQs from widely used resources for the United States Medical Licensing Examination (USMLE). MATERIALS AND METHODS: Fifty clinical vignette-style physiology multiple choice questions (MCQs) from the various USMLE question banks were submitted to ChatGPT and Microsoft Copilot to choose the correct option. The performance of ChatGPT and Microsoft Copilot was assessed using the provided answers in the question bank. Two experienced physiologists independently reviewed the explanations provided by ChatGPT and Microsoft Copilot for each MCQ. The explanations were rated between one to three points based on whether the answers were completely incorrect, partially correct with inaccurate information, or correct with adequate information. RESULTS: ChatGPT and Microsoft Copilot both correctly answered 48 and 47 out of 50 questions, reflecting a 96% and 94% accuracy rates respectively. One MCQ each on hypothyroidism and arrhythmia was incorrectly answered by both ChatGPT and Microsoft Copilot. For two MCQs, the explanations provided were inaccurate by ChatGPT and Microsoft Copilot provided inaccurate explanations for four of the MCQs. CONCLUSION: ChatGPT and Microsoft Copilot both demonstrated more than 90% accuracy in answering case-based MCQs from the USMLE Step 1 resources. Their incorrect option choices MCQs on hypothyroidism and inaccurate explanations for some MCQs highlight cautious use of AI by students.
- New
- Research Article
- 10.3390/electronics15051052
- Mar 3, 2026
- Electronics
- Thiago Cormie Monteiro + 1 more
Artificial Intelligence (AI) has emerged as a transformative force, increasingly integrated into diverse aspects of modern society, from healthcare and education to business and entertainment. Among the most influential AI technologies are large language models (LLMs), such as generative pretrained transformers (GPTs). These models are designed to process vast amounts of data and perform complex computations, enabling advanced capabilities in natural language understanding and generation. However, deployment and operation of such systems requires significant computational resources, leading to substantial energy consumption. While general-purpose hardware such as GPUs is limited by fixed-precision architectures, field-programmable gate arrays (FPGAs) offer the bit-level reconfigurability needed to exploit ultra-low-bitwidth representations. This allows power-intensive multiplications to be replaced by streamlined logic-based accumulations, maximizing the energy benefits of model quantization. This paper addresses the problem of the energy impact of LLMs by leveraging innovative FPGA-based heterogeneous computing platforms. Results demonstrate that ternary matrix multiplication (MatMul) achieves a 23% speedup and a remarkable 96% reduction in digital signal processor (DSP) utilization. Furthermore, the final optimized design shows a 52% reduction in total energy consumption compared to the baseline, making heterogeneous computing a compelling solution for power- and resource-constrained embedded applications.
- New
- Research Article
- 10.1038/s41597-026-06783-6
- Mar 3, 2026
- Scientific data
- Tao Jin + 9 more
Personality, as a stable and coherent set of behavioral and cognitive patterns, significantly influences linguistic expression, emotional regulation, and cognitive functioning. The Big Five personality traits-neuroticism, extraversion, openness, agreeableness, and conscientiousness- are especially relevant for understanding to language use and social interaction, making them foundational for developing of personality-informed natural language processing (NLP) systems. Despite this, existing personality lexicons often lacks rigorous validation, show weak alignment with linguistic features and personality traits, and fail to adapt to dynamic language environments such as social media. This study presents the construction and empirical validation of a personality lexicon derived from established psychological scales, dictionaries, and literature. Validation using real-world participant data yielded high hit rates across all Big Five dimensions (all > 0.70; mean = 0.787) and their 30 corresponding facets (all > 0.60; mean = 0.768). This lexicon provides a robust foundation for advancing computational personality assessment and supports applications in personalized NLP, large language models, and mental health prediction.
- New
- Research Article
- 10.3390/diagnostics16050749
- Mar 3, 2026
- Diagnostics
- Christian Nelles + 8 more
Background/Objectives: To evaluate the diagnostic accuracy of two visual large language models (vLLMs), GPT-4o (OpenAI) and Claude Sonnet 3.5 (Anthropic), for detecting brain metastases in routine MRI using combined imaging and textual input. Methods: This retrospective study included 31 patients with and 46 without brain metastases with underlying melanoma (n = 24), lung cancer (n = 23), breast cancer (n = 17), or renal cell carcinoma (n = 13). In total, 100 MRI examinations (50 with, 50 without metastases) were provided to both vLLMs using a single representative slice per sequence, together with clinical history and the referring question. The generated free-text reports were evaluated for detection accuracy, overdiagnosis, correct sequence recognition, anatomical localization, lesion laterality, and lesion size estimation. Results: Both vLLMs showed perfect sensitivity (100% for both) but very low specificity (GPT-4o: 8%, Sonnet 3.5: 4%; p = 0.625), resulting in low diagnostic accuracy (GPT-4o: 54%, Sonnet 3.5: 52%; p = 0.625). Sequence identification was highly accurate in both models, with GPT-4o performing significantly better (100% vs. 93%; p < 0.05). Identification of the anatomical brain region (70% vs. 72%; p = 1.00) and lesion laterality (62% vs. 76%; p = 0.189) was comparable. Both models hallucinated additional lesions in 12% of cases. Lesion size measurements showed no significant differences between the models or in comparison with the radiologist. Conclusions: GPT-4o and Claude Sonnet 3.5 can generate radiological reports and detect brain metastases with excellent sensitivity, but their very low specificity, frequent hallucinations, and limited spatial reliability currently preclude clinical application. Future work should address how the balance between visual and textual input influences diagnostic behavior in vLLMs.
- New
- Research Article
- 10.3389/fdgth.2026.1767648
- Mar 3, 2026
- Frontiers in Digital Health
- Noppawit Aiumtrakul + 5 more
Background Prior authorization (PA) is a major source of administrative burden, treatment delay, and clinician burnout. Artificial intelligence (AI), particularly large language models (LLMs), is increasingly used to assist with clinical documentation, yet its reliability for payer-facing administrative tasks remains uncertain. Objective To evaluate the quality of PA letters drafted by ChatGPT-5 for commonly used medications requiring PA in nephrology. Quality was evaluated based on correctness and strength of clinical reasoning. Methods We created a single standardized prompt and applied it across 29 nephrology scenarios to generate PA letters. Each PA letter was reviewed against four criteria: 1) absence of false statements or hallucinations, 2) correctness of ICD-10 coding, 3) presence and validity of citations, and 4) clinical reasoning, rated on a 4-point Likert scale (illogical, weak, adequate and strong). FDA drug labels, KDIGO guidelines and related randomized controlled trials were used as reference standards. Results Out of 29 letters, one letter (3.5%) contained false statements mentioning an irrelevant clinical trial. The ICD-10 diagnosis code was correct in 23 letters (79.3%), most errors were related to chronic kidney disease (CKD) staging or internal diagnostic inconsistencies. 27 letters (93.1%) cited valid references, with one letter citing an incorrect trial and another one citing a correct KDIGO guideline with inaccessible link. Twenty-six letters (89.7%) demonstrated strong clinical reasoning, supported by guideline-oriented or FDA label–aligned justification. The remaining 3 letters were rated as adequate reasoning. The main areas for improvement involved citing relevant references and emphasizing special considerations, for example Risk Evaluation and Mitigation Strategy (REMS) compliance for eculizumab. Conclusions ChatGPT-5 can generate clinically coherent PA drafts for nephrology medications, but limitations in coding precision and citation reliability persist. With appropriate oversight, AI-assisted documentation may reduce administrative burden while maintaining safety and accuracy.
- New
- Research Article
- 10.1088/1674-4527/ae4d1f
- Mar 3, 2026
- Research in Astronomy and Astrophysics
- Cunshi Wang + 14 more
Abstract To validate key technologies for wide field-of-view (FOV) X-ray polarization measurements, the Cosmic X-ray Polarization Detector (CXPD) CubeSat series has been developed as a prototype platform for the Low-Energy X-ray Polarization Detector (LPD) onboard the POLAR-2 mission. The wide-FOV design significantly increases the complexity of the background environment, posing notable challenges for real-time gamma-ray burst (GRB) identification. In this work, we propose an in-orbit GRB identification method based on machine learning, using simulated spectral data as input. A training dataset was constructed using a Geant4-based simulator, incorporating in-orbit background and GRB events modeled within the 2–10 keV energy range. To meet the computational constraints of onboard processing, we employ a multimodal large language model (MLLM), which is fine-tuned using low-rank adaptation (LoRA) based on miniCPM-V2.6 and quantized to 4-bit precision. The model achieves perfect classification accuracy on validation data and demonstrates strong regression performance in estimating GRB spectral indices, with an RMSE of 0.118. Furthermore, we validate the feasibility of onboard deployment through a simulated satellite data processing pipeline, highlighting the potential of our approach to enable future real-time GRB detection and spectral analysis in orbit.
- New
- Research Article
- 10.1038/s41598-026-41862-z
- Mar 3, 2026
- Scientific reports
- Yihan Dong + 1 more
Fact-checking is crucial as rumours and misinformation negatively impact social networking services (SNS) and online discussions, often leading to the spread of misinformation. Meanwhile, fact-checking with large language models (LLMs) is becoming increasingly popular with the increase in the performance of LLMs. However, the previous works have issues, including overconfidence in the judgment results of LLM and the insufficiency of binary fact-checking due to the text's complexity. On the other hand, using multiple information sources to make judgments reveals another obstacle: the lack of proper scoring mechanisms. Thus, we propose a framework called multi-agent fact-checking (MAFC), which includes multiple agents with unique information sources to measure the text's credibility. Specifically, a brand-new scoring mechanism is also used to calculate credibility according to each agent's judgment results and confidence. We tested our proposed method through several comparative experiments. The results of the experiments prove that the proposed method performs better than other baselines in both the binary fact-checking task and the multi-label fact-checking task. Finally, the challenges and obstacles existing in fact-checking fields, such as the definition standards and dataset creation, are discussed.