MoChat: Joints-Grouped Spatio-Temporal Grounding Multimodal Large Language Model for Multi-Turn Motion Comprehension and Description.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Despite continuous advancements in deep learning for understanding human motion, existing models often struggle to accurately identify action timing and specific body parts, typically supporting only single-round interaction. This limitation is particularly pronounced in home exercise monitoring, neurological disorder assessment, and rehabilitation, where precise motion analysis is crucial for ensuring exercise efficacy, detecting early signs of neurological conditions, and guiding personalized recovery programs. In this paper, we propose MoChat, a multimodal large language model capable of spatio-temporal grounding of human motion and multi-turn dialogue understanding. To achieve this, we first group spatial features in skeleton frames according to human anatomical structures and process them through a Joints-Grouped Skeleton Encoder. The encoder's outputs are fused with large language model embeddings to generate spatio-aware representations. A cross-attention-based Regression Head module is then designed to align hidden-layer embeddings and skeletal sequence embeddings, enabling precise temporal grounding. Furthermore, we develop a pipeline for temporal grounding task to extract timestamps from skeleton-text pairs and construct a multi-turn instruction dialogues for spatial grounding task. Finally, various task instructions are generated for jointly training. Experimental results demonstrate that MoChat achieves state-of-the-art performance across multiple metrics in motion understanding tasks, making it as the first model capable of fine-grained spatio-temporal grounding of human motion.

Similar Papers
  • Research Article
  • 10.34133/icomputing.0110
Deep Learning and Methods Based on Large Language Models Applied to Stellar Light Curve Classification
  • Jan 1, 2025
  • Intelligent Computing
  • Yu-Yang Li + 6 more

Light curves serve as a valuable source of information on stellar formation and evolution. With the rapid advancement of machine learning techniques, they can be effectively processed to extract astronomical patterns and information. In this study, we present a comprehensive evaluation of models based on deep learning and large language models (LLMs) for the automatic classification of variable star light curves, using large datasets from the Kepler and K2 missions. Special emphasis is placed on Cepheids, RR Lyrae, and eclipsing binaries, examining the influence of observational cadence and phase distribution on classification precision. Employing automated deep learning optimization, we achieve striking performance using 2 architectures: one that combines one-dimensional convolution (Conv1D) with bidirectional long short-term memory (BiLSTM) and another called the Swin Transformer. These achieved accuracies of 94% and 99%, respectively, with the latter demonstrating a notable 83% accuracy in discerning the elusive type II Cepheids that comprise merely 0.02% of the total dataset. We unveil StarWhisper LightCurve (LC), a series of 3 LLM models based on an LLM, a multimodal large language model (MLLM), and a large audio language model (LALM). Each model is fine-tuned with strategic prompt engineering and customized training methods to explore the emergent abilities of these models for astronomical data. Remarkably, StarWhisper LC series models exhibit high accuracies of around 90%, considerably reducing the need for explicit feature engineering, thereby paving the way for streamlined parallel data processing and the progression of multifaceted multimodal models in astronomical applications. The study furnishes 2 detailed catalogs illustrating the impacts of phase and sampling intervals on deep learning classification accuracy, showing that a substantial decrease of up to 14% in observation duration and 21% in sampling points can be realized without compromising accuracy by more than 10%.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/bdcc9050132
Comparative Evaluation of Multimodal Large Language Models for No-Reference Image Quality Assessment with Authentic Distortions: A Study of OpenAI and Claude.AI Models
  • May 16, 2025
  • Big Data and Cognitive Computing
  • Domonkos Varga

This study presents a comparative analysis of several multimodal large language models (LLMs) for no-reference image quality assessment, with a particular focus on images containing authentic distortions. We evaluate three models developed by OpenAI and three models from Claude.AI, comparing their performance in estimating image quality without reference images. Our results demonstrate that these LLMs outperform traditional methods based on hand-crafted features. However, more advanced deep learning models, especially those based on deep convolutional networks, surpass LLMs in performance. Notably, we make a unique contribution by publishing the processed outputs of the LLMs, providing a transparent and direct comparison of their quality assessments based solely on the predicted quality scores. This work underscores the potential of multimodal LLMs in image quality evaluation, while also highlighting the continuing advantages of specialized deep learning approaches.

  • Research Article
  • 10.1145/3732784
TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs
  • Apr 29, 2025
  • ACM Transactions on Intelligent Systems and Technology
  • Shuyi Xie + 15 more

Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation framework to assess LLMs’ proficiency in following instructions on diverse real-world tasks. We construct a hierarchical task tree encompassing 7 major areas covering over 200 categories and over 800 tasks, which covers diverse capabilities such as question answering, reasoning, multiturn dialogue, and text generation, to evaluate LLMs in a comprehensive and in-depth manner. We also design detailed evaluation standards and processes to facilitate consistent, unbiased judgments from human evaluators. A test set of over 3,000 instances is released, spanning different difficulty levels and knowledge domains. Our work provides a standardized methodology to evaluate human alignment in LLMs for both English and Chinese. We also analyze the feasibility of automating parts of evaluation with a strong LLM (GPT-4). Our framework supports a thorough assessment of LLMs as they are integrated into real-world applications. We have made publicly available the task tree, TencentLLMEval dataset, and evaluation methodology which have been demonstrated as effective in assessing the performance of Tencent Hunyuan LLMs 1 . By doing so, we aim to facilitate the benchmarking of advances in the development of safe and human-aligned LLMs.

  • Research Article
  • 10.1007/s10266-025-01283-2
Diagnostic accuracy of large language models in the classification of superior labial frenulum attachments.
  • Dec 10, 2025
  • Odontology
  • Mehmet GĂĽmĂĽĹź Kanmaz + 1 more

This study aimed to evaluate the diagnostic accuracy of multimodal large language models in classifying superior labial frenulum attachments from intraoral photographs using expert consensus as the reference standard. Five experts (two periodontists and three orthodontists) established the consensus standard by classifying frenulum attachments in 117 intraoral images as mucosal, gingival, papillary, and papilla penetrating. The same photographs were then presented to three multimodal large language models (ChatGPT 4o, Gemini 2.5 Pro, and Microsoft Copilot GPT-4), and their diagnostic performance was evaluated using accuracy, sensitivity, specificity, and F1 score. Reliability was assessed using Fleiss' and Cohen's Kappa, and diagnostic performances were compared using Cochran's Q test. Human raters demonstrated almost perfect agreement (Îş = 0.838, p < 0.001), whereas large language models showed poor inter-model agreement (Îş = -0.124, p < 0.001). ChatGPT achieved slight significant agreement with the consensus (Îş = 0.114, p = 0.019), although its clinical relevance was negligible. Gemini (Îş = 0.099) and Copilot (Îş = 0.027) showed no significant agreement (p > 0.05). Copilot yielded the highest overall accuracy (46.2%), followed by Gemini (44.5%) and ChatGPT (35.0%). The performance of the large language models varied across frenulum types. Current multimodal large language models demonstrate inconsistent and clinically insufficient accuracy in the classification of superior labial frenulum attachments from photographs. Domain-specific training is essential before large language models can be considered reliable diagnostic tools in dentistry.

  • Research Article
  • Cite Count Icon 124
  • 10.1097/corr.0000000000002704
Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.
  • May 23, 2023
  • Clinical orthopaedics and related research
  • Zachary C Lum

Advances in neural networks, deep learning, and artificial intelligence (AI) have progressed recently. Previous deep learning AI has been structured around domain-specific areas that are trained on dataset-specific areas of interest that yield high accuracy and precision. A new AI model using large language models (LLM) and nonspecific domain areas, ChatGPT (OpenAI), has gained attention. Although AI has demonstrated proficiency in managing vast amounts of data, implementation of that knowledge remains a challenge. (1) What percentage of Orthopaedic In-Training Examination questions can a generative, pretrained transformer chatbot (ChatGPT) answer correctly? (2) How does that percentage compare with results achieved by orthopaedic residents of different levels, and if scoring lower than the 10th percentile relative to 5th-year residents is likely to correspond to a failing American Board of Orthopaedic Surgery score, is this LLM likely to pass the orthopaedic surgery written boards? (3) Does increasing question taxonomy affect the LLM's ability to select the correct answer choices? This study randomly selected 400 of 3840 publicly available questions based on the Orthopaedic In-Training Examination and compared the mean score with that of residents who took the test over a 5-year period. Questions with figures, diagrams, or charts were excluded, including five questions the LLM could not provide an answer for, resulting in 207 questions administered with raw score recorded. The LLM's answer results were compared with the Orthopaedic In-Training Examination ranking of orthopaedic surgery residents. Based on the findings of an earlier study, a pass-fail cutoff was set at the 10th percentile. Questions answered were then categorized based on the Buckwalter taxonomy of recall, which deals with increasingly complex levels of interpretation and application of knowledge; comparison was made of the LLM's performance across taxonomic levels and was analyzed using a chi-square test. ChatGPT selected the correct answer 47% (97 of 207) of the time, and 53% (110 of 207) of the time it answered incorrectly. Based on prior Orthopaedic In-Training Examination testing, the LLM scored in the 40th percentile for postgraduate year (PGY) 1s, the eighth percentile for PGY2s, and the first percentile for PGY3s, PGY4s, and PGY5s; based on the latter finding (and using a predefined cutoff of the 10th percentile of PGY5s as the threshold for a passing score), it seems unlikely that the LLM would pass the written board examination. The LLM's performance decreased as question taxonomy level increased (it answered 54% [54 of 101] of Tax 1 questions correctly, 51% [18 of 35] of Tax 2 questions correctly, and 34% [24 of 71] of Tax 3 questions correctly; p = 0.034). Although this general-domain LLM has a low likelihood of passing the orthopaedic surgery board examination, testing performance and knowledge are comparable to that of a first-year orthopaedic surgery resident. The LLM's ability to provide accurate answers declines with increasing question taxonomy and complexity, indicating a deficiency in implementing knowledge. Current AI appears to perform better at knowledge and interpretation-based inquires, and based on this study and other areas of opportunity, it may become an additional tool for orthopaedic learning and education.

  • Research Article
  • Cite Count Icon 2
  • 10.1145/3732786
Graph Machine Learning in the Era of Large Language Models (LLMs)
  • May 6, 2025
  • ACM Transactions on Intelligent Systems and Technology
  • Shijie Wang + 10 more

Graphs play an important role in representing complex relationships in various domains like social networks, knowledge graphs, and molecular discovery. With the advent of deep learning, Graph Neural Networks (GNNs) have emerged as a cornerstone in Graph Machine Learning (Graph ML), facilitating the representation and processing of graphs. Recently, LLMs have demonstrated unprecedented capabilities in language tasks and are widely adopted in a variety of applications such as computer vision and recommender systems. This remarkable success has also attracted interest in applying LLMs to the graph domain. Increasing efforts have been made to explore the potential of LLMs in advancing Graph ML’s generalization, transferability, and few-shot learning ability. Meanwhile, graphs, especially knowledge graphs, are rich in reliable factual knowledge, which can be utilized to enhance the reasoning capabilities of LLMs and potentially alleviate their limitations such as hallucinations and the lack of explainability. Given the rapid progress of this research direction, a systematic review summarizing the latest advancements for Graph ML in the era of LLMs is necessary to provide an in-depth understanding to researchers and practitioners. Therefore, in this survey, we first review the recent developments in Graph ML. We then explore how LLMs can be utilized to enhance the quality of graph features, alleviate the reliance on labeled data, and address challenges such as graph Heterophily and out-of-distribution (OOD) generalization. Afterward, we delve into how graphs can enhance LLMs, highlighting their abilities to enhance LLM pre-training and inference. Furthermore, we investigate various applications and discuss the potential future directions in this promising field.

  • Research Article
  • Cite Count Icon 2
  • 10.2214/ajr.25.32729
Multimodal Large Language Model With Knowledge Retrieval Using Flowchart Embedding for Forming Follow-Up Recommendations for Pancreatic Cystic Lesions.
  • Jul 1, 2025
  • AJR. American journal of roentgenology
  • Zheren Zhu + 5 more

BACKGROUND. The American College of Radiology (ACR) Incidental Findings Committee (IFC) algorithm provides guidance for pancreatic cystic lesion (PCL) management. Its implementation using plain-text large language model (LLM) solutions is challenging given that key components include multimodal data (e.g., figures and tables). OBJECTIVE. The purpose of the study is to evaluate a multimodal LLM approach incorporating knowledge retrieval using flowchart embedding for forming follow-up recommendations for PCL management. METHODS. This retrospective study included patients who underwent abdominal CT or MRI from September 1, 2023, to September 1, 2024, and whose report mentioned a PCL. The reports' Findings sections were inputted to a multimodal LLM (GPT-4o). For task 1 (198 patients: mean age, 69.0 ± 13.0 [SD] years; 110 women, 88 men), the LLM assessed PCL features (presence of PCL, PCL size and location, presence of main pancreatic duct communication, presence of worrisome features or high-risk stigmata) and formed a follow-up recommendation using three knowledge retrieval methods (default knowledge, plain-text retrieval-augmented generation [RAG] from the ACR IFC algorithm PDF document, and flowchart embedding using the LLM's image-to-text conversion for in-context integration of the document's flowcharts and tables). For task 2 (85 patients: mean initial age, 69.2 ± 10.8 years; 48 women, 37 men), an additional relevant prior report was inputted; the LLM assessed for interval PCL change and provided an adjusted follow-up schedule accounting for prior imaging using flowchart embedding. Three radiologists assessed LLM accuracy in task 1 for PCL findings in consensus and follow-up recommendations independently; one radiologist assessed accuracy in task 2. RESULTS. For task 1, the LLM with flowchart embedding had accuracy for PCL features of 98.0-99.0%. The accuracy of the LLM follow-up recommendations based on default knowledge, plain-text RAG, and flowchart embedding for radiologist 1 was 42.4%, 23.7%, and 89.9% (p < .001), respectively; radiologist 2 was 39.9%, 24.2%, and 91.9% (p < .001); and radiologist 3 was 40.9%, 25.3%, and 91.9% (p < .001). For task 2, the LLM using flowchart embedding showed an accuracy for interval PCL change of 96.5% and for adjusted follow-up schedules of 81.2%. CONCLUSION. Multimodal flowchart embedding aided the LLM's automated provision of follow-up recommendations adherent to a clinical guidance document. CLINICAL IMPACT. The framework could be extended to other incidental findings through the use of other clinical guidance documents as the model input.

  • Research Article
  • 10.1371/journal.pone.0329590
EIM: An effective solution for improving multi-modal large language models.
  • Aug 11, 2025
  • PloS one
  • Yuting Bai + 2 more

Enabling large language models (LLMs) to have multi-modal capabilities, such as vision-language learning, has become a current research hotspot and the next milestone in LLM development with the advent of models like GPT4. The basic structure of current multi-modal LLMs usually includes three parts: the image encoder for extracting visual features, the semantic space transformation network ST for aligning the multi-modal semantic spaces, and LLM for generating text. Current works on multi-modal LLMs primarily focus on enhancing performance by utilizing larger image encoders and LLMs, and designing more complex fine-tuning methods and STs, which results in an escalation of model parameters. In this paper, we propose EIM, a novel effective solution for improving the performance of multi-modal large language models from the perspective of training process which reduces the need to introduce new parameters and modify the model structure, and is ignored and less explored in current research. EIM includes corresponding improvement measures in the image encoder, ST, and LLM. To validate EIM, we first apply it to ClipCap and conduct experiments on the COCO Caption dataset. Secondly, we extend EIM to the multi-modal LLMs, such as LLaMA-Adapter and LaVIN, and evaluate them on the ScienceQA dataset. Finally, we also conduct multi-modal chatbot experiments with the EIM enhanced LaVIN and evaluate it on the MME benchmark. The COCO Caption dataset experimental results of [Formula: see text], which is a model that applies EIM on the [Formula: see text], show the 1.75% performance improvement when compared to those of [Formula: see text], which has 3.13 times the number of parameters of [Formula: see text]. The experimental results on the ScienceQA dataset and MME benchmark show that EIM can achieve competitive performance with 7B model parameters when compared to the 13B multi-modal LLMs, which confirms the effective performance improvement of EIM for multi-modal LLMs.

  • Research Article
  • 10.56025/ijaresm.2025.1302252386
Conversational Information Retrieval by Leveraging Llms to Enhance User Experience
  • Jan 1, 2025
  • International Journal of All Research Education and Scientific Methods
  • Amit Raj + 2 more

A Conversational Information Retrieval (CIR) system can be defined as an information retrieval (IR) system characterized by a conversational interface that facilitates user interaction with the system to obtain information through multi-turn dialogues in natural language, whether in spoken or written modalities. Information Retrieval (IR) has undergone considerable transformation, transcending conventional search methodologies to address a wide array of user information requirements. IR models, LLMs, and human users establishes a novel technical paradigm that is significantly more effective for information seeking. IR models deliver timely and pertinent information, LLMs supply intrinsic knowledge, and humans assume a pivotal role as both demanders and assessors of the reliability of information services. Large Language Models (LLMs) have exhibited remarkable proficiency in text comprehension, generation, and knowledge inference, thereby unveiling promising prospects for research within the eld of IR. LLMs not only enhance the process of generative retrieval but also provide superior frameworks for user comprehension, model assessment, and user-system engagement. However, substantial challenges persist, encompassing computational expenses, issues of credibility, limitations specific to certain domains, and ethical implications. The brisk and extraordinary progress made in the eld of Large Language Models (LLMs) has dramatically rede ned the framework of natural language processing, paving the way for the advent of ever more elaborate and sophisticated conversational search functionalities that had previously seemed impossible. Large Language Models (LLMs) have shown an exceptionally proficiency in various essential areas such as text comprehension, text generation, and the inference of knowledge, which consequently opens up a plethora of promising opportunities for further research and exploration within the discipline of Information Retrieval (IR). Moreover, LLMs not only significantly enhance the mechanisms involved in generative retrieval processes but also offer advanced and superior frameworks that contribute to improved user comprehension, comprehensive model assessment, and enriched engagement between users and systems. The effective capture and interpretation of user intent in complex contextual search scenarios continues to represent a substantial and critical challenge that must be addressed in order to optimize these interactions. Through extensive experimentation and evaluation, we demonstrate the effectiveness of the proposed framework in improving search relevance, user satisfaction, and interaction efficiency.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.csbj.2024.12.019
Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis.
  • Jan 1, 2025
  • Computational and structural biotechnology journal
  • Reem Agbareia + 5 more

Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief concern, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs' training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models' explanations for a subset of cases. LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8 %, Claude Sonnet 3.5: 59.5 %, Physicians: 39.5 %, p < 0.001, Bonferroni-adjusted). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5 %, p < 0.001; Claude Sonnet 3.5: 67.3 %, p = 0.060; Physicians: 78.8 %, p < 0.001, all Bonferroni-adjusted). LLMs altered their explanatory reasoning in 45-60 % of cases when images were provided. Multimodal LLMs showed higher diagnostic accuracy than physicians in text-only scenarios, even in cases designed to require visual interpretation, suggesting that while images can enhance diagnostic accuracy, they may not be essential in every instance. Although adding images further improved LLM performance, the magnitude of this improvement was smaller than that observed in physicians. These findings suggest that enhanced visual data processing may be needed for LLMs to achieve the degree of image-related performance gains seen in human examiners.

  • Research Article
  • 10.36001/phmconf.2025.v17i1.4407
Evaluating Large Language Models for Turboshaft Engine Torque Prediction
  • Oct 26, 2025
  • Annual Conference of the PHM Society
  • Alessandro Tronconi + 2 more

Recent advancements in deep learning have introduced new opportunities for quality management in manufacturing, particularly through transformer-based architectures capable of learning from limited datasets and handling complex, multimodal inputs. Among these, Large Language Models (LLMs) have emerged as a significant innovation, demonstrating strong capabilities in forecasting and representing the cutting edge of artificial intelligence (AI). Through transfer learning, LLMs effectively process and generate extended text sequences, and recent developments show their potential for multimodal integration, including text, images, audio, and video data. Quality management is a critical area for industrial innovation, rapidly evolving as manufacturers seek to close the quality-manufacturing loop and achieve zero-defect production goals. While computer vision techniques based on deep learning have been widely implemented for visual inspection tasks, integrating multiple heterogeneous data sources offers the possibility for even greater improvements. Despite the success of LLMs in language tasks, their application to time series data remains relatively unexplored. Alternative statistical approaches and deep learning models have proven effective for time series forecasting. Nevertheless, LLMs could provide additional advantages in industrial contexts, offering opportunities to enhance in-line quality control, defect prevention, and predictive discarding strategies across various sectors. This paper investigates the potential of applying LLMs to time series analysis by comparing the performance of an LLM (GPT-2), originally trained on textual data, with a model specifically designed for time series data (TimeGPT), and a more conventional transformer-based architecture. Our study includes a dedicated time series GPT model and a general-purpose LLM in a comparative evaluation. Through this analysis, we aim to better understand how language models can be effectively adapted to time series forecasting tasks and explore their transfer learning potential for enhancing quality management in manufacturing.

  • Research Article
  • Cite Count Icon 8
  • 10.1007/s00261-024-04708-8
Multi-modal large language models in radiology: principles, applications, and potential.
  • Dec 2, 2024
  • Abdominal radiology (New York)
  • Yiqiu Shen + 6 more

Large language models (LLMs) and multi-modal large language models (MLLMs) represent the cutting-edge in artificial intelligence. This review provides a comprehensive overview of their capabilities and potential impact on radiology. Unlike most existing literature reviews focusing solely on LLMs, this work examines both LLMs and MLLMs, highlighting their potential to support radiology workflows such as report generation, image interpretation, EHR summarization, differential diagnosis generation, and patient education. By streamlining these tasks, LLMs and MLLMs could reduce radiologist workload, improve diagnostic accuracy, support interdisciplinary collaboration, and ultimately enhance patient care. We also discuss key limitations, such as the limited capacity of current MLLMs to interpret 3D medical images and to integrate information from both image and text data, as well as the lack of effective evaluation methods. Ongoing efforts to address these challenges are introduced.

  • Research Article
  • Cite Count Icon 13
  • 10.1016/j.media.2024.103279
Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
  • Jul 20, 2024
  • Medical Image Analysis
  • Xinyue Hu + 7 more

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning

  • Research Article
  • 10.1016/j.dld.2025.11.009
Performance of gastroenterologists and multimodal LLMs in endoscopic EREFS scoring of Eosinophilic Esophagitis.
  • Dec 1, 2025
  • Digestive and liver disease : official journal of the Italian Society of Gastroenterology and the Italian Association for the Study of the Liver
  • Asaf Levartovsky + 8 more

Performance of gastroenterologists and multimodal LLMs in endoscopic EREFS scoring of Eosinophilic Esophagitis.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 195
  • 10.1038/s41368-023-00239-y
ChatGPT for shaping the future of dentistry: the potential of multi-modal large language model
  • Jul 28, 2023
  • International Journal of Oral Science
  • Hanyao Huang + 10 more

The ChatGPT, a lite and conversational variant of Generative Pretrained Transformer 4 (GPT-4) developed by OpenAI, is one of the milestone Large Language Models (LLMs) with billions of parameters. LLMs have stirred up much interest among researchers and practitioners in their impressive skills in natural language processing tasks, which profoundly impact various fields. This paper mainly discusses the future applications of LLMs in dentistry. We introduce two primary LLM deployment methods in dentistry, including automated dental diagnosis and cross-modal dental diagnosis, and examine their potential applications. Especially, equipped with a cross-modal encoder, a single LLM can manage multi-source data and conduct advanced natural language reasoning to perform complex clinical operations. We also present cases to demonstrate the potential of a fully automatic Multi-Modal LLM AI system for dentistry clinical application. While LLMs offer significant potential benefits, the challenges, such as data privacy, data quality, and model bias, need further study. Overall, LLMs have the potential to revolutionize dental diagnosis and treatment, which indicates a promising avenue for clinical application and research in dentistry.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.