Вдосконалена класифікація великих мовних моделей
"Currently, large language models can generate text in response to input data. They are even starting to show good performance in other tasks. In addition, large language models can be components of models that do more than just generate text. There are well-known projects in which large language models were used to create sentiment detectors, toxicity classifiers, and image captions. The above has led to the interest of various companies in creating large language models, which has contributed to the creation of a significant number of large language models. In this regard, it is very difficult for an ordinary user to navigate the existing variety of large language models. Analysis of recent studies and publications on large language models has shown that, as a rule, they concern one large language model, or a comparative analysis of two large language models, and less often a comparative analysis of several large language models. Among the recent publications devoted to the study of large language models, one can note a publication that groups large language models according to their ease of use by end users. However, the above-mentioned work did not study large language models with which the user cannot interact via a chatbot and which are not available to ordinary users. It should be noted that users of large language models are not only physical users but also companies for which large language models with which the user cannot interact via a chatbot and which are not available to ordinary users, but may be available to the company, may also be interesting and in demand. As a result of the research, the classification of large language models was improved, which will allow different users to better navigate large language models and facilitate the search for the necessary language model. It should be noted that existing large language models are constantly being developed and improved by their developers. In addition, many large well-known companies and their separate divisions are working on the development of new large language models. In this regard, there is a constant need to track these processes and improve the classification of large language models in accordance with their current state."
- Conference Article
- 10.1145/3711875.3729128
- Jun 23, 2025
While large language models (LLMs) are endowed with broad knowledge, their task-specific performance is often suboptimal. Fine-tuning LLMs with task-specific data from diverse nodes is necessary, but this data is typically safeguarded and not shared publicly due to privacy concerns. A common solution involves downstream nodes downloading the LLM locally and fine-tuning it with their proprietary data. However, owners often regard pre-trained LLMs as valuable assets and are reluctant to share them. Additionally, the significant computational resources required by LLMs make local fine-tuning impractical for many nodes. To mitigate these problems, this paper proposes CrossLM, a data-free collaborative fine-tuning framework for large and small language models. CrossLM enables resource-constrained nodes to train smaller language models (SLMs) using their private task-specific data. These SLMs are subsequently leveraged to promote the task-specific natural language generation and understanding capabilities of the LLMs. Simultaneously, the SLMs of nodes also benefit from enhancement by the fine-tuned LLMs. In this way, CrossLM avoids sharing private data and proprietary LLMs, and also reduces the resource requirements of nodes. Through extensive experiments across a range of benchmark tasks and popular language models, we demonstrate that CrossLM significantly boosts the task-specific performance of both LLMs and SLMs while preserving the generalization capabilities of LLMs.
- Research Article
11
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
3
- 10.1109/access.2024.3419079
- Jan 1, 2024
- IEEE Access
Large language models’ exceptional all-purpose abilities have made human-computer conversations normal, but for particular industries and verticals, they fall short of enhancing the expertise of knowledge and the timeliness of information. In order to give current information, and provide improved search capabilities, large language models need to increasingly incorporate specialist resources and databases. In this research, a model for intelligent assisted decision-making was proposed that the model incorporates knowledge from domain-specific databases and real-time data and uses large language models to offer expert tax guidance. The research proposed to overcome the limits of general-purpose language models and deliver specialized advise for tax-related inquiries by complementing large language models with domain-specific information.The results we achieve demonstrate that by offering tax advice tailored to a given situation, and the model we proposed goes beyond the validity of general large language language models. Our contribution is that not only exploring the combination of tax area and large language model, but also proposing a new effective model for government tax department to use in real life. This study highlights the potential of big language models for use in real-world professional domains and advances the field of domain-specific human-computer interaction.
- Research Article
- 10.28945/5693
- Jan 1, 2026
- Journal of Information Technology Education: Research
Aim/Purpose: The study investigates the factors influencing the acceptance and utilisation of large language models (LLMs) (predictor variables of LLM usage), such as ChatGPT, in Learning design by instructional designers and university-teaching academics from various countries. Background: Large language models (LLMs) have exploded onto the scene, transforming the landscape of learning design. Instructional designers and university teaching academics have been overburdened with content creation for their teaching programmes, and the arrival of LLM models will help in this regard by developing more interactive content that drives student engagement and, in turn, contributes to student success. Since LLMs are a relatively new phenomenon, little is known about the factors influencing their acceptance in learning design; therefore, this research is needed, as learning design principles are the bedrock of student engagement and success. Methodology: A cross-sectional correlational quantitative study was employed. Data was collected using an online questionnaire posted on social media, including LinkedIn, from 203 instructional designers and university teaching academics. Purposive and snowball sampling methods were used to target instructional designers and university teaching academics at colleges and universities worldwide. Participants were asked to share the survey link with fellow instructional designers and university-teaching academics in their communities. The factor structure of the data was determined using exploratory factor analysis. Nonetheless, the factor structure derived from the LLMs did not entirely reflect the original configuration of the Unified Theory of Acceptance and Use of Technology (UTAUT3), as certain predictors appeared to coalesce, indicating LLMs’ unique nature in learning design. Confirmatory factor analysis was used to verify the fit of the data on the measurement model. First-order and second-order structural modelling were used to identify the structural relationships among the variables. Contribution: The study determines significant factors for the acceptance of LLMs by instructional designers and academic teaching staff in learning design, enabling possible opportunities for best practices in the field through interventions to optimize LLM usage. The study applies the technology acceptance model to the emerging LLM technology and extends the technology acceptance model by adding the trust construct as a predictor variable. Findings: The structural analysis results indicated that the ingrained LLM practices, LLM peer-driven expectations, innovative propensity towards LLM adoption, reliability and provider trust in LLMs, and ease of use and support influenced perceived LLM benefits and usage, but community standards and infrastructure had no influence. The second-order structural equation modelling indicated that perceived LLM benefits and usage and ingrained LLM habits contributed most to the learning design. Recommendations for Practitioners: Teaching academics and instructional designers must use LLMs in designing content, assessments, and interactive learning activities, and attend LLM training workshops on prompting and best practices in integrating LLMs into learning and teaching to see their benefits; hence, regular use of LLMs will then lead to trust and innovation in LLMs usage, enhancing learning design and improving student learning outcomes. Recommendation for Researchers: Researchers must use mixed methods approaches to have a deeper understanding of the factors influencing LLMs. Since habit and perceived LLM benefits and usage contributed the most variance to learning design, researchers must investigate strategies that optimise these factors in learning design, such as effective intervention strategies that can help form positive LLM habits. In addition, the findings provide researchers with a starting point for future research. Further researchers must investigate interventions that optimise the influence of personal innovativeness and trust that contributed the least variance to learning design, hence unlocking the potential of LLMs in learning design through innovation, responsible, and ethical use. Impact on Society: The use of LLMs in learning design has a high possibility of transforming education, specifically the learning design landscape. Using LLMs will free up more time for teaching academics and instructional designers so that they spend more time on higher-order thinking skill demands. Consequently, the students will be exposed to more engaging and interactive content, resulting in improved learning outcomes. Future Research: Future research must include context-derived external variables in technology acceptance models, such as levels of prompting competencies, to provide a deeper understanding of LLMs. In addition, future research must be based on the application and impact of LLMs on student engagement and success, and their attainment of 21st-century skills.
- Research Article
- 10.3348/kjr.2025.1045
- Jan 1, 2026
- Korean journal of radiology
To evaluate the accuracy and reasoning capabilities of large multimodal language models compared with those of neuroradiology subspecialty-trained radiologists in neuroradiology case interpretation. This experimental study used custom-made 401 radiologic quizzes derived from articles published in RadioGraphics covering neuroradiology and head and neck topics (October 2020 to February 2024). We prompted the GPT-4 Turbo with Vision (GPT-4V), GPT-4 Omni, Gemini Flash, and Claude models to provide the top three differential diagnoses with a rationale and describe examination characteristics such as imaging modality, sequence, use of contrast, image plane, and body part. The temperature was adjusted to 0 and 1 (T1). Two neuroradiologists answered the same questions. The accuracies of the large language models (LLMs) and the neuroradiologists were compared using generalized estimating equations. Three neuroradiologists assessed the rationale provided by the LLMs for their differential diagnoses using four-point scales, separately for specific lesion locations and imaging findings, and evaluated the presence of hallucinations and the overall acceptability of the responses. Top-3 accuracy (i.e., correct answers present among top-3 differential diagnoses) of LLMs ranged from 29.9% (120 of 401) to 49.4% (198 of 401, obtained with GPT-4V in the T1 setting), while radiologists achieved 80.3% (322 of 401) and 68.3% (274 of 401), respectively (P < 0.001). Regarding the rationale for differential diagnoses, GPT-4V (T1) accurately identified both the specific lesion location and imaging findings in 30.7% (123 of 401) and 12.9% (16 of 124) of cases without textual clinical history. Hallucinations occurred in 4.5% (18 of 401), and only 29.4% (118 of 401) of the LLM-generated analyses were deemed acceptable. GPT-4V (T1) demonstrated high accuracy in identifying the imaging modality (97.4% [800 of 821]) and scanned body parts (92.2% [756 of 820]). LLMs remarkably underperformed compared with neuroradiologists and showed unsatisfactory reasoning for their differential diagnoses, with performance declining further in cases without textual input of clinical history. These findings highlight the limitations of current multimodal LLMs in neuroradiological interpretation and their reliance on text input.
- Research Article
4
- 10.1038/s41698-025-00916-7
- May 23, 2025
- npj Precision Oncology
Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.
- Supplementary Content
- 10.1108/ir-02-2025-0074
- Jul 29, 2025
- Industrial Robot: the international journal of robotics research and application
Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.
- Research Article
1
- 10.1080/13658816.2025.2577252
- Nov 1, 2025
- International Journal of Geographical Information Science
The widespread use of online geoinformation platforms, such as Google Earth Engine (GEE), has produced numerous scripts. Extracting domain knowledge from these crowdsourced scripts supports understanding of geoprocessing workflows. Small Language Models (SLMs) are effective for semantic embedding but struggle with complex code; Large Language Models (LLMs) can summarize scripts, yet lack consistent geoscience terminology to express knowledge. In this paper, we propose Geo-CLASS, a knowledge extraction framework for geospatial analysis scripts that coordinates large and small language models. Specifically, we designed domain-specific schemas and a schema-aware prompt strategy to guide LLMs to generate and associate entity descriptions, and employed SLMs to standardize the outputs by mapping these descriptions to a constructed geoscience knowledge base. Experiments on 237 GEE scripts, selected from 295,943 scripts in total, demonstrated that our framework outperformed LLM baselines, including Llama-3, GPT-3.5 and GPT-4o. In comparison, the proposed framework improved accuracy in recognizing entities and relations by up to 31.9% and 12.0%, respectively. Ablation studies and performance analysis further confirmed the effectiveness of key components and the robustness of the framework. Geo-CLASS has the potential to enable the construction of geoprocessing modeling knowledge graphs, facilitate domain-specific reasoning and advance script generation via Retrieval-Augmented Generation (RAG).
- Research Article
37
- 10.2196/64290
- Feb 13, 2025
- Journal of medical Internet research
Laypeople have easy access to health information through large language models (LLMs), such as ChatGPT, and search engines, such as Google. Search engines transformed health information access, and LLMs offer a new avenue for answering laypeople's questions. We aimed to compare the frequency of use and attitudes toward LLMs and search engines as well as their comparative relevance, usefulness, ease of use, and trustworthiness in responding to health queries. We conducted a screening survey to compare the demographics of LLM users and nonusers seeking health information, analyzing results with logistic regression. LLM users from the screening survey were invited to a follow-up survey to report the types of health information they sought. We compared the frequency of use of LLMs and search engines using ANOVA and Tukey post hoc tests. Lastly, paired-sample Wilcoxon tests compared LLMs and search engines on perceived usefulness, ease of use, trustworthiness, feelings, bias, and anthropomorphism. In total, 2002 US participants recruited on Prolific participated in the screening survey about the use of LLMs and search engines. Of them, 52% (n=1045) of the participants were female, with a mean age of 39 (SD 13) years. Participants were 9.7% (n=194) Asian, 12.1% (n=242) Black, 73.3% (n=1467) White, 1.1% (n=22) Hispanic, and 3.8% (n=77) were of other races and ethnicities. Further, 1913 (95.6%) used search engines to look up health queries versus 642 (32.6%) for LLMs. Men had higher odds (odds ratio [OR] 1.63, 95% CI 1.34-1.99; P<.001) of using LLMs for health questions than women. Black (OR 1.90, 95% CI 1.42-2.54; P<.001) and Asian (OR 1.66, 95% CI 1.19-2.30; P<.01) individuals had higher odds than White individuals. Those with excellent perceived health (OR 1.46, 95% CI 1.1-1.93; P=.01) were more likely to use LLMs than those with good health. Higher technical proficiency increased the likelihood of LLM use (OR 1.26, 95% CI 1.14-1.39; P<.001). In a follow-up survey of 281 LLM users for health, most participants used search engines first (n=174, 62%) to answer health questions, but the second most common first source consulted was LLMs (n=39, 14%). LLMs were perceived as less useful (P<.01) and less relevant (P=.07), but elicited fewer negative feelings (P<.001), appeared more human (LLM: n=160, vs search: n=32), and were seen as less biased (P<.001). Trust (P=.56) and ease of use (P=.27) showed no differences. Search engines are the primary source of health information; yet, positive perceptions of LLMs suggest growing use. Future work could explore whether LLM trust and usefulness are enhanced by supplementing answers with external references and limiting persuasive language to curb overreliance. Collaboration with health organizations can help improve the quality of LLMs' health output.
- Preprint Article
- 10.2196/preprints.64290
- Jul 14, 2024
BACKGROUND Laypeople have easy access to health information through large language models (LLMs), such as ChatGPT, and search engines, such as Google. Search engines transformed health information access, and LLMs offer a new avenue for answering laypeople’s questions. OBJECTIVE We aimed to compare the frequency of use and attitudes toward LLMs and search engines as well as their comparative relevance, usefulness, ease of use, and trustworthiness in responding to health queries. METHODS We conducted a screening survey to compare the demographics of LLM users and nonusers seeking health information, analyzing results with logistic regression. LLM users from the screening survey were invited to a follow-up survey to report the types of health information they sought. We compared the frequency of use of LLMs and search engines using ANOVA and Tukey post hoc tests. Lastly, paired-sample Wilcoxon tests compared LLMs and search engines on perceived usefulness, ease of use, trustworthiness, feelings, bias, and anthropomorphism. RESULTS In total, 2002 US participants recruited on Prolific participated in the screening survey about the use of LLMs and search engines. Of them, 52% (n=1045) of the participants were female, with a mean age of 39 (SD 13) years. Participants were 9.7% (n=194) Asian, 12.1% (n=242) Black, 73.3% (n=1467) White, 1.1% (n=22) Hispanic, and 3.8% (n=77) were of other races and ethnicities. Further, 1913 (95.6%) used search engines to look up health queries versus 642 (32.6%) for LLMs. Men had higher odds (odds ratio [OR] 1.63, 95% CI 1.34-1.99; <i>P</i>&lt;.001) of using LLMs for health questions than women. Black (OR 1.90, 95% CI 1.42-2.54; <i>P</i>&lt;.001) and Asian (OR 1.66, 95% CI 1.19-2.30; <i>P</i>&lt;.01) individuals had higher odds than White individuals. Those with excellent perceived health (OR 1.46, 95% CI 1.1-1.93; <i>P</i>=.01) were more likely to use LLMs than those with good health. Higher technical proficiency increased the likelihood of LLM use (OR 1.26, 95% CI 1.14-1.39; <i>P</i>&lt;.001). In a follow-up survey of 281 LLM users for health, most participants used search engines first (n=174, 62%) to answer health questions, but the second most common first source consulted was LLMs (n=39, 14%). LLMs were perceived as less useful (<i>P</i>&lt;.01) and less relevant (<i>P</i>=.07), but elicited fewer negative feelings (<i>P</i>&lt;.001), appeared more human (LLM: n=160, vs search: n=32), and were seen as less biased (<i>P</i>&lt;.001). Trust (<i>P</i>=.56) and ease of use (<i>P</i>=.27) showed no differences. CONCLUSIONS Search engines are the primary source of health information; yet, positive perceptions of LLMs suggest growing use. Future work could explore whether LLM trust and usefulness are enhanced by supplementing answers with external references and limiting persuasive language to curb overreliance. Collaboration with health organizations can help improve the quality of LLMs’ health output.
- Conference Article
135
- 10.1145/3510003.3510203
- May 21, 2022
Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.
- Research Article
12
- 10.1016/j.procs.2023.09.086
- Jan 1, 2023
- Procedia Computer Science
A Large and Diverse Arabic Corpus for Language Modeling
- Discussion
- 10.1111/jgs.70177
- Oct 24, 2025
- Journal of the American Geriatrics Society
We thank Üçdal et al. for their thoughtful letter [1] advocating for the development and use of domain-specific large language models (LLMs) in healthcare in reference to our recent publication on the use of LLMs for identifying preoperative frailty among older adults using clinical notes [2]. They make interesting and valid points on how to ensure the use of artificial intelligence (AI) in medicine is accurate, applicable, and transparent, just like any clinical tools that are developed and become widely used to evaluate and treat patients. We agree that identifying or building tools that specifically excel in clinical applications will be key in the future of AI as clinical tools. General-purpose LLMs, while powerful, may fall short in the contextual understanding of medical text and handling of unique clinical language used in medicine. These categories of language models are typically trained on broad internet corpora that include only a small fraction of biomedical literature, electronic health records, and guideline-based knowledge. As a result, they may generate fluent but factually incorrect answers (e.g., hallucinations)—a phenomenon that is particularly problematic when applied to high-stakes clinical settings. In contrast, domain-specific models are specifically pre-trained on curated biomedical text, peer-reviewed literature, and structured health data and may potentially reduce hallucination rates, increase the precision of medical terminology, and align more closely with established standards of care. Our study is one example that shows how general-purpose models compared to specialized models tailored to clinical contexts may likely underperform in healthcare-related tasks. However, this is not always the case, as demonstrated by another study that showed similar performances between domain-specific and general-purpose language models for identifying the need for preoperative cardiac evaluations [3]. Furthermore, general-purpose language models may be further fine-tuned with clinical notes or with optimized prompt engineering to improve performance for healthcare-related tasks [4]. Regardless, the concept is the same, in that leveraging LLMs for clinical tasks must take into consideration the knowledge base of its underlying foundation model for developing accurate AI-based tools for medicine. We agree with the authors' point that we need to ensure international relevance when using LLMs as clinical tools. Even within healthcare itself, AI models trained on certain subpopulations may still not be accurate and exhibit bias when used on another patient population [5]. It follows that a model that performs well within one country, trained on one patient population, may not generalize globally, particularly when guidelines, documentation styles, and patient demographics vary. Just as we validate clinical guidelines across populations, so too must we evaluate LLMs to ensure safe, equitable application [6]. In order to properly use models, clinicians need to understand not only what the result is but why. In order to properly develop and use clinical tools, we must understand them—the ethical principle behind explainability. Moving forward, it is important if we use AI models that we maintain transparency, as Ucdal et al. point out, with interpretability mechanisms to understand what models learn and why. We will hold LLMs and other AI tools to the same standard as all clinical tools used in medicine. As with any medical advancement, those developing and implementing the tool have responsibility for clinical validation, usability testing, post-deployment monitoring, and ongoing iteration based on real-world data. As healthcare moves toward an increased demand and utilization, AI technologies such as LLMs have come into play to streamline and improve care in a world of increasing workload and decreasing resources. As we have seen in various aspects of healthcare, including our study using LLMs to identify a difficult-to-quantify state such as frailty, LLMs and other aspects of AI increasingly show great potential in improving our ability to care for patients. Like with any clinical tool we use, it must be proven to improve and not compromise care. Along those lines, it is also critical to apply tools that are relevant and designed to perform well. Careful steps forward to make sure only AI technologies appropriate to the proposed usage, such as domain-specific LLMs, careful testing, validation, and transparency of models will ensure we are improving care and not causing harm to our patients. In this way, clinicians can learn about and lead healthcare toward the best direction forward using a complex but powerful technology. Y.Q.Z. contributed to the concept design and preparation of the manuscript. R.A.G. contributed to the concept design and preparation of the manuscript. The authors have nothing to report. The authors declare no conflicts of interest. This publication is linked to a related Letter to the Editor article by Üçdal et al. To view this article, visit https://doi.org/10.1111/jgs.70171.
- Research Article
25
- 10.1200/jco.24.00326
- Sep 3, 2024
- Journal of clinical oncology : official journal of the American Society of Clinical Oncology
Current approaches to accurately identify immune-related adverse events (irAEs) in large retrospective studies are limited. Large language models (LLMs) offer a potential solution to this challenge, given their high performance in natural language comprehension tasks. Therefore, we investigated the use of an LLM to identify irAEs among hospitalized patients, comparing its performance with manual adjudication and International Classification of Disease (ICD) codes. Hospital admissions of patients receiving immune checkpoint inhibitor (ICI) therapy at a single institution from February 5, 2011, to September 5, 2023, were individually reviewed and adjudicated for the presence of irAEs. ICD codes and an LLM with retrieval-augmented generation were applied to detect frequent irAEs (ICI-induced colitis, hepatitis, and pneumonitis) and the most fatal irAE (ICI-myocarditis) from electronic health records. The performance between ICD codes and LLM was compared via sensitivity and specificity with an α = .05, relative to the gold standard of manual adjudication. External validation was performed using a data set of hospital admissions from June 1, 2018, to May 31, 2019, from a second institution. Of the 7,555 admissions for patients on ICI therapy in the initial cohort, 2.0% were adjudicated to be due to ICI-colitis, 1.1% ICI-hepatitis, 0.7% ICI-pneumonitis, and 0.8% ICI-myocarditis. The LLM demonstrated higher sensitivity than ICD codes (94.7% v 68.7%), achieving significance for ICI-hepatitis (P < .001), myocarditis (P < .001), and pneumonitis (P = .003) while yielding similar specificities (93.7% v 92.4%). The LLM spent an average of 9.53 seconds/chart in comparison with an estimated 15 minutes for adjudication. In the validation cohort (N = 1,270), the mean LLM sensitivity and specificity were 98.1% and 95.7%, respectively. LLMs are a useful tool for the detection of irAEs, outperforming ICD codes in sensitivity and adjudication in efficiency.
- Research Article
9
- 10.1016/j.wneu.2024.11.114
- Feb 1, 2025
- World Neurosurgery
AimThis study aimed to investigate the accuracy of large language models (LLMs), specifically ChatGPT and Claude, in surgical decision-making and radiological assessment for spine pathologies compared to experienced spine surgeons. MethodsThe study employed a comparative analysis between the LLMs and a panel of attending spine surgeons. Five written clinical scenarios encompassing various spine pathologies were presented to the LLMs and surgeons, who provided recommended surgical treatment plans. Additionally, MRI images depicting spine pathologies were analyzed by the LLMs and surgeons to assess their radiological interpretation abilities. Spino-pelvic parameters were estimated from a scoliosis radiograph by the LLMs. ResultsQualitative content analysis revealed limitations in the LLMs' consideration of patient-specific factors and the breadth of treatment options. Both ChatGPT and Claude provided detailed descriptions of MRI findings but differed from the surgeons in terms of specific levels and severity of pathologies. The LLMs acknowledged the limitations of accurately measuring spino-pelvic parameters without specialized tools. The accuracy of surgical decision-making for the LLMs (20%) was lower than that of the attending surgeons (100%). Statistical analysis showed no significant differences in accuracy between the groups. ConclusionThe study highlights the potential of LLMs in assisting with radiological interpretation and surgical decision-making in spine surgery. However, the current limitations, such as the lack of consideration for patient-specific factors and inaccuracies in treatment recommendations, emphasize the need for further refinement and validation of these AI models. Continued collaboration between AI researchers and clinical experts is crucial to address these challenges and realize the full potential of AI in spine surgery.