Text clustering with large language model embeddings

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Text clustering with large language model embeddings

Similar Papers
  • Research Article
  • 10.1609/aies.v8i2.36682
What Are Chatbots’ Stereotypes About? A Data-Driven Analysis of Large Language Models’ Content Associations with Social Categories
  • Oct 15, 2025
  • Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
  • Gandalf Nicolas + 1 more

This study introduces a data-driven taxonomy of stereotype content in contemporary large language models (LLMs). We prompt ChatGPT 4.5, ChatGPT 3.5, Llama 3, and Mixtral 8x7B, four recent and powerful LLMs, for the characteristics associated with 87 social categories (e.g., gender, race, occupations). We show that these prompts are reliable and valid, predicting unrelated tasks such as storytelling about the targets. Using text embeddings and cluster analyses, we identify 14 dimensions (Ability, Appearance, Assertiveness, Beliefs, Deviance, Emotion, Family, Geography, Health, Morality, Occupations, Social categories, Sociability, and Status) in LLMs’ stereotypes. This high-dimensional taxonomy reveals both similarities (e.g., same set of dimensions) and differences (e.g., variation in prevalence of content) with human stereotypes. In addition, we find that highly overlapping taxonomies emerge from analyses of personal and cultural stereotypes, as well as across various LLMs. However, again, some prompts and LLMs differ in how frequently specific dimensions appear in association with social categories. Our findings suggest that LLMs’ stereotypes are high-dimensional and auditing and debiasing would benefit from considering this complexity to minimize unidentified harm from reliance in low-dimensional views of bias in LLMs.

  • Research Article
  • Cite Count Icon 1
  • 10.1186/s13326-025-00331-8
Unveiling differential adverse event profiles in vaccines via LLM text embeddings and ontology semantic analysis
  • May 23, 2025
  • Journal of Biomedical Semantics
  • Zhigang Wang + 3 more

BackgroundVaccines are crucial for preventing infectious diseases; however, they may also be associated with adverse events (AEs). Conventional analysis of vaccine AEs relies on manual review and assignment of AEs to terms in terminology or ontology, which is a time-consuming process and constrained in scope. This study explores the potential of using Large Language Models (LLMs) and LLM text embeddings for efficient and comprehensive vaccine AE analysis.ResultsWe used Llama-3 LLM to extract AE information from FDA-approved vaccine package inserts for 111 licensed vaccines, including 15 influenza vaccines. Text embeddings were then generated for each vaccine’s AEs using the nomic-embed-text and mxbai-embed-large models. Llama-3 achieved over 80% accuracy in extracting AE text from vaccine package inserts. To further evaluate the performance of text embedding, the vaccines were clustered using two clustering methods: (1) LLM text embedding-based clustering and (2) ontology-based semantic similarity analysis. The ontology-based method mapped AEs to the Human Phenotype Ontology (HPO) and Ontology of Adverse Events (OAE), with semantic similarity analyzed using Lin’s method. Text embeddings were generated for each vaccine’s AE description using the LLM nomic-embed-text and mxbai-embed-large models. Compared to the semantic similarity analysis, the LLM approach was able to capture more differential AE profiles. Furthermore, LLM-derived text embeddings were used to develop a Lasso logistic regression model to predict whether a vaccine is “Live” or “Non-Live”. The term “Non-Live” refers to all vaccines that do not contain live organisms, including inactivated and mRNA vaccines. A comparative analysis showed that, despite similar clustering patterns, the nomic-embed-text model outperformed the other. It achieved 80.00% sensitivity, 83.06% specificity, and 81.89% accuracy in a 10-fold cross-validation. Many AE patterns, with examples demonstrated, were identified from our analysis with AE LLM embeddings.ConclusionThis study demonstrates the effectiveness of LLMs for automated AE extraction and analysis, and LLM text embeddings capture latent information about AEs, enabling more comprehensive knowledge discovery. Our findings suggest that LLMs demonstrate substantial potential for improving vaccine safety and public health research.

  • Research Article
  • 10.1001/jamanetworkopen.2025.49963
Vulnerability of Large Language Models to Prompt Injection When Providing Medical Advice
  • Dec 19, 2025
  • JAMA Network Open
  • Ro Woon Lee + 5 more

Large language models (LLMs) are increasingly integrated into health care applications; however, their vulnerability to prompt-injection attacks (ie, maliciously crafted inputs that manipulate an LLM's behavior) capable of altering medical recommendations has not been systematically evaluated. To evaluate the susceptibility of commercial LLMs to prompt-injection attacks that may induce unsafe clinical advice and to validate man-in-the-middle, client-side injection as a realistic attack vector. This quality improvement study used a controlled simulation design and was conducted between January and October 2025 using standardized patient-LLM dialogues. The main experiment evaluated 3 lightweight models (GPT-4o-mini [LLM 1], Gemini-2.0-flash-lite [LLM 2], and Claude-3-haiku [LLM 3]) across 12 clinical scenarios in 4 categories under controlled conditions. The 12 clinical scenarios were stratified by harm level across 4 categories: supplement recommendations, opioid prescriptions, pregnancy contraindications, and central-nervous-system toxic effects. A proof-of-concept experiment tested 3 flagship models (GPT-5 [LLM 4], Gemini 2.5 Pro [LLM 5], and Claude 4.5 Sonnet [LLM 6]) using client-side injection in a high-risk pregnancy scenario. Two prompt-injection strategies: (1) context-aware injection for moderate- and high-risk scenarios and (2) evidence-fabrication injection for extremely high-harm scenarios. Injections were programmatically inserted into user queries within a multiturn dialogue framework. The primary outcome was injection success at the primary decision turn. Secondary outcomes included persistence across dialogue turns and model-specific success rates by harm level. Across 216 evaluations (108 injection vs 108 control), attacks achieved 94.4% (102 of 108 evaluations) success at turn 4 and persisted in 69.4% (75 of 108 evaluations) of follow-ups. LLM 1 and LLM 2 were completely susceptible (36 of 36 dialogues [100%] each), and LLM 3 remained vulnerable in 83.3% of dialogues (30 of 36 dialogues). Extremely high-harm scenarios including US Food and Drug Administration Category X pregnancy drugs (eg, thalidomide) succeeded in 91.7% of dialogues (33 of 36 dialogues). The proof-of-concept experiment demonstrated 100% vulnerability for LLM 4 and LLM 5 (5 of 5 dialogues each) and 80.0% (4 of 5 dialogues) for LLM 6. In this quality improvement study using a controlled simulation, commercial LLMs demonstrated substantial vulnerability to prompt-injection attacks that could generate clinically dangerous recommendations; even flagship models with advanced safety mechanisms showed high susceptibility. These findings underscore the need for adversarial robustness testing, system-level safeguards, and regulatory oversight before clinical deployment.

  • Research Article
  • 10.4258/hir.2025.31.3.263
Public Perceptions and Barriers to Tuberculosis Treatment in Korea: A Large Language Model-Based Analysis of Naver Knowledge-iN Data from 2002 to 2024
  • Jul 1, 2025
  • Healthcare Informatics Research
  • Hyewon Park + 5 more

ObjectivesThis study was conducted to investigate public perceptions and concerns surrounding tuberculosis (TB) treatment in Korea through an analysis of online queries about antitubercular medications. Additionally, it evaluated the effectiveness of large language models (LLMs) as analytical tools for processing unstructured healthcare data.MethodsUsing LLMs, this study analyzed 44,174 questions that mentioned TB from Naver Knowledge-iN (2002–2024). Questions referencing antitubercular medications were extracted and thematically categorized. Side effects were analyzed through parallel approaches examining general and medication-specific effects. Questions about infectivity and social implications were further analyzed using text embedding, dimensionality reduction, and clustering. The performance of LLMs was evaluated against human researchers and traditional methods.ResultsAmong questions mentioning specific medications (n = 919), rifampin (31.8%) and isoniazid (31.6%) were most frequently referenced. Of the 10,044 questions regarding antitubercular medication, management challenges represented the largest category (44.8%). Analysis of infectivity and social implications (n = 583) revealed previously unidentified concerns about blood donation and immigration eligibility. Employment-related concerns constituted the largest distinct subgroup (20.6%). Hepatotoxicity, dermatosis, and vomiting were the most frequently reported side effects. LLMs outperformed keyword matching in data processing and offered cost advantages over human analysis, with fine-tuning further reducing processing costs.ConclusionsThis study produced novel insights into public concerns regarding TB treatment and demonstrated the effectiveness of combining social media platform data with LLM-based analysis, providing a systematic framework for future healthcare research using unstructured public data and LLMs.

  • Research Article
  • 10.1038/s44220-025-00527-y
The empirical structure of psychopathology is represented in large language models
  • Nov 18, 2025
  • Nature Mental Health
  • Joseph Kambeitz + 5 more

Clinical assessment and scientific research in psychiatry are largely based on questionnaires that are used to assess psychopathology. The development of large language models (LLMs) offers a new perspective for analysis of the language and terminology on which these questionnaires are based. We used state-of-the-art LLMs to derive numerical representations (‘text embeddings’) of the semantic and sentiment content of items from established questionnaires for the assessment of psychopathology. We compared the pairwise associations between empirical data from cross-sectional studies and text embeddings to test whether the empirical structure of psychopathology can be reconstructed by LLMs. Across four large-scale datasets ( n = 1,555, n = 1,099, n = 11,807 and n = 39,755), we found a range of significant correlations between empirical item-pair associations and associations derived from text embeddings ( r = 0.18 to r = 0.57, all P < 0.05). Random forest regression models based on semantic or sentiment embeddings predicted empirical item-pair associations with moderate to high accuracy ( r = 0.33 to r = 0.81, all P < 0.05). Similarly, empirical clustering of items and grouping to established subdomain scores could be partly reconstructed by text embeddings. Our results demonstrate that LLMs are able to represent substantial components of the empirical structure of psychopathology. Consequently, the integration of LLMs into mental health research has the potential to unlock numerous promising avenues. These may encompass improving the process of developing questionnaires, optimizing generalizability and reducing the redundancy of existing questionnaires or facilitating the development of new conceptualizations of mental disorders.

  • Conference Article
  • Cite Count Icon 3
  • 10.2991/icetis-13.2013.235
A Semi-Supervised Text Clustering Algorithm with Word Distribution Weights
  • Jan 1, 2013
  • Ping Zhou + 2 more

Semi-supervised text clustering, as a research branch of the text clustering, aims at employing limited priori knowledge to aid unsupervised text clustering process, and helping users get improved clustering results. Because labeled data are difficult, expensive and time-consuming to obtain, it is important to use the supervised information effectively to improve the performance of clustering significantly. This paper proposes a semi-supervised LDA text clustering algorithm based on the weights of word distribution (WWDLDA). By introducing the coefficients of word distribution obtained from labeled data, LDA model can be used in the field of semi-supervised clustering. In the process of clustering, coefficients always adjust the word distribution to change the clustering results. Our experimental results on real data sets show that the proposed semi-supervised text clustering algorithm can get better clustering results than constrained mixmnl, where mixmnl stands for multinomial model-based EM algorithm. Introduction Text clustering, as an important method of knowledge discovery, is a procedure and an unsupervised method of automatic text classification. By analyzing the relationship between documents, text clustering makes the same theme articles classified as a class. Without the training process and prior category label, text clustering is provided with higher ability of automatic processing and flexibility, which is widely used in data mining, information retrieval and theme testing. Research on text clustering is demonstrated in [1-3]. Traditional document clustering algorithm is an unsupervised learning method that processes unlabeled documents. In practical applications, however, people can get limited priori knowledge of the data, including class labels and documentation division of constraints conditions (such as pairwise constraints) [4]. Semi-supervised text clustering is a text clustering research branch. It utilizes priori labeled data to guide unsupervised text clustering process on the basis of the traditional text clustering method, and gets better clustering results. Semi-supervised text clustering has recently become a topic of significant interest. The complexity of document corpora has led to considerable interest in applying hierarchical statistical models based on what are called topic. Topic model could reduce data dimension by changing the document representation from by words to by topic, and achieve new document representation. Among topic models, Latent Dirichlet allocation (LDA) [5] is one of the simplest, most popular models and arguably most important probabilistic models in widespread use today. While cluster documents according to topic, the obtained distribution of topic helps us get clustering results. Therefore LDA can be applied on text clustering. LDA is a unsupervised learning algorithm. This paper puts forward a new semi-supervised text clustering algorithm, which embed weights of words distribution to LDA. The coefficient guides the clustering process by updating the word item distribution, and then enhances the clustering performance. The semi-supervised LDA text clustering algorithm based on the weights of word distribution (WWDLDA) is experimented on real data sets. The experimental results show that WWDLDA has a better performance than the constrained mixmnl algorithm [6]. International Conference on Education Technology and Information System (ICETIS 2013) © 2013. The authors Published by Atlantis Press 1024 Latent Dirichlet Allocation Latent Dirichlet allocation (LDA) presented by Blei is a topic model and a generative probabilistic model of a corpus. A document consisting of a large number of words might be concisely modeled as deriving from a smaller number of topics. A topic is a probability distribution over words. The basic idea of LDA is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. Fig.1. Graphical model representation of LDA. According to the graphical model representation shown in Figure 1, LDA assumes the following generative process for a document: first, choose a variableθ ,whereθ is the random variable parameter of a multinomial over topics and θ follows Dirichlet distribution; secondly choose a topic n z and then choose a word n w from a multinomial probability conditioned on the topic n z ; last repeated choosing topic and word N times. The probability of a corpus is obtained. ( ) ( ) ( ) ( ) 1 1 p | , | | | , d

  • Research Article
  • 10.2196/77561
Prompting and Fine-Tuning Large Language Models for Parkinson Disease Diagnosis: Comparative Evaluation Study Using the PPMI Structured Dataset.
  • Jan 15, 2026
  • JMIR medical informatics
  • Hyun-Ji Shin + 3 more

Parkinson disease (PD) presents diagnostic challenges due to its heterogeneous motor and nonmotor manifestations. Traditional machine learning (ML) approaches have been evaluated on structured clinical variables. However, the diagnostic utility of large language models (LLMs) using natural language representations of structured clinical data remains underexplored. This study aimed to evaluate the diagnostic classification performance of multiple LLMs using natural language prompts derived from structured clinical data and to compare their performance with traditional ML baselines. We reformatted structured clinical variables from the Parkinson's Progression Markers Initiative (PPMI) dataset into natural language prompts and used them as inputs for several LLMs. Variables with high multicollinearity were removed, and the top 10 features were selected using Shapley additive explanations (SHAP)-based feature ranking. LLM performance was examined across few-shot prompting, dual-output prompting that additionally generated post hoc explanatory text as an exploratory component, and supervised fine-tuning. Logistic regression (LR) and support vector machine (SVM) classifiers served as ML baselines. Model performance was evaluated using F1-scores on both the test set and a temporally independent validation set (temporal validation set) of limited size, and repeated output generation was carried out to assess stability. On the test set of 122 participants, LR and SVM trained on the 10 SHAP-selected clinical variables each achieved a macro-averaged F1-score of 0.960 (accuracy 0.975). LLMs receiving natural language prompts derived from the same variables reached comparable performance, with the best few-shot configurations achieving macro-averaged F1-scores of 0.987 (accuracy 0.992). In the temporal validation set of 31 participants, LR maintained a macro-averaged F1-score of 0.903, whereas SVM showed substantial performance degradation. In contrast, multiple LLMs sustained high diagnostic performance, reaching macro-averaged F1-scores up to 0.968 and high recall for PD. Repeated output generation across LLM conditions produced generally stable predictions, with rare variability observed across runs. Under dual-output prompting, diagnostic performance showed a reduction relative to few-shot prompting while remaining generally stable. Supervised fine-tuning of lightweight models improved stability and enabled GPT-4o-mini to achieve a macro-averaged F1-score of 0.987 on the test set, with uniformly correct predictions observed in the small temporal validation set, which should be interpreted cautiously given the limited sample size and exploratory nature of the evaluation. This study provides an exploratory benchmark of how modern LLMs process structured clinical variables in natural language form. While several models achieved diagnostic performance comparable to LR across both the test and temporal validation datasets, their outputs were sensitive to prompting formats, model choice, and class distributions. Occasional variability across repeated output generations reflected the stochastic nature of LLMs, and lightweight models required supervised fine-tuning for stable generalization. These findings highlight the capabilities and limitations of current LLMs in handling tabular clinical information and underscore the need for cautious application and further investigation.

  • Research Article
  • Cite Count Icon 1
  • 10.1001/jamanetworkopen.2025.11922
Large Language Models and Text Embeddings for Detecting Depression and Suicide in Patient Narratives
  • May 23, 2025
  • JAMA Network Open
  • Silvia Kyungjin Lho + 9 more

Large language models (LLMs) and text-embedding models have shown potential in assessing mental health risks based on narrative data from psychiatric patients. To assess whether LLMs and text-embedding models can identify depression and suicide risk based on sentence completion test (SCT) narratives of psychiatric patients. This cross-sectional study, conducted at Seoul Metropolitan Government-Seoul National University Boramae Medical Center, analyzed SCT data collected from April 1, 2016, to September 30, 2021. Participants included psychiatric patients aged 18 to 39 years who completed SCT and self-assessments for depression (Beck Depression Inventory-II or Zung Self-Rating Depression Scale) and/or suicide (Beck Scale for Suicidal Ideation). Patients confirmed to have an IQ below 70 were excluded, leaving 1064 eligible SCT datasets (52 627 completed responses). Data processing with LLMs (GPT-4o, May 13, 2024, version; OpenAI [hereafter, LLM1]; gemini-1.0-pro, February 2024 version; Google DeepMind [hereafter, LLM2]; and GPT-3.5-turbo-16k, January 25, 2024, version; OpenAI) and text-embedding models (text-embedding-3-large, OpenAI [hereafter, text-embedding 1]; text-embedding3-small; OpenAI; and text-embedding-ada-002; OpenAI) was performed between July 4 and September 30, 2024. Outcomes included the performance of LLMs and text-embedding models in detecting depression and suicide, as measured by the area under the receiver operating characteristic curve (AUROC), balanced accuracy, and macro F1-score. Performance was evaluated across concatenated narratives of SCT, including self-concept, family, gender perception, and interpersonal relations narratives. Based on SCT narratives from 1064 patients (mean [SD] age, 25.4 [5.5] years; 673 men [63.3%]), LLM1 showed strong performance in zero-shot learning, with an AUROC of 0.720 (95% CI, 0.689-0.752) for depression and 0.731 (95% CI, 0.704-0.762) for suicide risk using self-concept narratives. Few-shot learning for depression further improved the performance of LLM1 (AUROC, 0.754 [95% CI, 0.721-0.784]) and LLM2 (AUROC, 0.736 [95% CI, 0.704-0.770]). The text-embedding 1 model paired with extreme gradient boosting outperformed other models, achieving an AUROC of 0.841 (95% CI, 0.783-0.897) for depression and 0.724 (95% CI, 0.650-0.795) for suicide risk. Overall, self-concept narratives showed the most accurate detections across all models. This cross-sectional study of SCT narratives from psychiatric patients suggests that LLMs and text-embedding models may effectively detect depression and suicide risk, particularly using self-concept narratives. However, while these models demonstrated potential for detecting mental health risks, further improvements in performance and safety are essential before clinical application.

  • Research Article
  • 10.1016/j.ejrad.2025.112316
Comparing large language models and text embedding models for automated classification of textual, semantic, and critical changes in radiology reports.
  • Oct 1, 2025
  • European journal of radiology
  • Maximilian Lindholz + 11 more

Comparing large language models and text embedding models for automated classification of textual, semantic, and critical changes in radiology reports.

  • Research Article
  • 10.1609/icwsm.v18i1.31419
TeC: A Novel Method for Text Clustering with Large Language Models Guidance and Weakly-Supervised Contrastive Learning
  • May 28, 2024
  • Proceedings of the International AAAI Conference on Web and Social Media
  • Chen Yang + 2 more

Text clustering has become an important branch in unsupervised learning methods and has been widely used in social media. Recently, Large Language Models (LLMs) represent a significant advancement in the field of AI. Therefore, some works have been dedicated to improving the clustering performance of embedding models with feedback from LLMs. However, current approaches hardly take into consideration the cluster label information between text instances when fine-tuning embedding models, leading to the problem of cluster collision. To tackle this issue, this paper proposes TeC, a novel method operating through teaching and correcting phases. In these phases, LLMs take on the role of teachers, guiding embedding models as students to enhance their clustering performance. The teaching phase imparts guidance on cluster label information to embedding models by querying LLMs in a batch-wise manner and utilizes a proposed weakly-supervised contrastive learning loss to fine-tune embedding models based on the provided cluster label information. Subsequently, the correcting phase refines clustering outcomes obtained by the teaching phase by instructing LLMs to correct cluster assignments of low-confidence samples. The extensive experimental evaluation of six text datasets across three different clustering tasks shows the superior performance of our proposed method over existing state-of-the-art approaches.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s11227-025-07414-4
Optimizing SBERT for long text clustering: two novel approaches with empirical insights
  • Jun 2, 2025
  • The Journal of Supercomputing
  • Yasin Ortakci + 1 more

Transformer-based Large Language Models (LLMs), which have recently gained popularity, have significantly impacted the various fields of natural language processing. One of them is text clustering, which involves categorizing the huge texts produced by today’s digital world into meaningful groups. LLMs enable text clustering with a more semantic and contextualized approach than traditional methods. One such model is Sentence-BERT (SBERT), which has been modified to detect semantic similarity between texts. Before being clustered, texts need to be transformed into numerical text embeddings. SBERT-based models have shown promise in generating meaningful sentence embeddings. However, they face limitations when dealing with long texts that exceed their maximum token limit. In this context, this study proposes two distinct methods to overcome these limitations and enhance the performance of SBERT models for clustering long text. The proposed methods are combined with various SBERT models, and their combinations are compared to the existing default method on three datasets containing lengthy texts. This study evaluates the impact of these methods on the models and their contributions to clustering performance. The findings indicate that the proposed methods exhibit a higher clustering performance of up to 14% than the default method in text clustering. Additionally, this study provides valuable insights into the text clustering performance of SBERT models, offering practical implications for further research and applications.

  • Research Article
  • Cite Count Icon 5
  • 10.1038/s41598-024-75331-2
Distilling the knowledge from large-language model for health event prediction
  • Dec 28, 2024
  • Scientific Reports
  • Sirui Ding + 3 more

Health event prediction is empowered by the rapid and wide application of electronic health records (EHR). In the Intensive Care Unit (ICU), precisely predicting the health related events in advance is essential for providing treatment and intervention to improve the patients outcomes. EHR is a kind of multi-modal data containing clinical text, time series, structured data, etc. Most health event prediction works focus on a single modality, e.g., text or tabular EHR. How to effectively learn from the multi-modal EHR for health event prediction remains a challenge. Inspired by the strong capability in text processing of large language model (LLM), we propose the framework CKLE for health event prediction by distilling the knowledge from LLM and learning from multi-modal EHR. There are two challenges of applying LLM in the health event prediction, the first one is most LLM can only handle text data rather than other modalities, e.g., structured data. The second challenge is the privacy issue of health applications requires the LLM to be locally deployed, which may be limited by the computational resource. CKLE solves the challenges of LLM scalability and portability in the healthcare domain by distilling the cross-modality knowledge from LLM into the health event predictive model. To fully take advantage of the strong power of LLM, the raw clinical text is refined and augmented with prompt learning. The embedding of clinical text are generated by LLM. To effectively distill the knowledge of LLM into the predictive model, we design a cross-modality knowledge distillation (KD) method. A specially designed training objective will be used for the KD process with the consideration of multiple modality and patient similarity. The KD loss function consists of two parts. The first one is cross-modality contrastive loss function, which models the correlation of different modalities from the same patient. The second one is patient similarity learning loss function to model the correlations between similar patients. The cross-modality knowledge distillation can distill the rich information in clinical text and the knowledge of LLM into the predictive model on structured EHR data. To demonstrate the effectiveness of CKLE, we evaluate CKLE on two health event prediction tasks in the field of cardiology, heart failure prediction and hypertension prediction. We select the 7125 patients from MIMIC-III dataset and split them into train/validation/test sets. We can achieve a maximum 4.48% improvement in accuracy compared to state-of-the-art predictive model designed for health event prediction. The results demonstrate CKLE can surpass the baseline prediction models significantly on both normal and limited label settings. We also conduct the case study on cardiology disease analysis in the heart failure and hypertension prediction. Through the feature importance calculation, we analyse the salient features related to the cardiology disease which corresponds to the medical domain knowledge. The superior performance and interpretability of CKLE pave a promising way to leverage the power and knowledge of LLM in the health event prediction in real-world clinical settings.

  • Research Article
  • 10.6339/24-jds1149
Evaluation of Text Cluster Naming with Generative Large Language Models
  • Jan 1, 2024
  • Journal of Data Science
  • Alexander J Preiss + 8 more

Text clustering can streamline many labor-intensive tasks, but it creates a new challenge: efficiently labeling and interpreting the clusters. Generative large language models (LLMs) are a promising option to automate the process of naming text clusters, which could significantly streamline workflows, especially in domains with large datasets and esoteric language. In this study, we assessed the ability of GPT-3.5-turbo to generate names for clusters of texts and compared these to human-generated text cluster names. We clustered two benchmark datasets, each from a specialized domain: research abstracts and clinical patient notes. We generated names for each cluster using four prompting strategies (different ways of including information about the cluster in the prompt used to get LLM responses). For both datasets, the best prompting strategy beat the manual approach across all quality domains. However, name quality varied by prompting strategy and dataset. We conclude that practitioners should consider trying automated cluster naming to avoid bottlenecks or when the scale of the effort is enough to take advantage of the cost savings offered by automation, as detailed in our supplemental blueprint for using LLM cluster naming. However, to get the best performance, it is vital to test a variety of prompting strategies and perform a small test to identify which one performs best on each project’s unique data.

  • Research Article
  • 10.1158/1538-7445.pancreatic25-b073
Abstract B073: Predicting pancreatic cancer risk from clinical notes using large language models
  • Sep 28, 2025
  • Cancer Research
  • Daniel Mau + 10 more

Background: Pancreatic cancer is typically detected at an incurable stage because symptoms are often absent during the early stages, and the current screening guidelines lack sensitivity and specificity. Existing risk prediction tools require structured data, which hinders deployment. We evaluated whether large language models (LLMs) could predict pancreatic cancer risk, using only free-text clinical notes. Methods: We used routine free-text general practitioner clinical notes from individuals in Ontario aged >18 years, collected through ICES between 2010 and 2016. Pancreatic cancer patients were matched with controls using a nested case–control design with metrics adjusted using inverse probability weighting (IPW). Two approaches were explored: Reasoning-based LLM prediction: Source-available reasoning LLMs (DeepSeek-R1, QwQ) were prompted to simulate step-by-step clinical reasoning using raw clinical notes. Ensemble prediction: To minimize computational requirements at deployment, we tested several lightweight LLMs using different ensembling techniques, such as sampling with various decoding parameters (min-p, top-k, top-p) and LLMs, with samples aggregated using different strategies. We developed both methods using a development cohort of 200 patients (1:1 cases to controls) in Southwestern Ontario and subsequently tested them with a cohort of 750 patients (1:5) in Toronto. Look-ahead windows of five years were evaluated with a one-year exclusion period preceding diagnosis to focus on future risk and exclude patients undergoing a diagnostic work-up. Results: The median (range) number of characters per note was 390 (20-8000), and the median number of notes per patient was 20. In the reasoning-based approach, the best-performing model in the development cohort achieved an area under the receiver operating characteristic curve (AUROC) of 0.77 (95% CI: 0.70–0.84) for predicting pancreatic cancer in the five years after each clinical note in the test cohort. Lightweight models in ensembles exhibited variable performance, depending on the strategy used. When sampling a single model with different decoding parameters, performance reached an AUROC of 0.70. Ensembling multiple models and selecting the minimal predicted score across samples yielded an AUROC of 0.75. Using the most frequently predicted score across models with different decoding parameters improved the AUROC to 0.77. A simulated screening strategy that selected the top 0.5% highest-risk individuals resulted in a relative risk of 28.1×, a specificity of 0.991, a sensitivity of 0.192, a positive predictive value of 0.025, and a negative predictive value of 0.999. Conclusions: LLMs can predict pancreatic cancer risk directly from clinical notes years before diagnosis, without structured inputs or pre-processing. This approach provides a scalable, generalizable, and interpretable framework for future risk prediction, potentially supporting novel population-based approaches to pancreatic cancer screening. Citation Format: Daniel Mau, Karl Everett, Ning Liu, Jason Chai-Onn, Liisa Jaakkimainen, Anna Dodd, Spring Holter, Steven Gallinger, Rahul G. Krishnan, Kelvin Chan, Robert Grant. Predicting pancreatic cancer risk from clinical notes using large language models [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Advances in Pancreatic Cancer Research—Emerging Science Driving Transformative Solutions; Boston, MA; 2025 Sep 28-Oct 1; Boston, MA. Philadelphia (PA): AACR; Cancer Res 2025;85(18_Suppl_3):Abstract nr B073.

  • Research Article
  • 10.1109/jbhi.2025.3561214
MIFlu: Large Language Model-Based Multimodal Influenza Forecasting Scheme.
  • Oct 1, 2025
  • IEEE journal of biomedical and health informatics
  • Jaeuk Moon + 3 more

In order to minimize the impact of influenza on public health, accurate early forecasting is essential. Various deep-learning-based models have been proposed to predict future influenza occurrences by capturing temporal/regional patterns from past occurrence time-series data. However, the prediction performance of these unimodal approaches is limited because they extract knowledge only from collected data, and users cannot input contextual information and domain knowledge to them. Recently, large language models (LLMs) have demonstrated the potential to improve prediction accuracy by linking contextual text information to time-series predictions. In this paper, we propose MIFlu, a multimodal influenza forecasting scheme that can fuse contextual text information to time-series influenza occurrences using two LLMs. It first extracts text embeddings from the user's text prompts that contain contextual information using a text-embedding LLM. Then, MIFlu fuses the text embeddings and time-series embeddings and uses the fused embeddings to predict future occurrences using a forecasting LLM. In extensive experiments using public national/regional influenza datasets, MIFlu outperforms other predictive models, improving prediction performance by up to 26.2% compared to state-of-the-art models. We also analyze the effect of various textual input embedders, hyperparameters, and the amount of training data on forecasting accuracy.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.