An Agent‑Based Simulation of Politicized Topics Using Large Language Models: Algorithmic Personalization and Polarization on Social Media
Abstract Digital platforms now act as the primary environments for public discourse, where recommender systems shape visibility, emotion, and interpretation. This study introduces the Recommender Systems LLMs Playground (RecSysLLMsP), a simulation framework designed to examine how algorithmic personalization interacts with language generation to influence engagement and polarization. The research provides a reproducible and transparent environment for testing algorithmic effects on collective reasoning, which is an issue central to democratic communication. The study employs a one‑hundred‑agent simulation grounded in psychometric and demographic data from Serbian social media users. Agents interact through five stages of progressively personalized content feeds mediated by LLM‑generated posts. Quantitative metrics such as engagement intensity, network modularity, sentiment variance and qualitative linguistic validation are used to assess behavioral and structural change. Results reveal that moderate personalization maximizes engagement, while full personalization reduces diversity and amplifies both structural and affective polarization (Q = 0.22 → 0.68). LLM‑based agents successfully reproduce realistic patterns of emotional contagion and ideological clustering. The implications extend to computational social science and policy. Simulation‑based experimentation can inform ethical recommender design and algorithmic governance. Limitations concern the absence of genuine human cognition. Thus, findings indicate systemic tendencies rather than behavioral prediction. Future research should integrate real‑world datasets, multilingual testing, and policy‑driven intervention modeling to further calibrate this digital “laboratory” for exploring AI‑mediated communication.
- Research Article
8
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
6
- 10.1093/jamia/ocae312
- Dec 30, 2024
- Journal of the American Medical Informatics Association : JAMIA
Brief hospital course (BHC) summaries are clinical documents that summarize a patient's hospital stay. While large language models (LLMs) depict remarkable capabilities in automating real-world tasks, their capabilities for healthcare applications such as synthesizing BHCs from clinical notes have not been shown. We introduce a novel preprocessed dataset, the MIMIC-IV-BHC, encapsulating clinical note and BHC pairs to adapt LLMs for BHC synthesis. Furthermore, we introduce a benchmark of the summarization performance of 2 general-purpose LLMs and 3 healthcare-adapted LLMs. Using clinical notes as input, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to 3 open-source LLMs (Clinical-T5-Large, Llama2-13B, and FLAN-UL2) and 2 proprietary LLMs (Generative Pre-trained Transformer [GPT]-3.5 and GPT-4). We evaluate these LLMs across multiple context-length inputs using natural language similarity metrics. We further conduct a clinical study with 5 clinicians, comparing clinician-written and LLM-generated BHCs across 30 samples, focusing on their potential to enhance clinical decision-making through improved summary quality. We compare reader preferences for the original and LLM-generated summary using Wilcoxon signed-rank tests. We further request optional qualitative feedback from clinicians to gain deeper insights into their preferences, and we present the frequency of common themes arising from these comments. The Llama2-13B fine-tuned LLM outperforms other domain-adapted models given quantitative evaluation metrics of Bilingual Evaluation Understudy (BLEU) and Bidirectional Encoder Representations from Transformers (BERT)-Score. GPT-4 with in-context learning shows more robustness to increasing context lengths of clinical note inputs than fine-tuned Llama2-13B. Despite comparable quantitative metrics, the reader study depicts a significant preference for summaries generated by GPT-4 with in-context learning compared to both Llama2-13B fine-tuned summaries and the original summaries (P<.001), highlighting the need for qualitative clinical evaluation. We release a foundational clinically relevant dataset, the MIMIC-IV-BHC, and present an open-source benchmark of LLM performance in BHC synthesis from clinical notes. We observe high-quality summarization performance for both in-context proprietary and fine-tuned open-source LLMs using both quantitative metrics and a qualitative clinical reader study. Our research effectively integrates elements from the data assimilation pipeline: our methods use (1) clinical data sources to integrate, (2) data translation, and (3) knowledge creation, while our evaluation strategy paves the way for (4) deployment.
- Research Article
8
- 10.2196/59641
- Aug 29, 2024
- JMIR infodemiology
Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.
- Research Article
1
- 10.2196/65226
- Aug 9, 2024
- Journal of medical Internet research
The use of web-based search and social media can help identify epidemics, potentially earlier than clinical methods or even potentially identifying unreported outbreaks. Monitoring for eye-related epidemics, such as conjunctivitis outbreaks, can facilitate early public health intervention to reduce transmission and ocular comorbidities. However, monitoring social media content for conjunctivitis outbreaks is costly and laborious. Large language models (LLMs) could overcome these barriers by assessing the likelihood that real-world outbreaks are being described. However, public health actions for likely outbreaks could benefit more by knowing additional epidemiological characteristics, such as outbreak type, size, and severity. We aimed to assess whether and how well LLMs can classify epidemiological features from social media posts beyond conjunctivitis outbreak probability, including outbreak type, size, severity, etiology, and community setting. We used a validation framework comparing LLM classifications to those of other LLMs and human experts. We wrote code to generate synthetic conjunctivitis outbreak social media posts, embedded with specific preclassified epidemiological features to simulate various infectious eye disease outbreak and control scenarios. We used these posts to develop effective LLM prompts and test the capabilities of multiple LLMs. For top-performing LLMs, we gauged their practical utility in real-world epidemiological surveillance by comparing their assessments of Twitter/X, forum, and YouTube conjunctivitis posts. Finally, human raters also classified the posts, and we compared their classifications to those of a leading LLM for validation. Comparisons entailed correlation or sensitivity and specificity statistics. We assessed 7 LLMs for effectively classifying epidemiological data from 1152 synthetic posts, 370 Twitter/X posts, 290 forum posts, and 956 YouTube posts. Despite some discrepancies, the LLMs demonstrated a reliable capacity for nuanced epidemiological analysis across various data sources and compared to humans or between LLMs. Notably, GPT-4 and Mixtral 8x22b exhibited high performance, predicting conjunctivitis outbreak characteristics such as probability (GPT-4: correlation=0.73), size (Mixtral 8x22b: correlation=0.82), and type (infectious, allergic, or environmentally caused); however, there were notable exceptions. Assessing synthetic and real-world posts for etiological factors, infectious eye disease specialist validations revealed that GPT-4 had high specificity (0.83-1.00) but variable sensitivity (0.32-0.71). Interrater reliability analyses showed that LLM-expert agreement exceeded expert-expert agreement for severity assessment (intraclass correlation coefficient=0.69 vs 0.38), while agreement varied by condition type (κ=0.37-0.94). This investigation into the potential of LLMs for public health infoveillance suggests effectiveness in classifying key epidemiological characteristics from social media content about conjunctivitis outbreaks. Future studies should further explore LLMs' potential to support public health monitoring through the automated assessment and classification of potential infectious eye disease or other outbreaks. Their optimal role may be to act as a first line of documentation, alerting public health organizations for the follow-up of LLM-detected and -classified small, early outbreaks, with a focus on the most severe ones.
- Preprint Article
- 10.2196/preprints.65226
- Aug 9, 2024
BACKGROUND The use of web-based search and social media can help identify epidemics, potentially earlier than clinical methods or even potentially identifying unreported outbreaks. Monitoring for eye-related epidemics, such as conjunctivitis outbreaks, can facilitate early public health intervention to reduce transmission and ocular comorbidities. However, monitoring social media content for conjunctivitis outbreaks is costly and laborious. Large language models (LLMs) could overcome these barriers by assessing the likelihood that real-world outbreaks are being described. However, public health actions for likely outbreaks could benefit more by knowing additional epidemiological characteristics, such as outbreak type, size, and severity. OBJECTIVE We aimed to assess whether and how well LLMs can classify epidemiological features from social media posts beyond conjunctivitis outbreak probability, including outbreak type, size, severity, etiology, and community setting. We used a validation framework comparing LLM classifications to those of other LLMs and human experts. METHODS We wrote code to generate synthetic conjunctivitis outbreak social media posts, embedded with specific preclassified epidemiological features to simulate various infectious eye disease outbreak and control scenarios. We used these posts to develop effective LLM prompts and test the capabilities of multiple LLMs. For top-performing LLMs, we gauged their practical utility in real-world epidemiological surveillance by comparing their assessments of Twitter/X, forum, and YouTube conjunctivitis posts. Finally, human raters also classified the posts, and we compared their classifications to those of a leading LLM for validation. Comparisons entailed correlation or sensitivity and specificity statistics. RESULTS We assessed 7 LLMs for effectively classifying epidemiological data from 1152 synthetic posts, 370 Twitter/X posts, 290 forum posts, and 956 YouTube posts. Despite some discrepancies, the LLMs demonstrated a reliable capacity for nuanced epidemiological analysis across various data sources and compared to humans or between LLMs. Notably, GPT-4 and Mixtral 8x22b exhibited high performance, predicting conjunctivitis outbreak characteristics such as probability (GPT-4: correlation=0.73), size (Mixtral 8x22b: correlation=0.82), and type (infectious, allergic, or environmentally caused); however, there were notable exceptions. Assessing synthetic and real-world posts for etiological factors, infectious eye disease specialist validations revealed that GPT-4 had high specificity (0.83-1.00) but variable sensitivity (0.32-0.71). Interrater reliability analyses showed that LLM-expert agreement exceeded expert-expert agreement for severity assessment (intraclass correlation coefficient=0.69 vs 0.38), while agreement varied by condition type (κ=0.37-0.94). CONCLUSIONS This investigation into the potential of LLMs for public health infoveillance suggests effectiveness in classifying key epidemiological characteristics from social media content about conjunctivitis outbreaks. Future studies should further explore LLMs’ potential to support public health monitoring through the automated assessment and classification of potential infectious eye disease or other outbreaks. Their optimal role may be to act as a first line of documentation, alerting public health organizations for the follow-up of LLM-detected and -classified small, early outbreaks, with a focus on the most severe ones.
- Research Article
2
- 10.1093/jamia/ocaf023
- Mar 10, 2025
- Journal of the American Medical Informatics Association : JAMIA
Large language models (LLMs) are increasingly utilized in healthcare, transforming medical practice through advanced language processing capabilities. However, the evaluation of LLMs predominantly relies on human qualitative assessment, which is time-consuming, resource-intensive, and may be subject to variability and bias. There is a pressing need for quantitative metrics to enable scalable, objective, and efficient evaluation. We propose a unified evaluation framework that bridges qualitative and quantitative methods to assess LLM performance in healthcare settings. This framework maps evaluation aspects-such as linguistic quality, efficiency, content integrity, trustworthiness, and usefulness-to both qualitative assessments and quantitative metrics. We apply our approach to empirically evaluate the Epic In-Basket feature, which uses LLM to generate patient message replies. The empirical evaluation demonstrates that while Artificial Intelligence (AI)-generated replies exhibit high fluency, clarity, and minimal toxicity, they face challenges with coherence and completeness. Clinicians' manual decision to use AI-generated drafts correlates strongly with quantitative metrics, suggesting that quantitative metrics have the potential to reduce human effort in the evaluation process and make it more scalable. Our study highlights the potential of a unified evaluation framework that integrates qualitative and quantitative methods, enabling scalable and systematic assessments of LLMs in healthcare. Automated metrics streamline evaluation and monitoring processes, but their effective use depends on alignment with human judgment, particularly for aspects requiring contextual interpretation. As LLM applications expand, refining evaluation strategies and fostering interdisciplinary collaboration will be critical to maintaining high standards of accuracy, ethics, and regulatory compliance. Our unified evaluation framework bridges the gap between qualitative human assessments and automated quantitative metrics, enhancing the reliability and scalability of LLM evaluations in healthcare. While automated quantitative evaluations are not ready to fully replace qualitative human evaluations, they can be used to enhance the process and, with relevant benchmarks derived from the unified framework proposed here, they can be applied to LLM monitoring and evaluation of updated versions of the original technology evaluated using qualitative human standards.
- Research Article
5
- 10.1145/3731446
- Jun 16, 2025
- ACM Transactions on Information Systems
Recommender models capture ever-changing user preferences by training with in-domain user behavior data. These models are typically lightweight, facilitating real-time and large-scale online services. However, these models often falter when tasked with providing more sophisticated functionalities, such as offering explanations or engaging in conversations. Recently, large language models (LLMs) have emerged as a significant advancement towards artificial general intelligence, demonstrating impressive capabilities in instruction comprehension, reasoning, and human interaction. Unfortunately, LLMs lack the understanding of domain-specific item catalogs and behavioral patterns, especially in areas that deviate from general world knowledge, such as online e-commerce. This limitation makes them unsuitable to function as recommender models directly. In this article, we bridge the gap between recommender models and LLMs, combining their respective strengths to create an interactive recommender system. We present an efficient framework, termed as InteRecAgent , which utilizes LLMs as the brain and recommender models as instrumental tools. We first outline a minimal set of essential tools required to transform LLMs into InteRecAgent. To overcome specific challenges associated with LLM-based agents for recommender systems, we enhance three core components, covering memory mechanism, task planning, and tool learning abilities. The InteRecAgent empowers traditional recommender systems, like ID-based matrix factorization models, to evolve into versatile and interactive systems with a natural language interface through the integration of LLMs. Experimental results derived from three public datasets demonstrate that the InteRecAgent delivers strong performance as a conversational recommender system, surpassing general LLMs such as GPT-4.
- Research Article
- 10.1007/s44163-025-00334-5
- Aug 8, 2025
- Discover Artificial Intelligence
Recommender systems are now ubiquitous across the internet, from streaming services to online shopping to social media. Traditional systems operate behind the scenes, often invisible to the end user. While these systems have enjoyed prolific success, they have limitations—namely, their mechanical interactions lack contextual awareness. A promising area of research is the combination of large language models (LLMs) with traditional recommendation methods to increase flexibility and performance. We discuss prominent examples, including conversational recommender systems, LLMs as end-to-end recommenders, and LLMs as encoders for recommendation. Of particular importance is the transformer neural network architecture, which underpins these LLMs and has shown itself to be incredibly powerful in natural language processing, and has now been adapted to serve recommendation tasks. This review offers a unique perspective on the evolving role of data in recommender systems, tracing data requirements from traditional matrix-based data and knowledge-based data, to the adoption of transformers with web-scale data. We detail how these shifting data paradigms have shaped the field and integration of transformer architecture, large language models, and chatbots in modern recommender systems. This paper is intended for readers interested in the intersection of recommender systems and transformers (and LLMs), those tracking the evolution of data used in such systems, and newcomers seeking an introduction to these topics.
- Research Article
43
- 10.1145/3678004
- Jan 18, 2025
- ACM Transactions on Information Systems
With the rapid development of online services and web applications, recommender systems (RS) have become increasingly indispensable for mitigating information overload and matching users’ information needs by providing personalized suggestions over items. Although the RS research community has made remarkable progress over the past decades, conventional recommendation models (CRM) still have some limitations, e.g., lacking open-domain world knowledge, and difficulties in comprehending users’ underlying preferences and motivations. Meanwhile, large language models (LLM) have shown impressive general intelligence and human-like capabilities for various natural language processing (NLP) tasks, which mainly stem from their extensive open-world knowledge, logical and commonsense reasoning abilities, as well as their comprehension of human culture and society. Consequently, the emergence of LLM is inspiring the design of RS and pointing out a promising research direction, i.e., whether we can incorporate LLM and benefit from their common knowledge and capabilities to compensate for the limitations of CRM. In this article, we conduct a comprehensive survey on this research direction, and draw a bird’s-eye view from the perspective of the whole pipeline in real-world RS. Specifically, we summarize existing research works from two orthogonal aspects: where and how to adapt LLM to RS. For the “ WHERE ” question, we discuss the roles that LLM could play in different stages of the recommendation pipeline, i.e., feature engineering, feature encoder, scoring/ranking function, user interaction, and pipeline controller. For the “ HOW ” question, we investigate the training and inference strategies, resulting in two fine-grained taxonomy criteria, i.e., whether to tune LLM or not during training, and whether to involve CRM for inference. Detailed analysis and general development paths are provided for both “WHERE” and “HOW” questions, respectively. Then, we highlight the key challenges in adapting LLM to RS from three aspects, i.e., efficiency, effectiveness, and ethics. Finally, we summarize the survey and discuss the future prospects.
- Research Article
- 10.1093/humrep/deae108.088
- Jul 3, 2024
- Human Reproduction
Study question Can large language models be used to understand patient needs from conversational data? Summary answer Large language models can provide significant assistance for identifying and summarizing patients' queries. What is known already Traditionally, clinics have relied on techniques such as patient surveys, reviews and complaints procedures in order to understand shortcomings in the patient experience. As many clinics adopt digital communication platforms with patients, they have collected a wealth of conversational data that may shed light on the patient experience. However, the volume of data in clinic chat apps is often so great that analyzing them becomes challenging. Recently, advances in large language models (LLMs) have enabled the automated analysis of text at almost human level performance. This study therefore investigates whether LLMs can be used to extract insights from conversational data. Study design, size, duration This study is a retrospective analysis of 132,596 messages sent by patients to fertility advisors representing 40,853 questions asked. These conversations took place on a single centre’s patient communication app from 01/01/2021 to 09/06/2023. All patient types at all treatment stages were included. A private instance of the open-source Mistral-7B-Instruct-v0.1 LLM running on a single NVIDIA Titan X GPU was used for text analysis. Participants/materials, setting, methods Conversations were broken down into sentences and then categorized as either questions or non-questions by the LLM. Next the LLM categorized the individual questions, returning a category, a subcategory and question summary. These summaries were then embedded and clustered using the K-means algorithm (with k chosen by the elbow method). The LLM was then used to summarize the content of each cluster as five questions. Thematic analysis was then conducted by a patient experience expert. Main results and the role of chance In the initial phase of the study, the LLM, classified 34,222 questions into 6,177 categories and 13,533 subcategories. These were subsequently consolidated into 145 distinct clusters. Each cluster, on average, comprised 215±108 (M+SD) inquiries (excluding a notably larger outlier cluster that seemed to functionally encompass 3,300 inquiries that didn’t fit neatly into any other cluster). Examination of cluster centroids identified seven predominant themes: legal/financial (N = 1,218), general fertility advice (N = 1,430), patient administration (N = 8,734), medical tests (N = 4,820), medical procedures (N = 1,575), appointment scheduling (N = 4,320), and specific treatment/medication information (N = 9,125).Of the 145 clusters, 64 clusters comprising 12,313 inquiries were identified by the expert to highlight points for improvement in the patient experience. These broadly encompassed operational bottlenecks/weak points, issues with patient guidance and issues with communication. For instance, “Are there any late afternoon or evening appointment options available?” or “How long do eggs remain viable after they have been frozen?”. Overall, our use of LLMs enabled the analysis of a large number of queries that would previously have proven prohibitively expensive, time-consuming and labor-intensive. Limitations, reasons for caution Splitting conversations into sentences meant that context could not be taken into account and multi-message questions were hard to identify. Additionally, the interpretation of messages by LLMs and humans may not be aligned. Finally, the technical expertise required to execute this style of analysis may prove a barrier for clinics. Wider implications of the findings We demonstrate that LLMs can be used to draw insights from the wealth of digital communications held by modern IVF clinics. LLMs may thus enable clinics to collect data on the patient experience in a faster, more reliable way than traditional approaches such as patient surveys and complaints. Trial registration number Not applicable
- Research Article
42
- 10.1053/j.ackd.2013.04.001
- Jun 26, 2013
- Advances in Chronic Kidney Disease
Using Digital Media to Promote Kidney Disease Education
- Research Article
- 10.1145/3721299
- Nov 21, 2025
- ACM Transactions on Recommender Systems
Recommender systems have become pivotal in today’s digital landscape, shaping user experiences across diverse online platforms. Recent advances in Large Language Models (LLMs) such as T5, GPT, LLaMA, and their variants have introduced transformative possibilities for recommender systems. LLMs excel in processing and generating natural language text, offering a unique opportunity to reshape the design and elevate the effectiveness of recommendation algorithms. The main topic of this special issue is to explore the integration of Large Language Models and Recommender Systems, encompassing various facets, including model architectures, recommendation algorithms, evaluation methods, and real-world applications. It provides a dedicated platform for researchers and practitioners to share their insights, innovations, and empirical findings in the realm of LLMs for recommender systems, which helps to promote knowledge exchange, leading to best practices and guidelines for integrating LLMs and recommender systems. Towards this goal, the five articles in this collection span trustworthy issues such as recommendation fairness and diversity with LLMs, as well as classic recommendation problems, including sequential recommendation, click-through rate prediction and bundle recommendation with LLMs. By fostering interdisciplinary collaboration between the natural language processing and recommendation communities, the special issue aspires to advance the state of the art in this evolving field.
- Research Article
38
- 10.1016/j.mlwa.2024.100545
- Mar 11, 2024
- Machine Learning with Applications
The Dark Side of Language Models: Exploring the Potential of LLMs in Multimedia Disinformation Generation and Dissemination
- Research Article
2
- 10.1055/s-0044-1800750
- Aug 1, 2024
- Yearbook of Medical Informatics
SummaryObjectives: Large language models (LLMs) are revolutionizing the natural language pro-cessing (NLP) landscape within healthcare, prompting the need to synthesize the latest ad-vancements and their diverse medical applications. We attempt to summarize the current state of research in this rapidly evolving space.Methods: We conducted a review of the most recent studies on biomedical NLP facilitated by LLMs, sourcing literature from PubMed, the Association for Computational Linguistics Anthology, IEEE Explore, and Google Scholar (the latter particularly for preprints). Given the ongoing exponential growth in LLM-related publications, our survey was inherently selective. We attempted to abstract key findings in terms of (i) LLMs customized for medical texts, and (ii) the type of medical text being leveraged by LLMs, namely medical literature, electronic health records (EHRs), and social media. In addition to technical details, we touch upon topics such as privacy, bias, interpretability, and equitability.Results: We observed that while general-purpose LLMs (e.g., GPT-4) are most popular, there is a growing trend in training or customizing open-source LLMs for specific biomedi-cal texts and tasks. Several promising open-source LLMs are currently available, and appli-cations involving EHRs and biomedical literature are more prominent relative to noisier data sources such as social media. For supervised classification and named entity recogni-tion tasks, traditional (encoder only) transformer-based models still outperform new-age LLMs, and the latter are typically suited for few-shot settings and generative tasks such as summarization. There is still a paucity of research on evaluation, bias, privacy, reproduci-bility, and equitability of LLMs.Conclusions: LLMs have the potential to transform NLP tasks within the broader medical domain. While technical progress continues, biomedical application focused research must prioritize aspects not necessarily related to performance such as task-oriented evaluation, bias, and equitable use.
- Research Article
2
- 10.1038/s41598-025-89965-3
- Feb 14, 2025
- Scientific Reports
Fairness in recommendation systems is crucial for ensuring equitable treatment of all users. Inspired by research on human-like behavior in large language models (LLMs), we investigate whether LLMs can serve as “fairness recognizers” in recommendation systems and explore harnessing the inherent fairness awareness in LLMs to construct fair recommendations. Using the MovieLens and LastFM datasets, we compare recommendations produced by Variational Autoencoders (VAE) with and without fairness strategies, and use ChatGLM3-6B and Llama2-13B to identify the fairness of VAE-generated results. Evaluation reveals that LLMs can indeed recognize fair recommendations by recognizing the correlation between users’ sensitive attributes and their recommendation results. We then propose a method for incorporating LLMs into the recommendation process by replacing unfair recommendations identified as unfair by LLMs with those generated by a fair VAE. Our evaluation demonstrates that this approach improves fairness significantly with minimal loss in utility. For instance, the fairness-to-utility ratio for gender-based groups shows that VAEgan’s results are 6.0159 and 5.0658, while ChatGLM’s results achieve 30.9289 and 50.4312, respectively. These findings demonstrate that integrating LLMs’ fairness recognition capabilities leads to a more favorable trade-off between fairness and utility.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.