Evaluating Negotiation Capabilities of Large Language Models: From Ultimatum Games to Nash Bargaining

  • Abstract
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Negotiation is a live, back-and-forth process—exactly the kind of human interaction today’s static AI benchmarks miss. We created interactive agent environments based on two classic game-theory paradigms—the one-shot Ultimatum Game and the open-ended Nash Bargaining task—to watch large language models (LLMs) reason, cooperate, and compete as the deal keeps changing. Using the Harvard Negotiation Project’s six principles (Interests, Legitimacy, Relationship, Options, Commitment, Communication) we scored a variety of large language models across hundreds of rounds. Llama-3 generally struck the most effective bargains; Claude-3 leaned aggressive—maximizing its own gain but risking push-back—while GPT-4 offered the fairest splits. The results spotlight both promise and pitfalls: today’s top LLMs can already secure mutually beneficial deals, yet still falter on consistency, legitimacy, and commitment when stakes rise. Our open-source benchmark invites human-factors researchers to probe these behaviors, design safer negotiation workflows, and study how mixed human-AI teams might unlock even better outcomes.

ReferencesShowing 10 of 15 papers
  • Open Access Icon
  • Cite Count Icon 15
  • 10.1609/aaai.v38i16.29751
Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis
  • Mar 24, 2024
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Caoyun Fan + 3 more

  • Open Access Icon
  • Cite Count Icon 10281
  • 10.1162/003355399556151
A Theory of Fairness, Competition, and Cooperation
  • Aug 1, 1999
  • The Quarterly Journal of Economics
  • E Fehr + 1 more

  • Cite Count Icon 4
  • 10.1038/s41598-024-69032-z
Strategic behavior of large language models and the role of game structure versus contextual framing
  • Aug 9, 2024
  • Scientific Reports
  • Nunzio Lorè + 1 more

  • Cite Count Icon 5
  • 10.1177/20555636241269270
Making sense of negotiation and AI: The blossoming of a new collaboration
  • Jun 1, 2024
  • International Journal of Commerce and Contracting
  • Horacio Arruda Falcão Filho

  • Open Access Icon
  • Cite Count Icon 749
  • 10.1038/s41586-019-1138-y
Machine behaviour.
  • Apr 1, 2019
  • Nature
  • Iyad Rahwan + 22 more

  • Open Access Icon
  • Cite Count Icon 4334
  • 10.1016/0167-2681(82)90011-7
An experimental analysis of ultimatum bargaining
  • Dec 1, 1982
  • Journal of Economic Behavior & Organization
  • Werner Güth + 2 more

  • Open Access Icon
  • Cite Count Icon 81
  • 10.1073/pnas.2313925121
A Turing test of whether AI chatbots are behaviorally similar to humans
  • Feb 22, 2024
  • Proceedings of the National Academy of Sciences of the United States of America
  • Qiaozhu Mei + 3 more

  • Cite Count Icon 7516
  • 10.2307/1907266
The Bargaining Problem
  • Apr 1, 1950
  • Econometrica
  • John F Nash

  • Open Access Icon
  • Cite Count Icon 62
  • 10.1073/pnas.2316205120
The emergence of economic rationality of GPT
  • Dec 12, 2023
  • Proceedings of the National Academy of Sciences of the United States of America
  • Yiting Chen + 3 more

  • 10.1016/j.ssaho.2025.101346
Sensemaking with AI: How trust influences Human-AI collaboration in health and creative industries
  • Jan 1, 2025
  • Social Sciences & Humanities Open
  • Sarah J Daly + 2 more

Similar Papers
  • Research Article
  • Cite Count Icon 8
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Research Article
  • 10.1097/js9.0000000000003631
How does AI compare to the experts in a Delphi setting: simulating medical consensus with large language models.
  • Oct 15, 2025
  • International journal of surgery (London, England)
  • Young Suk Park + 6 more

Several attempts have been made to enhance decision-making capabilities of large language models (LLMs) through debate and collaboration, simulating human-like deliberative processes. However, limited research exists on whether the collective intelligence of LLMs can reproduce consensus decisions of human experts. We investigated consensus-building processes and outcomes among LLMs using a modified Delphi method, comparing results to a human expert Delphi study. We conducted a three-round Delphi study involving eight LLMs, evaluating 135 medical statements from the International Federation for the Surgery of Obesity and Metabolic Disorders 2024 Delphi study. LLMs independently assessed statements in Round 1, refined their opinions based on feedback integration in Round 2, and engaged in pairwise debate in Round 3. Consensus was defined as ≥70% agreement. Concordance was defined as identical outcomes between LLMs and human experts, either both reaching or both failing to reach consensus. LLMs achieved a higher overall consensus rate than human experts (93.3% vs. 81.5%, P =0.002). Initial independent evaluations yielded consensus on 117 statements (86.7%), with five additional statements reaching consensus after feedback integration and four more following structured debates. Concordance between LLM and human expert consensus outcomes was observed in 78.5% of statements overall, and in 91.8% of statements where human experts had achieved consensus. The consensus rates between LLMs and human experts demonstrated a strong positive correlation (Spearman's rho=0.73, P < 0.001). Substantial variation was observed among individual LLMs in their likelihood of changing decisions in response to peer feedback during Round 2 (0-44.4%). Similarly, considerable differences existed between LLMs in their ability to persuade others (0-63.6%) or their susceptibility to persuasion (0-80.0%) during Round 3. LLM-based Delphi methods demonstrated high clinical consensus closely aligned with human expert decisions. LLMs effectively simulated structured human-like deliberative reasoning, though they tended to adopt more guideline-driven and conservative positions. However, the use of commercial LLM platforms limited control over model parameters that may affect reproducibility. While the current study suggests that LLMs hold promise as complementary tools in medical consensus-building processes, further research addressing parameter optimization is warranted.

  • Research Article
  • 10.25136/2409-8698.2024.4.70455
Optimization of traditional methods for determining the similarity of project names and purchases using large language models
  • Apr 1, 2024
  • Litera
  • Aleksei Aleksandrovich Golikov + 2 more

The subject of the study is the analysis and improvement of methods for determining the relevance of project names to the information content of purchases using large language models. The object of the study is a database containing the names of projects and purchases in the field of electric power industry, collected from open sources. The author examines in detail such aspects of the topic as the use of TF-IDF and cosine similarity metrics for primary data filtering, and also describes in detail the integration and evaluation of the effectiveness of large language models such as GigaChat, GPT-3.5, and GPT-4 in text data matching tasks. Special attention is paid to the methods of clarifying the similarity of names based on reflection introduced into the prompta of large language models, which makes it possible to increase the accuracy of data comparison. The study uses TF-IDF and cosine similarity methods for primary data analysis, as well as large GigaChat, GPT-3.5 and GPT-4 language models for detailed verification of the relevance of project names and purchases, including reflection in model prompta to improve the accuracy of results. The novelty of the research lies in the development of a combined approach to determining the relevance of project names and purchases, combining traditional methods of processing text information (TF-IDF, cosine similarity) with the capabilities of large language models. A special contribution of the author to the research of the topic is the proposed methodology for improving the accuracy of data comparison by clarifying the results of primary selection using GPT-3.5 and GPT-4 models with optimized prompta, including reflection. The main conclusions of the study are confirmation of the prospects of using the developed approach in the tasks of information support for procurement processes and project implementation, as well as the possibility of using the results obtained for the development of text data mining systems in various sectors of the economy. The study showed that the use of language models makes it possible to improve the value of the F2 measure to 0.65, which indicates a significant improvement in the quality of data comparison compared with basic methods.

  • Research Article
  • 10.1108/ir-02-2025-0074
Large language and vision-language models for robot: safety challenges, mitigation strategies and future directions
  • Jul 29, 2025
  • Industrial Robot: the international journal of robotics research and application
  • Xiangyu Hu + 1 more

Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.

  • Research Article
  • 10.1038/s41698-025-00916-7
Evaluating the performance of large language & visual-language models in cervical cytology screening
  • May 23, 2025
  • npj Precision Oncology
  • Qi Hong + 15 more

Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.

  • Research Article
  • 10.1080/13658816.2025.2577252
Extraction of geoprocessing modeling knowledge from crowdsourced Google Earth Engine scripts by coordinating large and small language models
  • Nov 1, 2025
  • International Journal of Geographical Information Science
  • Anqi Zhao + 7 more

The widespread use of online geoinformation platforms, such as Google Earth Engine (GEE), has produced numerous scripts. Extracting domain knowledge from these crowdsourced scripts supports understanding of geoprocessing workflows. Small Language Models (SLMs) are effective for semantic embedding but struggle with complex code; Large Language Models (LLMs) can summarize scripts, yet lack consistent geoscience terminology to express knowledge. In this paper, we propose Geo-CLASS, a knowledge extraction framework for geospatial analysis scripts that coordinates large and small language models. Specifically, we designed domain-specific schemas and a schema-aware prompt strategy to guide LLMs to generate and associate entity descriptions, and employed SLMs to standardize the outputs by mapping these descriptions to a constructed geoscience knowledge base. Experiments on 237 GEE scripts, selected from 295,943 scripts in total, demonstrated that our framework outperformed LLM baselines, including Llama-3, GPT-3.5 and GPT-4o. In comparison, the proposed framework improved accuracy in recognizing entities and relations by up to 31.9% and 12.0%, respectively. Ablation studies and performance analysis further confirmed the effectiveness of key components and the robustness of the framework. Geo-CLASS has the potential to enable the construction of geoprocessing modeling knowledge graphs, facilitate domain-specific reasoning and advance script generation via Retrieval-Augmented Generation (RAG).

  • Research Article
  • Cite Count Icon 53
  • 10.1001/jamanetworkopen.2023.46721
Performance of Large Language Models on a Neurology Board–Style Examination
  • Dec 7, 2023
  • JAMA network open
  • Marc Cicero Schubert + 2 more

Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. To assess the performance of LLMs on neurology board-style examinations. This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Overall percentage scores of 2 LLMs. LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.joms.2024.11.007
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential
  • Mar 1, 2025
  • Journal of Oral and Maxillofacial Surgery
  • Reema Mahmoud + 5 more

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.2196/59641
Large Language Models Can Enable Inductive Thematic Analysis of a Social Media Corpus in a Single Prompt: Human Validation Study.
  • Aug 29, 2024
  • JMIR infodemiology
  • Michael S Deiner + 5 more

Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.

  • Conference Article
  • Cite Count Icon 100
  • 10.1145/3510003.3510203
Jigsaw
  • May 21, 2022
  • Naman Jain + 6 more

Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.

  • Conference Article
  • 10.54941/ahfe1006669
Enhancing Thematic Analysis with Local LLMs: A Scientific Evaluation of Prompt Engineering Techniques
  • Jan 1, 2025
  • Timothy Meyer + 2 more

Thematic Analysis (TA) is a powerful tool for human factors, HCI, and UX researchers to gather system usability insights from qualitative data like open-ended survey questions. However, TA is both time consuming and difficult, requiring researchers to review and compare hundreds, thousands, or even millions of pieces of text. Recently, this has driven many to explore using Large Language Models (LLMs) to support such an analysis. However, LLMs have their own processing limitations and usability challenges when implementing them reliably as part of a research process – especially when working with a large corpus of data that exceeds LLM context windows. These challenges are compounded when using locally hosted LLMs, which may be necessary to analyze sensitive and/or proprietary data. However, little human factors research has rigorously examined how various prompt engineering techniques can augment an LLM to overcome these limitations and improve usability. Accordingly, in the present paper, we investigate the impact of several prompt engineering techniques on the quality of LLM-mediated TA. Using a local LLM (Llama 3.1 8b) to ensure data privacy, we developed four LLM variants with progressively complex prompt engineering techniques and used them to extract themes from user feedback regarding the usability of a novel knowledge management system prototype. The LLM variants were as follows:1.A “baseline” variant with no prompt engineering or scalability2.A “naïve batch processing” variant that sequentially analyzed small batches of the user feedback to generate a single list of themes3.An “advanced batch processing” variant that built upon the naïve variant by adding prompt engineering techniques (e.g., chain-of-thought prompting)4.A “cognition-inspired” variant that incorporated advanced prompt engineering techniques and kept a working memory-like log of themes and their frequencyContrary to conventional approaches to studying LLMs, which largely rely upon descriptive statistics (e.g., % improvement), we systematically applied a set of evaluation methods from behavioral science and human factors. We performed three stages of evaluation of the outputs of each LLM variant: we compared the LLM outputs to our team’s original TA, we had human factors professionals (N = 4) rate the quality and usefulness of the outputs, and we compared the Inter-Rater Reliability (IRR) of other human factors professionals (N = 2) attempting to code the original data with the outputs generated by each variant. Results demonstrate that even small, locally deployed LLMs can produce high-quality TA when guided by appropriate prompts. While the “baseline” variant performed surprisingly well for small datasets, we found that the other, scalable methods were dependent upon advanced prompt engineering techniques to be successful. Only our novel "cognition-inspired" approach performed as well as the “baseline” variant in qualitative and quantitative comparisons of ratings and coding IRR. This research provides practical guidance for human factors researchers looking to integrate LLMs into their qualitative analysis workflows, disentangling and uncovering the importance of context window limitations, batch processing strategies, and advanced prompt engineering techniques. The findings suggest that local LLMs can serve as valuable and scalable tools in thematic analysis.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.procs.2023.09.086
A Large and Diverse Arabic Corpus for Language Modeling
  • Jan 1, 2023
  • Procedia Computer Science
  • Abbas Raza Ali + 3 more

A Large and Diverse Arabic Corpus for Language Modeling

  • Research Article
  • 10.1101/2025.08.22.671610
Out-of-the-box bioinformatics capabilities of large language models (LLMs)
  • Aug 27, 2025
  • bioRxiv
  • Varsha Rajesh + 1 more

Large Language Models (LLMs), AI agents and co-scientists promise to accelerate scientific discovery across fields ranging from chemistry to biology. Bioinformatics- the analysis of DNA, RNA and protein sequences plays a crucial role in biological research and is especially amenable to AI-driven automation given its computational nature. Here, we assess the bioinformatics capabilities of three popular general-purpose LLMs on a set of tasks covering basic analytical questions that include code writing and multi-step reasoning in the domain. Utilizing questions from Rosalind, a bioinformatics educational platform, we compare the performance of the LLMs vs. humans on 104 questions undertaken by 110 to 68,760 individuals globally. GPT-3.5 provided correct answers for 59/104 (58%) questions, while Llama-3–70B and GPT-4o answered 49/104 (47%) correctly. GPT-3.5 was the best performing in most categories, followed by Llama-3–70B and then GPT-4o. 71% of the questions were correctly answered by at least one LLM. The best performing categories included DNA analysis, while the worst performing were sequence alignment/comparative genomics and genome assembly. Overall, LLMs performance mirrored that of humans with lower performance in tasks in which humans had low performance and vice versa. However, LLMs also failed in some instances where most humans were correct and, in a few cases, LLMs excelled where most humans failed. To the best of our knowledge, this presents the first assessment of general purpose LLMs on basic bioinformatics tasks in distinct areas relative to the performance of hundreds to thousands of humans. LLMs provide correct answers to several questions that require use of biological knowledge, reasoning, statistical analysis and computer code.

  • Research Article
  • Cite Count Icon 3
  • 10.1109/embc53108.2024.10782119
High Throughput Phenotyping of Physician Notes with Large Language and Hybrid NLP Models.
  • Jul 15, 2024
  • Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
  • Syed I Munzir + 2 more

Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past 30 years, progress toward making high-throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping physician notes.Clinical relevance: Large language models will likely emerge as the dominant method for the high throughput phenotyping of signs and symptoms in physician notes.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 50
  • 10.1038/s41746-024-01024-9
CancerGPT for few shot drug pair synergy prediction using large pretrained language models
  • Feb 19, 2024
  • NPJ Digital Medicine
  • Tianhao Li + 6 more

Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.

More from: Proceedings of the Human Factors and Ergonomics Society Annual Meeting
  • New
  • Research Article
  • 10.1177/10711813251395046
The Impact of Electrification and Partial Automation on Driver Speeding Behavior
  • Nov 6, 2025
  • Proceedings of the Human Factors and Ergonomics Society Annual Meeting
  • Pnina Gershon + 2 more

  • New
  • Research Article
  • 10.1177/10711813251395048
Helping Smallholder Crop Farmers Adapt to Climate Change: Co-Design of a Seasonal Climate Forecasting Tool
  • Nov 6, 2025
  • Proceedings of the Human Factors and Ergonomics Society Annual Meeting
  • Andrew Thatcher + 3 more

  • Research Article
  • 10.1177/10711813251368823
Pilot Performance Modelling of Carrier-Based Aircraft Landing Missions with Applications in Human-Machine Systems Design
  • Oct 28, 2025
  • Proceedings of the Human Factors and Ergonomics Society Annual Meeting
  • Gu Sen + 2 more

  • Research Article
  • 10.1177/10711813251367740
Using Systems and Resilience Engineering Literature to Understand Labor and Delivery Work Systems
  • Oct 25, 2025
  • Proceedings of the Human Factors and Ergonomics Society Annual Meeting
  • Matthew Nare + 1 more

  • Research Article
  • 10.1177/10711813251369879
Impact of Visual Perception Mismatch Design on Response Time in Mixed Reality
  • Oct 22, 2025
  • Proceedings of the Human Factors and Ergonomics Society Annual Meeting
  • Jianan Zheng + 1 more

  • Research Article
  • 10.1177/10711813251369815
How Does an In-Vehicle Agent Accent Influence Driver Behavior and Perception?
  • Oct 21, 2025
  • Proceedings of the Human Factors and Ergonomics Society Annual Meeting
  • Mungyeong Choe + 4 more

  • Research Article
  • 10.1177/10711813251369392
New Frontiers in Human-Agent Team Modeling and Evaluation in the Era of Agentic AI
  • Oct 21, 2025
  • Proceedings of the Human Factors and Ergonomics Society Annual Meeting
  • Sylvain Bruni + 4 more

  • Research Article
  • 10.1177/10711813251369788
Financial Return on Investment Methods: Case Studies of System-Scale Interventions
  • Oct 21, 2025
  • Proceedings of the Human Factors and Ergonomics Society Annual Meeting
  • Dan Nathan-Roberts + 5 more

  • Research Article
  • 10.1177/10711813251357886
The Characteristics of Medical Teams: A Quantitative and Empirical Study of Teamness in Healthcare
  • Oct 18, 2025
  • Proceedings of the Human Factors and Ergonomics Society Annual Meeting
  • Harrison Sims + 2 more

  • Research Article
  • 10.1177/10711813251358240
Beyond Overreliance: The Human-AI-System Concordance (HASC) Matrix and the Cognitive Dynamics of AI-Assisted Decision-Making
  • Oct 16, 2025
  • Proceedings of the Human Factors and Ergonomics Society Annual Meeting
  • Wonji Doh + 2 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon