Evaluating Negotiation Capabilities of Large Language Models: From Ultimatum Games to Nash Bargaining
Negotiation is a live, back-and-forth process—exactly the kind of human interaction today’s static AI benchmarks miss. We created interactive agent environments based on two classic game-theory paradigms—the one-shot Ultimatum Game and the open-ended Nash Bargaining task—to watch large language models (LLMs) reason, cooperate, and compete as the deal keeps changing. Using the Harvard Negotiation Project’s six principles (Interests, Legitimacy, Relationship, Options, Commitment, Communication) we scored a variety of large language models across hundreds of rounds. Llama-3 generally struck the most effective bargains; Claude-3 leaned aggressive—maximizing its own gain but risking push-back—while GPT-4 offered the fairest splits. The results spotlight both promise and pitfalls: today’s top LLMs can already secure mutually beneficial deals, yet still falter on consistency, legitimacy, and commitment when stakes rise. Our open-source benchmark invites human-factors researchers to probe these behaviors, design safer negotiation workflows, and study how mixed human-AI teams might unlock even better outcomes.
15
- 10.1609/aaai.v38i16.29751
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
10281
- 10.1162/003355399556151
- Aug 1, 1999
- The Quarterly Journal of Economics
4
- 10.1038/s41598-024-69032-z
- Aug 9, 2024
- Scientific Reports
5
- 10.1177/20555636241269270
- Jun 1, 2024
- International Journal of Commerce and Contracting
749
- 10.1038/s41586-019-1138-y
- Apr 1, 2019
- Nature
4334
- 10.1016/0167-2681(82)90011-7
- Dec 1, 1982
- Journal of Economic Behavior & Organization
81
- 10.1073/pnas.2313925121
- Feb 22, 2024
- Proceedings of the National Academy of Sciences of the United States of America
7516
- 10.2307/1907266
- Apr 1, 1950
- Econometrica
62
- 10.1073/pnas.2316205120
- Dec 12, 2023
- Proceedings of the National Academy of Sciences of the United States of America
- 10.1016/j.ssaho.2025.101346
- Jan 1, 2025
- Social Sciences & Humanities Open
- Research Article
8
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
- 10.1097/js9.0000000000003631
- Oct 15, 2025
- International journal of surgery (London, England)
Several attempts have been made to enhance decision-making capabilities of large language models (LLMs) through debate and collaboration, simulating human-like deliberative processes. However, limited research exists on whether the collective intelligence of LLMs can reproduce consensus decisions of human experts. We investigated consensus-building processes and outcomes among LLMs using a modified Delphi method, comparing results to a human expert Delphi study. We conducted a three-round Delphi study involving eight LLMs, evaluating 135 medical statements from the International Federation for the Surgery of Obesity and Metabolic Disorders 2024 Delphi study. LLMs independently assessed statements in Round 1, refined their opinions based on feedback integration in Round 2, and engaged in pairwise debate in Round 3. Consensus was defined as ≥70% agreement. Concordance was defined as identical outcomes between LLMs and human experts, either both reaching or both failing to reach consensus. LLMs achieved a higher overall consensus rate than human experts (93.3% vs. 81.5%, P =0.002). Initial independent evaluations yielded consensus on 117 statements (86.7%), with five additional statements reaching consensus after feedback integration and four more following structured debates. Concordance between LLM and human expert consensus outcomes was observed in 78.5% of statements overall, and in 91.8% of statements where human experts had achieved consensus. The consensus rates between LLMs and human experts demonstrated a strong positive correlation (Spearman's rho=0.73, P < 0.001). Substantial variation was observed among individual LLMs in their likelihood of changing decisions in response to peer feedback during Round 2 (0-44.4%). Similarly, considerable differences existed between LLMs in their ability to persuade others (0-63.6%) or their susceptibility to persuasion (0-80.0%) during Round 3. LLM-based Delphi methods demonstrated high clinical consensus closely aligned with human expert decisions. LLMs effectively simulated structured human-like deliberative reasoning, though they tended to adopt more guideline-driven and conservative positions. However, the use of commercial LLM platforms limited control over model parameters that may affect reproducibility. While the current study suggests that LLMs hold promise as complementary tools in medical consensus-building processes, further research addressing parameter optimization is warranted.
- Research Article
- 10.25136/2409-8698.2024.4.70455
- Apr 1, 2024
- Litera
The subject of the study is the analysis and improvement of methods for determining the relevance of project names to the information content of purchases using large language models. The object of the study is a database containing the names of projects and purchases in the field of electric power industry, collected from open sources. The author examines in detail such aspects of the topic as the use of TF-IDF and cosine similarity metrics for primary data filtering, and also describes in detail the integration and evaluation of the effectiveness of large language models such as GigaChat, GPT-3.5, and GPT-4 in text data matching tasks. Special attention is paid to the methods of clarifying the similarity of names based on reflection introduced into the prompta of large language models, which makes it possible to increase the accuracy of data comparison. The study uses TF-IDF and cosine similarity methods for primary data analysis, as well as large GigaChat, GPT-3.5 and GPT-4 language models for detailed verification of the relevance of project names and purchases, including reflection in model prompta to improve the accuracy of results. The novelty of the research lies in the development of a combined approach to determining the relevance of project names and purchases, combining traditional methods of processing text information (TF-IDF, cosine similarity) with the capabilities of large language models. A special contribution of the author to the research of the topic is the proposed methodology for improving the accuracy of data comparison by clarifying the results of primary selection using GPT-3.5 and GPT-4 models with optimized prompta, including reflection. The main conclusions of the study are confirmation of the prospects of using the developed approach in the tasks of information support for procurement processes and project implementation, as well as the possibility of using the results obtained for the development of text data mining systems in various sectors of the economy. The study showed that the use of language models makes it possible to improve the value of the F2 measure to 0.65, which indicates a significant improvement in the quality of data comparison compared with basic methods.
- Research Article
- 10.1108/ir-02-2025-0074
- Jul 29, 2025
- Industrial Robot: the international journal of robotics research and application
Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.
- Research Article
- 10.1038/s41698-025-00916-7
- May 23, 2025
- npj Precision Oncology
Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.
- Research Article
- 10.1080/13658816.2025.2577252
- Nov 1, 2025
- International Journal of Geographical Information Science
The widespread use of online geoinformation platforms, such as Google Earth Engine (GEE), has produced numerous scripts. Extracting domain knowledge from these crowdsourced scripts supports understanding of geoprocessing workflows. Small Language Models (SLMs) are effective for semantic embedding but struggle with complex code; Large Language Models (LLMs) can summarize scripts, yet lack consistent geoscience terminology to express knowledge. In this paper, we propose Geo-CLASS, a knowledge extraction framework for geospatial analysis scripts that coordinates large and small language models. Specifically, we designed domain-specific schemas and a schema-aware prompt strategy to guide LLMs to generate and associate entity descriptions, and employed SLMs to standardize the outputs by mapping these descriptions to a constructed geoscience knowledge base. Experiments on 237 GEE scripts, selected from 295,943 scripts in total, demonstrated that our framework outperformed LLM baselines, including Llama-3, GPT-3.5 and GPT-4o. In comparison, the proposed framework improved accuracy in recognizing entities and relations by up to 31.9% and 12.0%, respectively. Ablation studies and performance analysis further confirmed the effectiveness of key components and the robustness of the framework. Geo-CLASS has the potential to enable the construction of geoprocessing modeling knowledge graphs, facilitate domain-specific reasoning and advance script generation via Retrieval-Augmented Generation (RAG).
- Research Article
53
- 10.1001/jamanetworkopen.2023.46721
- Dec 7, 2023
- JAMA network open
Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. To assess the performance of LLMs on neurology board-style examinations. This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Overall percentage scores of 2 LLMs. LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.
- Research Article
3
- 10.1016/j.joms.2024.11.007
- Mar 1, 2025
- Journal of Oral and Maxillofacial Surgery
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential
- Research Article
4
- 10.2196/59641
- Aug 29, 2024
- JMIR infodemiology
Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.
- Conference Article
100
- 10.1145/3510003.3510203
- May 21, 2022
Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.
- Conference Article
- 10.54941/ahfe1006669
- Jan 1, 2025
Thematic Analysis (TA) is a powerful tool for human factors, HCI, and UX researchers to gather system usability insights from qualitative data like open-ended survey questions. However, TA is both time consuming and difficult, requiring researchers to review and compare hundreds, thousands, or even millions of pieces of text. Recently, this has driven many to explore using Large Language Models (LLMs) to support such an analysis. However, LLMs have their own processing limitations and usability challenges when implementing them reliably as part of a research process – especially when working with a large corpus of data that exceeds LLM context windows. These challenges are compounded when using locally hosted LLMs, which may be necessary to analyze sensitive and/or proprietary data. However, little human factors research has rigorously examined how various prompt engineering techniques can augment an LLM to overcome these limitations and improve usability. Accordingly, in the present paper, we investigate the impact of several prompt engineering techniques on the quality of LLM-mediated TA. Using a local LLM (Llama 3.1 8b) to ensure data privacy, we developed four LLM variants with progressively complex prompt engineering techniques and used them to extract themes from user feedback regarding the usability of a novel knowledge management system prototype. The LLM variants were as follows:1.A “baseline” variant with no prompt engineering or scalability2.A “naïve batch processing” variant that sequentially analyzed small batches of the user feedback to generate a single list of themes3.An “advanced batch processing” variant that built upon the naïve variant by adding prompt engineering techniques (e.g., chain-of-thought prompting)4.A “cognition-inspired” variant that incorporated advanced prompt engineering techniques and kept a working memory-like log of themes and their frequencyContrary to conventional approaches to studying LLMs, which largely rely upon descriptive statistics (e.g., % improvement), we systematically applied a set of evaluation methods from behavioral science and human factors. We performed three stages of evaluation of the outputs of each LLM variant: we compared the LLM outputs to our team’s original TA, we had human factors professionals (N = 4) rate the quality and usefulness of the outputs, and we compared the Inter-Rater Reliability (IRR) of other human factors professionals (N = 2) attempting to code the original data with the outputs generated by each variant. Results demonstrate that even small, locally deployed LLMs can produce high-quality TA when guided by appropriate prompts. While the “baseline” variant performed surprisingly well for small datasets, we found that the other, scalable methods were dependent upon advanced prompt engineering techniques to be successful. Only our novel "cognition-inspired" approach performed as well as the “baseline” variant in qualitative and quantitative comparisons of ratings and coding IRR. This research provides practical guidance for human factors researchers looking to integrate LLMs into their qualitative analysis workflows, disentangling and uncovering the importance of context window limitations, batch processing strategies, and advanced prompt engineering techniques. The findings suggest that local LLMs can serve as valuable and scalable tools in thematic analysis.
- Research Article
7
- 10.1016/j.procs.2023.09.086
- Jan 1, 2023
- Procedia Computer Science
A Large and Diverse Arabic Corpus for Language Modeling
- Research Article
- 10.1101/2025.08.22.671610
- Aug 27, 2025
- bioRxiv
Large Language Models (LLMs), AI agents and co-scientists promise to accelerate scientific discovery across fields ranging from chemistry to biology. Bioinformatics- the analysis of DNA, RNA and protein sequences plays a crucial role in biological research and is especially amenable to AI-driven automation given its computational nature. Here, we assess the bioinformatics capabilities of three popular general-purpose LLMs on a set of tasks covering basic analytical questions that include code writing and multi-step reasoning in the domain. Utilizing questions from Rosalind, a bioinformatics educational platform, we compare the performance of the LLMs vs. humans on 104 questions undertaken by 110 to 68,760 individuals globally. GPT-3.5 provided correct answers for 59/104 (58%) questions, while Llama-3–70B and GPT-4o answered 49/104 (47%) correctly. GPT-3.5 was the best performing in most categories, followed by Llama-3–70B and then GPT-4o. 71% of the questions were correctly answered by at least one LLM. The best performing categories included DNA analysis, while the worst performing were sequence alignment/comparative genomics and genome assembly. Overall, LLMs performance mirrored that of humans with lower performance in tasks in which humans had low performance and vice versa. However, LLMs also failed in some instances where most humans were correct and, in a few cases, LLMs excelled where most humans failed. To the best of our knowledge, this presents the first assessment of general purpose LLMs on basic bioinformatics tasks in distinct areas relative to the performance of hundreds to thousands of humans. LLMs provide correct answers to several questions that require use of biological knowledge, reasoning, statistical analysis and computer code.
- Research Article
3
- 10.1109/embc53108.2024.10782119
- Jul 15, 2024
- Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past 30 years, progress toward making high-throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping physician notes.Clinical relevance: Large language models will likely emerge as the dominant method for the high throughput phenotyping of signs and symptoms in physician notes.
- Research Article
50
- 10.1038/s41746-024-01024-9
- Feb 19, 2024
- NPJ Digital Medicine
Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.
- New
- Research Article
- 10.1177/10711813251395046
- Nov 6, 2025
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
- New
- Research Article
- 10.1177/10711813251395048
- Nov 6, 2025
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
- Research Article
- 10.1177/10711813251368823
- Oct 28, 2025
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
- Research Article
- 10.1177/10711813251367740
- Oct 25, 2025
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
- Research Article
- 10.1177/10711813251369879
- Oct 22, 2025
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
- Research Article
- 10.1177/10711813251369815
- Oct 21, 2025
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
- Research Article
- 10.1177/10711813251369392
- Oct 21, 2025
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
- Research Article
- 10.1177/10711813251369788
- Oct 21, 2025
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
- Research Article
- 10.1177/10711813251357886
- Oct 18, 2025
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
- Research Article
- 10.1177/10711813251358240
- Oct 16, 2025
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.