Prosocial When Simple and Cold-Hearted When Complex: How Task Difficulty Shapes LLM Behavior

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Prior studies suggest that large language models (LLMs) act prosocially in simplified game-theoretic settings, but whether such behavior reflects stable objectives or context-driven patterns is unclear. We test whether LLMs exhibit fairness when choices follow complex tasks or take place in more complex decision environments. We hypothesize that problem complexity and mathematical prompts increase the LLM’s weight on prioritizing self-interests by activating responses geared toward calculation and rationality. We operationalize our theory using a quantal response framework and conducted a series of experiments with GPT-4, GPT-4o, and o3-mini as decision makers to test our hypotheses. In Study 1, models played Dictator and Ultimatum games following a series of unrelated problems that varied in context and difficulty. Study 2 was a sequential supply chain game that mirrors key aspects of the Ultimatum game regarding fairness concerns, but with added complexity. In Study 1, simple prompts produced nearly equal splits, because of fairness norms and preference for equity. However, complex math prompts invoked rational profit maximization logic to reduce allocation offers. In the pricing game, the models prioritized self-interested pricing but differed in decision execution. GPT-4 and GPT-4o selected lower prices because of random errors and heuristic responses rather than fairness concerns. In contrast, o3-mini consistently derived the profit-maximizing solution. Fairness in LLM responses is context sensitive and often suppressed by task characteristics that trigger goal-directed responses. Thus, researchers and developers must assess social preferences in more complex scenarios. Moreover, our research shows that utility-based models that incorporate bounded rationality and fairness capture core patterns in LLM behavior and yield testable predictions, supported by both choice data and model-generated text. History: This paper has been accepted for the Decision Analysis Special Issue on the Implications of Advances in Artificial Intelligence for Decision Analysis. Funding: The authors also acknowledge the financial support of UNSW Business School and the National Natural Science Foundation of China [Grant 72403226]. Supplemental Material: The online appendix is available at https://doi.org/10.1287/deca.2025.0396 .

Similar Papers
  • Research Article
  • Cite Count Icon 27
  • 10.1287/mnsc.2023.03014
Large Language Model in Creative Work: The Role of Collaboration Modality and User Expertise
  • Oct 15, 2024
  • Management Science
  • Zenan Chen + 1 more

Since the launch of ChatGPT in December 2022, large language models (LLMs) have been rapidly adopted by businesses to assist users in a wide range of open-ended tasks, including creative work. Although the versatility of LLM has unlocked new ways of human-artificial intelligence collaboration, it remains uncertain how LLMs should be used to enhance business outcomes. To examine the effects of human-LLM collaboration on business outcomes, we conducted an experiment where we tasked expert and nonexpert users to write an ad copy with and without the assistance of LLMs. Here, we investigate and compare two ways of working with LLMs: (1) using LLMs as “ghostwriters,” which assume the main role of the content generation task, and (2) using LLMs as “sounding boards” to provide feedback on human-created content. We measure the quality of the ads using the number of clicks generated by the created ads on major social media platforms. Our results show that different collaboration modalities can result in very different outcomes for different user types. Using LLMs as sounding boards enhances the quality of the resultant ad copies for nonexperts. However, using LLMs as ghostwriters did not provide significant benefits and is, in fact, detrimental to expert users. We rely on textual analyses to understand the mechanisms, and we learned that using LLMs as ghostwriters produces an anchoring effect, which leads to lower-quality ads. On the other hand, using LLMs as sounding boards helped nonexperts achieve ad content with low semantic divergence to content produced by experts, thereby closing the gap between the two types of users. This paper was accepted by D. J. Wu, information systems. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2023.03014 .

  • Research Article
  • 10.1287/ijoc.2024.0645
Mitigating Age-Related Bias in Large Language Models: Strategies for Responsible Artificial Intelligence Development
  • May 21, 2025
  • INFORMS Journal on Computing
  • Zhuang Liu + 3 more

The increasing popularity of large language models (LLMs) in digital platforms elevates the urgency to address inherent biases, particularly age-related biases, which can significantly skew the model’s fairness and performance. This paper introduces a novel two-stage bias mitigation approach utilizing LLM’s empathy ability, reinforcement learning, and human-in-the-loop mechanisms to identify and correct age-related biases without altering model parameters. There are two modes for our bias mitigation strategy. Self-bias mitigation in the loop allows LLMs to self-assess and adjust their outputs autonomously, promoting inherent bias awareness and correction. Alternatively, cooperative bias mitigation in the loop leverages collaborative filtering among multiple LLMs to debate and mitigate biases through consensus. Furthermore, we introduce the empathetic perspective exchange strategy, which can further refine the answers by changing the perspective in the context information given to the LLM. In this way, more suitable responses applicable to different ages are generated. Our comprehensive evaluation across several data sets demonstrates that our trained model, FairLLM, significantly reduces age bias, outperforming existing techniques in fairness metrics. These findings underscore the effectiveness of our proposed framework in fostering the development of more equitable artificial intelligence systems, potentially benefiting a broader demographic spectrum by reducing digital ageism. History: This paper has been accepted by Kaushik Dutta for the Special Issue on Responsible AI and Data Science for Social Good. Funding: This work was supported by the National Natural Science Foundation of China [Grants 71971046, 72172029, 72403033, 72272028, and 72442025]. Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2024.0645 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2024.0645 ). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/ .

  • Research Article
  • Cite Count Icon 8
  • 10.1287/mnsc.2023.01026
The Bullwhip Effect in Servitized Manufacturers
  • Apr 25, 2024
  • Management Science
  • Yimeng Niu + 3 more

The shift to a service-oriented economy has driven traditional product-oriented manufacturing firms to integrate various services into their businesses. This study aims to provide empirical evidence on how manufacturers’ service offerings impact demand variability and intrafirm bullwhip effects. Through “bag-of-words” text mining on 10-K filings of U.S.-listed manufacturing firms, we propose a novel measurement to identify annual services offered. We validate the measurement’s statistical and economic significance and verify its consistency with the results obtained using the large language model (i.e., GPT-4). Services are categorized as complementing product sales (e.g., maintenance and repair) or substituting product sales entirely (e.g., machine hours). Utilizing difference-in-difference techniques, we find robust evidence that manufacturers’ service offerings reduce the bullwhip effect in two steps: basic complementing services decrease demand variability, whereas advanced substituting services mitigate intrafirm bullwhip. Moreover, servitization mainly minimizes demand variability through information channels, whereas increased production efficiency decreases intrafirm bullwhip. Our findings contribute to understanding manufacturers’ business model innovations by demonstrating that servitization can smooth demand and mitigate intrafirm bullwhip. This paper was accepted by Karan Girotra, operations management. Funding: This work was supported by the National Natural Science Foundation of China [Grants 71931007, 72091214] and General Research Fund by Hong Kong Research Grants Council [Grant 14505320]. Supplemental Material: The data and the online appendix are available at https://doi.org/10.1287/mnsc.2023.01026 .

  • Research Article
  • Cite Count Icon 10
  • 10.1287/msom.2023.0279
A Manager and an AI Walk into a Bar: Does ChatGPT Make Biased Decisions Like We Do?
  • Jan 31, 2025
  • Manufacturing & Service Operations Management
  • Yang Chen + 4 more

Problem definition: Large language models (LLMs) are being increasingly leveraged in business and consumer decision-making processes. Because LLMs learn from human data and feedback, which can be biased, determining whether LLMs exhibit human-like behavioral decision biases (e.g., base-rate neglect, risk aversion, confirmation bias, etc.) is crucial prior to implementing LLMs into decision-making contexts and workflows. To understand this, we examine 18 common human biases that are important in operations management (OM) using the dominant LLM, ChatGPT. Methodology/results: We perform experiments where GPT-3.5 and GPT-4 act as participants to test these biases using vignettes adapted from the literature (“standard context”) and variants reframed in inventory and general OM contexts. In almost half of the experiments, Generative Pre-trained Transformer (GPT) mirrors human biases, diverging from prototypical human responses in the remaining experiments. We also observe that GPT models have a notable level of consistency between the standard and OM-specific experiments as well as across temporal versions of the GPT-3.5 model. Our comparative analysis between GPT-3.5 and GPT-4 reveals a dual-edged progression of GPT’s decision making, wherein GPT-4 advances in decision-making accuracy for problems with well-defined mathematical solutions while simultaneously displaying increased behavioral biases for preference-based problems. Managerial implications: First, our results highlight that managers will obtain the greatest benefits from deploying GPT to workflows leveraging established formulas. Second, that GPT displayed a high level of response consistency across the standard, inventory, and non-inventory operational contexts provides optimism that LLMs can offer reliable support even when details of the decision and problem contexts change. Third, although selecting between models, like GPT-3.5 and GPT-4, represents a trade-off in cost and performance, our results suggest that managers should invest in higher-performing models, particularly for solving problems with objective solutions. Funding: This work was supported by the Social Sciences and Humanities Research Council of Canada [Grant SSHRC 430-2019-00505]. The authors also gratefully acknowledge the Smith School of Business at Queen’s University for providing funding to support Y. Chen’s postdoctoral appointment. Supplemental Material: The online appendix is available at https://doi.org/10.1287/msom.2023.0279 .

  • Research Article
  • 10.1371/journal.pone.0320123
Prompting large language models to extract chemical‒disease relation precisely and comprehensively at the document level: an evaluation study.
  • Apr 8, 2025
  • PloS one
  • Mei Chen + 2 more

Given the scarcity of annotated data, current deep learning methods face challenges in the field of document-level chemical-disease relation extraction, making it difficult to achieve precise relation extraction capable of identifying relation types and comprehensive extraction tasks that identify relation-related factors. This study tests the abilities of three large language models (LLMs), GPT3.5, GPT4.0, and Claude-opus, to perform precise and comprehensive extraction in document-level chemical-disease relation extraction on a self-constructed dataset. Firstly, based on the task characteristics, this study designs six workflows for precise extraction and five workflows for comprehensive extraction using prompting engineering strategies. The characteristics of the extraction process are analyzed through the performance differences under different workflows. Secondly, this study analyzes the content bias in LLMs extraction by examining the extraction effectiveness of different workflows on different types of content. Finally, this study analyzes the error characteristics of extracting incorrect examples by the LLMs. The experimental results show that: (1) The LLMs demonstrate good extraction capabilities, achieving the highest F1 scores of 87% and 73% respectively in the tasks of precise extraction and comprehensive extraction; (2) In the extraction process, the LLMs exhibit a certain degree of stubbornness, with limited effectiveness of prompting engineering strategies; (3) In terms of extraction content, the LLMs show a content bias, with stronger abilities to identify positive relations such as induction and acceleration; (4) The essence of extraction errors lies in the LLMs' misunderstanding of the implicit meanings in biomedical texts. This study provides practical workflows for precise and comprehensive extraction of document-level chemical-disease relations and also indicates that optimizing training data is the key to building more efficient and accurate extraction methods in the future.

  • Research Article
  • 10.47989/ir30iconf47518
A benchmark for evaluating crisis information generation capabilities in LLMs
  • Mar 11, 2025
  • Information Research an international electronic journal
  • Ruilian Han + 3 more

Introduction. Large language models (LLMs) have become increasingly significant in crisis information management due to their advanced natural language processing capabilities. This study aims to develop a comprehensive evaluation benchmark to assess the effectiveness of LLMs in generating crisis information. Method. CIEeval, an evaluation dataset, was constructed through steps such as information extraction and prompt generation. CIEeval covers 26 types of crises across sub-domains including water disasters, environmental pollution, and others, comprising a total of 4.8k data entries. Analysis. Eight LLMs applicable to the Chinese context were selected for evaluation based on multidimensional criteria. A combination of manual and machine scoring methods was utilized. This approach ensured a comprehensive understanding of each model's performance. Results. The manual and machine scores showed significant correlation. Under this scoring method, Claude 3.5 Sonnet performed the best, particularly excelling in complex scenarios like natural and accident disasters. In contrast, while scoring slightly lower overall, Chinese models like ERNIE 4.0 Turbo and iFlytek Spark V4.0, showed strong performance in specific crises. Conclusion. The evaluation benchmark validates the best LLM for crisis information generation (Claude 3.5 Sonnet) and provides valuable insights for LLMs to optimize and apply LLM in crisis information.

  • Research Article
  • Cite Count Icon 2
  • 10.1287/mnsc.2023.01459
Antisocial Responses to the “Coal to Gas” Regulation: An Unintended Consequence of a Residential Energy Policy
  • Feb 11, 2025
  • Management Science
  • Jing Cao + 3 more

Policies geared toward environmental and economic improvement could unexpectedly lead to negative consequences in other dimensions. Such cases raise a red flag to economists and policymakers who aim to deliver comprehensive and sensible policy evaluations. This article investigates antisocial behaviors in response to the Clean Winter Heating Policy (CWHP), which seeks to improve outdoor air quality. Our results show that participating villagers are more likely to violate laws to burn agricultural waste and exhibit lower prosociality in incentivized dictator games and public goods games. We further explore treatment heterogeneities and find that two channels are likely to play a part. First, the CWHP was perceived as a negative income shock. Therefore, the villagers would want to reduce their expenditure on straw disposal and behave less generously in the incentivized games. Second, the CWHP could trigger discontent and directly affect social preference. Additional evidence suggests that the antisocial (less prosocial) responses could have been avoided by granting larger upfront subsidies. This paper was accepted by Axel Ockenfels, behavioral economics and decision analysis. Funding: J. Cao gratefully acknowledges financial support from the National Natural Science Foundation of China [Grants 72243007 and 72250064] and the Ministry of Science and Technology of the People’s Republic of China [Grant 2023YFE0112900]. T. X. Liu gratefully acknowledges financial support from the National Natural Science Foundation of China [Grants 72222005 and 72342032]. R. Ma gratefully acknowledges financial support from the National Natural Science Foundation of China [Grants 72134006 and 72304272]. A. Sun gratefully acknowledges financial support from the National Natural Science Foundation of China [Grant 72373157], Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China [Grant 22XNA003]. The authors are also thankful for the support from the Energy Foundation, China Southern Power Grid Co., Ltd., Research Center for Green Economy and Sustainable Development and Institute for Global Development of Tsinghua University, and the Harvard-China Project on Energy, Economy and Environment. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2023.01459 .

  • Research Article
  • Cite Count Icon 2
  • 10.1287/msom.2023.0266
Capacity Allocation and Scheduling in Two-Stage Service Systems with Multiclass Customers
  • Sep 1, 2024
  • Manufacturing & Service Operations Management
  • Zhiheng Zhong + 3 more

Problem definition: This paper considers a tandem queueing system in which stage 1 has one station serving multiple classes of arriving customers with different service requirements and related delay costs, and stage 2 has multiple parallel stations, with each station providing one type of service. Each station has many statistically identical servers. The objective is to design a joint capacity allocation between stages/stations and scheduling rule of different classes of customers to minimize the system’s long-run average cost. Methodology/results: Using fluid approximation, we convert the stochastic problem into a fluid optimization problem and develop a solution procedure. Based on the solution to the fluid optimization problem, we propose a simple and easy-to-implement capacity allocation and scheduling policy and establish its asymptotic optimality for the stochastic system. The policy has an explicit index-based scheduling rule that is independent of the arrival rates, and resource allocation is determined by the priority orders established between the classes and stations. We conduct numerical experiments to validate the accuracy of the fluid approximation and demonstrate the effectiveness of our proposed policy. Managerial implications: Tandem queueing systems are ubiquitous. Our results provide useful guidelines for the allocation of limited resources and the scheduling of customer service in those systems. Our proposed policy can improve the system’s operational efficiency and customers’ service quality. Funding: Z. Zhong’s research is partially supported by the Fundamental Research Funds for the Central Universities [Grant 2023ZYGXZR074] and the Hunan Provincial Natural Science Foundation of China [Grant 2022JJ40109]. P. Cao’s research is partially supported by the National Natural Science Foundation of China [Grant 72122019]. J. Huang’s research is partially supported by the Hong Kong Research Grants Council General Research Fund [CUHK-14501621] and the National Natural Science Foundation of China [Grant 72222023]. S. X. Zhou’s research is partially supported by the Hong Kong Research Grants Council General Research Fund [CUHK-14500921], the National Natural Science Foundation of China [Grant 72394395], and the Asian Institute of Supply Chains and Logistics. Supplemental Material: The online appendix is available at https://doi.org/10.1287/msom.2023.0266 .

  • Research Article
  • 10.1145/3773285
TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models
  • Oct 28, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Florian Tambon + 4 more

Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single prompt, despite the formulation of prompts having a profound impact on the outcome. This paper introduces a generalist approach, TaskEval, a framework using diverse prompts and Item Response Theory (IRT) to efficiently assess LLMs’ capabilities and benchmark task characteristics, improving the understanding of their performance. Using two code generation benchmarks, HumanEval + and ClassEval , as well as 8 code generation LLMs, we show that TaskEval is capable of characterising the properties of tasks. Using topic analysis, we identify and analyse the tasks of 17 and 21 topics within the benchmarks. We also cross-analyse tasks’ characteristics with programming constructs (e.g., variable assignment, conditions, etc.) used by LLMs, emphasising some patterns with tasks’ difficulty. Finally, we conduct a comparison between the difficulty assessment of tasks by human annotators and LLMs. Orthogonal to current benchmarking evaluation efforts, TaskEval can assist researchers and practitioners in fostering better assessments of LLMs. The tasks’ characteristics can be used to identify shortcomings within existing benchmarks or improve the evaluation of LLMs.

  • PDF Download Icon
  • Research Article
  • 10.21203/rs.3.rs-5382879/v1
Disagreements in Medical Ethics Question Answering Between Large Language Models and Physicians.
  • Nov 15, 2024
  • Research square
  • Shelly Soffer + 19 more

Medical ethics is inherently complex, shaped by a broad spectrum of opinions, experiences, and cultural perspectives. The integration of large language models (LLMs) in healthcare is new and requires an understanding of their consistent adherence to ethical standards. To compare the agreement rates in answering questions based on ethically ambiguous situations between three frontier LLMs (GPT-4, Gemini-pro-1.5, and Llama-3-70b) and a multi-disciplinary physician group. In this cross-sectional study, three LLMs generated 1,248 medical ethics questions. These questions were derived based on the principles outlined in the American College of Physicians Ethics Manual. The topics spanned traditional, inclusive, interdisciplinary, and contemporary themes. Each model was then tasked in answering all generated questions. Twelve practicing physicians evaluated and responded to a randomly selected 10% subset of these questions. We compared agreement rates in question answering among the physicians, between the physicians and LLMs, and among LLMs. The models generated a total of 3,744 answers. Despite physicians perceiving the questions' complexity as moderate, with scores between 2 and 3 on a 5-point scale, their agreement rate was only 55.9%. The agreement between physicians and LLMs was also low at 57.9%. In contrast, the agreement rate among LLMs was notably higher at 76.8% (p < 0.001), emphasizing the consistency in LLM responses compared to both physician-physician and physician-LLM agreement. LLMs demonstrate higher agreement rates in ethically complex scenarios compared to physicians, suggesting their potential utility as consultants in ambiguous ethical situations. Future research should explore how LLMs can enhance consistency while adapting to the complexities of real-world ethical dilemmas.

  • Research Article
  • 10.1145/3786609
U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack
  • Dec 24, 2025
  • ACM Transactions on Information Systems
  • Yunfan Gao + 5 more

Recent advancements in Large Language Models (LLMs) have significantly extended context windows, igniting discussions about the necessity of Retrieval-Augmented Generation (RAG). U-NIAH, a unified Needle-In-A-Haystack (NIAH) framework, systematically evaluates LLMs and RAG methods in controlled long-context settings. It extends beyond traditional NIAH by incorporating more practical and complex scenarios like multi-needle, long-needle, and needle-in-needle configurations and leveraging the synthetic dataset to mitigate LLM biases. The experiments aim to address three research questions in long-context scenarios: (1) performance trade-offs between LLMs and RAG, (2) error patterns in RAG, and (3) RAG’s limitations in complex settings. Results show that smaller LLMs benefit more from RAG. In all settings, RAG achieves a win rate of 82.58% over direct answers. Additionally, it is found that retrieval noise and chunk ordering degrade RAG performance, and we further summarised typical error patterns, including omissions due to noise, hallucinations under high noise critical conditions, and self-doubt behaviours, as well as how these phenomena vary with context length. Finally, in some challenging scenarios, experiments show that deep reasoning models are more easily affected by distractors. These findings highlight the complementary roles of RAG and LLMs and offer actionable insights for optimising deployment strategies 1 .

  • Research Article
  • 10.2196/69504
Identifying Deprescribing Opportunities With Large Language Models in Older Adults: Retrospective Cohort Study.
  • Apr 11, 2025
  • JMIR aging
  • Vimig Socrates + 13 more

Polypharmacy, the concurrent use of multiple medications, is prevalent among older adults and associated with increased risks for adverse drug events including falls. Deprescribing, the systematic process of discontinuing potentially inappropriate medications, aims to mitigate these risks. However, the practical application of deprescribing criteria in emergency settings remains limited due to time constraints and criteria complexity. This study aims to evaluate the performance of a large language model (LLM)-based pipeline in identifying deprescribing opportunities for older emergency department (ED) patients with polypharmacy, using 3 different sets of criteria: Beers, Screening Tool of Older People's Prescriptions, and Geriatric Emergency Medication Safety Recommendations. The study further evaluates LLM confidence calibration and its ability to improve recommendation performance. We conducted a retrospective cohort study of older adults presenting to an ED in a large academic medical center in the Northeast United States from January 2022 to March 2022. A random sample of 100 patients (712 total oral medications) was selected for detailed analysis. The LLM pipeline consisted of two steps: (1) filtering high-yield deprescribing criteria based on patients' medication lists, and (2) applying these criteria using both structured and unstructured patient data to recommend deprescribing. Model performance was assessed by comparing model recommendations to those of trained medical students, with discrepancies adjudicated by board-certified ED physicians. Selective prediction, a method that allows a model to abstain from low-confidence predictions to improve overall reliability, was applied to assess the model's confidence and decision-making thresholds. The LLM was significantly more effective in identifying deprescribing criteria (positive predictive value: 0.83; negative predictive value: 0.93; McNemar test for paired proportions: χ21=5.985; P=.02) relative to medical students, but showed limitations in making specific deprescribing recommendations (positive predictive value=0.47; negative predictive value=0.93). Adjudication revealed that while the model excelled at identifying when there was a deprescribing criterion related to one of the patient's medications, it often struggled with determining whether that criterion applied to the specific case due to complex inclusion and exclusion criteria (54.5% of errors) and ambiguous clinical contexts (eg, missing information; 39.3% of errors). Selective prediction only marginally improved LLM performance due to poorly calibrated confidence estimates. This study highlights the potential of LLMs to support deprescribing decisions in the ED by effectively filtering relevant criteria. However, challenges remain in applying these criteria to complex clinical scenarios, as the LLM demonstrated poor performance on more intricate decision-making tasks, with its reported confidence often failing to align with its actual success in these cases. The findings underscore the need for clearer deprescribing guidelines, improved LLM calibration for real-world use, and better integration of human-artificial intelligence workflows to balance artificial intelligence recommendations with clinician judgment.

  • Research Article
  • 10.1101/2025.01.28.634527
Extracting Knowledge from Scientific Texts on Patient-Derived Cancer Models Using Large Language Models: Algorithm Development and Validation
  • Jan 29, 2025
  • bioRxiv
  • Jiarui Yao + 5 more

Patient-derived cancer models (PDCMs) have emerged as indispensable tools in both cancer research and preclinical studies. The number of publications on PDCMs increased significantly in the last decade. Developments in Artificial Intelligence (AI), particularly Large Language Models (LLMs), hold promise for extracting knowledge from scientific texts at scale. This study investigates the use of LLM-based systems for automatically extracting PDCM-related entities from scientific texts. We evaluated two approaches: direct prompting and soft prompting using LLMs. For direct prompting, we manually create prompts to guide the LLMs to output PDCM-related entities from texts. The prompt consists of an instruction, definitions of entity types, gold examples and a query. We automatically train soft prompts – a novel line of research in this domain - as continuous vectors using machine learning approaches. Our experiments utilized state-of-the-art LLMs – proprietary GPT4-o and a series of open LLaMA3 family models. In our experiments, GPT4-o with direct prompts maintained competitive results. Our results demonstrate that soft prompting can effectively enhance the capabilities of smaller open LLMs, achieving results comparable to proprietary models. These findings highlight the potential of LLMs in domain-specific text extraction tasks and emphasize the importance of tailoring approaches to the task and model characteristics.

  • Research Article
  • Cite Count Icon 6
  • 10.1136/rapm-2023-104868
Danger, Danger, Gaston Labat! Does zero-shot artificial intelligence correlate with anticoagulation guidelines recommendations for neuraxial anesthesia?
  • Sep 1, 2024
  • Regional Anesthesia & Pain Medicine
  • Nathan C Hurley + 3 more

IntroductionArtificial intelligence and large language models (LLMs) have emerged as potentially disruptive technologies in healthcare. In this study GPT-3.5, an accessible LLM, was assessed for its accuracy and reliability in...

  • Research Article
  • 10.1287/mnsc.2024.04423
Financial Inclusion via FinTech: From Digital Payments to Platform Investments
  • Nov 18, 2025
  • Management Science
  • Claire Yurong Hong + 2 more

We study household finance in the age of FinTech, where digital payments are integrated with various financial services through all-in-one super apps. We hypothesize that increased FinTech adoption via digital payments can lower the nonmonetary costs (e.g., psychological barriers) associated with financial market participation. We find that higher FinTech adoption leads to greater participation and increased risk taking in mutual fund investments. Using distance from Ant as an instrument for FinTech penetration, as well as the exogenous penetration of QRPay in Shenzhen, we further provide causal evidence from digital payment to risky fund investment. Moreover, the effect of FinTech is stronger among individuals who are otherwise more constrained, those with higher risk tolerance, or those living in under-banked counties. This paper was accepted by Will Cong, finance. Funding: C. Y. Hong acknowledges financial support from the National Natural Science Foundation of China [Grant 72003125]. X. Lu acknowledges financial support from the National Natural Science Foundation of China [Grant 72473028]. J. Pan acknowledges financial support from the National Natural Science Foundation of China [Grant W2431051]. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2024.04423 .

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.