Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Divergent creativity in humans and large language models.

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

This study evaluates semantic diversity in large language models (LLMs) versus humans using computational creativity measures and the Divergent Association Task, finding LLMs can surpass average humans but remain below highly creative individuals, with performance improved through prompt and parameter adjustments.

Abstract
Translate article icon Translate Article Star icon

The recent surge of Large Language Models (LLMs) has led to claims that they are approaching a level of creativity akin to human capabilities. This idea has sparked a blend of excitement and apprehension. However, a critical piece that has been missing in this discourse is a systematic evaluation of LLMs' semantic diversity, particularly in comparison to human divergent thinking. To bridge this gap, we leverage recent advances in computational creativity to analyze semantic divergence in both state-of-the-art LLMs and a substantial dataset of 100,000 humans. These divergence-based measures index associative thinking-the ability to access and combine remote concepts in semantic space-an established facet of creative cognition. We benchmark performance on the Divergent Association Task (DAT) and across multiple creative-writing tasks (haiku, story synopses, and flash fiction), using identical, objective scoring. We found evidence that LLMs can surpass average human performance on the DAT, and approach human creative writing abilities, yet they remain below the mean creativity scores observed among the more creative segment of human participants. Notably, even the top performing LLMs are still largely surpassed by the aggregated top half of human participants, underscoring a ceiling that current LLMs still fail to surpass. We also systematically varied linguistic strategy prompts and temperature, observing reliable gains in semantic divergence for several models. Our human-machine benchmarking framework addresses the polemic surrounding the imminent replacement of human creative labor by AI, disentangling the quality of the respective creative linguistic outputs using established objective measures. While prompting deeper exploration of the distinctive elements of human inventive thought compared to those of AI systems, we lay out a series of techniques to improve their outputs with respect to semantic diversity, such as prompt design and hyper-parameter tuning.

Similar Papers
  • Research Article
  • 10.1200/jco.2025.43.16_suppl.e22603
The accuracy and efficiency of large language models for chart review in cancer genetics.
  • Jun 1, 2025
  • Journal of Clinical Oncology
  • James Dickerson + 6 more

e22603 Background: Constructing databases is crucial for answering clinical questions but is time-consuming and error-prone. Our institution has maintained a REDCap database of cancer genetics encounters since 2002, manually curated by research assistants. We explored automating some data entry using a HIPAA-compliant, commercially available large language model (LLM). Methods: We randomly selected 100 patients from our database since 2017; a board-certified oncologist reviewed each chart to establish a gold standard. We examined variable abstraction for (1) whether genetic testing was ordered, (2) whether genetic testing results were obtained, (3) whether a variant was identified and, if so, the (4) gene and (5) variant status (benign, uncertain significance, or pathogenic). For the LLM input, we provided every Epic note and letter from January 2017 to January 2025 from the Cancer Genetics group (n = 308) for the 100 patients. For patients with multiple notes, we took (1) concordant values from ≥ 2 notes or (2) a non-benign variant as the true LLM result. We made two API calls per note using Stanford Healthcare Secure GPT with OpenAI’s gpt-4o model. The code is available at https://github.com/MrJimb0/ASCO2025 . We calculated summary statistics for time, token use, accuracy, and sensitivity/specificity, with the oncologist chart review as the reference. Results: The LLM accurately categorized 88% of the 100 patients compared to 87% by research assistants in REDCap. LLM errors that occurred in more than one patient were from information being outside of the provided notes (n = 4), information being in an image never converted to text (n = 2), and incorrectly interpreting a familial variant as being the patients’ (n = 2). In contrast, errors in REDCap were from new results returning after the date the research assistant did data entry (n = 7) and typos (n = 5). 29% of the cohort had a pathogenic variant. The LLM had a sensitivity of 83% and specificity of 96% for pathogenic variant detection, compared to 76% and 100% for REDCap. The LLM processed an average of 9,801 input tokens and 372 output tokens per patient, processing each patient in approximately 24 seconds. For a research assistant, the average time was 6 minutes per patient. Assuming 2,500 patients in a year, typical for this clinic, the LLM would take 16.5 hours of work at around $72 compared to 250 hours, or $7,500 of effort, for a research assistant. Conclusions: Compared to abstraction by a research assistant, the LLM was quicker and had similar sensitivity and specificity for these five variables. We obtained these results without hyperparameter tuning, vectorization, note standardization, model retraining, or the development of a foundational model. These results suggest that commercial LLMs with limited prompt engineering and post-LLM processing can support chart review in cancer genetics, potentially reducing costs and improving the efficiency of database construction.

  • Research Article
  • Cite Count Icon 39
  • 10.2196/64290
Laypeople's Use of and Attitudes Toward Large Language Models and Search Engines for Health Queries: Survey Study.
  • Feb 13, 2025
  • Journal of medical Internet research
  • Tamir Mendel + 4 more

Laypeople have easy access to health information through large language models (LLMs), such as ChatGPT, and search engines, such as Google. Search engines transformed health information access, and LLMs offer a new avenue for answering laypeople's questions. We aimed to compare the frequency of use and attitudes toward LLMs and search engines as well as their comparative relevance, usefulness, ease of use, and trustworthiness in responding to health queries. We conducted a screening survey to compare the demographics of LLM users and nonusers seeking health information, analyzing results with logistic regression. LLM users from the screening survey were invited to a follow-up survey to report the types of health information they sought. We compared the frequency of use of LLMs and search engines using ANOVA and Tukey post hoc tests. Lastly, paired-sample Wilcoxon tests compared LLMs and search engines on perceived usefulness, ease of use, trustworthiness, feelings, bias, and anthropomorphism. In total, 2002 US participants recruited on Prolific participated in the screening survey about the use of LLMs and search engines. Of them, 52% (n=1045) of the participants were female, with a mean age of 39 (SD 13) years. Participants were 9.7% (n=194) Asian, 12.1% (n=242) Black, 73.3% (n=1467) White, 1.1% (n=22) Hispanic, and 3.8% (n=77) were of other races and ethnicities. Further, 1913 (95.6%) used search engines to look up health queries versus 642 (32.6%) for LLMs. Men had higher odds (odds ratio [OR] 1.63, 95% CI 1.34-1.99; P<.001) of using LLMs for health questions than women. Black (OR 1.90, 95% CI 1.42-2.54; P<.001) and Asian (OR 1.66, 95% CI 1.19-2.30; P<.01) individuals had higher odds than White individuals. Those with excellent perceived health (OR 1.46, 95% CI 1.1-1.93; P=.01) were more likely to use LLMs than those with good health. Higher technical proficiency increased the likelihood of LLM use (OR 1.26, 95% CI 1.14-1.39; P<.001). In a follow-up survey of 281 LLM users for health, most participants used search engines first (n=174, 62%) to answer health questions, but the second most common first source consulted was LLMs (n=39, 14%). LLMs were perceived as less useful (P<.01) and less relevant (P=.07), but elicited fewer negative feelings (P<.001), appeared more human (LLM: n=160, vs search: n=32), and were seen as less biased (P<.001). Trust (P=.56) and ease of use (P=.27) showed no differences. Search engines are the primary source of health information; yet, positive perceptions of LLMs suggest growing use. Future work could explore whether LLM trust and usefulness are enhanced by supplementing answers with external references and limiting persuasive language to curb overreliance. Collaboration with health organizations can help improve the quality of LLMs' health output.

  • Research Article
  • Cite Count Icon 17
  • 10.1115/1.4066730
Evaluating Large Language Models for Material Selection
  • Nov 14, 2024
  • Journal of Computing and Information Science in Engineering
  • Daniele Grandi + 4 more

Material selection is a crucial step in conceptual design due to its significant impact on the functionality, aesthetics, manufacturability, and sustainability impact of the final product. This study investigates the use of large language models (LLMs) for material selection in the product design process and compares the performance of LLMs against expert choices for various design scenarios. By collecting a dataset of expert material preferences, the study provides a basis for evaluating how well LLMs can align with expert recommendations through prompt engineering and hyperparameter tuning. The divergence between LLM and expert recommendations is measured across different model configurations, prompt strategies, and temperature settings. This approach allows for a detailed analysis of factors influencing the LLMs' effectiveness in recommending materials. The results from this study highlight two failure modes: the low variance of recommendations across different design scenarios and the tendency toward overestimating material appropriateness. Parallel prompting is identified as a useful prompt-engineering method when using LLMs for material selection. The findings further suggest that, while LLMs can provide valuable assistance, their recommendations often vary significantly from those of human experts. This discrepancy underscores the need for further research into how LLMs can be better tailored to replicate expert decision-making in material selection. This work contributes to the growing body of knowledge on how LLMs can be integrated into the design process, offering insights into their current limitations and potential for future improvements.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 34
  • 10.1038/s41598-024-76682-6
Reconciling the contrasting narratives on the environmental impact of large language models
  • Nov 1, 2024
  • Scientific Reports
  • Shaolei Ren + 3 more

The recent proliferation of large language models (LLMs) has led to divergent narratives about their environmental impacts. Some studies highlight the substantial carbon footprint of training and using LLMs, while others argue that LLMs can lead to more sustainable alternatives to current practices. We reconcile these narratives by presenting a comparative assessment of the environmental impact of LLMs vs. human labor, examining their relative efficiency across energy consumption, carbon emissions, water usage, and cost. Our findings reveal that, while LLMs have substantial environmental impacts, their relative impacts can be dramatically lower than human labor in the U.S. for the same output, with human-to-LLM ratios ranging from 40 to 150 for a typical LLM (Llama-3-70B) and from 1200 to 4400 for a lightweight LLM (Gemma-2B-it). While the human-to-LLM ratios are smaller with regard to human labor in India, these ratios are still between 3.4 and 16 for a typical LLM and between 130 and 1100 for a lightweight LLM. Despite the potential benefit of switching from humans to LLMs, economic factors may cause widespread adoption to lead to a new combination of human and LLM-driven work, rather than a simple substitution. Moreover, the growing size of LLMs may substantially increase their energy consumption and lower the human-to-LLM ratios, highlighting the need for further research to ensure the sustainability and efficiency of LLMs.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 25
  • 10.1007/s11633-025-1546-4
Assessing and Understanding Creativity in Large Language Models
  • Apr 28, 2025
  • Machine Intelligence Research
  • Yunpu Zhao + 3 more

In the field of natural language processing, the rapid development of large language model (LLM) has attracted increasing attention. LLMs have shown a high level of creativity in various tasks, but the methods for assessing such creativity are inadequate. Assessment of LLM creativity needs to consider differences from humans, requiring multiple dimensional measurement while balancing accuracy and efficiency. This paper aims to establish an efficient framework for assessing the level of creativity in LLMs. By adapting the modified Torrance tests of creative thinking, the research evaluates the creative performance of various LLMs across 7 tasks, emphasizing 4 criteria including fluency, flexibility, originality, and elaboration. In this context, we develop a comprehensive dataset of 700 questions for testing and an LLM-based evaluation method. In addition, this study presents a novel analysis of LLMs’ responses to diverse prompts and role-play situations. We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration. In addition, the use of prompts and role-play settings of the model significantly influence creativity. Additionally, the experimental results also indicate that collaboration among multiple LLMs can enhance originality. Notably, our findings reveal a consensus between human evaluations and LLMs regarding the personality traits that influence creativity. The findings underscore the significant impact of LLM design on creativity and bridge artificial intelligence and human creativity, offering insights into LLMs’ creativity and potential applications.

  • Research Article
  • Cite Count Icon 26
  • 10.1111/2041-210x.14325
Harnessing large language models for coding, teaching and inclusion to empower research in ecology and evolution
  • May 2, 2024
  • Methods in Ecology and Evolution
  • Natalie Cooper + 4 more

Large language models (LLMs) are a type of artificial intelligence (AI) that can perform various natural language processing tasks. The adoption of LLMs has become increasingly prominent in scientific writing and analyses because of the availability of free applications such as ChatGPT. This increased use of LLMs not only raises concerns about academic integrity but also presents opportunities for the research community. Here we focus on the opportunities for using LLMs for coding in ecology and evolution. We discuss how LLMs can be used to generate, explain, comment, translate, debug, optimise and test code. We also highlight the importance of writing effective prompts and carefully evaluating the outputs of LLMs. In addition, we draft a possible road map for using such models inclusively and with integrity. LLMs can accelerate the coding process, especially for unfamiliar tasks, and free up time for higher level tasks and creative thinking while increasing efficiency and creative output. LLMs also enhance inclusion by accommodating individuals without coding skills, with limited access to education in coding, or for whom English is not their primary written or spoken language. However, code generated by LLMs is of variable quality and has issues related to mathematics, logic, non‐reproducibility and intellectual property; it can also include mistakes and approximations, especially in novel methods. We highlight the benefits of using LLMs to teach and learn coding, and advocate for guiding students in the appropriate use of AI tools for coding. Despite the ability to assign many coding tasks to LLMs, we also reaffirm the continued importance of teaching coding skills for interpreting LLM‐generated code and to develop critical thinking skills. As editors of MEE, we support—to a limited extent—the transparent, accountable and acknowledged use of LLMs and other AI tools in publications. If LLMs or comparable AI tools (excluding commonly used aids like spell‐checkers, Grammarly and Writefull) are used to produce the work described in a manuscript, there must be a clear statement to that effect in its Methods section, and the corresponding or senior author must take responsibility for any code (or text) generated by the AI platform.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1038/s41598-025-20496-7
Reasoning-based LLMs surpass average human performance on medical social skills
  • Oct 17, 2025
  • Scientific Reports
  • Khalid Ibraheem Alohali + 4 more

A significant portion of medical licensing examinations assesses key social skills such as communication, ethics, and professionalism, which are vital for quality patient care. Artificial intelligence (AI) has been increasingly integrated into healthcare systems in recent years, raising concerns among regulators, providers, and patients regarding AI’s capacity to handle complex, human-centered scenarios. Previous work has shown that large language models (LLMs) like GPT-3.5 and GPT-4 perform well on social skills questions from the United States Medical Licensing Examination (USMLE). However, newer models like GPT-4o, Gemini 1.5 Pro, and o1 have been introduced, with the latter designed to mimic human thinking through a “chain of thought” reasoning, unlike other LLMs that provide instantaneous answers. The impact of reasoning on LLMs’ ability to navigate scenarios requiring social skills remains unclear. Here, we evaluate five LLMs: GPT-4, GPT-4o, Gemini 1.5 Pro, and o1-preview, and its full version, o1; using forty USMLE-style social skills questions from the UWORLD question bank covering several categories: communication & interpersonal skills, healthcare policy & economics, system-based practice & quality improvement, and medical ethics & jurisprudence. After each LLM answered, it was subjected to an “Are you sure?” follow-up prompt to test consistency. Our results show that o1, the reasoning model, came out on top with 39 out of 40 correct final answers (97.5%). GPT-4o and Gemini 1.5 Pro (87.5%) tied in second place, followed by o1-preview (77.5%) and lastly GPT-4 (75%). All LLMs surpassed the UWORLD question bank’s 64% average. Domain-specific analysis revealed that despite having equal overall scores, GPT-4o and Gemini 1.5 Pro -developed by two different companies- had varying strengths. GPT-4o demonstrated its greatest strengths in communication & interpersonal skills and patient safety, while Gemini 1.5 Pro achieved perfect scores in healthcare policy & economics, system-based practice & quality improvement, and medical ethics & jurisprudence. Although o1-preview demonstrated strong initial performance, its inconsistency under skepticism; changing answers frequently, primarily to incorrect ones, reduced its overall ranking from second to fourth. This phenomenon was not observed in any other model, including the final o1 release, which maintained consistent, high-level performance. These findings, along with prior work, highlight the potential of LLMs to demonstrate effectiveness at answering knowledge-based social skills questions in a medical context, sometimes surpassing average human performance. As LLMs continue to grow in size and sophistication, their performance is expected to improve further. In particular, the strong performance of reasoning-based LLMs suggests that such architectures hold significant promise for advancing AI’s role in socially oriented tasks. These results demonstrate the growing potential for reasoning-based LLMs to complement and enhance clinical training, medical education, and patient care.Supplementary InformationThe online version contains supplementary material available at 10.1038/s41598-025-20496-7.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/app152010968
Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks
  • Oct 13, 2025
  • Applied Sciences
  • Fidan Kaya Gülağız

Large language models (LLMs) are playing an increasingly important role in data science applications. In this study, the performance of LLMs in generating code and designing solutions for data science tasks is systematically evaluated based on different real-world tasks from the Kaggle platform. Models from different LLM families were tested under both default settings and configurations with hyperparameter tuning (HPT) applied. In addition, the effects of few-shot prompting (FSP) and Tree of Thought (ToT) strategies on code generation were compared. Alongside technical metrics such as accuracy, F1 score, Root Mean Squared Error (RMSE), execution time, and peak memory consumption, LLM outputs were also evaluated against Kaggle user-submitted solutions, leaderboard scores, and two established AutoML frameworks (auto-sklearn and AutoGluon). The findings suggest that, with effective prompting strategies and HPT, models can deliver competitive results on certain tasks. The ability of some LLMS to suggest appropriate algorithms reveals that LLMs can be seen not only as code generators, but also as systems capable of designing machine learning (ML) solutions. This study presents a comprehensive analysis of how strategic decisions such as prompting methods, tuning approaches, and algorithm selection, affect the design of LLM-based data science systems, offering insights for future hybrid human–LLM systems.

  • Conference Article
  • 10.1109/iccp68926.2025.11427136
Comparative Analysis of LSTM Models and Large Language Models for Stock Trend Forecasting with Sentiment Analysis and Google Trends
  • Oct 16, 2025
  • Răzvan-Andrei Moga

This research examines and compares the effectiveness of Long Short-Term Memory (LSTM) models and Large Language Models (LLMs) for stock trend forecasting by integrating sentiment analysis of financial news and Google Trends data. Advanced machine learning techniques are required to process large datasets and uncover complex patterns, and recent developments in LLMs present new opportunities for financial prediction. An LSTM neural network was trained on historical stock data, sentiment scores, and Google Trends data, with its performance compared against four state-of-the-art LLMs: Phi4, Llama-4 Maverick, Gemini 2.5 Flash, and DeepSeek R1, as well as a Vector AutoRegressive model. The study evaluates performance across four major stock tickers using comprehensive metrics. The LSTM model showed significantly lower error metrics compared to the VAR baseline, while LLMs demonstrated superior performance when provided with detailed prompts, achieving 0.32% sMAPE with perfect directional accuracy. The study used hyperparameter tuning, early stopping, and model checkpointing to enhance LSTM performance, while prompt engineering proved critical for LLM success. The findings suggest that both classical LSTM approaches and modern LLMs can effectively improve stock market predictions, with LLMs showing particular strength in processing multi-modal financial data.

  • Research Article
  • Cite Count Icon 8
  • 10.1136/bmjment-2025-301787
Role of large language models in mental health research: an international survey of researchers' practices and perspectives.
  • Jun 1, 2025
  • BMJ mental health
  • Jake Linardon + 7 more

Large language models (LLMs) offer significant potential to streamline research workflows and enhance productivity. However, limited data exist on the extent of their adoption within the mental health research community. We examined how LLMs are being used in mental health research, the types of tasks they support, barriers to their adoption and broader attitudes towards their integration. 714 mental health researchers from 42 countries and various career stages (from PhD student, to early career researcher, to Professor) completed a survey assessing LLM-related practices and perspectives. 496 (69.5%) reported using LLMs to assist with research, with 94% indicating use of ChatGPT. The most common applications were for proofreading written work (69%) and refining or generating code (49%). LLM use was more prevalent among early career researchers. Common challenges reported by users included inaccurate responses (78%), ethical concerns (48%) and biased outputs (27%). However, many users indicated that LLMs improved efficiency (73%) and output quality (44%). Reasons for non-use were concerns with ethical issues (53%) and accuracy of outputs (50%). Most agreed that they wanted more training on responsible use (77%), that researchers should be required to disclose use of LLMs in manuscripts (79%) and that they were concerned about LLMs affecting how their work is evaluated (60%). While LLM use is widespread in mental health research, key barriers and implementation challenges remain. LLMs may streamline mental health research processes, but clear guidelines are needed to support their ethical and transparent use across the research lifecycle.

  • Conference Article
  • Cite Count Icon 4
  • 10.18653/v1/2024.emnlp-main.1071
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
  • Jan 1, 2024
  • Abhishek Divekar + 1 more

It is often desirable to distill the capabilities of large language models (LLMs) into smaller student models due to compute and memory constraints. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SYNTHESIZRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is "seeded" with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find that SYNTHESIZRR 1 greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to 32-shot prompting and four prior approaches.

  • Research Article
  • Cite Count Icon 1
  • 10.29119/1641-3466.2024.210.39
Applying generative artificial intelligence to support invention processes: an analysis of the Systematic Inventive Thinking (SIT) methodology
  • Jan 1, 2024
  • Scientific Papers of Silesian University of Technology. Organization and Management Series
  • Paweł Wawrzała

Purpose: This paper aims to explore the integration of Systematic Inventive Thinking (SIT) methodology with Large Language Models (LLMs) to enhance innovative processes. It seeks to assess how LLMs can support analytical and creative processes in design teams and how hybrid human-LLM collaboration can contribute to more dynamic and unconventional problem-solving approaches Design/methodology/approach: The study employs a theoretical analysis of SIT methodology and LLM capabilities, synthesizing existing literature on both topics. It proposes a framework for integrating SIT with LLMs, including structured prompt patterns for each stage of the SIT process. The approach includes a comparative analysis of human and LLM capabilities in inventive processes. Findings: Research reveals that LLMs can significantly enhance the SIT process by providing rapid information synthesis, generating diverse ideas, and systematically applying SIT principles. However, human creativity, intuition, and holistic assessment remain crucial for breakthrough innovations. The study identifies specific prompt patterns and techniques for effective human-LLM collaboration within the SIT framework. Research limitations/implications: As this is an initial theoretical framework, empirical validation through case studies or experimental research is needed to assess its practical effectiveness. Practical implications: The proposed framework offers practitioners in the fields of innovation and design a structured approach to integrating AI into their creative processes. Provides specific guidelines for the use of LLM to enhance each stage of the SIT methodology, which could lead to more efficient and innovative outcomes. Social implications: Integration of SIT with LLM could significantly influence public attitudes toward AI, potentially increasing its acceptance as a collaborative tool in creative and problem- solving processes. This approach may lead to more efficient and sustainable innovation practices in various industries, potentially addressing social challenges more effectively. However, it may also raise concerns about job displacement in creative fields, necessitating a focus on reskilling and education to prepare the workforce for collaboration with AI systems. Originality/value: This paper presents a novel approach to integrating SIT methodology with state-of-the-art AI technology, offering new perspectives on increasing human creativity with machine capabilities in structured innovation processes. It contributes to the emerging field of AI-assisted design thinking and provides a foundation for further research in this area. Keywords: Systematic Inventive Thinking, Large Language Models, Innovation, Human-AI Collaboration. Category of the paper: Conceptual paper, Research paper.

  • Research Article
  • Cite Count Icon 3
  • 10.2196/67469
Large Language Models in Randomized Controlled Trials Design: Observational Study
  • Sep 3, 2025
  • Journal of Medical Internet Research
  • Liyuan Jin + 6 more

BackgroundRandomized controlled trials (RCTs) face challenges such as limited generalizability, insufficient recruitment diversity, and high failure rates, often due to restrictive eligibility criteria and inefficient patient selection. Large language models (LLMs) have shown promise in various clinical tasks, but their potential role in RCT design remains underexplored.ObjectiveThis study investigates the ability of LLMs, specifically GPT-4-Turbo-Preview, to assist in designing RCTs that enhance generalizability, recruitment diversity, and reduce failure rates, while maintaining clinical safety and ethical standards.MethodsWe conducted a noninterventional, observational study analyzing 20 parallel-arm RCTs, comprising 10 completed and 10 registered studies published after January 2024 to mitigate pretraining biases. The LLM was tasked with generating RCT designs based on input criteria, including eligibility, recruitment strategies, interventions, and outcomes. The accuracy of LLM-generated designs was quantitatively assessed by 2 independent clinical experts by comparing them to clinically validated ground truth data from ClinicalTrials.gov. We have conducted statistical analysis using natural language processing–based methods, including Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-L, and Metric for Evaluation of Translation with Explicit ORdering (METEOR), for objective scoring on corresponding LLM outputs. Qualitative assessments were performed using Likert scale ratings (1-3) for domains such as safety, clinical accuracy, objectivity or bias, pragmatism, inclusivity, and diversity.ResultsThe LLM achieved an overall accuracy of 72% in replicating RCT designs. Recruitment and intervention designs demonstrated high agreement with the ground truth, achieving 88% and 93% accuracy, respectively. However, LLMs showed lower accuracy in designing eligibility criteria (55%) and outcomes measurement (53%). Natural language processing statistical analysis reported BLEU=0.04, ROUGE-L=0.20, and METEOR=0.18 on average objective scoring of LLM outputs. Qualitative evaluations showed that LLM-generated designs scored above 2 points and closely matched the original designs in scores across all domains, indicating strong clinical alignment. Specifically, both original and LLM-based designs ranked similarly high in safety, clinical accuracy, and objectivity or bias in published RCTs. Moreover, LLM-based design ranked noninferior to original designs in registered RCTs in multiple domains. In particular, LLMs enhanced diversity and pragmatism, which are key factors in improving RCT generalizability and addressing failure rates.ConclusionsLLMs, such as GPT-4-Turbo-Preview, have demonstrated potential in improving RCT design, particularly in recruitment and intervention planning, while enhancing generalizability and addressing diversity. However, expert oversight and regulatory measures are essential to ensure patient safety and ethical standards. The findings support further integration of LLMs into clinical trial design, although continued refinement is necessary to address limitations in eligibility and outcomes measurement.

  • Research Article
  • Cite Count Icon 114
  • 10.1287/mnsc.2023.03014
Large Language Model in Creative Work: The Role of Collaboration Modality and User Expertise
  • Oct 15, 2024
  • Management Science
  • Zenan Chen + 1 more

Since the launch of ChatGPT in December 2022, large language models (LLMs) have been rapidly adopted by businesses to assist users in a wide range of open-ended tasks, including creative work. Although the versatility of LLM has unlocked new ways of human-artificial intelligence collaboration, it remains uncertain how LLMs should be used to enhance business outcomes. To examine the effects of human-LLM collaboration on business outcomes, we conducted an experiment where we tasked expert and nonexpert users to write an ad copy with and without the assistance of LLMs. Here, we investigate and compare two ways of working with LLMs: (1) using LLMs as “ghostwriters,” which assume the main role of the content generation task, and (2) using LLMs as “sounding boards” to provide feedback on human-created content. We measure the quality of the ads using the number of clicks generated by the created ads on major social media platforms. Our results show that different collaboration modalities can result in very different outcomes for different user types. Using LLMs as sounding boards enhances the quality of the resultant ad copies for nonexperts. However, using LLMs as ghostwriters did not provide significant benefits and is, in fact, detrimental to expert users. We rely on textual analyses to understand the mechanisms, and we learned that using LLMs as ghostwriters produces an anchoring effect, which leads to lower-quality ads. On the other hand, using LLMs as sounding boards helped nonexperts achieve ad content with low semantic divergence to content produced by experts, thereby closing the gap between the two types of users. This paper was accepted by D. J. Wu, information systems. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2023.03014 .

  • Research Article
  • Cite Count Icon 2
  • 10.3390/app14167118
Targeted Training Data Extraction—Neighborhood Comparison-Based Membership Inference Attacks in Large Language Models
  • Aug 14, 2024
  • Applied Sciences
  • Huan Xu + 8 more

A large language model refers to a deep learning model characterized by extensive parameters and pretraining on a large-scale corpus, utilized for processing natural language text and generating high-quality text output. The increasing deployment of large language models has brought significant attention to their associated privacy and security issues. Recent experiments have demonstrated that training data can be extracted from these models due to their memory effect. Initially, research on large language model training data extraction focused primarily on non-targeted methods. However, following the introduction of targeted training data extraction by Carlini et al., prefix-based extraction methods to generate suffixes have garnered considerable interest, although current extraction precision remains low. This paper focuses on the targeted extraction of training data, employing various methods to enhance the precision and speed of the extraction process. Building on the work of Yu et al., we conduct a comprehensive analysis of the impact of different suffix generation methods on the precision of suffix generation. Additionally, we examine the quality and diversity of text generated by various suffix generation strategies. The study also applies membership inference attacks based on neighborhood comparison to the extraction of training data in large language models, conducting thorough evaluations and comparisons. The effectiveness of membership inference attacks in extracting training data from large language models is assessed, and the performance of different membership inference attacks is compared. Hyperparameter tuning is performed on multiple parameters to enhance the extraction of training data. Experimental results indicate that the proposed method significantly improves extraction precision compared to previous approaches.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant