The creative musical achievement of AI systems compared to music students: A replication of the study by Schreiber et al. (2024)
Although the last two years have seen AI systems progress significantly when it comes to generating cultural products like literature, poems, or music, the jury is still out when it comes to determining whether the aesthetic quality of these products increases in tandem with the performance enhancements of underlying large language models (LLMs). We replicated the study by Schreiber et al. (2024) to test whether the creative performance of selected LLMs had improved over the past two years in the musical domain. In an online rating experiment based on a melody continuation paradigm, 75 melodic continuations generated by the AI systems Qwen 2 (Version 72B Instruct), Llama 3 (Version 70B Instruct), and ChatGPT (Version 4) were compared to 23 solutions composed by humans. The aesthetic quality of the sound examples was then evaluated by N = 54 listeners (music students) using four criteria (convincing, logical and meaningful, interesting, and liking). As the first main finding, human-based creative solutions outperformed all three AI systems on all four dependent variables (large effect sizes 1.11 ≤ dz ≤ 2.51), thus confirming the finding by Schreiber et al. (2024). The second main finding revealed a mean (and meaningful) discrimination sensitivity of d’ = 1.09 for AI- and human-based solutions. We conclude that merely boosting the volume of training of the AI systems does not guarantee correlating improvement in the creative musical output produced under controlled conditions.
- Research Article
- 10.1111/bjet.70071
- Apr 20, 2026
- British Journal of Educational Technology
This study explores the impact of robot–LLM (Large Language Model) integration on collaborative creative writing, focusing on how embodiment and AI creativity influence various aspects of creative output. A total of 150 undergraduate students participated in a structured experimental design with five collaboration conditions: Human–Human (HH), Human–Computer with High‐Creativity LLM (HC), Human–Robot with High‐Creativity LLM (HR), Human–Robot with Low‐Creativity LLM (RL) and Human–Computer with Low‐Creativity LLM (CL). Creativity was assessed through expert ratings and computational analysis of originality, imagery, voice and semantic flow. The results revealed that while the Human–Robot (High‐Creativity LLM) condition significantly enhanced originality, Human–Human and Human–LLM (text‐based) collaborations excelled in imagery and voice. The study identified an ‘embodiment paradox’, where robot embodiment amplified creativity in high‐creativity AI conditions, yet human collaboration remained superior in stylistic expression. Mediation analysis revealed that user engagement acted as a mediator, with embodiment compensating for low‐creativity AI and amplifying the creative process with high‐creativity AI. The findings have important implications for the design of collaborative AI systems, highlighting the need for a balanced integration of embodiment and AI creativity to optimize creative outcomes. This research contributes to our understanding of how human–robot–LLM collaborations can expand creative potential in writing, offering insights for future AI applications in educational and creative industries. Practitioner notes What is already known about this topic? Previous studies have explored the impact of AI in creative collaborations, with a focus on text‐based models like LLMs enhancing writing quality. Embodiment in AI systems, such as humanoid robots, has been shown to affect user engagement and emotional responses, influencing creativity. Human collaboration has traditionally been seen as superior in generating stylistic elements like imagery and voice, while AI excels in originality and idea generation. What this paper adds? This research demonstrates that robot–LLM collaboration significantly boosts originality, particularly when high‐creativity AI is used. The study uncovers the ‘embodiment paradox’, where embodied robots enhance creativity in high‐creativity AI conditions but human collaboration remains superior in stylistic expression. The mediation role of user engagement is explored, showing how embodiment can enhance creative outcomes when AI creativity is low and amplify them when AI creativity is high. Implications for practice and/or policy Educators and trainers can utilize embodied AI systems in creative tasks to increase student or participant engagement and foster more original outputs. Training programmes can be structured to leverage the strengths of both human collaboration and AI, tailoring tasks based on AI's creativity levels for optimal outcomes. Policy around the integration of AI in educational and creative settings should encourage balanced AI systems that combine embodiment and creativity for enhanced collaborative work.
- Research Article
- 10.30574/wjarr.2025.26.1.1268
- Apr 30, 2025
- World Journal of Advanced Research and Reviews
With Enterprises rapidly including Large Language Models (LLMs) in their core operations, from customer service to finance to healthcare to e-commerce, there is an urgent need to pay utmost attention to the scalability and robustness of quality assurance (QA) pipelines. LLMs are probabilistic, sensitive to the context, and non-deterministic, so traditional QA methods fail them. In this article, we look at what organizations can do to build scalable QA frameworks to address the peculiar requirements and possibilities of AI systems built on LLMs. We first look at what sets LLM-specific QA apart from conventional software QA, ranging from output unpredictability to hallucination hazards and the need to ensure bias and fairness. After that, the article specifies the core components of a modern QA pipeline: automation, reproducibility, observability, and continuous integration to share best practices for each. The paper goes in-depth into the technical architecture, data quality validation, synthetic testing strategies, and how human-in-the-loop processes can be used to provide nuanced evaluation. Leading enterprises in JPMorgan Chase, Amazon, and the healthcare industry have demonstrated real-world case studies of how they moved fast and deployed rigorous QA frameworks to gain reliability from these LLMs and compliance and trust from their users. Tools and technology for QA are discussed, ranging from open-source testing frameworks MLOps stacks, and NLP validation platforms. Finally, we examine future relationships between self-healing AI systems, autonomous QA agents, and multimodal validation pipelines in the context of adaptive intelligent QA strategies that define the enterprise AI of the future. The article discusses ideas for building responsible, scalable, enterprise-ready AI systems.
- Research Article
18
- 10.1073/pnas.2426153122
- Jun 13, 2025
- Proceedings of the National Academy of Sciences
AI systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased-shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here, we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 text-based medical case vignettes. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.
- Discussion
2
- 10.1111/ans.18720
- Oct 2, 2023
- ANZ Journal of Surgery
We are grateful for the thoughtful commentary provided by Kleebayoon and Wiwanitkit on our recent article published in the ANZ Journal of Surgery.1 While we welcome this scholarly engagement, we find it imperative to address their concerns in order to elucidate the robustness of our study's design, methodology, and findings. First, the concern regarding the sample size and scope appears to overlook our study's qualitative nature, aimed at understanding the foundational capabilities of large language models (LLMs) in a controlled medical environment. It is important to underscore that our study serves as an exploratory assessment, where a large sample size is not the primary focus. Second, our study deliberately restricts its application to controlled settings as a foundational step, with future work intending to address real-world efficacy, as explicitly stated in our conclusions. Third, although our evaluation criteria focus on readability, reliability, and consistency with clinical guidelines, these are integral components that inherently contribute to clinical accuracy and patient safety, with ethical considerations forming an underlying theme. Fourth, the critique regarding potential bias in the evaluation process seems to underestimate the diversity of expertise among our panel of evaluators, which included three plastic surgeons and two junior doctors, thereby bringing multiple perspectives to the assessment. Fifth, while the study does not compare LLM performance to human expertise, it is not designed to propose LLMs as substitutes for human clinicians but rather as supplementary tools. Lastly, we acknowledge the ethical dimensions surrounding AI deployment in clinical settings and emphasize in our study the need for ongoing human supervision and algorithmic auditing to mitigate risks and biases. While LLM's have existed in the academic and scientific space for some time, it was the introduction of ChatGPT that captured the public's attention and imagination.2, 3 Since then, multiple papers have been published exploring the role of ChatGPT and other LLMs in the academic and clinical setting. Initial inquiries into the utility of LLMs have included topics such as medical education, clinical management, and scientific research.2-4 As interest in LLMs continued to grow, various other models have arisen, each with their advantages and drawbacks when compared with ChatGPT. Current literature demonstrates that LLMs show significant deficits in referencing, a tangible information inaccuracy rate, and are susceptible to bias. We agree with our readers that our study only reinforces these concerns, while the true extent to which LLMs can be developed into safe, ethical and clinically viable tools remains to be seen. However, only by asking the right questions and raising the relevant issue can we further drive research and lead to a greater understanding of this field. Consequently, we believe that our study offers valuable preliminary insights into the role of LLMs in medical education and clinical assistance. We advocate for a responsible integration of AI into clinical practice that adheres to stringent ethical and safety standards, and our paper mentions the need for future studies that scrutinize these ethical concerns in greater detail. Moreover, our study underscores the importance of interdisciplinary collaborations involving clinicians, data scientists, and ethicists to ensure that AI systems are both effective and ethical. Yi Xie: Conceptualization; project administration; writing – original draft; writing – review and editing. Ishith Seth: Conceptualization; investigation; writing – original draft; writing – review and editing. David J. Hunter-Smith: Supervision, writing - original draft; writing - review and editing. Marc A. Seifman: Supervision; writing – original draft; writing – review and editing. Warren M. Rozen: Supervision; writing – original draft; writing – review and editing.
- Research Article
8
- 10.1609/aaaiss.v3i1.31183
- May 20, 2024
- Proceedings of the AAAI Symposium Series
Large language models (LLMs) have revolutionized the way humans interact with AI systems, transforming a wide range of fields and disciplines. In this talk, I share two distinct approaches to empowering human-AI interaction using LLMs. The first one explores how LLMstransform computational social science, and how human-AI collaboration can reduce costs and improve the efficiency of social science research. The second part looks at social skill learning via LLMs by empowering therapists and learners with LLM-empowered feedback and deliberative practices. These two works demonstrate how human-AI collaboration via LLMs can empower individuals and foster positive change. We conclude by discussing how LLMs enable collaborative intelligence by redefining the interactions between humans and AI systems.
- Dissertation
- 10.32657/10356/184392
- Jan 1, 2025
This thesis addresses the critical challenge of developing trustworthy and reliable Natural Language Processing (NLP) systems, specifically the newly emerged Large Language Models (LLMs). As LLMs become increasingly prevalent in various domains, the need for transparent, interpretable, and controllable AI systems has never been more pressing. However, the complexity of LLMs, the compositional nature of language, and the potential for hallucinations pose significant obstacles to achieving these goals. To increase user trust of AI systems in real-life deployment, we hope to enhance the trustworthiness and reliability of LLMs without requiring model revisions or compromising performance. Motivated by this overarching goal, we delve into two main goals that enhance trustworthiness, providing user-friendly explanations of the LLM’s decisions and controlling the LLM’s behaviors. Specifically, we raise three main research questions: How can we disentangle the true reasons behind LLM decisions from the complex architecture and vast number of parameters? How can we provide user-friendly explanations for LLM generations? How can we increase LLM controllability with minimal interventions?
- Research Article
1
- 10.29119/1641-3466.2024.210.39
- Jan 1, 2024
- Scientific Papers of Silesian University of Technology. Organization and Management Series
Purpose: This paper aims to explore the integration of Systematic Inventive Thinking (SIT) methodology with Large Language Models (LLMs) to enhance innovative processes. It seeks to assess how LLMs can support analytical and creative processes in design teams and how hybrid human-LLM collaboration can contribute to more dynamic and unconventional problem-solving approaches Design/methodology/approach: The study employs a theoretical analysis of SIT methodology and LLM capabilities, synthesizing existing literature on both topics. It proposes a framework for integrating SIT with LLMs, including structured prompt patterns for each stage of the SIT process. The approach includes a comparative analysis of human and LLM capabilities in inventive processes. Findings: Research reveals that LLMs can significantly enhance the SIT process by providing rapid information synthesis, generating diverse ideas, and systematically applying SIT principles. However, human creativity, intuition, and holistic assessment remain crucial for breakthrough innovations. The study identifies specific prompt patterns and techniques for effective human-LLM collaboration within the SIT framework. Research limitations/implications: As this is an initial theoretical framework, empirical validation through case studies or experimental research is needed to assess its practical effectiveness. Practical implications: The proposed framework offers practitioners in the fields of innovation and design a structured approach to integrating AI into their creative processes. Provides specific guidelines for the use of LLM to enhance each stage of the SIT methodology, which could lead to more efficient and innovative outcomes. Social implications: Integration of SIT with LLM could significantly influence public attitudes toward AI, potentially increasing its acceptance as a collaborative tool in creative and problem- solving processes. This approach may lead to more efficient and sustainable innovation practices in various industries, potentially addressing social challenges more effectively. However, it may also raise concerns about job displacement in creative fields, necessitating a focus on reskilling and education to prepare the workforce for collaboration with AI systems. Originality/value: This paper presents a novel approach to integrating SIT methodology with state-of-the-art AI technology, offering new perspectives on increasing human creativity with machine capabilities in structured innovation processes. It contributes to the emerging field of AI-assisted design thinking and provides a foundation for further research in this area. Keywords: Systematic Inventive Thinking, Large Language Models, Innovation, Human-AI Collaboration. Category of the paper: Conceptual paper, Research paper.
- Research Article
27
- 10.1162/daed_e_01897
- May 1, 2022
- Daedalus
This dialogue is from an early scene in the 2014 film Ex Machina, in which Nathan has invited Caleb to determine whether Nathan has succeeded in creating artificial intelligence.1 The achievement of powerful artificial general intelligence has long held a grip on our imagination not only for its exciting as well as worrisome possibilities, but also for its suggestion of a new, uncharted era for humanity. In opening his 2021 BBC Reith Lectures, titled "Living with Artificial Intelligence," Stuart Russell states that "the eventual emergence of general-purpose artificial intelligence [will be] the biggest event in human history."2Over the last decade, a rapid succession of impressive results has brought wider public attention to the possibilities of powerful artificial intelligence. In machine vision, researchers demonstrated systems that could recognize objects as well as, if not better than, humans in some situations. Then came the games. Complex games of strategy have long been associated with superior intelligence, and so when AI systems beat the best human players at chess, Atari games, Go, shogi, StarCraft, and Dota, the world took notice. It was not just that Als beat humans (although that was astounding when it first happened), but the escalating progression of how they did it: initially by learning from expert human play, then from self-play, then by teaching themselves the principles of the games from the ground up, eventually yielding single systems that could learn, play, and win at several structurally different games, hinting at the possibility of generally intelligent systems.3Speech recognition and natural language processing have also seen rapid and headline-grabbing advances. Most impressive has been the emergence recently of large language models capable of generating human-like outputs. Progress in language is of particular significance given the role language has always played in human notions of intelligence, reasoning, and understanding. While the advances mentioned thus far may seem abstract, those in driverless cars and robots have been more tangible given their embodied and often biomorphic forms. Demonstrations of such embodied systems exhibiting increasingly complex and autonomous behaviors in our physical world have captured public attention.Also in the headlines have been results in various branches of science in which AI and its related techniques have been used as tools to advance research from materials and environmental sciences to high energy physics and astronomy.4 A few highlights, such as the spectacular results on the fifty-year-old protein-folding problem by AlphaFold, suggest the possibility that AI could soon help tackle science's hardest problems, such as in health and the life sciences.5While the headlines tend to feature results and demonstrations of a future to come, AI and its associated technologies are already here and pervade our daily lives more than many realize. Examples include recommendation systems, search, language translators - now covering more than one hundred languages - facial recognition, speech to text (and back), digital assistants, chatbots for customer service, fraud detection, decision support systems, energy management systems, and tools for scientific research, to name a few. In all these examples and others, AI-related techniques have become components of other software and hardware systems as methods for learning from and incorporating messy real-world inputs into inferences, predictions, and, in some cases, actions. As director of the Future of Humanity Institute at the University of Oxford, Nick Bostrom noted back in 2006, "A lot of cutting-edge AI has filtered into general applications, often without being called AI because once something becomes useful enough and common enough it's not labeled AI anymore."6As the scope, use, and usefulness of these systems have grown for individual users, researchers in various fields, companies and other types of organizations, and governments, so too have concerns when the systems have not worked well (such as bias in facial recognition systems), or have been misused (as in deepfakes), or have resulted in harms to some (in predicting crime, for example), or have been associated with accidents (such as fatalities from self-driving cars).7Dædalus last devoted a volume to the topic of artificial intelligence in 1988, with contributions from several of the founders of the field, among others. Much of that issue was concerned with questions of whether research in AI was making progress, of whether AI was at a turning point, and of its foundations, mathematical, technical, and philosophical-with much disagreement. However, in that volume there was also a recognition, or perhaps a rediscovery, of an alternative path toward AI - the connectionist learning approach and the notion of neural nets-and a burgeoning optimism for this approach's potential. Since the 1960s, the learning approach had been relegated to the fringes in favor of the symbolic formalism for representing the world, our knowledge of it, and how machines can reason about it. Yet no essay captured some of the mood at the time better than Hilary Putnam's "Much Ado About Not Very Much." Putnam questioned the Dædalus issue itself: "Why a whole issue of Dædalus? Why don't we wait until AI achieves something and then have an issue?" He concluded:This volume of Dædalus is indeed the first since 1988 to be devoted to artificial intelligence. This volume does not rehash the same debates; much else has happened since, mostly as a result of the success of the machine learning approach that was being rediscovered and reimagined, as discussed in the 1988 volume. This issue aims to capture where we are in AI's development and how its growing uses impact society. The themes and concerns herein are colored by my own involvement with AI. Besides the television, films, and books that I grew up with, my interest in AI began in earnest in 1989 when, as an undergraduate at the University of Zimbabwe, I undertook a research project to model and train a neural network.9 I went on to do research on AI and robotics at Oxford. Over the years, I have been involved with researchers in academia and labs developing AI systems, studying AI's impact on the economy, tracking AI's progress, and working with others in business, policy, and labor grappling with its opportunities and challenges for society.10The authors of the twenty-five essays in this volume range from AI scientists and technologists at the frontier of many of AI's developments to social scientists at the forefront of analyzing AI's impacts on society. The volume is organized into ten sections. Half of the sections are focused on AI's development, the other half on its intersections with various aspects of society. In addition to the diversity in their topics, expertise, and vantage points, the authors bring a range of views on the possibilities, benefits, and concerns for society. I am grateful to the authors for accepting my invitation to write these essays.Before proceeding further, it may be useful to say what we mean by artificial intelligence. The headlines and increasing pervasiveness of AI and its associated technologies have led to some conflation and confusion about what exactly counts as AI. This has not been helped by the current trend-among researchers in science and the humanities, startups, established companies, and even governments-to associate anything involving not only machine learning, but data science, algorithms, robots, and automation of all sorts with AI. This could simply reflect the hype now associated with AI, but it could also be an acknowledgment of the success of the current wave of AI and its related techniques and their wide-ranging use and usefulness. I think both are true; but it has not always been like this. In the period now referred to as the AI winter, during which progress in AI did not live up to expectations, there was a reticence to associate most of what we now call AI with AI.Two types of definitions are typically given for AI. The first are those that suggest that it is the ability to artificially do what intelligent beings, usually human, can do. For example, artificial intelligence is:The human abilities invoked in such definitions include visual perception, speech recognition, the capacity to reason, solve problems, discover meaning, generalize, and learn from experience. Definitions of this type are considered by some to be limiting in their human-centricity as to what counts as intelligence and in the benchmarks for success they set for the development of AI (more on this later). The second type of definitions try to be free of human-centricity and define an intelligent agent or system, whatever its origin, makeup, or method, as:This type of definition also suggests the pursuit of goals, which could be given to the system, self-generated, or learned.13 That both types of definitions are employed throughout this volume yields insights of its own.These definitional distinctions notwithstanding, the term AI, much to the chagrin of some in the field, has come to be what cognitive and computer scientist Marvin Minsky called a "suitcase word."14 It is packed variously, depending on who you ask, with approaches for achieving intelligence, including those based on logic, probability, information and control theory, neural networks, and various other learning, inference, and planning methods, as well as their instantiations in software, hardware, and, in the case of embodied intelligence, systems that can perceive, move, and manipulate objects.Three questions cut through the discussions in this volume: 1) Where are we in AI's development? 2) What opportunities and challenges does AI pose for society? 3) How much about AI is really about us?Notions of intelligent machines date all the way back to antiquity.15 Philosophers, too, among them Hobbes, Leibnitz, and Descartes, have been dreaming about AI for a long time; Daniel Dennett suggests that Descartes may have even anticipated the Turing Test.16 The idea of computation-based machine intelligence traces to Alan Turing's invention of the universal Turing machine in the 1930s, and to the ideas of several of his contemporaries in the mid-twentieth century. But the birth of artificial intelligence as we know it and the use of the term is generally attributed to the now famed Dartmouth summer workshop of 1956. The workshop was the result of a proposal for a two-month summer project by John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon whereby "An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves."17In their respective contributions to this volume, "From So Simple a Beginning: Species of Artificial Intelligence" and "If We Succeed," and in different but complementary ways, Nigel Shadbolt and Stuart Russell chart the key ideas and developments in AI, its periods of excitement as well as the aforementioned AI winters. The current AI spring has been underway since the 1990s, with headline-grabbing breakthroughs appearing in rapid succession over the last ten years or so: a period that Jeffrey Dean describes in the title of his essay as a "golden decade," not only for the pace of AI development but also its use in a wide range of sectors of society, as well as areas of scientific research.18 This period is best characterized by the approach to achieve artificial intelligence through learning from experience, and by the success of neural networks, deep learning, and reinforcement learning, together with methods from probability theory, as ways for machines to learn.19A brief history may be useful here: In the 1950s, there were two dominant visions of how to achieve machine intelligence. One vision was to use computers to create a logic and symbolic representation of the world and our knowledge of it and, from there, create systems that could reason about the world, thus exhibiting intelligence akin to the mind. This vision was most espoused by Allen Newell and Hebert Simon, along with Marvin Minsky and others. Closely associated with it was the "heuristic search" approach that supposed intelligence was essentially a problem of exploring a space of possibilities for answers. The second vision was inspired by the brain, rather than the mind, and sought to achieve intelligence by learning. In what became known as the connectionist approach, units called perceptrons were connected in ways inspired by the connection of neurons in the brain. At the time, this approach was most associated with Frank Rosenblatt. While there was initial excitement about both visions, the first came to dominate, and did so for decades, with some successes, including so-called expert systems.Not only did this approach benefit from championing by its advocates and plentiful funding, it came with the suggested weight of a long intellectual tradition-exemplified by Descartes, Boole, Frege, Russell, and Church, among others-that sought to manipulate symbols and to formalize and axiomatize knowledge and reasoning. It was only in the late 1980s that interest began to grow again in the second vision, largely through the work of David Rumelhart, Geoffrey Hinton, James McClelland, and others. The history of these two visions and the associated philosophical ideas are discussed in Hubert Dreyfus and Stuart Dreyfus's 1988 Dædalus essay "Making a Mind Versus Modeling the Brain: Artificial Intelligence Back at a Branchpoint."20 Since then, the approach to intelligence based on learning, the use of statistical methods, back-propagation, and training (supervised and unsupervised) has come to characterize the current dominant approach.Kevin Scott, in his essay "I Do Not Think It Means What You Think It Means: Artificial Intelligence, Cognitive Work & Scale," reminds us of the work of Ray Solomonoff and others linking information and probability theory with the idea of machines that can not only learn, but compress and potentially generalize what they learn, and the emerging realization of this in the systems now being built and those to come. The success of the machine learning approach has benefited from the boon in the availability of data to train the algorithms thanks to the growth in the use of the Internet and other applications and services. In research, the data explosion has been the result of new scientific instruments and observation platforms and data-generating breakthroughs, for example, in astronomy and in genomics. Equally important has been the co-evolution of the software and hardware used, especially chip architectures better suited to the parallel computations involved in data- and compute-intensive neural networks and other machine learning approaches, as Dean discusses.Several authors delve into progress in key subfields of AI.21 In their essay, "Searching for Computer Vision North Stars," Fei-Fei Li and Ranjay Krishna chart developments in machine vision and the creation of standard data sets such as ImageNet that could be used for benchmarking performance. In their respective essays "Human Language Understanding & Reasoning" and "The Curious Case of Commonsense Intelligence," Chris Manning and Yejin Choi discuss different eras and ideas in natural language processing, including the recent emergence of large language models comprising hundreds of billions of parameters and that use transformer architectures and self-supervised learning on vast amounts of data.22 The resulting pretrained models are impressive in their capacity to take natural language prompts for which they have not been trained specifically and generate human-like outputs, not only in natural language, but also images, software code, and more, as Mira Murati discusses and illustrates in "Language & Coding Creativity." Some have started to refer to these large language models as foundational models in that once they are trained, they are adaptable to a wide range of tasks and outputs.23 But despite their unexpected performance, these large language models are still early in their development and have many shortcomings and limitations that are highlighted in this volume and elsewhere, including by some of their developers.24In "The Machines from Our Future," Daniela Rus discusses the progress in robotic systems, including advances in the underlying technologies, as well as in their integrated design that enables them to operate in the physical world. She highlights the limitations in the "industrial" approaches used thus far and suggests new ways of conceptualizing robots that draw on insights from biological systems. In robotics, as in AI more generally, there has always been a tension as to whether to copy or simply draw inspiration from how humans and other biological organisms achieve intelligent behavior. Elsewhere, AI researcher Demis Hassabis and colleagues have explored how neuroscience and AI learn from and inspire each other, although so far more in one than the other, as and have the success of the current approaches to AI, there are still many shortcomings and as well as problems in It is useful to on one such as when AI does not as or or or that can to or when it on or information about the world, or when it has such as of all of which can to a of public shortcomings have captured the attention of the wider public and as well as among there is an on AI and In recent years, there has been a of to principles and approaches to AI, as well as involving and such as the on AI, that to best important has been the of with to and - in the and developing AI in both and as has been well in recent This is an important in its own but also with to the of the resulting AI and, in its intersections with more the other there are limitations and problems associated with the that AI is not capable of if could to more more or more general AI. In their Turing deep learning and Geoffrey took of where deep learning and highlighted its current such as the with In the case of natural language processing, Manning and Choi the challenges in and despite the of large language Elsewhere, and have the notion that large language models do anything learning, or In & of in a and discuss the problems in systems, the as how to reason about other their systems, and well as challenges in both and especially when the include both humans and Elsewhere, and others a useful of the problems in there is a growing among many that we do not have for the of AI systems, especially as they become more capable and the of use although AI and its related techniques are to be powerful tools for research in science, as examples in this volume and recent examples in which AI not only help results but also by design and become what some have AI to science and and to and challenges for the possibility that more powerful AI could to new in science, as well as progress in some of challenges and has long been a key for many at the frontier of AI research to more capable the of each of AI, the of more general problems that to the possibility of more capable AI learning, reasoning, of and and of these and other problems that could to more capable systems the of whether current characterized by deep learning, the of and and more foundational and and reinforcement or whether different approaches are in such as cognitive agent approaches or or based on logic and probability theory, to name a few. whether and what of approaches be the AI is but many the current along with of and learning architectures have to their about the of the current approaches is associated with the of whether artificial general intelligence can be and if how and Artificial general intelligence is in to what is called that AI and for tasks and goals, such as The development of on the other aims for more powerful AI - at as powerful as is generally to problem or and, in some the capacity to and improve as well as set and its own and the of and when will be is a for most that its achievement have and as is often in and such as A through and The to Ex and it is or there is growing among many at the frontier of AI research that we for the possibility of powerful with to and and with humans, its and use, and the possibility that of could and that we these into how we approach the development of of the research and development, and in AI is of the AI and in its what Nigel Shadbolt the of AI. This is given the for useful and applications and the for in sectors of the However, a few have made the development of their the most of these are and each of which has demonstrated results of increasing still a long way from the most discussed impact of AI and automation is on and the future of This is not In in the of the excitement about AI and and concerns about their impact on a on and the was that such technologies were important for growth and and "the that but not Most recent of this including those I have been involved have and that over time, more are than are that it is the and the and the of will the In their essay AI & and John discuss these for work and further, in & the of & to discuss the with to and and as well as the opportunities that are especially in developing In "The Turing The & of Artificial Intelligence," discusses how the use of human benchmarks in the development of AI the of AI that rather than human He that the AI's development will take in this and resulting for will on the for companies, and a that the that more will be than too much from of the and does not far enough into the future and at what AI will be capable The for AI could from of that in the is and labor and ability to are and and until automation has mostly physical and but that AI will be on more cognitive and tasks based on and, if early examples are even tasks are not of the In other are now in the world machines that that learn and that their ability to do these is to a range of problems they can will be with the range to which the human has been This was and Allen Newell in that this time could be different usually two that new labor will in which will by other humans for their own even when machines may be capable of these as well as or even better than The other is that AI will create so much and all without the for human and the of will be to for when that will the that once the first time since his creation will be with his his to use his from how to the which science and interest will have for to live and and However, most researchers that we are not to a future in which the of will and that until then, there are other and that be in the labor now and in the such as and other and how humans work increasingly capable that and John and discuss in this are not the only of the by AI. Russell a of the potentially from artificial general intelligence, once a of or ten But even we to general-purpose AI, the opportunities for companies and, for the and growth as well as from AI and its related technologies are more than to pursuit and by companies and in the development, and use of AI. At the many the is it is generally that is a in AI, as by its growth in AI research, and as highlighted in several will have for companies and given the of such technologies as discussed by and others the may in the way of approaches to AI and (such as whether they are companies or as and have have the to to in AI. The role of AI in intelligence, systems, autonomous even and other of increasingly In &
- Research Article
5
- 10.1088/1402-4896/ad7a27
- Oct 1, 2024
- Physica Scripta
Large Language Models (LLMs) can solve some undergraduate-level to graduate-level physics textbook problems and are proficient at coding. Combining these two capabilities could one day enable AI systems to simulate and predict the physical world. We present an evaluation of state-of-the-art (SOTA) LLMs on PhD-level to research-level computational physics problems. We condition LLM generation on the use of well-documented and widely-used packages to elicit coding capabilities in the physics and astrophysics domains. We contribute ∼50 original and challenging problems in celestial mechanics (with REBOUND), stellar physics (with MESA), 1D fluid dynamics (with Dedalus) and non-linear dynamics (with SciPy). Since our problems do not admit unique solutions, we evaluate LLM performance on several soft metrics: counts of lines that contain different types of errors (coding, physics, necessity and sufficiency) as well as a more educational’ Pass-Fail metric focused on capturing the salient physical ingredients of the problem at hand. As expected, today's SOTA LLM (GPT4) zero-shot fails most of our problems, although about 40% of the solutions could plausibly get a passing grade. About 70%–90% of the code lines produced are necessary, sufficient and correct (coding & physics). Physics and coding errors are the most common, with some unnecessary or insufficient lines. We observe significant variations across problem class and difficulty. We identify several failure modes of GPT4 in the computational physics domain, such as poor physical units handling, poor code versioning, tendency to hallucinate plausible sub-modules, lack of physical justification for global run parameters (e.g., simulation time, or upper-lower bounds for parametric exploration) and inability to define steady-state or stopping conditions reliably. Our reconnaissance work provides a snapshot of current computational capabilities in classical physics and points to obvious improvement targets if AI systems are ever to reach a basic level of autonomy in physics simulation capabilities.
- Research Article
1
- 10.1609/aaai.v39i24.34760
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
The mathematical capabilities of AI systems are complex and multifaceted. Most existing research has predominantly focused on the correctness of AI-generated solutions to mathematical problems. In this work, we argue that beyond producing correct answers, AI systems should also be capable of, or assist humans in, developing novel solutions to mathematical challenges. This study explores the creative potential of Large Language Models (LLMs) in mathematical reasoning, an aspect that has received limited attention in prior research. We introduce a novel framework and benchmark, CreativeMath, which encompasses problems ranging from middle school curricula to Olympic-level competitions, designed to assess LLMs' ability to propose innovative solutions after some known solutions have been provided. Our experiments demonstrate that, while LLMs perform well on standard mathematical tasks, their capacity for creative problem-solving varies considerably. Notably, the Gemini-1.5-Pro model outperformed other LLMs in generating novel solutions. This research opens a new frontier in evaluating AI creativity, shedding light on both the strengths and limitations of LLMs in fostering mathematical innovation, and setting the stage for future developments in AI-assisted mathematical discovery.
- Research Article
- 10.55041/ijsrem54274
- Nov 21, 2025
- International Journal of Scientific Research in Engineering and Management
Large Language Models (LLMs) like ChatGPT, Gemini, and Claude are transforming teaching and learning practices worldwide. These AI systems can generate human-like text, assist in assignments, support research, and produce learning materials. The adoption of LLMs is rapidly increasing among students and teachers, offering numerous advantages such as increased efficiency, personalized learning, and improved understanding. However, LLMs also present certain challenges such as accuracy issues, misinformation, plagiarism risks, reduced creativity, and over-dependence. Thus, understanding the perceptions of both students and teachers is essential for integrating LLMs responsibly into the education system. This study examines data from 100 students and 100 teachers collected through structured questionnaires. The findings show that students commonly use LLMs for research, writing assignments, coding help, and presentations, whereas teachers primarily use them for idea generation, content development, coding assistance, and personalizing instruction. Both groups express mixed views regarding accuracy and ethical use. The study concludes that LLMs can significantly enhance education when used with proper guidelines and AI literacy. Recommendations include institutional policies, training programs, and curriculum integration. The study contributes to a balanced understanding of LLM use in academic settings.
- Research Article
- 10.3390/a19030170
- Feb 25, 2026
- Algorithms
Recent studies investigating the diagnostic capabilities of large language models (LLMs) have attracted significant media attention, often resulting in headlines claiming that AI systems can match or even outperform physicians. As LLMs have rapidly proliferated, this has fueled a widespread misconception that they represent the cutting edge of artificial intelligence in all contexts. This narrative tends to overshadow the continued importance of task-specific machine learning models, which were developed and validated for particular diagnostic applications well before the rise of LLMs. This single-case study evaluated the reliability of five leading multimodal LLMs (GPT-5, Gemini 3 Pro, Llama 4 Maverick, Grok 4, and Claude Opus 4.5 Extended) for radiological image interpretation by presenting each model with an identical non-contrast head CT demonstrating intracranial pathology, complemented by a novel cross-evaluation protocol wherein each model graded all responses. The deliberate use of a straightforward case (rather than diagnostically challenging pathology) aimed to establish minimum competency thresholds; if LLMs cannot reliably interpret obvious pathology, their deployment on ambiguous cases becomes indefensible. The study intentionally excluded human radiologist ground truth to avoid generating comparative accuracy metrics that could be selectively cited for commercial purposes, focusing instead on demonstrating class-wide limitations rather than ranking individual products. Results revealed a 20% rate of fundamental diagnostic error, with one model misidentifying ischemic stroke as intracerebral hemorrhage with incorrect lateralization. Even among concordant models, clinically meaningful variability persisted in acuity characterization, anatomical localization, and differential diagnoses. Cross-evaluation exposed ground truth disagreement between models, self-evaluation bias, inconsistent grading stringency, and divergent evaluation philosophies. Only one model included appropriate safety disclaimers. These findings demonstrate that current multimodal LLMs exhibit unacceptable diagnostic variability and evaluative inconsistency for autonomous clinical deployment. The appropriate clinical role for LLMs should be distinguished by deployment context: autonomous diagnosis requires validated task-specific models; decision support applications demand rigorous radiologist oversight protocols; and educational summarization represents the most appropriate current use case, with mandatory disclaimers. Healthcare applications requiring reliable image interpretation should prioritize validated, task-specific machine learning systems over general-purpose language models.
- Research Article
2
- 10.1007/s11098-025-02347-3
- May 27, 2025
- Philosophical Studies
The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are fundamentally inadequate to prevent misuse. Despite ongoing efforts to instill norms such as helpfulness, honesty, and harmlessness in LLMs through fine-tuning based on human preferences, they remain vulnerable to adversarial attacks that exploit conflicts between these norms. I argue that this vulnerability reflects a fundamental limitation of existing alignment methods: they reinforce shallow behavioral dispositions rather than endowing LLMs with a genuine capacity for normative deliberation. Drawing from on research in moral psychology, I show how humans’ ability to engage in deliberative reasoning enhances their resilience against similar adversarial tactics. LLMs, by contrast, lack a robust capacity to detect and rationally resolve normative conflicts, leaving them susceptible to manipulation; even recent advances in reasoning-focused LLMs have not addressed this vulnerability. This “shallow alignment” problem carries significant implications for AI safety and regulation, suggesting that current approaches are insufficient for mitigating potential harms posed by increasingly capable AI systems.
- Research Article
2
- 10.1093/pnasnexus/pgaf317
- Dec 1, 2025
- PNAS Nexus
Social networks shape how humans form opinions, exchange information, and organize collectively. As large language models (LLMs) become embedded in social and professional environments, it is critical to understand whether their interactions resemble human network dynamics. We introduce a framework to study the network formation behaviors of multiple LLM agents and benchmark them against human decisions. Across synthetic and real-world settings, including friendship, telecommunication, and employment networks, LLMs reproduce core microlevel principles (preferential attachment, triadic closure, and homophily), and macrolevel properties (community structure, small-world effects). Their emphasis on these principles adapts to context: for example, LLMs favor homophily in friendship networks but heterophily in organizational settings, mirroring patterns of social mobility. A controlled survey shows strong alignment between LLM and human link-formation decisions. These results highlight LLMs’ potential as tools for social simulation and synthetic data generation, while underscoring risks of bias and fairness in AI systems that interact with human networks.
- Research Article
- 10.7759/cureus.100476
- Dec 31, 2025
- Cureus
AI systems are increasingly being evaluated for their potential role in medical decision-making. Pulmonary thromboembolism (PTE) represents an ideal test domain for evaluating AI clinical reasoning capabilities due to its high prevalence, significant mortality risk, and clinical complexity requiring integration of validated risk stratification tools, multiple imaging modalities, and nuanced treatment algorithms across diverse patient populations, including pregnancy, malignancy, and renal impairment. We compared the performance of large language models (LLMs) with specialist physicians on PTE knowledge assessment. We administered 25 multiple-choice questions covering the diagnosis, treatment, complications, and management of PTEto 17 physicians (seven emergency medicine, five internal medicine, and five pulmonary specialists) and three AI systems: ChatGPT-4 (OpenAI, San Francisco, CA, USA), Claude 2 (Anthropic, San Francisco, CA, USA), and Google Med-PaLM (Google Research, Mountain View, CA, USA). Questions were categorized into four domains: diagnosis, treatment, complications, and management/ICU. We calculated overall accuracy and domain-specific performance. We applied a pre-specified non-inferiority margin of 10 percentage points, a threshold consistent with FDA guidance for medical device comparison studies and prior AI-physician trials, representing the maximum clinically acceptable performance gap that would still support practical utility in adjunctive clinical decision support while maintaining appropriate safety standards. Internal medicine and pulmonary specialists achieved the highest scores (80% each), matched by Claude 2 (80%). ChatGPT-4 and MedPalm scored 72% each, while emergency medicine specialists averaged 64.6%. Claude 2 significantly outperformed emergency medicine physicians (+15.4 percentage points, p<0.05). ChatGPT-4 and MedPalm demonstrated non-inferiority to internal medicine and pulmonary specialists (-8 percentage points, within the 10% margin). All groups performed well on diagnostic questions but struggled with nuanced treatment and management scenarios. AI systems showed particular difficulty with guideline-based edge cases and cancer-associated thromboembolism management. Advanced AI systems can achieve specialist-level performance on structured medical knowledge assessments. Claude 2 matched top specialists and exceeded emergency medicine performance, while other AI systems were non-inferior to domain experts. These findings support the potential utility of AI in medical education and clinical decision support while highlighting areas requiring further development.