Metacognition and Uncertainty Communication in Humans and Large Language Models
Metacognition—the capacity to monitor and evaluate one’s own knowledge and performance—is foundational to human decision-making, learning, and communication. As large language models (LLMs) become increasingly embedded in both high-stakes and widespread low-stakes contexts, it is important to assess whether, how, and to what extent they exhibit metacognitive abilities. Here, we provide an overview of the current knowledge of LLMs’ metacognitive capacities, how they might be studied, and how they relate to our knowledge of metacognition in humans. We show that although humans and LLMs can sometimes appear quite aligned in their metacognitive capacities and behaviors, it is clear many differences remain; attending to these differences is important for enhancing the collaboration between humans and artificial intelligence. Last, we discuss how endowing future LLMs with more sensitive and more calibrated metacognition may also help them develop new capacities such as more efficient learning, self-direction, and curiosity.
- Research Article
8
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
1
- 10.1038/s41598-025-22290-x
- Oct 7, 2025
- Scientific Reports
Large language models (LLMs) increasingly mimic human cognition in various language-based tasks. However, their capacity for metacognition—particularly in predicting memory performance—remains unexplored. Here, we introduce a cross-agent prediction model to assess whether ChatGPT-based LLMs align with human judgments of learning (JOL), a metacognitive measure where individuals predict their own future memory performance. We tested humans and LLMs on pairs of sentences, one of which was a garden-path sentence—a sentence that initially misleads the reader toward an incorrect interpretation before requiring reanalysis. By manipulating contextual fit (fitting vs. unfitting sentences), we probed how intrinsic cues (i.e., relatedness) affect both LLM and human JOL. Our results revealed that while human JOL reliably predicted actual memory performance, none of the tested LLMs (GPT-3.5-turbo, GPT-4-turbo, and GPT-4o) demonstrated comparable predictive accuracy. This discrepancy emerged regardless of whether sentences appeared in fitting or unfitting contexts. These findings indicate that, despite LLMs’ demonstrated capacity to model human cognition at the object-level, they struggle at the meta-level, failing to capture the variability in individual memory predictions. By identifying this shortcoming, our study underscores the need for further refinements in LLMs’ self-monitoring abilities, which could enhance their utility in educational settings, personalized learning, and human–AI interactions. Strengthening LLMs’ metacognitive performance may reduce the reliance on human oversight, paving the way for more autonomous and seamless integration of AI into tasks requiring deeper cognitive awareness.
- Discussion
- 10.1111/cogs.13430
- Mar 1, 2024
- Cognitive science
This letter explores the intricate historical and contemporary links between large language models (LLMs) and cognitive science through the lens of information theory, statistical language models, and socioanthropological linguistic theories. The emergence of LLMs highlights the enduring significance of information-based and statistical learning theories in understanding human communication. These theories, initially proposed in the mid-20th century, offered a visionary framework for integrating computational science, social sciences, and humanities, which nonetheless was not fully fulfilled at that time. The subsequent development of sociolinguistics and linguistic anthropology, especially since the 1970s, provided critical perspectives and empirical methods that both challenged and enriched this framework. This letter proposes that two pivotal concepts derived from this development, metapragmatic function and indexicality, offer a fruitful theoretical perspective for integrating the semantic, textual, and pragmatic, contextual dimensions of communication, an amalgamation that contemporary LLMs have yet to fully achieve. The author believes that contemporary cognitive science is at a crucial crossroads, where fostering interdisciplinary dialogues among computational linguistics, social linguistics and linguistic anthropology, and cognitive and social psychology is in particular imperative. Such collaboration is vital to bridge the computational, cognitive, and sociocultural aspects of human communication and human-AI interaction, especially in the era of large language and multimodal models and human-centric Artificial Intelligence (AI).
- Research Article
5
- 10.1016/j.joms.2024.11.007
- Mar 1, 2025
- Journal of Oral and Maxillofacial Surgery
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential
- Research Article
23
- 10.1016/j.cpa.2024.102722
- Feb 22, 2024
- Critical Perspectives on Accounting
New large language models (LLMs) like ChatGPT have the potential to change qualitative research by contributing to every stage of the research process from generating interview questions to structuring research publications. However, it is far from clear whether such ‘assistance’ will enable or deskill and eventually displace the qualitative researcher. This paper sets out to explore the implications for qualitative research of the recently emerged capabilities of LLMs; how they have acquired their seemingly ‘human-like’ capabilities to ‘converse’ with us humans, and in what ways these capabilities are deceptive or misleading. Building on a comparison of the different ‘trainings’ of humans and LLMs, the paper first traces the seemingly human-like qualities of the LLM to the human proclivity to project communicative intent into or onto LLMs’ purely imitative capacity to predict the structure of human communication. It then goes on to detail the ways in which such human-like communication is deceptive and misleading in relation to the absolute ‘certainty’ with which LLMs ‘converse’, their intrinsic tendencies to ‘hallucination’ and ‘sycophancy’, the narrow conception of ‘artificial intelligence’, LLMs’ complete lack of ethical sensibility or capacity for responsibility, and finally the feared danger of an ‘emergence’ of ‘human-competitive’ or ‘superhuman’ LLM capabilities. The paper concludes by noting the potential dangers of the widespread use of LLMs as ‘mediators’ of human self-understanding and culture. A postscript offers a brief reflection on what only humans can do as qualitative researchers.
- Research Article
- 10.1182/blood-2025-6214
- Nov 3, 2025
- Blood
Evaluating artificial intelligence (AI) as a clinical decision support tool for AML patients
- Research Article
- 10.1007/s44202-025-00463-z
- Dec 17, 2025
- Discover Psychology
Background & problem statement Human metacognition, defined as the awareness and regulation of one’s cognitive processes, plays a critical role in effective learning and decision-making. With the advancement of Artificial Intelligence (AI), educational technologies now offer personalized support to human being to accomplish certain tasks. It may indirectly, enhance or hinder human metacognitive skills. However, despite growing research interest, a comprehensive understanding of global research trends in understanding how AI intermediates with human metacognition remains lacking. Purpose This study aims to systematically map the scientific landscape of how AI intermediates with human metacognition through a bibliometric analysis, identifying key contributors, influential publications, and emerging research themes. Methodology A total of 144 articles published between 1985 and 2024 were retrieved from the Scopus database. Using bibliometric tools such as VOSviewer and Bibliometrix (R-package), the study employed performance analysis, co-authorship analysis, and co-word analysis to examine publication trends, leading countries, collaboration networks, prolific journals, and thematic clusters within the field. Findings & contributions The analysis reveals a gradual increase in publications with AI and human metacognition as common point of study from 1985, with a sharp rise post-2009 and a peak in 2023. The United States leads in research output and the most cited work is by Graesser (2005), focusing on metacognitive scaffolding through intelligent tutoring systems such as iSTART and AutoTutor. The study also highlighted the limited number of empirical studies discussing the negative effects of AI on human metacognitive abilities although few studies discussed about overreliance on AI, fear of failure and other aspects. Theoretically, this research repositions AI not only as a facilitator of technology but as a cognitive collaborator that directly influences metacognitive activities such as planning, monitoring, and reflective judgment. This theoretical framework helps to develop metacognitive theory in the digital era by situating AI as a collaborator in, not a substitute for, human cognitive control.
- Research Article
124
- 10.1097/corr.0000000000002704
- May 23, 2023
- Clinical orthopaedics and related research
Advances in neural networks, deep learning, and artificial intelligence (AI) have progressed recently. Previous deep learning AI has been structured around domain-specific areas that are trained on dataset-specific areas of interest that yield high accuracy and precision. A new AI model using large language models (LLM) and nonspecific domain areas, ChatGPT (OpenAI), has gained attention. Although AI has demonstrated proficiency in managing vast amounts of data, implementation of that knowledge remains a challenge. (1) What percentage of Orthopaedic In-Training Examination questions can a generative, pretrained transformer chatbot (ChatGPT) answer correctly? (2) How does that percentage compare with results achieved by orthopaedic residents of different levels, and if scoring lower than the 10th percentile relative to 5th-year residents is likely to correspond to a failing American Board of Orthopaedic Surgery score, is this LLM likely to pass the orthopaedic surgery written boards? (3) Does increasing question taxonomy affect the LLM's ability to select the correct answer choices? This study randomly selected 400 of 3840 publicly available questions based on the Orthopaedic In-Training Examination and compared the mean score with that of residents who took the test over a 5-year period. Questions with figures, diagrams, or charts were excluded, including five questions the LLM could not provide an answer for, resulting in 207 questions administered with raw score recorded. The LLM's answer results were compared with the Orthopaedic In-Training Examination ranking of orthopaedic surgery residents. Based on the findings of an earlier study, a pass-fail cutoff was set at the 10th percentile. Questions answered were then categorized based on the Buckwalter taxonomy of recall, which deals with increasingly complex levels of interpretation and application of knowledge; comparison was made of the LLM's performance across taxonomic levels and was analyzed using a chi-square test. ChatGPT selected the correct answer 47% (97 of 207) of the time, and 53% (110 of 207) of the time it answered incorrectly. Based on prior Orthopaedic In-Training Examination testing, the LLM scored in the 40th percentile for postgraduate year (PGY) 1s, the eighth percentile for PGY2s, and the first percentile for PGY3s, PGY4s, and PGY5s; based on the latter finding (and using a predefined cutoff of the 10th percentile of PGY5s as the threshold for a passing score), it seems unlikely that the LLM would pass the written board examination. The LLM's performance decreased as question taxonomy level increased (it answered 54% [54 of 101] of Tax 1 questions correctly, 51% [18 of 35] of Tax 2 questions correctly, and 34% [24 of 71] of Tax 3 questions correctly; p = 0.034). Although this general-domain LLM has a low likelihood of passing the orthopaedic surgery board examination, testing performance and knowledge are comparable to that of a first-year orthopaedic surgery resident. The LLM's ability to provide accurate answers declines with increasing question taxonomy and complexity, indicating a deficiency in implementing knowledge. Current AI appears to perform better at knowledge and interpretation-based inquires, and based on this study and other areas of opportunity, it may become an additional tool for orthopaedic learning and education.
- Research Article
16
- 10.1162/daed_e_01897
- May 1, 2022
- Daedalus
This dialogue is from an early scene in the 2014 film Ex Machina, in which Nathan has invited Caleb to determine whether Nathan has succeeded in creating artificial intelligence.1 The achievement of powerful artificial general intelligence has long held a grip on our imagination not only for its exciting as well as worrisome possibilities, but also for its suggestion of a new, uncharted era for humanity. In opening his 2021 BBC Reith Lectures, titled "Living with Artificial Intelligence," Stuart Russell states that "the eventual emergence of general-purpose artificial intelligence [will be] the biggest event in human history."2Over the last decade, a rapid succession of impressive results has brought wider public attention to the possibilities of powerful artificial intelligence. In machine vision, researchers demonstrated systems that could recognize objects as well as, if not better than, humans in some situations. Then came the games. Complex games of strategy have long been associated with superior intelligence, and so when AI systems beat the best human players at chess, Atari games, Go, shogi, StarCraft, and Dota, the world took notice. It was not just that Als beat humans (although that was astounding when it first happened), but the escalating progression of how they did it: initially by learning from expert human play, then from self-play, then by teaching themselves the principles of the games from the ground up, eventually yielding single systems that could learn, play, and win at several structurally different games, hinting at the possibility of generally intelligent systems.3Speech recognition and natural language processing have also seen rapid and headline-grabbing advances. Most impressive has been the emergence recently of large language models capable of generating human-like outputs. Progress in language is of particular significance given the role language has always played in human notions of intelligence, reasoning, and understanding. While the advances mentioned thus far may seem abstract, those in driverless cars and robots have been more tangible given their embodied and often biomorphic forms. Demonstrations of such embodied systems exhibiting increasingly complex and autonomous behaviors in our physical world have captured public attention.Also in the headlines have been results in various branches of science in which AI and its related techniques have been used as tools to advance research from materials and environmental sciences to high energy physics and astronomy.4 A few highlights, such as the spectacular results on the fifty-year-old protein-folding problem by AlphaFold, suggest the possibility that AI could soon help tackle science's hardest problems, such as in health and the life sciences.5While the headlines tend to feature results and demonstrations of a future to come, AI and its associated technologies are already here and pervade our daily lives more than many realize. Examples include recommendation systems, search, language translators - now covering more than one hundred languages - facial recognition, speech to text (and back), digital assistants, chatbots for customer service, fraud detection, decision support systems, energy management systems, and tools for scientific research, to name a few. In all these examples and others, AI-related techniques have become components of other software and hardware systems as methods for learning from and incorporating messy real-world inputs into inferences, predictions, and, in some cases, actions. As director of the Future of Humanity Institute at the University of Oxford, Nick Bostrom noted back in 2006, "A lot of cutting-edge AI has filtered into general applications, often without being called AI because once something becomes useful enough and common enough it's not labeled AI anymore."6As the scope, use, and usefulness of these systems have grown for individual users, researchers in various fields, companies and other types of organizations, and governments, so too have concerns when the systems have not worked well (such as bias in facial recognition systems), or have been misused (as in deepfakes), or have resulted in harms to some (in predicting crime, for example), or have been associated with accidents (such as fatalities from self-driving cars).7Dædalus last devoted a volume to the topic of artificial intelligence in 1988, with contributions from several of the founders of the field, among others. Much of that issue was concerned with questions of whether research in AI was making progress, of whether AI was at a turning point, and of its foundations, mathematical, technical, and philosophical-with much disagreement. However, in that volume there was also a recognition, or perhaps a rediscovery, of an alternative path toward AI - the connectionist learning approach and the notion of neural nets-and a burgeoning optimism for this approach's potential. Since the 1960s, the learning approach had been relegated to the fringes in favor of the symbolic formalism for representing the world, our knowledge of it, and how machines can reason about it. Yet no essay captured some of the mood at the time better than Hilary Putnam's "Much Ado About Not Very Much." Putnam questioned the Dædalus issue itself: "Why a whole issue of Dædalus? Why don't we wait until AI achieves something and then have an issue?" He concluded:This volume of Dædalus is indeed the first since 1988 to be devoted to artificial intelligence. This volume does not rehash the same debates; much else has happened since, mostly as a result of the success of the machine learning approach that was being rediscovered and reimagined, as discussed in the 1988 volume. This issue aims to capture where we are in AI's development and how its growing uses impact society. The themes and concerns herein are colored by my own involvement with AI. Besides the television, films, and books that I grew up with, my interest in AI began in earnest in 1989 when, as an undergraduate at the University of Zimbabwe, I undertook a research project to model and train a neural network.9 I went on to do research on AI and robotics at Oxford. Over the years, I have been involved with researchers in academia and labs developing AI systems, studying AI's impact on the economy, tracking AI's progress, and working with others in business, policy, and labor grappling with its opportunities and challenges for society.10The authors of the twenty-five essays in this volume range from AI scientists and technologists at the frontier of many of AI's developments to social scientists at the forefront of analyzing AI's impacts on society. The volume is organized into ten sections. Half of the sections are focused on AI's development, the other half on its intersections with various aspects of society. In addition to the diversity in their topics, expertise, and vantage points, the authors bring a range of views on the possibilities, benefits, and concerns for society. I am grateful to the authors for accepting my invitation to write these essays.Before proceeding further, it may be useful to say what we mean by artificial intelligence. The headlines and increasing pervasiveness of AI and its associated technologies have led to some conflation and confusion about what exactly counts as AI. This has not been helped by the current trend-among researchers in science and the humanities, startups, established companies, and even governments-to associate anything involving not only machine learning, but data science, algorithms, robots, and automation of all sorts with AI. This could simply reflect the hype now associated with AI, but it could also be an acknowledgment of the success of the current wave of AI and its related techniques and their wide-ranging use and usefulness. I think both are true; but it has not always been like this. In the period now referred to as the AI winter, during which progress in AI did not live up to expectations, there was a reticence to associate most of what we now call AI with AI.Two types of definitions are typically given for AI. The first are those that suggest that it is the ability to artificially do what intelligent beings, usually human, can do. For example, artificial intelligence is:The human abilities invoked in such definitions include visual perception, speech recognition, the capacity to reason, solve problems, discover meaning, generalize, and learn from experience. Definitions of this type are considered by some to be limiting in their human-centricity as to what counts as intelligence and in the benchmarks for success they set for the development of AI (more on this later). The second type of definitions try to be free of human-centricity and define an intelligent agent or system, whatever its origin, makeup, or method, as:This type of definition also suggests the pursuit of goals, which could be given to the system, self-generated, or learned.13 That both types of definitions are employed throughout this volume yields insights of its own.These definitional distinctions notwithstanding, the term AI, much to the chagrin of some in the field, has come to be what cognitive and computer scientist Marvin Minsky called a "suitcase word."14 It is packed variously, depending on who you ask, with approaches for achieving intelligence, including those based on logic, probability, information and control theory, neural networks, and various other learning, inference, and planning methods, as well as their instantiations in software, hardware, and, in the case of embodied intelligence, systems that can perceive, move, and manipulate objects.Three questions cut through the discussions in this volume: 1) Where are we in AI's development? 2) What opportunities and challenges does AI pose for society? 3) How much about AI is really about us?Notions of intelligent machines date all the way back to antiquity.15 Philosophers, too, among them Hobbes, Leibnitz, and Descartes, have been dreaming about AI for a long time; Daniel Dennett suggests that Descartes may have even anticipated the Turing Test.16 The idea of computation-based machine intelligence traces to Alan Turing's invention of the universal Turing machine in the 1930s, and to the ideas of several of his contemporaries in the mid-twentieth century. But the birth of artificial intelligence as we know it and the use of the term is generally attributed to the now famed Dartmouth summer workshop of 1956. The workshop was the result of a proposal for a two-month summer project by John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon whereby "An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves."17In their respective contributions to this volume, "From So Simple a Beginning: Species of Artificial Intelligence" and "If We Succeed," and in different but complementary ways, Nigel Shadbolt and Stuart Russell chart the key ideas and developments in AI, its periods of excitement as well as the aforementioned AI winters. The current AI spring has been underway since the 1990s, with headline-grabbing breakthroughs appearing in rapid succession over the last ten years or so: a period that Jeffrey Dean describes in the title of his essay as a "golden decade," not only for the pace of AI development but also its use in a wide range of sectors of society, as well as areas of scientific research.18 This period is best characterized by the approach to achieve artificial intelligence through learning from experience, and by the success of neural networks, deep learning, and reinforcement learning, together with methods from probability theory, as ways for machines to learn.19A brief history may be useful here: In the 1950s, there were two dominant visions of how to achieve machine intelligence. One vision was to use computers to create a logic and symbolic representation of the world and our knowledge of it and, from there, create systems that could reason about the world, thus exhibiting intelligence akin to the mind. This vision was most espoused by Allen Newell and Hebert Simon, along with Marvin Minsky and others. Closely associated with it was the "heuristic search" approach that supposed intelligence was essentially a problem of exploring a space of possibilities for answers. The second vision was inspired by the brain, rather than the mind, and sought to achieve intelligence by learning. In what became known as the connectionist approach, units called perceptrons were connected in ways inspired by the connection of neurons in the brain. At the time, this approach was most associated with Frank Rosenblatt. While there was initial excitement about both visions, the first came to dominate, and did so for decades, with some successes, including so-called expert systems.Not only did this approach benefit from championing by its advocates and plentiful funding, it came with the suggested weight of a long intellectual tradition-exemplified by Descartes, Boole, Frege, Russell, and Church, among others-that sought to manipulate symbols and to formalize and axiomatize knowledge and reasoning. It was only in the late 1980s that interest began to grow again in the second vision, largely through the work of David Rumelhart, Geoffrey Hinton, James McClelland, and others. The history of these two visions and the associated philosophical ideas are discussed in Hubert Dreyfus and Stuart Dreyfus's 1988 Dædalus essay "Making a Mind Versus Modeling the Brain: Artificial Intelligence Back at a Branchpoint."20 Since then, the approach to intelligence based on learning, the use of statistical methods, back-propagation, and training (supervised and unsupervised) has come to characterize the current dominant approach.Kevin Scott, in his essay "I Do Not Think It Means What You Think It Means: Artificial Intelligence, Cognitive Work & Scale," reminds us of the work of Ray Solomonoff and others linking information and probability theory with the idea of machines that can not only learn, but compress and potentially generalize what they learn, and the emerging realization of this in the systems now being built and those to come. The success of the machine learning approach has benefited from the boon in the availability of data to train the algorithms thanks to the growth in the use of the Internet and other applications and services. In research, the data explosion has been the result of new scientific instruments and observation platforms and data-generating breakthroughs, for example, in astronomy and in genomics. Equally important has been the co-evolution of the software and hardware used, especially chip architectures better suited to the parallel computations involved in data- and compute-intensive neural networks and other machine learning approaches, as Dean discusses.Several authors delve into progress in key subfields of AI.21 In their essay, "Searching for Computer Vision North Stars," Fei-Fei Li and Ranjay Krishna chart developments in machine vision and the creation of standard data sets such as ImageNet that could be used for benchmarking performance. In their respective essays "Human Language Understanding & Reasoning" and "The Curious Case of Commonsense Intelligence," Chris Manning and Yejin Choi discuss different eras and ideas in natural language processing, including the recent emergence of large language models comprising hundreds of billions of parameters and that use transformer architectures and self-supervised learning on vast amounts of data.22 The resulting pretrained models are impressive in their capacity to take natural language prompts for which they have not been trained specifically and generate human-like outputs, not only in natural language, but also images, software code, and more, as Mira Murati discusses and illustrates in "Language & Coding Creativity." Some have started to refer to these large language models as foundational models in that once they are trained, they are adaptable to a wide range of tasks and outputs.23 But despite their unexpected performance, these large language models are still early in their development and have many shortcomings and limitations that are highlighted in this volume and elsewhere, including by some of their developers.24In "The Machines from Our Future," Daniela Rus discusses the progress in robotic systems, including advances in the underlying technologies, as well as in their integrated design that enables them to operate in the physical world. She highlights the limitations in the "industrial" approaches used thus far and suggests new ways of conceptualizing robots that draw on insights from biological systems. In robotics, as in AI more generally, there has always been a tension as to whether to copy or simply draw inspiration from how humans and other biological organisms achieve intelligent behavior. Elsewhere, AI researcher Demis Hassabis and colleagues have explored how neuroscience and AI learn from and inspire each other, although so far more in one than the other, as and have the success of the current approaches to AI, there are still many shortcomings and as well as problems in It is useful to on one such as when AI does not as or or or that can to or when it on or information about the world, or when it has such as of all of which can to a of public shortcomings have captured the attention of the wider public and as well as among there is an on AI and In recent years, there has been a of to principles and approaches to AI, as well as involving and such as the on AI, that to best important has been the of with to and - in the and developing AI in both and as has been well in recent This is an important in its own but also with to the of the resulting AI and, in its intersections with more the other there are limitations and problems associated with the that AI is not capable of if could to more more or more general AI. In their Turing deep learning and Geoffrey took of where deep learning and highlighted its current such as the with In the case of natural language processing, Manning and Choi the challenges in and despite the of large language Elsewhere, and have the notion that large language models do anything learning, or In & of in a and discuss the problems in systems, the as how to reason about other their systems, and well as challenges in both and especially when the include both humans and Elsewhere, and others a useful of the problems in there is a growing among many that we do not have for the of AI systems, especially as they become more capable and the of use although AI and its related techniques are to be powerful tools for research in science, as examples in this volume and recent examples in which AI not only help results but also by design and become what some have AI to science and and to and challenges for the possibility that more powerful AI could to new in science, as well as progress in some of challenges and has long been a key for many at the frontier of AI research to more capable the of each of AI, the of more general problems that to the possibility of more capable AI learning, reasoning, of and and of these and other problems that could to more capable systems the of whether current characterized by deep learning, the of and and more foundational and and reinforcement or whether different approaches are in such as cognitive agent approaches or or based on logic and probability theory, to name a few. whether and what of approaches be the AI is but many the current along with of and learning architectures have to their about the of the current approaches is associated with the of whether artificial general intelligence can be and if how and Artificial general intelligence is in to what is called that AI and for tasks and goals, such as The development of on the other aims for more powerful AI - at as powerful as is generally to problem or and, in some the capacity to and improve as well as set and its own and the of and when will be is a for most that its achievement have and as is often in and such as A through and The to Ex and it is or there is growing among many at the frontier of AI research that we for the possibility of powerful with to and and with humans, its and use, and the possibility that of could and that we these into how we approach the development of of the research and development, and in AI is of the AI and in its what Nigel Shadbolt the of AI. This is given the for useful and applications and the for in sectors of the However, a few have made the development of their the most of these are and each of which has demonstrated results of increasing still a long way from the most discussed impact of AI and automation is on and the future of This is not In in the of the excitement about AI and and concerns about their impact on a on and the was that such technologies were important for growth and and "the that but not Most recent of this including those I have been involved have and that over time, more are than are that it is the and the and the of will the In their essay AI & and John discuss these for work and further, in & the of & to discuss the with to and and as well as the opportunities that are especially in developing In "The Turing The & of Artificial Intelligence," discusses how the use of human benchmarks in the development of AI the of AI that rather than human He that the AI's development will take in this and resulting for will on the for companies, and a that the that more will be than too much from of the and does not far enough into the future and at what AI will be capable The for AI could from of that in the is and labor and ability to are and and until automation has mostly physical and but that AI will be on more cognitive and tasks based on and, if early examples are even tasks are not of the In other are now in the world machines that that learn and that their ability to do these is to a range of problems they can will be with the range to which the human has been This was and Allen Newell in that this time could be different usually two that new labor will in which will by other humans for their own even when machines may be capable of these as well as or even better than The other is that AI will create so much and all without the for human and the of will be to for when that will the that once the first time since his creation will be with his his to use his from how to the which science and interest will have for to live and and However, most researchers that we are not to a future in which the of will and that until then, there are other and that be in the labor now and in the such as and other and how humans work increasingly capable that and John and discuss in this are not the only of the by AI. Russell a of the potentially from artificial general intelligence, once a of or ten But even we to general-purpose AI, the opportunities for companies and, for the and growth as well as from AI and its related technologies are more than to pursuit and by companies and in the development, and use of AI. At the many the is it is generally that is a in AI, as by its growth in AI research, and as highlighted in several will have for companies and given the of such technologies as discussed by and others the may in the way of approaches to AI and (such as whether they are companies or as and have have the to to in AI. The role of AI in intelligence, systems, autonomous even and other of increasingly In &
- Research Article
- 10.1177/2473011425s00142
- Oct 1, 2025
- Foot & Ankle Orthopaedics
Research Type: Level 3 - Retrospective cohort study, Case-control study, Meta-analysis of Level 3 studies Introduction/Purpose: Identifying and tracking surgical complications is a critical component of maintaining quality registries and improving clinical care. Medical record review to record complications is often performed by a dedicated clinical team and is time-consuming and expensive. In recent years, Large Language Models (LLMs) have emerged as a promising Artificial Intelligence (AI) tool to more efficiently and accurately retrieve clinical information from patient records. However, early literature has shown that without careful prompt design, LLMs are vulnerable to error. The primary purpose of this study is to determine if an LLM platform, compared to traditional clinical chart reviewers, can be used reliably and automatically screen for complications directly from medical notes for patients who underwent total ankle arthroplasty (TAA). Methods: Following IRB approval, patient records were retrospectively identified from an institutional TAA registry with surgeries performed from 2015 to 2024. Patients were manually evaluated by the research team for intraoperative fracture (IOF), deep vein thrombosis (DVT), superficial wound infection (SWI), and deep wound infection (DWI). Patient records were then scrubbed of HIPAA-identifiers, age, and gender, and input into an LLM for analysis. An automated script was developed that assessed each patient visit note for complications and recorded the result in a table format with “yes” or “no” if a complication occurred. Disagreements between reviewer and LLM complications were secondarily reviewed by a blinded investigator to produce a final, gold standard data set. The sensitivity and specificity of both reviewer and LLM chart review are compared. Statistical difference between the groups is determined using a McNemar test and similarity of decisions was evaluated using an Intraclass Correlation Coefficient (ICC). Results: A total of 1952 notes were reviewed for 310 TAA procedures. The final rate of IOF, DVT, SWI, and DWI was found to be 5.2%, 0.0%, 4.2%, and 0.6%, respectively. Chart reviewers had high agreement with the LLM in evaluated DVT (100% match, ICC 1.00), and DWI (99.7% match, ICC 0.88), but significant disagreement in rate of IOF (97.1% match, ICC 0.77, p = 0.008) and SWI (91.9% match, ICC 0.49, p = 0.05). After secondary review by a blinded author, the LLM was found to have a higher sensitivity compared to reviewers for SWI (0.85 vs. 0.69, respectively) and IOF (1.00 vs. 0.50, respectively). However, reviewers had a higher specificity than the LLM for SWI (0.98 vs. 0.95, respectively). Conclusion: Our results support that current LLMs can be applied to screen free text medical records for complications at a sensitivity and specificity comparable to clinical chart reviewers. The LLM is more prone to both true and false positives, indicated by both a higher sensitivity and lower specificity. Importantly, this assessment of complications using the LLM was completely automated and did not require human intervention to run, allowing substantially higher efficiency. LLMs show promise to dramatically scale the size and reliability of clinical outcome and quality registries in coming years. Further refinement of the LLM script may improve accuracy. Comparing identified complications following total ankle arthroplasty between manual chart review and a large language model Rate of intraoperative fracture (IOP), deep vein thrombosis (DVT), superficial wound infection (SWI) and deep wound infection (DWI) identified by manual chart reviewers and a large language model (LLM). Sensitivity and specificity for each are provided relative to a verified gold-standard, a fellowship trained blind investigator.
- Abstract
3
- 10.1182/blood-2023-185854
- Nov 2, 2023
- Blood
Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
- Research Article
60
- 10.2196/56764
- Apr 25, 2024
- Journal of Medical Internet Research
As the health care industry increasingly embraces large language models (LLMs), understanding the consequence of this integration becomes crucial for maximizing benefits while mitigating potential pitfalls. This paper explores the evolving relationship among clinician trust in LLMs, the transition of data sources from predominantly human-generated to artificial intelligence (AI)–generated content, and the subsequent impact on the performance of LLMs and clinician competence. One of the primary concerns identified in this paper is the LLMs’ self-referential learning loops, where AI-generated content feeds into the learning algorithms, threatening the diversity of the data pool, potentially entrenching biases, and reducing the efficacy of LLMs. While theoretical at this stage, this feedback loop poses a significant challenge as the integration of LLMs in health care deepens, emphasizing the need for proactive dialogue and strategic measures to ensure the safe and effective use of LLM technology. Another key takeaway from our investigation is the role of user expertise and the necessity for a discerning approach to trusting and validating LLM outputs. The paper highlights how expert users, particularly clinicians, can leverage LLMs to enhance productivity by off-loading routine tasks while maintaining a critical oversight to identify and correct potential inaccuracies in AI-generated content. This balance of trust and skepticism is vital for ensuring that LLMs augment rather than undermine the quality of patient care. We also discuss the risks associated with the deskilling of health care professionals. Frequent reliance on LLMs for critical tasks could result in a decline in health care providers’ diagnostic and thinking skills, particularly affecting the training and development of future professionals. The legal and ethical considerations surrounding the deployment of LLMs in health care are also examined. We discuss the medicolegal challenges, including liability in cases of erroneous diagnoses or treatment advice generated by LLMs. The paper references recent legislative efforts, such as The Algorithmic Accountability Act of 2023, as crucial steps toward establishing a framework for the ethical and responsible use of AI-based technologies in health care. In conclusion, this paper advocates for a strategic approach to integrating LLMs into health care. By emphasizing the importance of maintaining clinician expertise, fostering critical engagement with LLM outputs, and navigating the legal and ethical landscape, we can ensure that LLMs serve as valuable tools in enhancing patient care and supporting health care professionals. This approach addresses the immediate challenges posed by integrating LLMs and sets a foundation for their maintainable and responsible use in the future.
- Abstract
- 10.1182/blood-2024-208513
- Nov 5, 2024
- Blood
Evaluating the Accuracy of Artificial Intelligence(AI)-Generated Synopses for Plasma Cell Disorder Treatment Regimens
- Research Article
13
- 10.1016/j.jclinepi.2025.111746
- May 1, 2025
- Journal of clinical epidemiology
Machine learning promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention. We aimed at providing an overview of LLM applications in SR conduct in health research. We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: February 26, 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review that has not yet identified LLM applications to support SRs. Two reviewers independently screened studies for eligibility; after piloting, 1 reviewer extracted data, checked by another. Our database search yielded 8054 hits, and we identified 33 articles from our hand search. We finally included 37 articles on LLM support. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n = 15, 41%), study selection (n = 14, 38%), and data extraction (n = 11, 30%). The mostly recurring LLM was Generative Pretrained Transformer (GPT) (n = 33, 89%). Validation studies were predominant (n = 21, 57%). In half of the studies, authors evaluated LLM use as promising (n = 20, 54%), one-quarter as neutral (n = 9, 24%) and one-fifth as nonpromising (n = 8, 22%). Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance. Systematic reviews are a crucial tool in health research where experts carefully collect and analyze all available evidence on a specific research question. Creating these reviews is typically time- and resource-intensive, often taking months or even years to complete, as researchers must thoroughly search, evaluate, and synthesize an immense number of scientific studies. For the present article, we conducted a review to understand how new artificial intelligence (AI) tools, specifically large language models (LLMs) like Generative Pretrained Transformer (GPT), can be used to help create systematic reviews in health research. We searched multiple scientific databases and finally found 37 relevant articles. We found that LLMs have been tested to help with various parts of the systematic review process, particularly in 3 main areas: searching scientific literature (41% of studies), selecting relevant studies (38%), and extracting important information from these studies (30%). GPT was the most commonly used LLM, appearing in 89% of the studies. Most of the research (57%) focused on testing whether these AI tools actually work as intended in this context of systematic review production. The results were mixed: about half of the studies found LLMs promising, a quarter were neutral, and one-fifth found them not promising. While LLMs show potential for making the systematic review process more efficient, there is still a lack of fully tested and validated applications. However, the increasing number of studies in this field suggests that these AI tools are becoming increasingly important in creating systematic reviews.
- Research Article
4
- 10.3205/zma001702
- Jan 1, 2024
- GMS journal for medical education
The high performance of generative artificial intelligence (AI) and large language models (LLM) in examination contexts has triggered an intense debate about their applications, effects and risks. What legal aspects need to be considered when using LLM in teaching and assessment? What possibilities do language models offer? Statutes and laws are used to assess the use of LLM: - University statutes, state higher education laws, licensing regulations for doctors - Copyright Act (UrhG) - General Data Protection Regulation (DGPR) - AI Regulation (EU AI Act) LLM and AI offer opportunities but require clear university frameworks. These should define legitimate uses and areas where use is prohibited. Cheating and plagiarism violate good scientific practice and copyright laws. Cheating is difficult to detect. Plagiarism by AI is possible. Users of the products are responsible. LLM are effective tools for generating exam questions. Nevertheless, careful review is necessary as even apparently high-quality products may contain errors. However, the risk of copyright infringement with AI-generated exam questions is low, as copyright law allows up to 15% of protected works to be used for teaching and exams. The grading of exam content is subject to higher education laws and regulations and the GDPR. Exclusively computer-based assessment without human review is not permitted. For high-risk applications in education, the EU's AI Regulation will apply in the future. When dealing with LLM in assessments, evaluation criteria for existing assessments can be adapted, as can assessment programmes, e.g. to reduce the motivation to cheat. LLM can also become the subject of the examination themselves. Teachers should undergo further training in AI and consider LLM as an addition.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.