LLM “metalinguistic” capabilities are creating a stir – here's why

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract The latest evolutions of large language models are now able to tackle problems stepwise and complete tasks once thought a defining feature of human language – Anna Demming asks how, what's next, and why it matters

Similar Papers
  • Research Article
  • Cite Count Icon 6
  • 10.1609/aaai.v38i16.29799
Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis
  • Mar 24, 2024
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • James R Kirk + 3 more

Large language models (LLMs) offer significant promise as a knowledge source for task learning. Prompt engineering has been shown to be effective for eliciting knowledge from an LLM, but alone it is insufficient for acquiring relevant, situationally grounded knowledge for an embodied agent learning novel tasks. We describe a cognitive-agent approach, STARS, that extends and complements prompt engineering, mitigating its limitations and thus enabling an agent to acquire new task knowledge matched to its native language capabilities, embodiment, environment, and user preferences. The STARS approach is to increase the response space of LLMs and deploy general strategies, embedded within the autonomous agent, to evaluate, repair, and select among candidate responses produced by the LLM. We describe the approach and experiments that show how an agent, by retrieving and evaluating a breadth of responses from the LLM, can achieve 77-94% task completion in one-shot learning without user oversight. The approach achieves 100% task completion when human oversight (such as an indication of preference) is provided. Further, the type of oversight largely shifts from explicit, natural language instruction to simple confirmation/discomfirmation of high-quality responses that have been vetted by the agent before presentation to a user.

  • Research Article
  • Cite Count Icon 5
  • 10.1073/pnas.2413443122
Scaling language model size yields diminishing returns for single-message political persuasion
  • Mar 7, 2025
  • Proceedings of the National Academy of Sciences
  • Kobi Hackenburg + 5 more

Large language models can now generate political messages as persuasive as those written by humans, raising concerns about how far this persuasiveness may continue to increase with model size. Here, we generate 720 persuasive messages on 10 US political issues from 24 language models spanning several orders of magnitude in size. We then deploy these messages in a large-scale randomized survey experiment (N = 25,982) to estimate the persuasive capability of each model. Our findings are twofold. First, we find evidence that model persuasiveness is characterized by sharply diminishing returns, such that current frontier models are only slightly more persuasive than models smaller in size by an order of magnitude or more. Second, we find that the association between language model size and persuasiveness shrinks toward zero and is no longer statistically significant once we adjust for mere task completion (coherence, staying on topic), a pattern that highlights task completion as a potential mediator of larger models' persuasive advantage. Given that current frontier models are already at ceiling on this task completion metric in our setting, taken together, our results suggest that further scaling model size may not much increase the persuasiveness of static LLM-generated political messages.

  • Research Article
  • Cite Count Icon 16
  • 10.1016/j.ajhg.2024.08.010
Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease
  • Sep 9, 2024
  • The American Journal of Human Genetics
  • Junyoung Kim + 3 more

Phenotype-driven gene prioritization is fundamental to diagnosing rare genetic disorders. While traditional approaches rely on curated knowledge graphs with phenotype-gene relations, recent advancements in large language models (LLMs) promise a streamlined text-to-gene solution. In this study, we evaluated five LLMs, including two generative pre-trained transformers (GPT) series and three Llama2 series, assessing their performance across task completeness, gene prediction accuracy, and adherence to required output structures. We conducted experiments, exploring various combinations of models, prompts, phenotypic input types, and task difficulty levels. Our findings revealed that the best-performed LLM, GPT-4, achieved an average accuracy of 17.0% in identifying diagnosed genes within the top 50 predictions, which still falls behind traditional tools. However, accuracy increased with the model size. Consistent results were observed over time, as shown in the dataset curated after 2023. Advanced techniques such as retrieval-augmented generation (RAG) and few-shot learning did not improve the accuracy. Sophisticated prompts were more likely to enhance task completeness, especially in smaller models. Conversely, complicated prompts tended to decrease output structure compliance rate. LLMs also achieved better-than-random prediction accuracy with free-text input, though performance was slightly lower than with standardized concept input. Bias analysis showed that highly cited genes, such as BRCA1, TP53, and PTEN, are more likely to be predicted. Our study provides valuable insights into integrating LLMs with genomic analysis, contributing to the ongoing discussion on their utilization in clinical workflows.

  • Research Article
  • 10.57028/c58-101-z1096
EVIDENCE FROM LARGE LANGUAGE MODELS, HOW AI VINDICATES CLASSICAL THEORIES OF MEANING: FOR THE SEMANTICS AND PRAGMATICS DISTINCTION; CLASSICAL THEORIES OF GRAMMAR: FOR THE SYNTAX SEMANTICS INTERFACE; THE ALIGNMENT OF GRAMMAR AND LOGIC: FOR THE UNITY OF FORM.
  • Dec 1, 2025
  • Communication & Cognition
  • J.-M Kuczynski

In section 1 this paper argues that the demonstrated capabilities of large language models (LLMs) provide surprising empirical support for classical theories of meaning, particularly the distinction between semantics and pragmatics and the reality of compositional literal meaning (Partee, 2018). While LLMs employ connectionist architectures rather than classical computational ones, their ability to systematically process novel sentences and distinguish between literal and contextual meaning suggests that key insights of classical semantic theory capture genuine features of linguistic understanding, even if the underlying mechanisms differ from those traditionally posited (Bommasani et al., 2021). In section 2, this paper argues that the demonstrated capabilities of large language models (LLMs) provide surprising empirical support for classical theories of grammar, particularly regarding the relationship between syntax and semantics (Manning et al., 2022). While LLMs employ connectionist architectures rather than classical computational ones, their ability to process structural relationships independently of meaning while maintaining systematic syntaxsemantics mappings suggests that key insights of classical grammatical theory capture genuine features of language, even if the underlying mechanisms differ from those traditionally posited (Linzen & Baroni, 2021). In section 3, this paper argues that the demonstrated capabilities of large language models (LLMs) provide surprising empirical support for the alignment of grammatical and logical form (Chowdhury & Linzen, 2021). While philosophers have traditionally posited a divergence between grammatical and logical structure, LLMs' ability to make correct inferences without FOL-style logical forms suggests that grammatical structure itself guides valid reasoning (Manning et al., 2022). This indicates that the perceived misalignment between grammatical and logical form may be an artifact of our chosen formal systems rather than a feature of language itself.

  • Research Article
  • 10.3138/calico-2025-0035
Toward Adaptive Spoken Dialogue Systems for Language Learning: Predicting Task Completion from Learning Process Data
  • Oct 1, 2025
  • CALICO Journal
  • Frederik Cornillie + 4 more

Within the space of dialogue systems for language learning, the rapid advance of conversational artificial intelligence, powered among others by large language models, is currently driving innovations that enable learners of a second or foreign language (L2) to practice dialogic interaction at their own pace, including through functional tasks. Such language practice is particularly relevant when learners have limited opportunities to interact with more proficient speakers of that L2. However, to ensure a meaningful contribution of dialogue systems to L2 development, technology-mediated practice of dialogic interaction needs to be adapted to the needs and proficiency of learners. This requires accurate and transparent assessment of L2 performance that is both driven by theories about L2 acquisition and practically feasible with state-of-the-art technologies. The end goal is that task-based dialogue practice can be scaffolded through individualized feedback and other forms of learning support. This study models task performance in Language Hero, a game-based spoken dialogue system designed for Dutch-speaking learners who want to practice French as a L2. Using explanatory item response analysis, we explored to what extent the completion of functional spoken tasks can be predicted from learning process data, more specifically from fully-automated measures of L2 task performance. Data were drawn from 263 participants who completed a total of 739 tasks in the system, comprising 22,074 spoken responses. The results indicate that 12 fully-automated measures of previous task performance, including complexity, accuracy, fluency, and functional adequacy, as well as two measures of hint use, significantly predicted future task completion. A multilevel model with fixed and random effects accounted for 45% of the variance in task completion. This study demonstrates the potential of data-driven learner models for micro-adaptivity in dialogic technology-mediated practice while simultaneously highlighting the need to include complementary predictors as well as human evaluation.

  • PDF Download Icon
  • Research Article
  • 10.54254/2753-7064/42/20242585
Exploring the Impact of Different Textual Language Features on Large Language Models' Detection of Fake News
  • Nov 15, 2024
  • Communications in Humanities Research
  • Huilin Ouyang + 1 more

Abstract: With the proliferation of social media and online platforms, fake news has become increasingly rampant. This study explores the impact of different textual language features on large language models (such as ChatGPT) in detecting fake news. By extracting extreme emotional vocabulary and exaggerated syntactic words commonly found in fake news and calculating their TF-IDF values, this study analyzes their influence on large language models' ability to assess the veracity of news. The study found that the frequency of extreme emotional words is higher than that of exaggerated syntactic words and has a more significant impact on fake news detection by large language models. Furthermore, this study suggests that by carefully selecting and adjusting language features, the accuracy and stability of fake news detection can be improved, providing new insights for optimizing automated detection systems. These findings provide important references for improving the technology of automatic fake news detection, contributing to the construction of a safer and more reliable online environment.

  • Conference Article
  • Cite Count Icon 3
  • 10.24963/ijcai.2024/711
ScreenAgent: A Vision Language Model-driven Computer Control Agent
  • Aug 1, 2024
  • Runliang Niu + 8 more

Large Language Models (LLM) can invoke a variety of tools and APIs to complete complex tasks. The computer, as the most powerful and universal tool, could potentially be controlled by a trained LLM agent. Powered by the computer, we can hopefully build a more generalized agent to assist humans in various daily digital works. In this paper, we construct an environment for a Vision Language Model (VLM) agent to interact with a real computer screen. Within this environment, the agent can observe screenshots and manipulate the Graphical User Interface (GUI) by outputting mouse and keyboard actions. We also design an automated control pipeline that includes planning, acting, and reflecting phases, guiding the agent to continuously interact with the environment and complete multi-step tasks. Additionally, we construct the ScreenAgent Dataset, which collects screenshots and action sequences when completing daily computer tasks. Finally, we train a model, ScreenAgent, which achieves comparable computer control capabilities to GPT-4V and demonstrated more precise UI positioning capabilities. Our attempts could inspire further research on building a generalist LLM agent. The code and more detailed information are at https://github.com/niuzaisheng/ScreenAgent.

  • Conference Article
  • Cite Count Icon 30
  • 10.1109/iccv.2019.00757
Language Features Matter: Effective Language Representations for Vision-Language Tasks
  • Oct 1, 2019
  • Andrea Burns + 4 more

Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings trained on text-only data or are learned from scratch. We conclude that language features deserve more attention, which has been informed by experiments which compare different word embeddings, language models, and embedding augmentation steps on five common VL tasks: image-sentence retrieval, image captioning, visual question answering, phrase grounding, and text-to-clip retrieval. Our experiments provide some striking results; an average embedding language model outperforms a LSTM on retrieval-style tasks; state-of-the-art representations such as BERT perform relatively poorly on vision-language tasks. From this comprehensive set of experiments we can propose a set of best practices for incorporating the language component of vision-language tasks. To further elevate language features, we also show that knowledge in vision-language problems can be transferred across tasks to gain performance with multi-task training. This multi-task training is applied to a new Graph Oriented Vision-Language Embedding (GrOVLE), which we adapt from Word2Vec using WordNet and an original visual-language graph built from Visual Genome, providing a ready-to-use vision-language embedding: http://ai.bu.edu/grovle.

  • Research Article
  • Cite Count Icon 1
  • 10.1503/jpn.230026
Characterizing and detecting delirium with clinical and computational measures of speech and language disturbance.
  • Jul 4, 2023
  • Journal of psychiatry & neuroscience : JPN
  • Sunny X Tang + 9 more

Delirium is a critically underdiagnosed syndrome of altered mental status affecting more than 50% of older adults admitted to hospital. Few studies have incorporated speech and language disturbance in delirium detection. We sought to describe speech and language disturbances in delirium, and provide a proof of concept for detecting delirium using computational speech and language features. Participants underwent delirium assessment and completed language tasks. Speech and language disturbances were rated using standardized clinical scales. Recordings and transcripts were processed using an automated pipeline to extract acoustic and textual features. We used binomial, elastic net, machine learning models to predict delirium status. We included 33 older adults admitted to hospital, of whom 10 met criteria for delirium. The group with delirium scored higher on total language disturbances and incoherence, and lower on category fluency. Both groups scored lower on category fluency than the normative population. Cognitive dysfunction as a continuous measure was correlated with higher total language disturbance, incoherence, loss of goal and lower category fluency. Including computational language features in the model predicting delirium status increased accuracy to 78%. This was a proof-of-concept study with limited sample size, without a set-aside cross-validation sample. Subsequent studies are needed before establishing a generalizable model for detecting delirium. Language impairments were elevated among patients with delirium and may also be used to identify subthreshold cognitive disturbances. Computational speech and language features are promising as accurate, noninvasive and efficient biomarkers of delirium.

  • Book Chapter
  • Cite Count Icon 2
  • 10.3233/faia240787
Large Language Model Agentic Approach to Fact Checking and Fake News Detection
  • Oct 16, 2024
  • Xinyi Li + 2 more

In the current digital era, the rapid spread of misinformation on online platforms presents significant challenges to societal well-being, public trust, and democratic processes, influencing critical decision making and public opinion. To address these challenges, there is a growing need for automated fake news detection mechanisms. Pre-trained large language models (LLMs) have demonstrated exceptional capabilities across various natural language processing (NLP) tasks, prompting exploration into their potential for verifying news claims. Instead of employing LLMs in a non-agentic way, where LLMs generate responses based on direct prompts in a single shot, our work introduces FactAgent, an agentic approach of utilizing LLMs for fake news detection. FactAgent enables LLMs to emulate human expert behavior in verifying news claims without any model training, following a structured workflow. This workflow breaks down the complex task of news veracity checking into multiple sub-steps, where LLMs complete simple tasks using their internal knowledge or external tools. At the final step of the workflow, LLMs integrate all findings throughout the workflow to determine the news claim’s veracity. Compared to manual human verification, FactAgent offers enhanced efficiency. Experimental studies demonstrate the effectiveness of FactAgent in verifying claims without the need for any training process. Moreover, FactAgent provides transparent explanations at each step of the workflow and during final decision-making, offering insights into the reasoning process of fake news detection for end users. FactAgent is highly adaptable, allowing for straightforward updates to its tools that LLMs can leverage within the workflow, as well as updates to the workflow itself using domain knowledge. This adaptability enables FactAgent’s application to news verification across various domains.

  • Research Article
  • Cite Count Icon 2
  • 10.3897/biss.7.112926
Using ChatGPT with Confidence for Biodiversity-Related Information Tasks
  • Sep 19, 2023
  • Biodiversity Information Science and Standards
  • Michael Elliott + 1 more

Recent advancements in conversational Artificial Intelligence (AI), such as OpenAI's Chat Generative Pre-Trained Transformer (ChatGPT), present the possibility of using large language models (LLMs) as tools for retrieving, analyzing, and transforming scientific information. We have found that ChatGPT (GPT 3.5) can provide accurate biodiversity knowledge in response to questions about species descriptions, occurrences, and taxonomy, as well as structure information according to data sharing standards such as Darwin Core. A rigorous evaluation of ChatGPT's capabilities in biodiversity-related tasks may help to inform viable use cases for today's LLMs in research and information workflows. In this work, we test the extent of ChatGPT's biodiversity knowledge, characterize its mistakes, and suggest how LLM-based systems might be designed to complete knowledge-based tasks with confidence. To test ChatGPT's biodiversity knowledge, we compiled a question-and-answer test set derived from Darwin Core records available in Integrated Digitized Biocollections (iDigBio). Each question focuses on one or more Darwin Core terms to test the model’s ability to recall species occurrence information and its understanding of the standard. The test set covers a range of locations, taxonomic groups, and both common and rare species (defined by the number of records in iDigBio). The results of the tests will be presented. We also tested ChatGPT on generative tasks, such as creating species occurrence maps. A visual comparison of the maps with iDigBio data shows that for some species, ChatGPT can generate fairly accurate representationsof their geographic ranges (Fig. 1). ChatGPT's incorrect responses in our tests show several patterns of mistakes. First, responses can be self-conflicting. For example, when asked "Does Acer saccharum naturally occur in Benton, Oregon?", ChatGPT responded "YES, Acer saccharum DOES NOT naturally occur in Benton, Oregon". ChatGPT can also be misled by semantics in species names. For Rafinesquia neomexicana, the word "neomexicana" leads ChatGPT to believe that the species primarily occurs in New Mexico, USA. ChatGPT may also confuse species, such as when attempting to describe a lesser-known species (e.g., a rare bee) within the same genus as a better-known species. Other causes of mistakes include hallucination (Ji et al. 2023), memorization (Chang and Bergen 2023), and user deception (Li et al. 2023). Some mistakes may be avoided by prompt engineering, e.g., few-shot prompting (Chang and Bergen 2023) and chain-of-thought prompting (Wei et al. 2022). These techniques assist Large Language Models (LLMs) by clarifying expectations or by guiding recollection. However, such methods cannot help when LLMs lack required knowledge. In these cases, alternative approaches are needed. A desired reliability can be theoretically guaranteed if responses that contain mistakes are discarded or corrected. This requires either detecting or predicting mistakes. Sometimes mistakes can be ruled out by verifying responses with a trusted source. For example, a trusted specimen record might be found that corroborates the response. The difficulty, however, is finding such records programmatically; e.g., using iDigBio and Global Biodiversity Information Facility's (GBIF) search Application Programming Interfaces (APIs) requires specifying indexed terms that might not appear in an LLM's response. This presents a secondary problem for which LLMs may be well suited. Note that with presence-only data, it can be difficult to disprove presence claims or prove absence claims. Besides verification, mistakes may be predicted using probabilistic methods. Formulating mistake probabilities often relies on heuristics. For example, variability in a model’s responses to a repeated query can be a sign of hallucination (Manakul et al. 2023). In practice, both probabilistic and verification methods may be needed to reach a desired reliability. LLM outputs that can be verified may be directly accepted (or discarded), while others are judged by estimating mistake probabilities. We will consider a set of heuristics and verification methods, and report empirical assessments of their impact on ChatGPT’s reliability.

  • Research Article
  • 10.1609/aaai.v39i9.32974
FLAME: Learning to Navigate with Multimodal LLM in Urban Environments
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Yunzhe Xu + 3 more

Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for route summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME's superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards applications of MLLMs in the field of embodied intelligence.

  • Research Article
  • Cite Count Icon 37
  • 10.1109/jbhi.2022.3163751
Vision-Language Transformer for Interpretable Pathology Visual Question Answering.
  • Apr 1, 2023
  • IEEE Journal of Biomedical and Health Informatics
  • Usman Naseem + 2 more

Pathology visual question answering (PathVQA) attempts to answer a medical question posed by pathology images. Despite its great potential in healthcare, it is not widely adopted because it requires interactions on both the image (vision) and question (language) to generate an answer. Existing methods focused on treating vision and language features independently, which were unable to capture the high and low-level interactions that are required for VQA. Further, these methods failed to offer capabilities to interpret the retrieved answers, which are obscure to humans where the models' interpretability to justify the retrieved answers has remained largely unexplored. Motivated by these limitations, we introduce a vision-language transformer that embeds vision (images) and language (questions) features for an interpretable PathVQA. We present an interpretable transformer-based Path-VQA (TraP-VQA), where we embed transformers' encoder layers with vision and language features extracted using pre-trained CNN and domain-specific language model (LM), respectively. A decoder layer is then embedded to upsample the encoded features for the final prediction for PathVQA. Our experiments showed that our TraP-VQA outperformed the state-of-the-art comparative methods with public PathVQA dataset. Our experiments validated the robustness of our model on another medical VQA dataset, and the ablation study demonstrated the capability of our integrated transformer-based vision-language model for PathVQA. Finally, we present the visualization results of both text and images, which explain the reason for a retrieved answer in PathVQA.

  • Research Article
  • Cite Count Icon 8
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Research Article
  • 10.1038/s41467-025-67145-1
Linguistic features of AI mis/disinformation and the detection limits of LLMs
  • Dec 11, 2025
  • Nature Communications
  • Yulong Ma + 5 more

The persuasive capability of large language models (LLMs) in generating mis/disinformation is widely recognized, but the linguistic ambiguity of such content and inconsistent findings on LLM-based detection reveal unresolved risks in information governance. To address the lack of Chinese datasets, this study compiles two datasets of Chinese AI mis/disinformation generated by multi-lingual models involving deepfakes and cheapfakes. Through psycholinguistic and computational linguistic analyses, the quality modulation effects of eight language features (including sentiment, cognition, and personal concerns), along with toxicity scores and syntactic dependency distance differences, were discovered. Furthermore, key factors influencing zero-shot LLMs in comprehending and detecting AI mis/disinformation are examined. The results show that although implicit linguistic distinctions exist, the intrinsic detection capability of LLMs remains limited. Meanwhile, the quality modulation effects of AI mis/disinformation linguistic features may lead to the failure of AI mis/disinformation detectors. These findings highlight the major challenges of applying LLMs in information governance.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.