EditCoT: A Stepwise Chain-of-Thought Reasoning Framework for Multi-Intent Text Revision

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Text revision is necessary to harness the written-text following human-acceptable requirements. Multi-intent text revision, however, requires all potential textual defects to be addressed in the same computational model, which poses a new challenge to the traditional single-intent-based text revision modeling approach. Conventional approaches often rely on models tailored to specific edit intents, limiting their ability to address diverse or unseen edit intents. Inspired by the reasoning strengths of Large Language Models (LLMs), we introduce EditCoT, a novel framework for multi-intent text revision. EditCoT breaks down the revision process into sequential reasoning steps, each targeting a specific text defect. The structured approach can enhance LLMs’ editing capabilities by enabling precise, intent-specific revisions within a unified model. We evaluate the effect of EditCoT on multi-/single-intent text revision tasks. For multi-intent tasks, EditCoT achieves state-of-the-art results, with a SARI score of 65.80 and a BERTScore of 88.27. For single-intent tasks, EditCoT, paired with GPT-o1, presents a competitive performance compared with specifically fine-tuned models. Furthermore, when combined with GPT-o1 or DeepSeek, EditCoT demonstrates impressive transferability to new edit intents via custom edit-chains. Overall, this study offers an effective framework for modeling and resolving text editing tasks, contributing a multi-intent dataset and an augmented single-intent dataset to support the community in advancing text revision research.

Similar Papers
  • Research Article
  • Cite Count Icon 201
  • 10.1016/j.caeai.2023.100199
Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions
  • Dec 29, 2023
  • Computers and Education: Artificial Intelligence
  • Jennifer Meyer + 6 more

Writing proficiency is an essential skill for upper secondary students that can be enhanced through effective feedback. Creating feedback on writing tasks, however, is time-intensive and presents a challenge for educators, often resulting in students receiving insufficient or no feedback. The advent of text-generating large language models (LLMs) offers a promising solution, namely, automated evidence-based feedback generation. Yet, empirical evidence from randomized controlled studies about the effectiveness of LLM-generated feedback is missing. To address this issue, the current study compared the effectiveness of LLM-generated feedback to no feedback. A sample of N = 459 upper secondary students of English as a foreign language wrote an argumentative essay. Students in the experimental group were asked to revise their text according to feedback that was generated using the LLM GPT-3.5-turbo. The control group revised their essays without receiving feedback. We assessed improvement in the revision using automated essay scoring. The results showed that LLM-generated feedback increased revision performance (d = .19) and task motivation (d = 0.36). Moreover, it increased positive emotions (d = 0.34) compared to revising without feedback. The findings highlight that using LLMs allows to create timely feedback that can positively relate to students’ cognitive and affective-motivational outcomes. Future perspectives and the implications for research and practice of using LLM-generated feedback in intelligent tutoring systems are discussed.

  • Research Article
  • 10.5617/dhnbpub.12916
Enriching Cultural Heritage Knowledge Graph Metadata from Finnish Texts with Large Language Models
  • Feb 5, 2026
  • Digital Humanities in the Nordic and Baltic Countries Publications
  • Rafael Leal + 2 more

This paper introduces the Finnish Named Entity Linker (FINEL), a tool that leverages Deep Learning models, including Large Language Models (LLMs), to recognize, disambiguate, and link Named Entities in Cultural Heritage texts. FINEL is designed to enhance the metadata of textual documents by connecting them to Knowledge Graphs (KG). We propose a zero-shot classification method that resembles Retrieval-Augmented Generation (RAG) and discuss a prototype web service with a user interface that enables human intervention for final disambiguation decisions. This editing capability is crucial, particularly when automatic linking may be hindered by errors and hallucinations inherent in LLM-based tools. The paper also reflects on lessons learned from using FINEL in applications targeting Digital Humanities (DH) research. Since the focus is on Finnish texts, our methods accommodate the specific challenges posed by this highly inflectional language and the available processing resources. Preliminary evaluation results underscore the potential of FINEL: our named entity lemmatizer achieved an accuracy of 96.5% on the test dataset, while an LLM from the Llama family reached 97% accuracy for entities with only one candidate. However, accuracy decreased with each additional candidate.

  • Research Article
  • Cite Count Icon 11
  • 10.1002/adma.202502771
Empowering Generalist Material Intelligence with Large Language Models.
  • May 12, 2025
  • Advanced materials (Deerfield Beach, Fla.)
  • Wenhao Yuan + 3 more

Large language models (LLMs) are steering the development of generalist materials intelligence (GMI), a unified framework integrating conceptual reasoning, computational modeling, and experimental validation. Central to this framework is the agent-in-the-loop paradigm, where LLM-based agents function as dynamic orchestrators, synthesizing multimodal knowledge, specialized models, and experimental robotics to enable fully autonomous discovery. Drawing from a comprehensive review of LLMs' transformative impact across representative applications in materials science, including data extraction, property prediction, structure generation, synthesis planning, and self-driven labs, this study underscores how LLMs are revolutionizing traditional tasks, catalyzing the agent-in-the-loop paradigm, and bridging the ontology-concept-computation-experiment continuum. Then the unique challenges of scaling up LLM adoption are discussed, particularly those arising from the misalignment of foundation LLMs with materials-specific knowledge, emphasizing the need to enhance adaptability, efficiency, sustainability, interpretability, and trustworthiness in the pursuit of GMI. Nonetheless, it is important to recognize that LLMs are not universally efficient. Their substantial resource demands and inconsistent performance call for careful deployment based on demonstrated task suitability. To address these realities, actionable strategies and a progressive roadmap for equitably and democratically implementing materials-aware LLMs in real-world practices are proposed.

  • Conference Article
  • Cite Count Icon 27
  • 10.18653/v1/2022.acl-long.250
Understanding Iterative Revision from Human-Written Text
  • Jan 1, 2022
  • Wanyu Du + 5 more

Writing is, by nature, a strategic, adaptive, and more importantly, an iterative process. A crucial part of writing is editing and revising the text. Previous works on text revision have focused on defining edit intention taxonomies within a single domain or developing computational models with a single level of edit granularity, such as sentence-level edits, which differ from human's revision cycles. This work describes IteraTeR: the first large-scale, multi-domain, edit-intention annotated corpus of iteratively revised text. In particular, IteraTeR is collected based on a new framework to comprehensively model the iterative text revisions that generalize to various domains of formal writing, edit intentions, revision depths, and granularities. When we incorporate our annotated edit intentions, both generative and edit-based text revision models significantly improve automatic evaluations. Through our work, we better understand the text revision process, making vital connections between edit intentions and writing quality, enabling the creation of diverse corpora to support computational modeling of iterative text revisions.

  • Video Transcripts
  • 10.48448/n7h6-ar79
Understanding Iterative Revision from Human-Written Text
  • May 7, 2022
  • Underline Science Inc.
  • Wanyu Du + 5 more

Writing is, by nature, a strategic, adaptive, and more importantly, an iterative process. A crucial part of writing is editing and revising the text. Previous works on text revision have focused on defining edit intention taxonomies within a single domain or developing computational models with a single level of edit granularity, such as sentence-level edits, which differ from human's revision cycles. This work describes IteraTeR: the first large-scale, multi-domain, edit-intention annotated corpus of iteratively revised text. In particular, IteraTeR is collected based on a new framework to comprehensively model the iterative text revisions that generalize to various domains of formal writing, edit intentions, revision depths, and granularities. When we incorporate our annotated edit intentions, both generative and edit-based text revision models significantly improve automatic evaluations. Through our work, we better understand the text revision process, making vital connections between edit intentions and writing quality, enabling the creation of diverse corpora to support computational modeling of iterative text revisions.

  • Research Article
  • Cite Count Icon 14
  • 10.3390/sym16111470
Optimizing Microservice Deployment in Edge Computing with Large Language Models: Integrating Retrieval Augmented Generation and Chain of Thought Techniques
  • Nov 5, 2024
  • Symmetry
  • Kan Feng + 8 more

Large Language Models (LLMs) have demonstrated impressive capabilities in autogenerating code based on natural language instructions provided by humans. We observed that in the microservice models of edge computing, the problem of deployment latency optimization can be transformed into an NP-hard mathematical optimization problem. However, in the real world, deployment strategies at the edge often require immediate updates, while human-engineered code tends to be lagging. To bridge this gap, we innovatively integrated LLMs into the decision-making process for microservice deployment. Initially, we constructed a private Retrieval Augmented Generation (RAG) database containing prior knowledge. Subsequently, we employed meticulously designed step-by-step inductive instructions and used the chain of thought (CoT) technique to enable the LLM to learn, reason, reflect, and regenerate. We decomposed the microservice deployment latency optimization problem into a collection of granular sub-problems (described in natural language), progressively providing instructions to the fine-tuned LLM to generate corresponding code blocks. The generated code blocks underwent integration and consistency assessment. Additionally, we prompted the LLM to generate code without the use of the RAG database for comparative analysis. We executed the aforementioned code and comparison algorithm under identical operational environments and simulation parameters, conducting rigorous result analysis. Our fine-tuned model significantly reduced latencies by 22.8% in handling surges in request flows, 37.8% in managing complex microservice types, and 39.5% in processing increased network nodes compared to traditional algorithms. Moreover, our approach demonstrated marked improvements in latency performance over LLMs not utilizing RAG technology and reinforcement learning algorithms reported in other literature. The use of LLMs also highlights the concept of symmetry, as the symmetrical structure of input-output relationships in microservice deployment models aligns with the LLM’s inherent ability to process and generate balanced and optimized code. Symmetry in this context allows for more efficient resource allocation and reduces redundant operations, further enhancing the model’s effectiveness. We believe that LLMs hold substantial potential in optimizing microservice deployment models.

  • Research Article
  • Cite Count Icon 17
  • 10.1016/j.swevo.2024.101741
Large language models as surrogate models in evolutionary algorithms: A preliminary study
  • Sep 26, 2024
  • Swarm and Evolutionary Computation
  • Hao Hao + 2 more

Large language models as surrogate models in evolutionary algorithms: A preliminary study

  • Research Article
  • Cite Count Icon 1
  • 10.7498/aps.74.20250497
Material design accelerated by large language models: end-to-end empowerment from knowledge mining to intelligent design
  • Jan 1, 2025
  • Acta Physica Sinica
  • Yudan Huang + 8 more

<sec>With the rapid development of artificial intelligence technology, large language models (LLMs) have become the core driving force for the paradigm shift in materials science research. This review explores the comprehensive role of LLMs in accelerating material design throughout the entire research lifecycle from knowledge mining to intelligent design. This work aims to emphasize how LLMs can leverage their advantages in information retrieval, cross-modal data integration, and intelligent reasoning to address challenges in traditional materials research, such as data fragmentation, high experimental costs, and limited reasoning capabilities.</sec><sec>Key methods include applying LLMs to knowledge discovery through techniques such as retrieval-augmented generation (RAG), multi-modal information retrieval, and knowledge graph construction. These approaches can efficiently extract and construct material data from a vast repository of scientific literature and experimental records. Additionally, LLMs are integrated with automated experimental platforms to optimize workflows from natural language-driven experiment design to high-throughput iterative testing.</sec><sec>The results demonstrate that LLMs significantly enhance material research efficiency and accuracy. For instance, in knowledge mining, LLMs improve information retrieval accuracy by up to 29.4% in tasks such as predicting material synthesis conditions. In material design, LLMs can accelerate computational modeling, structure and performance prediction, and reverse engineering, reducing experimental trial-and-error cycles. Notably, LLMs perform well in cross-scale knowledge integration, linking material composition, processing parameters, and performance metrics to guide innovative synthesis pathways.</sec><sec>However, challenges still exist, including dependence on high-quality data, the “black-box” nature of LLMs, and limitations in handling complex material systems. The future direction emphasizes improving data quality through multi-source integration, enhancing model explainability through visualization tools, and deepening interdisciplinary collaboration, and bridging the gaps between AI and domain-specific expertise.</sec><sec>In summary, LLMs are reshaping materials science by implementing a data-driven, knowledge-intensive research paradigms. The ability of LLMs to integrate vast datasets, predict material properties, and automate experimental workflows makes them indispensable tools for accelerating material discovery and innovation. With the development of LLMs, their synergistic effect with physical constraints and experimental platforms is expected to open new fields in material design.</sec>

  • Research Article
  • Cite Count Icon 7
  • 10.1093/jamia/ocaf023
Application of unified health large language model evaluation framework to In-Basket message replies: bridging qualitative and quantitative assessments.
  • Mar 10, 2025
  • Journal of the American Medical Informatics Association : JAMIA
  • Chuan Hong + 13 more

Application of unified health large language model evaluation framework to In-Basket message replies: bridging qualitative and quantitative assessments.

  • Research Article
  • Cite Count Icon 3
  • 10.2196/77334
Performance of Large Language Models in Diagnosing Rare Hematologic Diseases and the Impact of Their Diagnostic Outputs on Physicians: Combined Retrospective and Prospective Study
  • Oct 9, 2025
  • Journal of Medical Internet Research
  • Hongbin Yu + 15 more

BackgroundRare hematologic diseases are frequently underdiagnosed or misdiagnosed due to their clinical complexity. Whether new-generation large language models (LLMs), particularly those using chain-of-thought reasoning, can improve diagnostic accuracy remains unclear.ObjectiveThis study aimed to evaluate the diagnostic performance of new-generation commercial LLMs in rare hematologic diseases and to determine whether the LLM output enhances physicians’ diagnostic accuracy.MethodsWe conducted a 2-phase study. In the retrospective phase, we evaluated 7 mainstream LLMs on 158 nonpublic real-world admission records covering 9 rare hematologic diseases, assessed diagnostic performance using top-10 accuracy and mean reciprocal rank (MRR), and evaluated ranking stability via Jaccard similarity and entropy. Spearman rank correlation was used to examine the association between physicians’ diagnoses and LLM-generated outputs. In the prospective phase, 28 physicians with varying levels of experience diagnosed 5 cases each, gaining access to LLM-generated diagnoses across 3 sequential steps to assess whether LLMs can improve diagnostic accuracy.ResultsIn the retrospective phase, ChatGPT-o1-preview demonstrated the highest top-10 accuracy (70.3%) and MRR (0.577), and DeepSeek-R1 ranked second. Diagnostic performance was low for amyloid light-chain (AL) amyloidosis; Castleman disease; Erdheim-Chester disease; and polyneuropathy, organomegaly, endocrinopathy, monoclonal gammopathy, and skin changes (POEMS) syndrome. Interestingly, higher accuracy often correlated with lower ranking stability across most LLMs. The physician performance showed a strong correlation with both top-10 accuracy (ρ=0.565) and MRR (ρ=0.650). In the prospective phase, LLMs significantly improved the diagnostic accuracy of less-experienced physicians; no significant benefit was observed for specialists. However, when LLMs generated biased responses, physician performance often failed to improve or even declined.ConclusionsWithout fine-tuning, new-generation commercial LLMs, particularly those with chain-of-thought reasoning, can identify diagnoses of rare hematologic diseases with high accuracy and significantly enhance the diagnostic performance of less-experienced physicians. Nevertheless, biased LLM outputs may mislead clinicians, highlighting the need for critical appraisal and cautious clinical integration with appropriate safeguard systems.

  • Research Article
  • Cite Count Icon 69
  • 10.1038/s41598-025-98483-1
Industrial applications of large language models
  • Apr 21, 2025
  • Scientific Reports
  • Mubashar Raza + 4 more

Large language models (LLMs) are artificial intelligence (AI) based computational models designed to understand and generate human like text. With billions of training parameters, LLMs excel in identifying intricate language patterns, enabling remarkable performance across a variety of natural language processing (NLP) tasks. After the introduction of transformer architectures, they are impacting the industry with their text generation capabilities. LLMs play an innovative role across various industries by automating NLP tasks. In healthcare, they assist in diagnosing diseases, personalizing treatment plans, and managing patient data. LLMs provide predictive maintenance in automotive industry. LLMs provide recommendation systems, and consumer behavior analyzers. LLMs facilitates researchers and offer personalized learning experiences in education. In finance and banking, LLMs are used for fraud detection, customer service automation, and risk management. LLMs are driving significant advancements across the industries by automating tasks, improving accuracy, and providing deeper insights. Despite these advancements, LLMs face challenges such as ethical concerns, biases in training data, and significant computational resource requirements, which must be addressed to ensure impartial and sustainable deployment. This study provides a comprehensive analysis of LLMs, their evolution, and their diverse applications across industries, offering researchers valuable insights into their transformative potential and the accompanying limitations.

  • Research Article
  • 10.54097/hw85q020
Collaborative Integration of Large Language Models and Computer Algebra Systems for Simulation Verification and Code Generation
  • Jan 29, 2026
  • Academic Journal of Science and Technology
  • Xuhui Shi

This paper explores the transformative role of Large Language Models (LLMs) in scientific computing and research. As scientific datasets increasingly contain unstructured or semi-structured text, preprocessing and structuring this data are crucial challenges. LLMs have shown great potential in automating data extraction, transforming raw text into structured formats for computational models. By utilizing techniques such as Named Entity Recognition (NER), LLMs can extract critical information like chemical names, experimental conditions, and research findings, significantly reducing manual efforts. In addition to data preprocessing, LLMs facilitate literature reviews by rapidly scanning vast amounts of research papers, summarizing key insights, and constructing knowledge graphs to visualize relationships across complex datasets. Furthermore, LLMs contribute to problem-solving by generating theoretical insights and assisting in mathematical and computational tasks. Finally, LLMs can support scientific programming by automating code generation for data analysis and simulations. These capabilities enhance efficiency, foster faster discoveries, and lower the barrier to managing complex scientific data, ultimately accelerating research in multiple scientific domains.

  • Research Article
  • 10.1609/aaai.v39i24.34742
InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Yutong Wu + 15 more

Recent advancements in open-source code large language models (LLMs) have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g. CodeLlama-Python and DeepSeek-Coder) and benchmarks (e.g., HumanEval(+), MBPP(+), DS-1000 and MultiPL-E), showing it consistently improves the base models.

  • Research Article
  • 10.1162/opmi_a_00209
Relative Value Encoding in Large Language Models: A Multi-Task, Multi-Model Investigation
  • May 9, 2025
  • Open Mind : Discoveries in Cognitive Science
  • William M Hayes + 2 more

AbtractIn-context learning enables large language models (LLMs) to perform a variety of tasks, including solving reinforcement learning (RL) problems. Given their potential use as (autonomous) decision-making agents, it is important to understand how these models behave in RL tasks and the extent to which they are susceptible to biases. Motivated by the fact that, in humans, it has been widely documented that the value of a choice outcome depends on how it compares to other local outcomes, the present study focuses on whether similar value encoding biases apply to LLMs. Results from experiments with multiple bandit tasks and models show that LLMs exhibit behavioral signatures of relative value encoding. Adding explicit outcome comparisons to the prompt magnifies the bias, impairing the ability of LLMs to generalize from the outcomes presented in-context to new choice problems, similar to effects observed in humans. Computational cognitive modeling reveals that LLM behavior is well-described by a simple RL algorithm that incorporates relative values at the outcome encoding stage. Lastly, we present preliminary evidence that the observed biases are not limited to fine-tuned LLMs, and that relative value processing is detectable in the final hidden layer activations of a raw, pretrained model. These findings have important implications for the use of LLMs in decision-making applications.

  • Research Article
  • Cite Count Icon 6
  • 10.1177/25152459251357566
Six Fallacies in Substituting Large Language Models for Human Participants
  • Jul 1, 2025
  • Advances in Methods and Practices in Psychological Science
  • Zhicheng Lin

Can artificial-intelligence (AI) systems, such as large language models (LLMs), replace human participants in behavioral and psychological research? Here, I critically evaluate the replacement perspective and identify six interpretive fallacies that undermine its validity. These fallacies are (a) equating token prediction with human intelligence, (b) treating LLMs as the average human, (c) interpreting alignment as explanation, (d) anthropomorphizing AI systems, (e) essentializing identities, and (f) substituting model data for human evidence. Each fallacy represents a potential misunderstanding about what LLMs are and what they can tell researchers about human cognition. In the analysis, I distinguish levels of similarity between LLMs and humans, particularly functional equivalence (outputs) versus mechanistic equivalence (processes), while highlighting both technical limitations (addressable through engineering) and conceptual limitations (arising from fundamental differences between statistical and biological intelligence). For each fallacy, specific safeguards are provided to guide responsible research practices. Ultimately, the analysis supports conceptualizing LLMs as pragmatic simulation tools—useful for role-play, rapid hypothesis testing, and computational modeling (provided their outputs are validated against human data)—rather than as replacements for human participants. This framework enables researchers to leverage language models productively while respecting the fundamental differences between machine intelligence and human thought.

Save Icon
Up Arrow
Open/Close