Enhancing Robot Task Planning and Execution through Multi-Layer Large Language Models.
Large language models have found utility in the domain of robot task planning and task decomposition. Nevertheless, the direct application of these models for instructing robots in task execution is not without its challenges. Limitations arise in handling more intricate tasks, encountering difficulties in effective interaction with the environment, and facing constraints in the practical executability of machine control instructions directly generated by such models. In response to these challenges, this research advocates for the implementation of a multi-layer large language model to augment a robot's proficiency in handling complex tasks. The proposed model facilitates a meticulous layer-by-layer decomposition of tasks through the integration of multiple large language models, with the overarching goal of enhancing the accuracy of task planning. Within the task decomposition process, a visual language model is introduced as a sensor for environment perception. The outcomes of this perception process are subsequently assimilated into the large language model, thereby amalgamating the task objectives with environmental information. This integration, in turn, results in the generation of robot motion planning tailored to the specific characteristics of the current environment. Furthermore, to enhance the executability of task planning outputs from the large language model, a semantic alignment method is introduced. This method aligns task planning descriptions with the functional requirements of robot motion, thereby refining the overall compatibility and coherence of the generated instructions. To validate the efficacy of the proposed approach, an experimental platform is established utilizing an intelligent unmanned vehicle. This platform serves as a means to empirically verify the proficiency of the multi-layer large language model in addressing the intricate challenges associated with both robot task planning and execution.
Highlights
Grounded in experiential learning and knowledge accumulation, humans demonstrate a remarkable ability to comprehend intricate tasks through simple communication
We described these five tasks using natural language and employed the framework proposed in this paper to enable the robot to accomplish navigation tasks
The third point is that this work designs a feedback mechanism during the task decomposition process, it solves the problem of aligning the task decomposition with the robot control instructions
Summary
Grounded in experiential learning and knowledge accumulation, humans demonstrate a remarkable ability to comprehend intricate tasks through simple communication. Exploiting the inherent augmentation capability within large language models enables the decomposition of tasks into multiple subtasks of reduced complexity [1]. This distinctive feature can be harnessed for task planning within robotic systems, leading to more efficient and seamless human–robot interactions. Prevailing studies have predominantly concentrated on tasks characterized by low complexity, such as robotic arm trajectory planning and robotic object handling. While these investigations have contributed significantly to establishing a theoretical framework for large-model-based robot control, they fall short in addressing tasks of elevated complexity. The expansion into more intricate tasks necessitates the robot’s ability to adeptly handle and integrate complex environmental information into the task planning process
254
- 10.18653/v1/2023.acl-long.754
- Jan 1, 2023
207
- 10.1109/icra48891.2023.10160591
- May 29, 2023
15
- 10.1109/iros47612.2022.9982236
- Oct 23, 2022
76
- 10.18653/v1/2020.emnlp-main.373
- Jan 1, 2020
60
- 10.18653/v1/n18-1197
- Jan 1, 2018
30
- 10.1109/icra46639.2022.9811931
- May 23, 2022
421
- 10.1109/iccv48922.2021.00180
- Oct 1, 2021
15
- 10.18653/v1/2021.acl-long.159
- Jan 1, 2021
132
- 10.1146/annurev-control-101119-071628
- May 3, 2020
- Annual Review of Control, Robotics, and Autonomous Systems
26
- 10.1145/3568162.3578623
- Mar 13, 2023
- Book Chapter
- 10.1007/978-981-96-2260-3_47
- Jan 1, 2025
Scene Graph-Enhanced Embodied Decision Making for Autonomous Object Search
- Research Article
- 10.3390/electronics14153125
- Aug 5, 2025
- Electronics
Facing the demands of Human–Robot Collaborative (HRC) assembly for complex products under Industry 5.0, this paper proposes an intelligent assembly method that integrates Large Language Model (LLM) reasoning with Augmented Reality (AR) interaction. To address issues such as poor visibility, difficulty in knowledge acquisition, and strong decision dependency in the assembly of complex aerospace products within confined spaces, an assembly task model and structured process information are constructed. Combined with a retrieval-augmented generation mechanism, the method realizes knowledge reasoning and optimization suggestion generation. An improved ORB-SLAM2 algorithm is applied to achieve virtual–real mapping and component tracking, further supporting the development of an enhanced visual interaction system. The proposed approach is validated through a typical aerospace electronic cabin assembly task, demonstrating significant improvements in assembly efficiency, quality, and human–robot interaction experience, thus providing effective support for intelligent HRC assembly.
- Book Chapter
- 10.1007/978-3-658-46485-1_6
- Jan 1, 2025
Generative Artificial Intelligence as Driver for Innovation in the Automotive Industry – A Systematic Analysis
- Research Article
6
- 10.1109/lra.2024.3471457
- Nov 1, 2024
- IEEE Robotics and Automation Letters
Large language models (LLMs) have gained increasing popularity in robotic task planning due to their exceptional abilities in text analytics and generation, as well as their broad knowledge of the world. However, they fall short in decoding visual cues. LLMs have limited direct perception of the world, which leads to a deficient grasp of the current state of the world. By contrast, the emergence of visual language models (VLMs) fills this gap by integrating visual perception modules, which can enhance the autonomy of robotic task planning. Despite these advancements, VLMs still face challenges, such as the potential for task execution errors, even when provided with accurate instructions. To address such issues, this paper proposes a ReplanVLM framework for robotic task planning. In this study, we focus on error correction interventions. An internal error correction mechanism and an external error correction mechanism are presented to correct errors under corresponding phases. A replan strategy is developed to replan tasks or correct error codes when task execution fails. Experimental results on real robots and in simulation environments have demonstrated the superiority of the proposed framework, with higher success rates and robust error correction capabilities in open-world tasks. Videos of our experiments are available at https://youtu.be/NPk2pWKazJc.
- Book Chapter
- 10.1007/978-3-031-84744-8_17
- Jan 1, 2025
Large Language Models for Robotics: A Systematic Literature Review on Prompt Engineering
- Conference Article
- 10.1109/iros58592.2024.10802251
- Oct 14, 2024
Task planning for robots in real-life settings presents significant challenges. These challenges stem from three primary issues: the difficulty in identifying grounded sequences of steps to achieve a goal; the lack of a standardized mapping between high-level actions and low-level commands; and the challenge of maintaining low computational overhead given the limited resources of robotic hardware. We introduce EMPOWER, a framework designed for open-vocabulary online grounding and planning for embodied agents aimed at addressing these issues. By leveraging efficient pre-trained foundation models and a multi-role mechanism, EMPOWER demonstrates notable improvements in grounded planning and execution. Quantitative results highlight the effectiveness of our approach, achieving an average success rate of 0.73 across six different real-life scenarios using a TIAGo robot.
- Research Article
- 10.1109/jiot.2024.3516729
- May 15, 2025
- IEEE Internet of Things Journal
A Robotic AI Algorithm for Fusing Generative Large Models in Agriculture Internet of Things
- New
- Research Article
- 10.1016/j.rcim.2025.103176
- Apr 1, 2026
- Robotics and Computer-Integrated Manufacturing
From insight to autonomous execution: VLM-enhanced embodied agents towards digital twin-assisted human-robot collaborative assembly
- Conference Article
- 10.1117/12.3064972
- Apr 18, 2025
Research on the task decision algorithm for unmanned aerial vehicles based on the robot operating system and large language models
- Research Article
- 10.1038/s41698-025-00916-7
- May 23, 2025
- npj Precision Oncology
Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.
- Research Article
13
- 10.3389/frobt.2023.1221739
- Aug 15, 2023
- Frontiers in Robotics and AI
Long-horizon task planning is essential for the development of intelligent assistive and service robots. In this work, we investigate the applicability of a smaller class of large language models (LLMs), specifically GPT-2, in robotic task planning by learning to decompose tasks into subgoal specifications for a planner to execute sequentially. Our method grounds the input of the LLM on the domain that is represented as a scene graph, enabling it to translate human requests into executable robot plans, thereby learning to reason over long-horizon tasks, as encountered in the ALFRED benchmark. We compare our approach with classical planning and baseline methods to examine the applicability and generalizability of LLM-based planners. Our findings suggest that the knowledge stored in an LLM can be effectively grounded to perform long-horizon task planning, demonstrating the promising potential for the future application of neuro-symbolic planning methods in robotics.
- Research Article
- 10.1038/s41598-025-04384-8
- Jun 4, 2025
- Scientific Reports
Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable capabilities, the contribution of LLMs to enhancing the longstanding key problem of classifying an image among a set of choices remains unclear. Through extensive experiments involving seven models, ten visual understanding datasets, and multiple prompt variations per dataset, we find that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do. Yet at the same time, leveraging LLMs can improve performance on tasks requiring reasoning and outside knowledge. In response to these challenges, we propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task. The LLM router undergoes training using a dataset constructed from more than 2.5 million examples of pairs of visual task and model accuracy. Our results reveal that this lightweight fix surpasses or matches the accuracy of state-of-the-art alternatives, including GPT-4V and HuggingGPT, while improving cost-effectiveness.
- Research Article
36
- 10.1609/aaai.v38i19.30150
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
Warning: this paper contains data, prompts, and model outputs that are offensive in nature. Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions (that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.
- Research Article
8
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
2
- 10.3389/fnbot.2024.1342786
- Jun 4, 2024
- Frontiers in neurorobotics
Symbolic task planning is a widely used approach to enforce robot autonomy due to its ease of understanding and deployment in engineered robot architectures. However, techniques for symbolic task planning are difficult to scale in real-world, highly dynamic, human-robot collaboration scenarios because of the poor performance in planning domains where action effects may not be immediate, or when frequent re-planning is needed due to changed circumstances in the robot workspace. The validity of plans in the long term, plan length, and planning time could hinder the robot's efficiency and negatively affect the overall human-robot interaction's fluency. We present a framework, which we refer to as Teriyaki, specifically aimed at bridging the gap between symbolic task planning and machine learning approaches. The rationale is training Large Language Models (LLMs), namely GPT-3, into a neurosymbolic task planner compatible with the Planning Domain Definition Language (PDDL), and then leveraging its generative capabilities to overcome a number of limitations inherent to symbolic task planners. Potential benefits include (i) a better scalability in so far as the planning domain complexity increases, since LLMs' response time linearly scales with the combined length of the input and the output, instead of super-linearly as in the case of symbolic task planners, and (ii) the ability to synthesize a plan action-by-action instead of end-to-end, and to make each action available for execution as soon as it is generated instead of waiting for the whole plan to be available, which in turn enables concurrent planning and execution. In the past year, significant efforts have been devoted by the research community to evaluate the overall cognitive capabilities of LLMs, with alternate successes. Instead, with Teriyaki we aim to providing an overall planning performance comparable to traditional planners in specific planning domains, while leveraging LLMs capabilities in other metrics, specifically those related to their short- and mid-term generative capabilities, which are used to build a look-ahead predictive planning model. Preliminary results in selected domains show that our method can: (i) solve 95.5% of problems in a test data set of 1,000 samples; (ii) produce plans up to 13.5% shorter than a traditional symbolic planner; (iii) reduce average overall waiting times for a plan availability by up to 61.4%.
- Research Article
7
- 10.3390/robotics13080112
- Jul 23, 2024
- Robotics
The potential use of large language models (LLMs) in healthcare robotics can help address the significant demand put on healthcare systems around the world with respect to an aging demographic and a shortage of healthcare professionals. Even though LLMs have already been integrated into medicine to assist both clinicians and patients, the integration of LLMs within healthcare robots has not yet been explored for clinical settings. In this perspective paper, we investigate the groundbreaking developments in robotics and LLMs to uniquely identify the needed system requirements for designing health-specific LLM-based robots in terms of multi-modal communication through human–robot interactions (HRIs), semantic reasoning, and task planning. Furthermore, we discuss the ethical issues, open challenges, and potential future research directions for this emerging innovative field.
- Research Article
- 10.1038/s41598-025-91448-4
- Feb 28, 2025
- Scientific Reports
This study proposes a method for generating complex and long-horizon off-line task plans using large language models (LLMs). Although several studies have been conducted in recent years on robot task planning using LLMs, the planning results tend to be simple, consisting of ten or fewer action commands, depending on the task. In the proposed method, the LLM actively collects missing information by asking questions, and the task plan is upgraded with one dialog example. One of the contributions of this study is a Q&A process in which ambiguity judgment is left to the LLM. By sequentially eliminating ambiguities contained in long-horizon tasks through dialogue, our method increases the amount of information included in movement plans. This study aims to further refine action plans obtained from active modification through dialogue by passive modification, and few studies have addressed these issues for long-horizon robot tasks. In our experiments, we define the number of items in the task planning as information for robot task execution, and we demonstrate the effectiveness of the proposed method through dialogue experiments using a cooking task as the subject. And as a result of the experiment, the amount of information could be increased by the proposed method.
- Research Article
- 10.1108/ir-02-2025-0074
- Jul 29, 2025
- Industrial Robot: the international journal of robotics research and application
Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.
- Conference Article
100
- 10.1145/3510003.3510203
- May 21, 2022
Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.
- Research Article
- 10.1080/13658816.2025.2577252
- Nov 1, 2025
- International Journal of Geographical Information Science
The widespread use of online geoinformation platforms, such as Google Earth Engine (GEE), has produced numerous scripts. Extracting domain knowledge from these crowdsourced scripts supports understanding of geoprocessing workflows. Small Language Models (SLMs) are effective for semantic embedding but struggle with complex code; Large Language Models (LLMs) can summarize scripts, yet lack consistent geoscience terminology to express knowledge. In this paper, we propose Geo-CLASS, a knowledge extraction framework for geospatial analysis scripts that coordinates large and small language models. Specifically, we designed domain-specific schemas and a schema-aware prompt strategy to guide LLMs to generate and associate entity descriptions, and employed SLMs to standardize the outputs by mapping these descriptions to a constructed geoscience knowledge base. Experiments on 237 GEE scripts, selected from 295,943 scripts in total, demonstrated that our framework outperformed LLM baselines, including Llama-3, GPT-3.5 and GPT-4o. In comparison, the proposed framework improved accuracy in recognizing entities and relations by up to 31.9% and 12.0%, respectively. Ablation studies and performance analysis further confirmed the effectiveness of key components and the robustness of the framework. Geo-CLASS has the potential to enable the construction of geoprocessing modeling knowledge graphs, facilitate domain-specific reasoning and advance script generation via Retrieval-Augmented Generation (RAG).
- Research Article
7
- 10.1016/j.procs.2023.09.086
- Jan 1, 2023
- Procedia Computer Science
A Large and Diverse Arabic Corpus for Language Modeling
- Discussion
- 10.1111/cogs.13430
- Mar 1, 2024
- Cognitive science
This letter explores the intricate historical and contemporary links between large language models (LLMs) and cognitive science through the lens of information theory, statistical language models, and socioanthropological linguistic theories. The emergence of LLMs highlights the enduring significance of information-based and statistical learning theories in understanding human communication. These theories, initially proposed in the mid-20th century, offered a visionary framework for integrating computational science, social sciences, and humanities, which nonetheless was not fully fulfilled at that time. The subsequent development of sociolinguistics and linguistic anthropology, especially since the 1970s, provided critical perspectives and empirical methods that both challenged and enriched this framework. This letter proposes that two pivotal concepts derived from this development, metapragmatic function and indexicality, offer a fruitful theoretical perspective for integrating the semantic, textual, and pragmatic, contextual dimensions of communication, an amalgamation that contemporary LLMs have yet to fully achieve. The author believes that contemporary cognitive science is at a crucial crossroads, where fostering interdisciplinary dialogues among computational linguistics, social linguistics and linguistic anthropology, and cognitive and social psychology is in particular imperative. Such collaboration is vital to bridge the computational, cognitive, and sociocultural aspects of human communication and human-AI interaction, especially in the era of large language and multimodal models and human-centric Artificial Intelligence (AI).
- Research Article
3
- 10.1109/embc53108.2024.10782119
- Jul 15, 2024
- Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past 30 years, progress toward making high-throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping physician notes.Clinical relevance: Large language models will likely emerge as the dominant method for the high throughput phenotyping of signs and symptoms in physician notes.
- Research Article
50
- 10.1038/s41746-024-01024-9
- Feb 19, 2024
- NPJ Digital Medicine
Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.
- New
- Research Article
- 10.3390/s25216787
- Nov 6, 2025
- Sensors
- New
- Research Article
- 10.3390/s25216783
- Nov 6, 2025
- Sensors
- New
- Research Article
- 10.3390/s25216797
- Nov 6, 2025
- Sensors
- New
- Research Article
- 10.3390/s25216784
- Nov 6, 2025
- Sensors
- New
- Research Article
- 10.3390/s25216785
- Nov 6, 2025
- Sensors
- New
- Research Article
- 10.3390/s25216786
- Nov 6, 2025
- Sensors
- New
- Research Article
- 10.3390/s25216764
- Nov 5, 2025
- Sensors
- New
- Research Article
- 10.3390/s25216781
- Nov 5, 2025
- Sensors
- New
- Research Article
- 10.3390/s25216775
- Nov 5, 2025
- Sensors
- New
- Research Article
- 10.3390/s25216770
- Nov 5, 2025
- Sensors
- Ask R Discovery
- Chat PDF