HG2P: Hippocampus-inspired high-reward graph and model-free Q-gradient penalty for path planning and motion control.
HG2P: Hippocampus-inspired high-reward graph and model-free Q-gradient penalty for path planning and motion control.
- Research Article
- 10.1609/aaai.v39i14.33606
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
Reinforcement learning (RL) has shown promising performance in tackling robotic manipulation tasks (RMTs), which require learning a prolonged sequence of manipulation actions to control robots efficiently. However, most RL algorithms often suffer from two problems when solving RMTs: inefficient exploration due to the extremely large action space and catastrophic forgetting due to the poor sampling efficiency. To alleviate these problems, this paper introduces an Evolutionary Reinforcement Learning algorithm with parameterized Action Primitives, called ERLAP, which combines the advantages of an evolutionary algorithm (EA) and hierarchical RL (HRL) to solve diverse RMTs. A library of heterogeneous action primitives is constructed in HRL to enhance the exploration efficiency of robots and dual populations with new evolutionary operators are run in EA to optimize these primitive sequences, which can diversify the distribution of replay buffer and avoid catastrophic forgetting. The experiments show that ERLAP outperforms four state-of-the-art RL algorithms in simulated RMTs with dense rewards and can effectively avoid catastrophic forgetting in a set of more challenging simulated RMTs with sparse rewards.
- Research Article
1
- 10.1109/tnnls.2024.3425809
- May 1, 2025
- IEEE transactions on neural networks and learning systems
Goal-conditioned hierarchical reinforcement learning (HRL) presents a promising approach for enabling effective exploration in complex, long-horizon reinforcement learning (RL) tasks through temporal abstraction. Empirically, heightened interlevel communication and coordination can induce more stable and robust policy improvement in hierarchical systems. Yet, most existing goal-conditioned HRL algorithms have primarily focused on the subgoal discovery, neglecting interlevel cooperation. Here, we propose a novel goal-conditioned HRL framework named Guided Cooperation via Model-Based Rollout (GCMR; code is available at https://github.com/HaoranWang-TJ/GCMR_ACLG_official), aiming to bridge interlayer information synchronization and cooperation by exploiting forward dynamics. First, the GCMR mitigates the state-transition error within off-policy correction via model-based rollout, thereby enhancing sample efficiency. Second, to prevent disruption by the unseen subgoals and states, lower level Q-function gradients are constrained using a gradient penalty with a model-inferred upper bound, leading to a more stable behavioral policy conducive to effective exploration. Third, we propose a one-step rollout-based planning, using higher level critics to guide the lower level policy. Specifically, we estimate the value of future states of the lower level policy using the higher level critic function, thereby transmitting global task information downward to avoid local pitfalls. These three critical components in GCMR are expected to facilitate interlevel cooperation significantly. Experimental results demonstrate that incorporating the proposed GCMR framework with a disentangled variant of hierarchical reinforcement learning guided by landmarks (HIGL), namely, adjacency constraint and landmark-guided planning (ACLG), yields more stable and robust policy improvement compared with various baselines and significantly outperforms previous state-of-the-art (SOTA) algorithms.
- Research Article
- 10.32628/cseit251112170
- Feb 10, 2025
- International Journal of Scientific Research in Computer Science, Engineering and Information Technology
Recent advancements in reinforcement learning (RL) have marked a significant transformation from academic research to practical industrial applications. This comprehensive article explores how methodological breakthroughs in RL are creating tangible value across various sectors. The article examines three key evolutionary areas: hierarchical reinforcement learning, which enables efficient handling of complex tasks through decomposition; offline reinforcement learning, which facilitates learning from historical data; and model-based approaches that improve sample efficiency. It discusses successful implementations in resource allocation, energy management, and manufacturing, highlighting how RL systems are optimizing operations and improving performance. The integration of domain knowledge through constraint satisfaction and human-in-the-loop learning has further enhanced RL's practical applicability. While celebrating these achievements, the article also addresses critical challenges in scalability, interpretability, and robustness that must be overcome for broader adoption. It encompasses both current capabilities and future directions, providing insights into how RL continues to evolve as a crucial technology for next-generation intelligent systems.
- Research Article
- 10.1038/s41598-025-20653-y
- Oct 21, 2025
- Scientific Reports
Deep reinforcement learning methods have shown promising results in learning specific tasks, but struggle to cope with the challenges of long horizon manipulation tasks. As task complexity increases, the large state space and sparse reward make it difficult to collect effective samples through random exploration. Hierarchical reinforcement learning decomposes complex tasks into subtasks, which can reduce the difficulty of skill learning, but still suffers from limitations such as inefficient training and poor transferability. Recently, large language models (LLMs) have demonstrated the ability to encode vast amounts of knowledge about the world and to excel in context-based learning and reasoning tasks. However, applying LLMs to real-world tasks remains challenging due to their lack of grounding in specific task contexts. In this paper, we leverage the planning capabilities of LLMs alongside reinforcement learning (RL) to facilitate learning from the environment. The proposed approach yields a hierarchical agent that combines LLMs with parameterized action primitives (LARAP) to address long-horizon manipulation tasks. Rather than relying solely on LLMs, the agent uses them to guide a high-level policy, improving sample efficiency during training. Experimental results show that LARAP significantly outperforms baseline methods across various simulated manipulation tasks. The source code is available at: https://github.com/ningzhang-buaa/LARAP-code.
- Conference Article
3
- 10.1109/iccrd56364.2023.10080162
- Jan 10, 2023
Hierarchical reinforcement learning has achieved good results in solving complex learning tasks. Hierarchical reinforcement learning mainly includes on-policy and off-policy methods. On-policy cannot be applied to actual scenarios due to low data utilization. Therefore, off-policy reinforcement learning methods have become the main development direction. However, in the off-policy method, because the data in the replay buffer comes from different policy, the upper-layer policy gives the same goal, and the lower-layer policy is constantly updated and shifts to a different state, so the upper-layer policy cannot be stably trained. Aiming at the above problems, we propose a hierarchical reinforcement learning algorithm for environment representation based on the mutual bisimulation Metrics. When training the upper policy, the lower policy is used as a part of the environment, which is called virtual representation environment. The output is used for feature extraction, and the feature value is used as the state value of the upper-level policy net. Using our proposed method to compare a variety of complex tasks with the current main hierarchical reinforcement learning has achieved the stability and training effect improvement.
- Conference Article
12
- 10.1109/icra.2013.6630737
- May 1, 2013
In this paper we employ probabilistic relational affordance models in a robotic manipulation task. Such affordance models capture the interdependencies between properties of multiple objects, executed actions, and effects of those actions on objects. Recently it was shown how to learn such models from observed video demonstrations of actions manipulating several objects. This paper extends that work and employs those models for sequential tasks. Our approach consists of two parts. First, we employ affordance models sequentially in order to recognize the individual actions making up a demonstrated sequential skill or high level concept. Second, we utilize the models of concepts to plan a suitable course of action to replicate the observed consequences of a demonstration. For this we adopt the framework of relational Markov decision processes. Empirical results show the viability of the affordance models for sequential manipulation skills for object placement.
- Research Article
3
- 10.1017/s0263574724000389
- May 2, 2024
- Robotica
With the rise of deep reinforcement learning (RL) methods, many complex robotic manipulation tasks are being solved. However, harnessing the full power of deep learning requires large datasets. Online RL does not suit itself readily into this paradigm due to costly and time-consuming agent-environment interaction. Therefore, many offline RL algorithms have recently been proposed to learn robotic tasks. But mainly, all such methods focus on a single-task or multitask learning, which requires retraining whenever we need to learn a new task. Continuously learning tasks without forgetting previous knowledge combined with the power of offline deep RL would allow us to scale the number of tasks by adding them one after another. This paper investigates the effectiveness of regularisation-based methods like synaptic intelligence for sequentially learning image-based robotic manipulation tasks in an offline-RL setup. We evaluate the performance of this combined framework against common challenges of sequential learning: catastrophic forgetting and forward knowledge transfer. We performed experiments with different task combinations to analyse the effect of task ordering. We also investigated the effect of the number of object configurations and the density of robot trajectories. We found that learning tasks sequentially helps in the retention of knowledge from previous tasks, thereby reducing the time required to learn a new task. Regularisation-based approaches for continuous learning, like the synaptic intelligence method, help mitigate catastrophic forgetting but have shown only limited transfer of knowledge from previous tasks.
- Conference Article
- 10.1115/detc2025-169527
- Aug 17, 2025
We present a novel, modular Graph based Vision-Language-Action (VLAG) framework designed for long-horizon robotic manipulation tasks. Our approach integrates a graph-based planner with dedicated vision, language, and action modules, enabling robust and efficient task planning and execution. The graph planner serves as a high-level decision-making entity that interprets visual observations and language instructions to select appropriate task sequences. Specifically, our framework leverages a vision model with a multi-layer perceptron to extract key environmental features from both RGB and depth images. The language model is fine-tuned from a pre-trained model to enhance instruction-to-task pairing accuracy, thus having reliable and robust task recognition. The action model is built on the Action Chunking with Transformers (ACT) architecture, modified to accommodate the vision and language modalities. The graph planner is crucial to the framework’s functionality as it allows the combination of the strengths of the vision, language, and action modules, leading to a system that is both adaptable and computationally efficient. Overall, VLAG’s modular design enables the flexible integration of its components, providing a scalable solution for robotic manipulation tasks in both seen and unseen environments.
- Research Article
39
- 10.1109/tai.2022.3222143
- Dec 1, 2023
- IEEE Transactions on Artificial Intelligence
Autonomous control in high-dimensional, continuous state spaces is a persistent and important challenge in the fields of robotics and artificial intelligence. Because of high risk and complexity, the adoption of AI for autonomous combat systems has been a long-standing difficulty. In order to address these issues, DARPA's AlphaDogfight Trials (ADT) program sought to vet the feasibility of and increase trust in AI for autonomously piloting an F-16 in simulated air-to-air combat. Our submission to ADT solves the high-dimensional, continuous control problem using a novel hierarchical deep reinforcement learning approach consisting of a high-level policy selector and a set of separately trained low-level policies specialized for excelling in specific regions of the state space. Both levels of the hierarchy are trained using off-policy, maximum entropy methods with expert knowledge integrated through reward shaping. Our approach outperformed human expert pilots and achieved a second-place rank in the ADT championship event. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Impact Statement–</i> Significant performance milestones in reinforcement learning have been achieved in recent years, with autonomous agents demonstrating super-human performance across a wide variety of tasks. Before these algorithms can be extensively deployed in real-world defense applications, a greater level of trust must first be achieved. ADT was an important step towards developing the trust necessary to operationalize these algorithms, by demonstrating their effectiveness on a foundational yet relevant problem in a high-fidelity simulation environment. Developed for the program, our hierarchical reinforcement learning agent was designed alongside of and competed against active fighter pilots, and ultimately defeated a graduate of the United States Air Force's F-16 Weapons Instructor Course in match play.
- Conference Article
1
- 10.1109/isctech58360.2022.00119
- Dec 1, 2022
Deep reinforcement learning (DRL) has become a popular learning paradigm for decision and control and has been widely applied in robot manipulation in recent years. However, due to its special learning pattern of “trial and error”, there are still some remaining problems with DRL. Such as exploration dilemma, sample inefficient, and slow convergence, are to be refined, especially when faced with complex long-horizon tasks. As a solution for these limits, hierarchical reinforcement learning (HRL) is proposed and developed by decomposing challenging tasks into multiple simpler subtasks, to efficiently solve the main task in a“divide and conquer” manner. At present, there are comprehensive HRL methods for robotic manipulation tasks, while a review is lacking. To facilitate researchers to form a general view of this field, we systematically summarize related HRL methods for robotic manipulation. This review carries out literature sortation in a novel taxonomy of subtask generation and divides HRL methods into two categories: handcrafted subtask generation and learning-based subtask generation. A great number of representative methods are analyzed in detail. In the end, we also present some important future directions for HRL.
- Conference Article
- 10.1109/syscon48628.2021.9447100
- Apr 15, 2021
Slip detection and correction plays a very important role in robotic manipulation tasks, and it has long been a challenging problem in the robotic community. Further, the advantage of using systems engineering tools and framework to approach a solution and/or modeling of robotic tasks is not often pursued. In this paper, we use Model-Based Systems Engineering techniques to verify system requirements and validate stakeholder requirements for the problem of detecting and correcting for object slippage within a dexterous five-fingered robotic hand. We will discuss how the work accomplished in our laboratory was transferred to a simulated environment and how this simulated environment built in CoppeliaSim was connected to a systems engineering software, Cameo Systems Modeler. Measures of effectiveness were created from the stakeholder requirements for the slippage problem which allowed us to validate the robotic simulation that was built. Structural diagrams of the robotic system and environment were built along with behavioral diagrams of the simulation. Further, we used the connection of Cameo Systems Modeler and CoppeliaSim to track the measures of effectiveness for our robotic task which provided us with a complete systems engineering framework for the problem from the requirements phase through the implementation phase. Our main goal is to show the advantages of following a systems engineering framework to complete a robotic task through the connection of Cameo Systems Modeler and CoppeliaSim.
- Research Article
5
- 10.1162/jocn_a_01869
- Jul 1, 2022
- Journal of Cognitive Neuroscience
To effectively behave within ever-changing environments, biological agents must learn and act at varying hierarchical levels such that a complex task may be broken down into more tractable subtasks. Hierarchical reinforcement learning (HRL) is a computational framework that provides an understanding of this process by combining sequential actions into one temporally extended unit called an option. However, there are still open questions within the HRL framework, including how options are formed and how HRL mechanisms might be realized within the brain. In this review, we propose that the existing human motor sequence literature can aid in understanding both of these questions. We give specific emphasis to visuomotor sequence learning tasks such as the discrete sequence production task and the M × N (M steps × N sets) task to understand how hierarchical learning and behavior manifest across sequential action tasks as well as how the dorsal cortical-subcortical circuitry could support this kind of behavior. This review highlights how motor chunks within a motor sequence can function as HRL options. Furthermore, we aim to merge findings from motor sequence literature with reinforcement learning perspectives to inform experimental design in each respective subfield.
- Conference Article
2
- 10.1109/iros47612.2022.9981933
- Oct 23, 2022
Hierarchical algorithms have often been used to plan and execute complicated robotic sequential manipulation tasks, where an abstract planner searches for a skill sequence in an abstract space, and each skill generates actual motions on the basis of the planned skill sequences. To generate executable plans, the abstract planner should know the pre-/postconditions of each skill and appropriately choose skills so that the generated plan satisfies their pre-/postconditions. For such hierarchical planning, this paper presents a novel method for robot skill learning that learns not only a control policy but also the learned skill's pre-/postconditions to complete a given task. Our method combines an optimal control method and an active learning approach called level set estimation (LSE) to effectively collect training data for learning control policies and pre-/postconditions. Although there exists a LSE-based policy learning algorithm that identifies preconditions, its performance is limited to cases where the dimension of the search space for pre-/postconditions is low. The main contribution of this paper is the proposal of a new learning method that can handle tasks having a high-dimensional search space for pre-/postconditions. We demonstrate our proposed method in two robotic tasks. The results show that our method can more effectively learn a control policy and its pre-/postconditions compared with the existing LSE-based method.
- Research Article
- 10.17531/ein/205794
- Jun 19, 2025
- Eksploatacja i Niezawodność – Maintenance and Reliability
This paper studies the path planning and motion control method of the robot arm based on neural network, aiming to improve the path planning efficiency and motion control accuracy of the robot arm in complex environments. By introducing the deep reinforcement learning (DRL) method, especially the proximal policy optimization (PPO), this paper proposes a framework for integrated path planning and motion control. Experimental results show that the path generated by PPO in the path planning task has the highest smoothness, the shortest path length and the strongest obstacle avoidance ability. In the motion control task, PPO exhibits the smallest trajectory error, the highest motion accuracy and the best stability. Comprehensive experiments further verify the superior performance of PPO in the combination of path planning and motion control, which can generate smooth, short and safe paths, and accurately control the motion trajectory of the robot arm to ensure the high-quality completion of the task.
- Book Chapter
- 10.1007/978-3-030-04182-3_28
- Jan 1, 2018
We present ASD (Action, Sequence, and Divide), a new framework for Hierarchical Reinforcement Learning (HRL). Present HRL methods construct the task hierarchies but fail to avoid exploration when tasks are to be performed in a particular sequence, resulting in the agent needlessly exploring all permutations of the tasks. When the task hierarchies are used as an ASD framework, the RL agent encounters better constraints, preventing it from pursuing policies that are not valid, thus enabling the agent to achieve the optimal policy faster. The hierarchies created using the methods explained in this paper can be used to solve new episodes of the same environment, as well as similar instances of the problem. The hierarchies generated with an ASD framework can be used to establish an ordering of tasks. The objective is to not only to complete the tasks but also give the agent insights into the sequence of tasks that need to be performed in order to correctly solve a problem. We present an algorithm to generate the hierarchies as an ASD framework. The algorithm has been evaluated on some of the standard RL domains, namely, Taxi and Wargus, and is found to give correct results.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.