Reinforcement learning-based obstacle avoidance for continuum robots
This work presents a reinforcement learning framework for controlling a planar threesection continuum robot in environments with static obstacles. Assuming constant curvature for each section, the robot is trained to navigate toward a fixed goal while avoiding collisions with multiple static objects. A custom simulation environment was developed to support three levels of scenario difficulty, easy, medium and hard, each with varying obstacle density and placement. The learning process is driven by the Deep Deterministic Policy Gradient (DDPG) algorithm, which enables smooth and continuous curvature control. Careful attention was paid to the design of the reward function and the network architecture, both of which were critical to achieving stable and reliable policy learning. Performance was evaluated across multiple runs, revealing that the agent successfully generalized its behavior across scenarios of increasing complexity. The proposed framework demonstrates the potential of reinforcement learning as a viable approach to safe and adaptive control in continuum robotic systems, with promising implications for applications such as medical navigation, search and rescue, and inspection in confined environments.
- Research Article
6
- 10.3390/machines10070496
- Jun 21, 2022
- Machines
The unmanned aerial vehicle (UAV) trajectory tracking control algorithm based on deep reinforcement learning is generally inefficient for training in an unknown environment, and the convergence is unstable. Aiming at this situation, a Markov decision process (MDP) model for UAV trajectory tracking is established, and a state-compensated deep deterministic policy gradient (CDDPG) algorithm is proposed. An additional neural network (C-Net) whose input is compensation state and output is compensation action is added to the network model of a deep deterministic policy gradient (DDPG) algorithm to assist in network exploration training. It combined the action output of the DDPG network with compensated output of the C-Net as the output action to interact with the environment, enabling the UAV to rapidly track dynamic targets in the most accurate continuous and smooth way possible. In addition, random noise is added on the basis of the generated behavior to realize a certain range of exploration and make the action value estimation more accurate. The OpenAI Gym tool is used to verify the proposed method, and the simulation results show that: (1) The proposed method can significantly improve the training efficiency by adding a compensation network and effectively improve the accuracy and convergence stability; (2) Under the same computer configuration, the computational cost of the proposed algorithm is basically the same as that of the QAC algorithm (Actor-critic algorithm based on behavioral value Q) and the DDPG algorithm; (3) During the training process, with the same tracking accuracy, the learning efficiency is about 70% higher than that of QAC and DDPG; (4) During the simulation tracking experiment, under the same training time, the tracking error of the proposed method after stabilization is about 50% lower than that of QAC and DDPG.
- Research Article
8
- 10.3390/s24175667
- Aug 30, 2024
- Sensors (Basel, Switzerland)
In the traditional Deep Deterministic Policy Gradient (DDPG) algorithm, path planning for mobile robots in mapless environments still encounters challenges regarding learning efficiency and navigation performance, particularly adaptability and robustness to static and dynamic obstacles. To address these issues, in this study, an improved algorithm frame was proposed that designs the state and action spaces, and introduces a multi-step update strategy and a dual-noise mechanism to improve the reward function. These improvements significantly enhance the algorithm's learning efficiency and navigation performance, rendering it more adaptable and robust in complex mapless environments. Compared to the traditional DDPG algorithm, the improved algorithm shows a 20% increase in the stability of the navigation success rate with static obstacles along with a 25% reduction in pathfinding steps for smoother paths. In environments with dynamic obstacles, there is a remarkable 45% improvement in success rate. Real-world mobile robot tests further validated the feasibility and effectiveness of the algorithm in true mapless environments.
- Research Article
- 10.1109/access.2025.3595873
- Jan 1, 2025
- IEEE Access
Piezoelectric inkjet printing has important applications in high-precision manufacturing, but precise control of droplet quality faces challenges such as the nonlinear coupling between drive waveform parameters and droplet characteristics, as well as environmental disturbances. To address these issues, this paper proposes a closed-loop droplet control method based on a curriculum deep deterministic policy gradient (DDPG) algorithm, with the standard deviation of droplet volume and velocity as the main optimization objectives. To tackle the complex mapping relationships and boost training efficiency, we first built a high-fidelity simulation environment using a four-layer deep residual network trained on real-world industrial ink droplet data. Following that, we designed a dynamic weight-adaptive composite reward mechanism and integrated a curriculum learning strategy with the Deep Deterministic Policy Gradient (DDPG) algorithm. Relying on our self-developed droplet observation platform, we achieve closed-loop control of the standard deviation of droplet volume and velocity. Experimental results show that this method can precisely control the droplet volume within ±0.1 pl of the target value, while keeping the velocity standard deviation below 40 mm/s. The adjustment process is stable without severe oscillations, and the method demonstrates strong robustness against environmental disturbances such as ink supply pressure. This provides an efficient and practical solution for the automated control of droplet quality in piezoelectric inkjet printing.
- Book Chapter
2
- 10.1007/978-3-030-93049-3_11
- Jan 1, 2021
As a decision-making problem with interaction between vehicles, it is difficult to describe intelligent vehicle lane change state space using a rule-based decision system. The deep deterministic policy gradient (DDPG) algorithm offers good performance for autonomous driving decision, but still has slow convergence and high collision probability in learning process when applied to lane change. Therefore, we propose an improved deep deterministic policy gradient algorithm with barrier function (DDPG-BF) algorithm to address these problems. The barrier function is constructed depending on the safety distance required for lane changes, and DDPG algorithm optimization is improved by guiding the vehicle to choose actions within safety constraints. Simulation results on TORCS confirmed that the proposed method converged in hundreds of training episodes, and reduced the unsafe behavior ratio to less than 0.05. Compared with DDPG and FEC-DDPG algorithm, the proposed method has the contribution to improve the convergence speed of learning and maintain the safe distance between vehicles in lane change.
- Research Article
1
- 10.2478/amns-2024-3426
- Jan 1, 2024
- Applied Mathematics and Nonlinear Sciences
Compared to individual microgrid, multi-microgrid (MMG) system can enhance the overall utilization of renewable energy, effectively improve the operational stability of local microgrids, and reduce the dependence on main grid. However, energy management of MMG encounters significant challenges due to the complex interaction between different microgrids. To tackle this issue, this paper introduces a non-cooperative gamebased optimal scheduling market trading model for MMG composed of various renewable energy sources, completing trade decisions while ensuring information independence. Considering the real-time changes in environmental transition functions and complex scheduling scenarios, the multi-agent deep deterministic policy gradient (MADDPG) algorithm is employed, which modifies the experience replay mechanism and Markov process of the basic deep deterministic policy gradient (DDPG) algorithm. Compared to traditional multi-microgrid system scheduling algorithms, the method presented in this paper does not require individual predictions of state variables, achieves end-to-end training from agent states to actions, and ensures the information security and autonomous decision-making of each microgrid.
- Research Article
37
- 10.37965/jait.2021.12003
- Dec 7, 2021
- Journal of Artificial Intelligence and Technology
Aiming at intelligent decision-making of UAV based on situation information in air combat, a novel maneuvering decision method based on deep reinforcement learning is proposed in this paper. The autonomous maneuvering model of UAV is established by Markov Decision Process. The Twin Delayed Deep Deterministic Policy Gradient(TD3) algorithm and the Deep Deterministic Policy Gradient (DDPG) algorithm in deep reinforcement learning are used to train the model, and the experimental results of the two algorithms are analyzed and compared. The simulation experiment results show that compared with the DDPG algorithm, the TD3 algorithm has stronger decision-making performance and faster convergence speed, and is more suitable forsolving combat problems. The algorithm proposed in this paper enables UAVs to autonomously make maneuvering decisions based on situation information such as position, speed, and relative azimuth, adjust their actions to approach and successfully strike the enemy, providing a new method for UAVs to make intelligent maneuvering decisions during air combat.
- Research Article
7
- 10.3390/sym13061061
- Jun 12, 2021
- Symmetry
The research on autonomous driving based on deep reinforcement learning algorithms is a research hotspot. Traditional autonomous driving requires human involvement, and the autonomous driving algorithms based on supervised learning must be trained in advance using human experience. To deal with autonomous driving problems, this paper proposes an improved end-to-end deep deterministic policy gradient (DDPG) algorithm based on the convolutional block attention mechanism, and it is called multi-input attention prioritized deep deterministic policy gradient algorithm (MAPDDPG). Both the actor network and the critic network of the model have the same structure with symmetry. Meanwhile, the attention mechanism is introduced to help the vehicles focus on useful environmental information. The experiments are conducted in the open racing car simulator (TORCS)and the results of five experiment runs on the test tracks are averaged to obtain the final result. Compared with the state-of-the-art algorithm, the maximum reward increases from 62,207 to 116,347, and the average speed increases from 135 km/h to 193 km/h, while the number of success episodes to complete a circle increases from 96 to 147. Also, the variance of the distance from the vehicle to the center of the road is compared, and the result indicates that the variance of the DDPG is 0.6 m while that of the MAPDDPG is only 0.2 m. The above results indicate that the proposed MAPDDPG achieves excellent performance.
- Research Article
23
- 10.3389/fninf.2023.1096053
- Jan 23, 2023
- Frontiers in Neuroinformatics
Aiming at the poor robustness and adaptability of traditional control methods for different situations, the deep deterministic policy gradient (DDPG) algorithm is improved by designing a hybrid function that includes different rewards superimposed on each other. In addition, the experience replay mechanism of DDPG is also improved by combining priority sampling and uniform sampling to accelerate the DDPG's convergence. Finally, it is verified in the simulation environment that the improved DDPG algorithm can achieve accurate control of the robot arm motion. The experimental results show that the improved DDPG algorithm can converge in a shorter time, and the average success rate in the robotic arm end-reaching task is as high as 91.27%. Compared with the original DDPG algorithm, it has more robust environmental adaptability.
- Conference Article
10
- 10.1109/vppc53923.2021.9699253
- Oct 1, 2021
Deep reinforcement learning-based energy management strategy (EMS) is a state-of-art technology for hybrid electric vehicles (HEVs). This paper proposes a novel EMS based on improved deep deterministic policy gradient (DDPG) algorithm with prioritized replay for a power-split plug-in hybrid electric bus (PHEB) to improve the fuel economy of PHEB as well as the learning efficiency of DDPG. Firstly, prioritized experience replay is incorporate into DDPG to use samples more efficiently. Secondly, a real-world speed profile collected from a fixed bus route rather than short-distance standard driving cycles is used to train the improved DDPG algorithm until it converges completely. The superiority of the proposed EMS in terms of learning efficiency and fuel economy is validated under another real-world speed profile which is different from the training dataset. Simulation results indicate that the proposed EMS improves fuel economy by 3.22% and learning efficiency is improved significantly compared with the DDPG-based EMS.
- Research Article
34
- 10.3390/en12183461
- Sep 7, 2019
- Energies
Reinforcement learning has potential in the area of intelligent transportation due to its generality and real-time feature. The Q-learning algorithm, which is an early proposed algorithm, has its own merits to solve the train timetable rescheduling (TTR) problem. However, it has shortage in two aspects: Dimensional limits of action and a slow convergence rate. In this paper, a deep deterministic policy gradient (DDPG) algorithm is applied to solve the energy-aimed train timetable rescheduling (ETTR) problem. This algorithm belongs to reinforcement learning, which fulfills real-time requirements of the ETTR problem, and has adaptability on random disturbances. Superior to the Q-learning, DDPG has a continuous state space and action space. After enough training, the learning agent based on DDPG takes proper action by adjusting the cruising speed and the dwelling time continuously for each train in a metro network when random disturbances happen. Although training needs an iteration for thousands of episodes, the policy decision during each testing episode takes a very short time. Models for the metro network, based on a real case of the Shanghai Metro Line 1, are established as a training and testing environment. To validate the energy-saving effect and the real-time feature of the proposed algorithm, four experiments are designed and conducted. Compared with the no action strategy, results show that the proposed algorithm has real-time performance, and saves a significant percentage of energy under random disturbances.
- Research Article
1
- 10.1109/tmech.2024.3419803
- Jun 1, 2025
- IEEE/ASME Transactions on Mechatronics
This article proposes a swing-up control method combining deep deterministic policy gradient (DDPG) algorithm and transfer learning method for vertical <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</i>-link underactuated manipulators with two passive joints, which provides an effective control strategy to guarantee control effectiveness and robustness. In the proposed strategy, the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</i>-link manipulator is first reduced to an equivalent three-link manipulator with a single active joint. Next, the deep deterministic policy gradient algorithm is employed to pretrain an agent, which controls all links of the equivalent three-link manipulator to the straight-up position, regardless of their respective angular velocities. Then, policy transfer is implemented using transfer learning method. In this process, we leverage the pretrained agent as the initial agent and train it using the deep deterministic policy gradient algorithm with redesigned reward function. Through this training, the obtained agent can successfully swing all links of the manipulator to approach the straight-up position, with their angular velocities close to zero, realizing the control objective, which is to stabilize the manipulator at the straight-up position. Simulations of different scenarios are carried out to demonstrate the effectiveness of the presented swing-up control method.
- Research Article
5
- 10.3390/en18123046
- Jun 9, 2025
- Energies
The Dual Active Bridge converter (DABC), known for its bidirectional power transfer capability and high efficiency, plays a crucial role in various applications, particularly in electric vehicles (EVs), where it facilitates energy storage, battery charging, and grid integration. The Dual Active Bridge Converter (DABC), when paired with a high-performance CLLC filter, is well-regarded for its ability to transfer power bidirectionally with high efficiency, making it valuable across a range of energy applications. While these features make the DABC highly efficient, they also complicate controller design due to nonlinear behavior, fast switching, and sensitivity to component variations. We have used a Fractional-order PID (FOPID) controller to benefit from the simple structure of classical PID controllers with lower complexity and improved flexibility because of additional filtering gains adopted in this method. However, for a FOPID controller to operate effectively under real-time conditions, its parameters must adapt continuously to changes in the system. To achieve this adaptability, a Multi-Agent Reinforcement Learning (MARL) approach is adopted, where each gain of the controller is tuned individually using the Deep Deterministic Policy Gradient (DDPG) algorithm. This structure enhances the controller’s ability to respond to external disturbances with greater robustness and adaptability. Meanwhile, finding the best initial gains in the RL structure can decrease the overall efficiency and tracking performance of the controller. To overcome this issue, Grey Wolf Optimization (GWO) algorithm is proposed to identify the most suitable initial gains for each agent, providing faster adaptation and consistent performance during the training process. The complete approach is tested using a Hardware-in-the-Loop (HIL) platform, where results confirm accurate voltage control and resilient dynamic behavior under practical conditions. In addition, the controller’s performance was validated under a battery management scenario where the DAB converter interacts with a nonlinear lithium-ion battery. The controller successfully regulated the State of Charge (SOC) through automated charging and discharging transitions, demonstrating its real-time adaptability for BMS-integrated EV systems. Consequently, the proposed MARL-FOPID controller reported better disturbance-rejection performance in different working cases compared to other conventional methods.
- Research Article
19
- 10.1109/access.2023.3341507
- Jan 1, 2023
- IEEE Access
Deep Reinforcement Learning (DRL) allows agents to make decisions in a specific environment based on a reward function, without prior knowledge. Adapting hyperparameters significantly impacts the learning process and time. Precise estimation of hyperparameters during DRL training poses a major challenge. To tackle this problem, this study utilizes Grey Wolf Optimization (GWO), a metaheuristic algorithm, to optimize the hyperparameters of the Deep Deterministic Policy Gradient (DDPG) algorithm for achieving optimal control strategy in two simulated Gymnasium environments provided by OpenAI. The ability to adapt hyperparameters accurately contributes to faster convergence and enhanced learning, ultimately leading to more efficient control strategies. The proposed DDPG-GWO algorithm is evaluated in the 2DRobot and MountainCarContinuous simulation environments, chosen for their ease of implementation. Our experimental results reveal that optimizing the hyperparameters of the DDPG using theGWO algorithm in the Gymnasium environments maximizes the total rewards during testing episodes while ensuring the stability of the learning policy. This is evident in comparing our proposed DDPG-GWO agent with optimized hyperparameters and the original DDPG. In the 2DRobot environment, the original DDPG had rewards ranging from -150 to -50, whereas, in the proposed DDPG-GWO, they ranged from -100 to 100 with a running average between 1 and 800 across 892 episodes. In the MountainCarContinuous environment, the original DDPG struggled with negative rewards, while the proposed DDPG-GWO achieved rewards between 20 and 80 over 218 episodes with a total of 490 timesteps.
- Research Article
101
- 10.1016/j.neucom.2020.03.063
- Apr 8, 2020
- Neurocomputing
Adaptive neuro-fuzzy PID controller based on twin delayed deep deterministic policy gradient algorithm
- Research Article
1
- 10.1088/1742-6596/2637/1/012006
- Nov 1, 2023
- Journal of Physics: Conference Series
With the rapid development of robotics technology, robotic arm grasping has gained significant attention in the fields of automation and artificial intelligence. In this study, we propose a fractional-order deep deterministic policy gradient (DDPG) algorithm for optimizing robotic arm grasping tasks. Traditional machine learning algorithms face challenges in handling continuous action spaces, while the DDPG algorithm effectively addresses this issue. In this research, we first review the background and challenges of robotic arm grasping and provide an overview of the application of traditional reinforcement learning algorithms in grasping tasks. Subsequently, we introduce the principles and fundamental ideas of the DDPG algorithm in detail, discussing its potential for optimizing robotic arm grasping. To further enhance the performance of robotic arm grasping, we propose an improved approach based on fractional-order control. Fractional-order control exhibits unique advantages in environmental dynamics modeling and grasp posture optimization, enhancing the robustness and adaptability of robotic arm grasping. Through a series of experiments, we validate the effectiveness and superiority of the fractional-order DDPG algorithm in robotic arm grasping tasks. Our algorithm achieves significant improvements in grasping success rate and stability compared to traditional methods. The experimental results demonstrate that the fractional-order DDPG algorithm is better equipped to handle control challenges in continuous action spaces and optimize the performance of robotic arm grasping tasks.