For a three-dimensional impact angle-constrained time-coordination guidance problem, a value-policy decomposed multi-agent twin delayed deep deterministic (VPD-MATD3) policy gradient algorithm is proposed in this paper, and a temporal-spatial cooperative guidance law is designed consequently. The derived cooperative engagement kinematics is formulated as a value-policy decomposed partially observable Markov decision process. Value decomposition and policy decomposition are correspondingly proposed to address the two bottlenecks of reward coupling and policy coupling in 3D guidance, thus improving the training process. The training environment with multiple uncertainties, such as observation noise, maneuverability perturbation, autopilot time lag, switching communication topology, and random initialization, is subsequently constructed. Time-energy type reward functions are designed, incorporating energy consumption as a direct optimization indicator via a penalty term. A long short-term memory network layer is inserted inside the policy networks to suppress the negative effect of observation noise. The policy training curves verify the significant effect of the proposed algorithm in improving training convergence. The policy testing results illustrate the robustness and generalization of the VPD-MATD3 cooperative guidance law and its ability to cope with randomly switched communication topologies. Moreover, the comparative testing further reveals its time-energy suboptimal property.
Read full abstract