Jamming decision-making is a pivotal component of modern electromagnetic warfare, wherein recent years have witnessed the extensive application of deep reinforcement learning techniques to enhance the autonomy and intelligence of wireless communication jamming decisions. However, existing researches heavily rely on manually designed customized jamming reward functions, leading to significant consumption of human and computational resources. To this end, under the premise of obviating designing task-customized reward functions, we propose a jamming policy optimization method that learns from imperfect demonstrations to effectively address the complex and high-dimensional jamming resource allocation problem against frequency hopping spread spectrum (FHSS) communication systems. To achieve this, a policy network is meticulously architected to consecutively ascertain jamming schemes for each jamming node, facilitating the construction of the dynamic transition within the Markov decision process. Subsequently, anchored in the dual-trust region concept, we design policy improvement and policy adversarial imitation phases. During the policy improvement phase, the trust region policy optimization method is utilized to refine the policy, while the policy adversarial imitation phase employs adversarial training to guide policy exploration using information embedded in demonstrations. Extensive simulation results indicate that our proposed method can approximate the optimal jamming performance trained under customized reward functions, even with rough binary reward settings, and also significantly surpass demonstration performance.
Read full abstract