In this paper, we focus on trajectory optimization for multiple three-dimensional unmanned aerial vehicles (UAVs) serving as aerial base stations (BS) to provide wireless coverage for Internet of Things (IoT) devices. These IoT devices are randomly distributed in a three-dimensional region with unknown locations and channel parameters. By optimizing the UAV trajectory and power, we aim to achieve the minimum communication time for all IoT devices. However, the unknown flight time of the UAVs and the IoT device locations make it difficult to formulate an optimization problem between the objective and variables by using traditional convex optimization. To deal with this intractable problem, We propose a solution based on Hierarchical Reinforcement Learning (HRL), specifically modeling it as a Markov Decision Process (MDP) and we divide the training process into two parts, each utilizing the Proximal Policy Optimization (PPO) algorithm for training. Our simulation results show that the proposed HRL offers promising performance for both training and testing phases.
Read full abstract