Abstract The multimodal perception is very crucial for mobile robots to achieve safe autonomous
steering in complex and changing environments without colliding with obstacles. In spite of the fact that a rich environmental information can be provided by the multimodal data, how to integrate the complementary features from different sources effectively remains an open issue. In this paper, a simple yet effective multimodal deep reinforcement learning (DRL) method, CMADRL, by fusing two input modalities such as depth images and pseudo-LiDAR data generated by RGB-D cameras, is proposed to improve the performance of single sensor in environmental perception tasks. And the Convolutional LSTM Network (ConvLSTM) layers are also adopted to extract spatiotemporal features and capture temporal relationships in consecutive time steps, which allow the robot to predict dynamic changes of the environment. In addition, A crossmodal fusion module is designed to optimize the multimodal data fusion scheme, with dynamic weighting and merging of features from different modalities, allowing the agent to focus more reasonably on the current state and improving the accuracy and efficiency of the obstacle avoidance strategy. The experimental results show that the proposed approach achieved significant performance in cumulative rewards, convergence speed, and success rate. Additionally, tests in both simulated and real-world environments further verify that collisions are effectively avoided using only a low-cost RGB-D camera, demonstrating the method’s strong generalization capabilities.
Read full abstract