Unmanned Aerial Vehicles (UAVs) exhibit superior maneuverability and large-scale movement capability in 3D space, making them ideal platforms for various autonomous applications. However, UAVs with cameras face challenges in 3D situational awareness, which limits their applications in 3D space, particularly in Lake Search and Rescue (LSAR) scenarios. In this paper, we propose a new LSAR Aerial 3D detection dataset (LA3D) and a pioneering end-to-end aerial Monocular-GPS Fusion 3D detector (MGF3D) to improve UAVs’ 3D understanding capability and mitigate the gap between research and real-world application. LA3D is collected by a well-designed hybrid-wing UAV platform on lake surface scenes, which has high-quality annotated 3D bounding boxes, aligned GPS altitudes, and comprehensive metrics. Moreover, MGF3D aggregates multi-modal features with rich geometric and semantic information and feeds them into the 3D detection head for final predictions. To achieve this, we design the Altitude Encoder (AE) module to extract geometric GPS altitude features and introduce Image Encoder (IE) to encode semantic image features respectively. The plug-and-play Monocular-GPS Fusion (MGF) module is proposed to fuse GPS altitude features with image features, which adaptively learns 3D space information and enhances long-distance depth estimation capability. The extensive experiments on the LA3D dataset validate that MGF3D significantly improves aerial 3D detection accuracy. Furthermore, the MGF3D is deployed on a UAV embedded platform to perform the challenging real-world LSAR flight task, demonstrating the effectiveness and practicality of our LA3D and MGF3D. We plan to release the dataset and code in here.
Read full abstract