Large-scale aerial scene perception based on self-supervised multi-view stereo via cycled generative adversarial network

Kevin W Tong,Zhiyi Shi,Guangyu Zhu,Ya Duan,Yuhong Hou,Edmond Q Wu,Limin Zhu

doi:10.1016/j.inffus.2024.102399

Kevin W Tong, Zhiyi Shi + Show 5 more

https://doi.org/10.1016/j.inffus.2024.102399

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Unmanned aerial vehicle (UAV) has the characteristics of strong maneuverability and wide field of vision, and its application in real-time terrain perception and target capture is one of the active frontier topics in the field of UAV cooperative situation awareness. LiDAR-based method has unique advantages, but data acquisition is difficult and costly. In comparison, multiple view stereo-based (MVS) perception method contains rich image information with low cost, which is relatively easy to be collected. MVS also has potential application value in human–computer interaction and robot positioning (unstructured environment perception, visual servo). To reduce the dependence on costly ground truth, the current self-supervised MVS works calculate the training loss by assuming that the same space points projected into multiple views will share the same RGB information. However, this gap occurs mainly because in the real scene, interference factors such as specular reflection, illumination, and noise will cause large color changes in the source images, which will make it difficult to restrict the photometric consistency of different perspectives with ideal self-supervised loss. To realize more accurate scene perception, a novel self-supervised MVS integrated with generative adversarial network (GAN) is proposed. The overall framework is stacked twice by inputting real images and synthetic images into the network in turn, and adversarial learning is introduced to jointly distinguish the images. In addition, the consistency loss of predicted depth maps from real images and synthetic images is designed to improve the robustness of feature matching and color anti-noise ability. To the best of our knowledge, the proposed work is the first end-to-end depth inference model in MVS field that introduces the adversial training based on GAN, which is an effective complement to self-supervised scene reconstruction. Comprehensive experiment results on the public datasets show that the proposed work is comparable to the mainstream self-supervised MVS.

Full Text