Autonomous vehicles and mobile robotic systems are typically equipped with multiple sensors to provide redundancy. By integrating the observations from different sensors, these mobile agents are able to perceive the environment and estimate system states, e.g., locations and orientations. Although deep learning (DL) approaches for multimodal odometry estimation and localization have gained traction, they rarely focus on the issue of robust sensor fusion--a necessary consideration to deal with noisy or incomplete sensor observations in the real world. Moreover, current deep odometry models suffer from a lack of interpretability. To this extent, we propose SelectFusion, an end-to-end selective sensor fusion module that can be applied to useful pairs of sensor modalities, such as monocular images and inertial measurements, depth images, and light detection and ranging (LIDAR) point clouds. Our model is a uniform framework that is not restricted to specific modality or task. During prediction, the network is able to assess the reliability of the latent features from different sensor modalities and to estimate trajectory at both scale and global pose. In particular, we propose two fusion modules--a deterministic soft fusion and a stochastic hard fusion--and offer a comprehensive study of the new strategies compared with trivial direct fusion. We extensively evaluate all fusion strategies both on public datasets and on progressively degraded datasets that present synthetic occlusions, noisy and missing data, and time misalignment between sensors, and we investigate the effectiveness of the different fusion strategies in attending the most reliable features, which in itself provides insights into the operation of the various models.
Read full abstract