Multi-modal fusion architecture search for camera-based semantic scene completion

Xuzhi Wang,Wei Feng,Liang Wan

doi:10.1016/j.eswa.2023.122885

Abstract

Camera-based Semantic scene completion (SSC) aims to infer the 3D volumetric occupancy and semantic categories of a scene simultaneously from a single RGB image. The main challenge of camera-based SSC is the lack of geometry information compared with RGB-D SSC. Although the estimated depth from RGB image will help SSC to some extent, the depth prediction quality is far from the demand of SSC. To solve this problem, we propose a NAS-based multi-modal fusion method to incorporate the semantic and geometry information from other intermediate representations (predicted depth and predicted 2D segmentation) to form a more robust 2D feature representation. A key idea of this design is that explicit 2D semantic information could alleviate the misleading information of 3D distortions introduced by estimated depth. Specifically, we propose the Confidence-Block to automatically learn an optimal architecture for routing and obtaining the depth prediction confidence. We propose the two-level fusion search space by decomposing the fusion search space into fusion stage search space and fusion operation search space. Moreover, we propose a confidence-aware 2D–3D projection module to alleviate the 3D projection error. Extensive experiments show that our method outperforms the state-of-the-art method by a large margin using a single RGB image on NYU and NYUCAD datasets.

Full Text