Depth completion aims at reconstructing a dense depth from sparse depth input, frequently using color images as guidance. The sparse depth map lacks sufficient contexts for reconstructing focal contexts such as the shape of objects. The RGB images contain redundant contexts including details useless for reconstruction, which reduces the efficiency of focal context extraction. The unaligned contextual information from these two modalities poses a challenge to focal context extraction and further fusion, as well as the accuracy of depth completion. To optimize the utilization of multimodal contextual information, we explore a novel framework: Context Aligned Fusion Network (CAFNet). CAFNet comprises two stages: the context-aligned stage and the full-scale stage. In the context-aligned stage, CAFNet downsamples input RGB-D pairs to the scale, at which multimodal contextual information is adequately aligned for feature extraction in two encoders and fusion in CF modules. In the full-scale stage, feature maps with fused multimodal context from the previous stage are upsampled to the original scale and subsequentially fused with full-scale depth features by the GF module utilizing a dynamic masked fusion strategy. Ultimately, accurate dense depth maps are reconstructed, leveraging the GF module’s resultant features. Experiments conducted on indoor and outdoor benchmark datasets show that the CAFNet produces results comparable to state-of-the-art methods while effectively reducing computational costs.
Read full abstract