Improved stereo matching framework with embedded multilevel attention

Bohan Li,Juan Du,James Okae

doi:10.1117/1.jei.31.3.033037

Abstract

The recent advent of deep convolutional neural networks (CNNs) in stereo matching has led to significant improvements. However, current CNN methods still face challenges in incorporating hierarchical context information with global dependencies and lacking the discriminative ability of feature representation to resolve matching ambiguities in ill-conditioned regions. To address the aforementioned problems, we propose an improved stereo matching framework that joins a stereo backbone network and an embedded independent multilevel attention subnetwork in an end-to-end trainable pipeline. The stereo backbone network applies a residual atrous spatial pyramid pooling integrated with channelwise attention to capture richer multiscale contextual information and selectively enhance discriminative features. This is followed by unary feature concatenation to construct cost volume for disparity prediction. To further improve performance, the embedded multilevel attention subnetwork learns global coherent contextual information to generate three attention streams, which are used to boost the unary feature representations with spatial encoding, enhance the quality of cost volume, and refine the disparity map, respectively. We show that appending the proposed multilevel attention subnetwork to the stereo backbone network produces significant improvements in matching accuracy. The experimental results on Scene Flow and KITTI 2012/2015 demonstrate that our method can achieve competitive performance in stereo matching.

Full Text