Abstract
In recent years, convolutional neural network (CNN) algorithms promote the development of stereo matching and make great progress, but some mismatches still occur in textureless, occluded and reflective regions. In feature extraction and cost aggregation, CNNs will greatly improve the accuracy of stereo matching by utilizing global context information and high-quality feature representations. In this paper, we design a novel end-to-end stereo matching algorithm named Multi-Attention Network (MAN). To obtain the global context information in detail at the pixel-level, we propose a Multi-Scale Attention Module (MSAM), combining a spatial pyramid module with an attention mechanism, when we extract the image features. In addition, we introduce a feature refinement module (FRM) and a 3D attention aggregation module (3D AAM) during cost aggregation so that the network can extract informative features with high representational ability and high-quality channel attention vectors. Finally, we obtain the final disparity through bilinear interpolation and disparity regression. We evaluate our method on the Scene Flow, KITTI 2012 and KITTI 2015 stereo datasets. The experimental results show that our method achieves state-of-the-art performance and that every component of our network is effective.
Highlights
Binocular stereo vision simulates the operating principle of biological vision systems
To make better use the global context information for stereo matching, we propose a novel convolutional neural network
We introduce an image feature refinement module to enhance the representation of feature maps at each stage
Summary
Binocular stereo vision simulates the operating principle of biological vision systems. The MC-CNN scene disparity estimation method proposed by Zbontar et al [24] pioneered a Siamese network to compute the similarity between two image patches for stereo matching. Mayer et al created a large synthetic dataset to train an end-to-end network called DispNet [26] to estimate disparity; DispNet consists of a set of convolution layers to extract features, a cost volume formed by patch-wise correlation, an encoder-decoder structure for the second-stage process, and a classification layer to estimate disparity. To make better use the global context information for stereo matching, we propose a novel convolutional neural network. We introduce a 3D aggregation attention module, which can use high-level information to guide low-level texture information and identify high-quality channel attention vector features. Our MAN achieves state-of-the-art performance on the Scene Flow dataset, KITTI stereo 2012 and KITTI stereo 2015 benchmarks
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.