Abstract

In recent years, convolutional neural network (CNN) algorithms promote the development of stereo matching and make great progress, but some mismatches still occur in textureless, occluded and reflective regions. In feature extraction and cost aggregation, CNNs will greatly improve the accuracy of stereo matching by utilizing global context information and high-quality feature representations. In this paper, we design a novel end-to-end stereo matching algorithm named Multi-Attention Network (MAN). To obtain the global context information in detail at the pixel-level, we propose a Multi-Scale Attention Module (MSAM), combining a spatial pyramid module with an attention mechanism, when we extract the image features. In addition, we introduce a feature refinement module (FRM) and a 3D attention aggregation module (3D AAM) during cost aggregation so that the network can extract informative features with high representational ability and high-quality channel attention vectors. Finally, we obtain the final disparity through bilinear interpolation and disparity regression. We evaluate our method on the Scene Flow, KITTI 2012 and KITTI 2015 stereo datasets. The experimental results show that our method achieves state-of-the-art performance and that every component of our network is effective.

Highlights

  • Binocular stereo vision simulates the operating principle of biological vision systems

  • To make better use the global context information for stereo matching, we propose a novel convolutional neural network

  • We introduce an image feature refinement module to enhance the representation of feature maps at each stage

Read more

Summary

INTRODUCTION

Binocular stereo vision simulates the operating principle of biological vision systems. The MC-CNN scene disparity estimation method proposed by Zbontar et al [24] pioneered a Siamese network to compute the similarity between two image patches for stereo matching. Mayer et al created a large synthetic dataset to train an end-to-end network called DispNet [26] to estimate disparity; DispNet consists of a set of convolution layers to extract features, a cost volume formed by patch-wise correlation, an encoder-decoder structure for the second-stage process, and a classification layer to estimate disparity. To make better use the global context information for stereo matching, we propose a novel convolutional neural network. We introduce a 3D aggregation attention module, which can use high-level information to guide low-level texture information and identify high-quality channel attention vector features. Our MAN achieves state-of-the-art performance on the Scene Flow dataset, KITTI stereo 2012 and KITTI stereo 2015 benchmarks

RELATED WORK
FEATURE EXTRACTION
DISPARITY REGRESSION AND LOSS FUNCTION
EXPERIMENTS AND DISCUSSION
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.