Abstract

In this article, we propose an end-to-end real-time stereo matching network (RTSMNet). RTSMNet consists of three modules. The global and local feature extraction (GLFE) module captures the hierarchical context information and generates the coarse cost volume. The initial disparity estimation module is a compact three-dimensional convolution architecture aiming to produce the low-resolution (LR) disparity map rapidly. The feature-guided spatial attention upsampling module takes the LR disparity map and the shared features from the GLFE module as guidance, first estimates residual disparity values and then an attention mechanism is developed to generate context-aware adaptive kernels for each upsampled pixel. The adaptive kernels emphasize higher attention weights on the reliable area, which can significantly reduce blurred edges and recover thin structures. The proposed networks achieve 66 ∼ 175 fps on a 2080Ti and 11 ∼ 42 fps on edge computing devices, with competitive accuracy compared to state-of-the-art methods on multiple benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call