Abstract

Stereo matching is the key technology in stereo vision. Given a pair of rectified images, stereo matching determines correspondences between the pair images and estimate depth by obtaining disparity between corresponding pixels. The current work has shown that depth estimation from a stereo pair of images can be formulated as a supervised learning task with an end-to-end frame based on convolutional neural networks (CNNs). However, 3D CNN puts a great burden on memory storage and computation, which further leads to the significantly increased computation time. To alleviate this issue, atrous convolution was proposed to reduce the number of convolutional operations via a relatively sparse receptive field. However, this sparse receptive field makes it difficult to find reliable corresponding points in fuzzy areas, e.g., occluded areas and untextured areas, owing to the loss of rich contextual information. To address this problem, we propose the Group-based Atrous Convolution Spatial Pyramid Pooling (GASPP) to robustly segment objects at multiple scales with affordable computing resources. The main feature of the GASPP module is to set convolutional layers with continuous dilation rate in each group, so that it can reduce the impact of holes introduced by atrous convolution on network performance. Moreover, we introduce a tailored cascade cost volume in a pyramid form to reduce memory, so as to meet real-time performance. The group-based atrous convolution stereo matching network is evaluated on the street scene benchmark KITTI 2015 and Scene Flow and achieves state-of-the-art performance.

Highlights

  • Some complex advanced visual tasks depend on depth perception [1], such as robot control and navigation [2], threedimensional measurement [3], unmanned aerial vehicles (UAVs), virtual reality, and microoperating system parameter detection, showing the significance of distance information acquisition for vision works

  • To incorporate large context and compute feature maps more densely, whilst reducing the loss of local information and improving the accuracy of the disparity map, we propose the Group-based Atrous Convolution Spatial Pyramid Pooling (GASPP)

  • This paper presents a group-based atrous convolution pyramid pooling module, which uses a densely atrous convolution to form multiscale receptive fields, reduce the loss of local information, and improve the matching accuracy

Read more

Summary

Introduction

Some complex advanced visual tasks depend on depth perception [1], such as robot control and navigation [2], threedimensional measurement [3], unmanned aerial vehicles (UAVs), virtual reality, and microoperating system parameter detection, showing the significance of distance information acquisition for vision works. Stereo matching estimates depth by matching pixels from a rectified image pair captured by two cameras, in which the goal is to obtain distance and contextual information from disparity quickly and accurately. End-to-end disparity estimation network as one of the CNN-based algorithms, which integrates all steps in the stereo matching pipeline for concatenating optimization, produces dense disparity maps from stereo images directly. Stereo matching networks with end-to-end approach are able to generate highly accurate depth estimation from stereo image pairs They require huge memory and computation consumption. The module ensures multiscale context information captured from various receptive fields when reducing the network size significantly (2) The tailored cascade cost volume is constructed by changing the output channels and utilizing pyramid construction of cascade structure leading to efficiency to calculate the disparity. The model is less dependent on the batch size

Background
F W Figure 2
Related Work
Group-Based Atrous Convolution Stereo Network
Experiment
Experiment Details
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call