Also after many years of research, stereo matching remains to be a challenging task in photogrammetry and computer vision. Recent work has achieved great progress by formulating dense stereo matching as a pixel-wise learning task to be resolved with a deep convolutional neural network (CNN). However, most estimation methods, including traditional and deep learning approaches, still have difficulty to handle real-world challenging scenarios, especially those including large depth discontinuity and low texture areas.To tackle these problems, we investigate a recently proposed end-to-end disparity learning network, DispNet (Mayer et al., 2015), and improve it to yield better results in these problematic areas. The improvements consist of three major contributions. First, we use dilated convolutions to develop a context pyramidal feature extraction module. A dilated convolution expands the receptive field of view when extracting features, and aggregates more contextual information, which allows our network to be more robust in weakly textured areas. Second, we construct the matching cost volume with patch-based correlation to handle larger disparities. We also modify the basic encoder-decoder module to regress detailed disparity images with full resolution. Third, instead of using post-processing steps to impose smoothness in the presence of depth discontinuities, we incorporate disparity gradient information as a gradient regularizer into the loss function to preserve local structure details in large depth discontinuity areas.We evaluate our model in terms of end-point-error on several challenging stereo datasets including Scene Flow, Sintel and KITTI. Experimental results demonstrate that our model decreases the estimation error compared with DispNet on most datasets (e.g. we obtain an improvement of 46% on Sintel) and estimates better structure-preserving disparity maps. Moreover, our proposal also achieves competitive performance compared to other methods.
Read full abstract