Abstract
Abstract. Recently, great progress has been made in formulating dense disparity estimation as a pixel-wise learning task to be solved by deep convolutional neural networks. However, most resulting pixel-wise disparity maps only show little detail for small structures. In this paper, we propose a two-stage architecture: we first learn initial disparities using an initial network, and then employ a disparity refinement network, guided by the initial results, which directly learns disparity corrections. Based on the initial disparities, we construct a residual cost volume between shared left and right feature maps in a potential disparity residual interval, which can capture more detailed context information. Then, the right feature map is warped with the initial disparity and a reconstruction error volume is constructed between the warped right feature map and the original left feature map, which provides a measure of correctness of the initial disparities. The main contribution of this paper is to combine the residual cost volume and the reconstruction error volume to guide training of the refinement network. We use a shallow encoder-decoder module in the refinement network and do learning from coarse to fine, which simplifies the learning problem. We evaluate our method on several challenging stereo datasets. Experimental results demonstrate that our refinement network can significantly improve the overall accuracy by reducing the estimation error by 30% compared with our initial network. Moreover, our network also achieves competitive performance compared with other CNN-based methods.
Highlights
Stereo matching has been investigated for many years and still remains to be a challenging task in photogrammetry and computer vision
We introduce two interpretable inputs, namely the residual cost volume and the reconstruction error volume as guidance for learning disparity details
In our approach, we propose a residual cost volume and a reconstruction error volume, which we argue can be better interpreted as inputs for residual learning
Summary
Stereo matching has been investigated for many years and still remains to be a challenging task in photogrammetry and computer vision. We train a residual network, guided by the residual cost volume and the reconstruction error volume, to learn disparity residuals and estimate the final depth map by adding the learned residuals to the initial disparity In this way, the refinement sub-net can concentrate on learning more accurate results, especially in problem areas where the initial network fails. Compared to the initial network, the residual cost volume takes into consideration a significantly shorter range of disparity with finer resolution, the complexity of learning is lower than learning the disparity for these pixels directly For this reason, we can employ a shallow encoder-decoder module in our refinement sub-net, and we learn multiple residuals from coarse to fine, which allows our approach to correct errors and refine details from coarse to fine.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have