The accuracy of stereo matching has been greatly improved with the advent of convolutional neural networks. However, existing methods still perform poorly in regions of low-texture and reflections due to insufficient matching information. In this paper, we propose a novel semantic stereo matching network named GSSNet, which combines global context information with high-level semantic clues to further enrich the matching features. Specifically, we employ Swin Transformer as a joint feature extractor to capture multi-scale global features for both semantic segmentation and disparity estimation. By fusing low-resolution multi-scale feature maps progressively, we construct a Global Context Volume (GCV) representation covering long-range receptive field for initial disparity estimation. Initial disparities are further refined with a high-level embedding learned from the semantic segmentation branch of the network through residual learning. During training, due to the lack of joint annotations, we also propose a simple and effective pseudo-labeling strategy. Comprehensive experimental results demonstrated on various datasets manifest the effectiveness of the proposed GSSNet. Among all published methods as of 9, February 2024, our approach ranks 1st on KITTI 2012 leaderboard and 3rd on KITTI 2015 leaderboard, and produces competitive results on Scene Flow and Cityscapes datasets. Code will be available at https://github.com/Twil-7/GSSNet.
Read full abstract