An Efficient Stereo Matching Network Using Sequential Feature Fusion

Jaecheol Jeong,Yong Seok Heo,Suyeon Jeon

doi:10.3390/electronics10091045

Abstract

Recent stereo matching networks adopt 4D cost volumes and 3D convolutions for processing those volumes. Although these methods show good performance in terms of accuracy, they have an inherent disadvantage in that they require great deal of computing resources and memory. These requirements limit their applications for mobile environments, which are subject to inherent computing hardware constraints. Both accuracy and consumption of computing resources are important, and improving both at the same time is a non-trivial task. To deal with this problem, we propose a simple yet efficient network, called Sequential Feature Fusion Network (SFFNet) which sequentially generates and processes the cost volume using only 2D convolutions. The main building block of our network is a Sequential Feature Fusion (SFF) module which generates 3D cost volumes to cover a part of the disparity range by shifting and concatenating the target features, and processes the cost volume using 2D convolutions. A series of the SFF modules in our SFFNet are designed to gradually cover the full disparity range. Our method prevents heavy computations and allows for efficient generation of an accurate final disparity map. Various experiments show that our method has an advantage in terms of accuracy versus efficiency compared to other networks.

Highlights

Stereo matching is a fundamental computer vision problem, and has been studied for decades
We evaluate our network on several datasets and demonstrate that the proposed Sequential Feature Fusion Network (SFFNet) achieves better results in terms of consumption of computing resources vs. accuracy compared to the other methods
We propose a simple yet efficient network, called Sequential Feature Fusion Network (SFFNet) for stereo matching

Summary

Introduction

Stereo matching is a fundamental computer vision problem, and has been studied for decades. It aims to estimate the disparity for every pixel in the reference image from a pair of images taken from different points of view. Disparity is the difference in horizontal coordinates between corresponding pixels in the reference and target stereo images. If the pixel (x, y) in the reference left image corresponds to the pixel (x − d, y) in the target right image, the disparity of this pixel is d. Stereo matching allows us to obtain 3D information in a relatively inexpensive manner compared to other methods which leverage active 3D sensors [1] such as LiDAR, ToF, and structured light. The importance of stereo matching is recently increasing, because 3D information is required in various emerging applications, including autonomous driving [2], augmented reality [3], virtual reality [4], and robot vision [5]

Methods

Results

Conclusion