Abstract
Abstract. The capability of globally modeling and reasoning about relations between image regions is crucial for complex scene understanding tasks such as semantic segmentation. Most current semantic segmentation methods fall back on deep convolutional neural networks (CNNs), while their use of convolutions with local receptive fields is typically inefficient at capturing long-range dependencies. Recent works on self-attention mechanisms and relational reasoning networks seek to address this issue by learning pairwise relations between each two entities and have showcased promising results. But such approaches have heavy computational and memory overheads, which is computationally infeasible for dense prediction tasks, particularly on large size images, i.e., aerial imagery. In this work, we propose an efficient method for global context modeling in which at each position, a sparse set of features, instead of all features, over the spatial domain are adaptively sampled and aggregated. We further devise a highly efficient instantiation of the proposed method, namely learning RANdom walK samplIng aNd feature aGgregation (RANKING). The proposed module is lightweight and general, which can be used in a plug-and-play fashion with the existing fully convolutional neural network (FCN) framework. To evaluate RANKING-equipped networks, we conduct experiments on two aerial scene parsing datasets, and the networks can achieve competitive results at significant low costs in terms of the computational and memory.
Highlights
Being able to reason about such relations among different regions in an image/video is inherent to humans, but is not easy for convolutional neural networks (CNNs)
Our goal is to explicitly model global context with low computational and memory overheads in a fully convolutional network (FCN) for aerial scene parsing by considering a sparse set of important short- and long-range relations instead of all
We can observe that the integration of RANdom walK samplIng aNd feature aGgregation (RANKING) module contributes to increments of 2.17% and 2.01% in the mean F1 score and overall accuracy, respectively, compared to fully convolutional neural network (FCN)-dCRF
Summary
Capturing and modeling both short- and long-range relations is of paramount importance for many vision tasks, to name a few, semantic segmentation (Fu et al, 2019, Liu et al, 2017, Bertasius et al, 2017), object detection (Shvets et al, 2019, Hu et al, 2018), action recognition (Wang et al, 2018), and visual question answering (VQA) (Santoro et al, 2017, Lobry et al, 2019). Because an individual convolution layer can only learn features locally, and deep CNNs with large receptive fields have proven to be not efficient at modeling long-range dependencies (Luo et al, 2016, Zhou et al, 2015) To address this issue, many efforts have been made to enhance the capacity of CNNs to capture long-term relations, such as dilated convolutions (Chen et al, 2015, Chen et al, 2018a, Chen et al, 2018b), introducing graphical models into networks (Chen et al, 2018a, Liu et al, 2015, Zheng et al, 2015), and constructing spatial propagation network modules (Bell et al, 2016, Liu et al, 2017). These methods somehow learn pairwise relations between each two
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.