Abstract
Given their powerful feature representation for recognition, deep convolutional neural networks (DCNNs) have been driving rapid advances in high-level computer vision tasks. However, their performance in semantic image segmentation is still not satisfactory. Based on the analysis of visual mechanism, we conclude that DCNNs in a bottom-up manner are not enough, because semantic image segmentation task requires not only recognition but also visual attention capability. In the study, superpixels containing visual attention information are introduced in a top-down manner, and an extensible architecture is proposed to improve the segmentation results of current DCNN-based methods. We employ the current state-of-the-art fully convolutional network (FCN) and FCN with conditional random field (DeepLab-CRF) as baselines to validate our architecture. Experimental results of the PASCAL VOC segmentation task qualitatively show that coarse edges and error segmentation results are well improved. We also quantitatively obtain about 2%-3% intersection over union (IOU) accuracy improvement on the PASCAL VOC 2011 and 2012 test sets.
Highlights
Semantic image segmentation is one of the central and important computer vision tasks
The output of fully convolutional network (FCN)-8s illustrates an impressive performance on the PASCAL Visual Object Class (VOC) benchmark and achieve 20% relative improvement to 62.2% mean intersection over union (IOU) in 2012 test set
In order to compare mutual promotion of semantic labels and superpixels for segmentation results, we both test two architectures that one only uses superpixels to improve semantic labels, which is denoted by deep convolutional neural networks (DCNNs)-Sp, and the other one represents overall architecture that performs mutual promotion of semantic labels and superpixels, which is denoted by DCNN-Sp-v2
Summary
Semantic image segmentation is one of the central and important computer vision tasks. Compared with image classification aiming at labeling at the image level, semantic image segmentation needs to assign a semantic label at each pixel. Classifying region proposals and refining labels to obtain final segmentation is a common technique. Carreira et al [1] used constrained parametric min-cuts [2] to generate 150 region proposals per image and predicted each region with the use of variants of scale-invariant feature transform and local binary pattern. Jimei et al [3] presented a scalable scene parsing algorithm based on image retrieval and superpixel matching, and obtained good performance. Tighe et al [4] combined region-level features with per-exemplar sliding window detectors for interpreting a scene. Despite being the focus of considerable attention, such a task remains challenging
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.