Abstract

In this paper, we propose a weakly supervised semantic segmentation method by directly learning from web images, which are crawled from the Internet by using text queries, without any explicit user annotation or even data filtering. With the goal of handling the massive amount of noisy labels in web images, we design a three-stage approach for weakly-supervised semantic segmentation based on curriculum learning. We first generate pixel-level masks for the training images via a popular weakly-supervised semantic segmentation framework. Then, we consider the noise of the web data in two ways. At the image-level, the complexity of data is measured using its distribution density in a classification feature space. At the pixel-level, the complexity of the mask is evaluated by exploiting the relationship between the saliency map and those segmented images in an unsupervised manner. The key insight to this design is that, common and simple object patterns in images should be salient with both the saliency detector and weakly supervised DCNNs, where they should be sparse with high regional consistency between them. This allows for an efficient implementation of curriculum learning from noisy web images. Experiments on the popular PASCAL VOC 2012 benchmark show that we achieve very competitive performance with scores of 64.0% mIoU using our pure web dataset, which contains noisy, single-label images. We further improve the performance to 69.7% mIoU by using the CurriculumWebSegNet fine-tuned on the PASCAL VOC dataset, which has more precise multi-label supervision.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call