Abstract

Deep learning approaches to interactive image segmentation are typically formulated as a binary labeling problem. A model trained to make predictions within a fixed set of labels (i.e., foreground and background labels) cannot be used to directly predict the binary masks of multiple objects of interest, which greatly limits its flexibility and adaptivity. The use of different classes of clicks as input is opted for and the first end-to-end learning model for multi-object segmentation, based on a new designed neural network, is developed. The network consists of a visual feature extractor, a recurrent attention module and a dynamic segmentation head, extracts user click-adapted appearance embedding features and spatial attention features, and then learns to transform this information into a segmentation of multiple objects. It is also proposed to train the network using a joint loss function, taking the embedding learning into account for segmentation. Comprehensive experiments are conducted on three benchmark datasets to demonstrate the effectiveness of the proposed method. It performs favorably against state-of-the-art approaches on the multiple object segmentation task, for example, with 0.15 s per image, 0.06 s per object and mean IoU & F1 score of 84.90% on Pascal VOC 2012 validation set. It is further shown that the method can be used in numerous vision applications such as image recoloring and colorization.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call