Thanks to the advantages of the friendly annotations and the satisfactory performance, weakly-supervised semantic segmentation (WSSS) approaches have been extensively studied. Recently, the single-stage WSSS (SS-WSSS) was awakened to alleviate problems of the expensive computational costs and the complicated training procedures in multistage WSSS. However, the results of such an immature model suffer from problems of background incompleteness and object incompleteness. We empirically find that they are caused by the insufficiency of the global object context and the lack of local regional contents, respectively. Under these observations, we propose an SS-WSSS model with only the image-level class label supervisions, termed weakly supervised feature coupling network (WS-FCN), which can capture the multiscale context formed from the adjacent feature grids, and encode the fine-grained spatial information from the low-level features into the high-level ones. Specifically, a flexible context aggregation (FCA) module is proposed to capture the global object context in different granular spaces. Besides, a semantically consistent feature fusion (SF2) module is proposed in a bottom-up parameter-learnable fashion to aggregate the fine-grained local contents. Based on these two modules, WS-FCN lies in a self-supervised end-to-end training fashion. Extensive experimental results on the challenging PASCAL VOC 2012 and MS COCO 2014 demonstrate the effectiveness and efficiency of WS-FCN, which can achieve state-of-the-art results by 65.02% and 64.22% mIoU on PASCAL VOC 2012 val set and test set, 34.12% mIoU on MS COCO 2014 val set, respectively. The code and weight have been released at:WS-FCN.
Read full abstract