Video object segmentation (VOS) exhibits heavy occlusions, large deformation, and severe motion blur. While many remarkable convolutional neural networks are devoted to the VOS task, they often mis-identify background noise as the target or output coarse object boundaries, due to the failure of mining detail information and high-order correlations of pixels within the whole video. In this work, we propose an edge attention gated graph convolutional network (GCN) for VOS. The seed point initialization and graph construction stages construct a spatio-temporal graph of the video by exploring the spatial intra-frame correlation and the temporal inter-frame correlation of superpixels. The node classification stage identifies foreground superpixels by using an edge attention gated GCN which mines higher-order correlations between superpixels and propagates features among different nodes. The segmentation optimization stage optimizes the classification of foreground superpixels and reduces segmentation errors by using a global appearance model which captures the long-term stable feature of objects. In summary, the key contribution of our framework is twofold: (a) the spatio-temporal graph representation can propagate the seed points of the first frame to subsequent frames and facilitate our framework for the semi-supervised VOS task; (b) the edge attention gated GCN can learn the importance of each node with respect to both the neighboring nodes and the whole task with small number of layers. Experiments on Davis 2016 and Davis 2017 datasets show that our framework achieves the excellent performance with only small training samples (45 video sequences).
Read full abstract