Researchers from Nanjing University of Information Science and Technology (NUIST) present an attention-modulating network for video object segmentation with an advanced attention modulator to efficiently modulate a segmentation model to focus on a specific object of interest. The group employ a focal loss that distinguishes simple samples from more difficult ones to accelerate the convergence of network training to achieve state-of-the-art segmentation performance. Video object segmentation (VOS) is a fundamental task in computer vision, with important applications in video editing, robotics, and self-driving cars. VOS tasks are mainly categorised into unsupervised and semi-supervised classifications. The former seeks to find and segment the salient targets in the videos completely without supervision, with the algorithm itself deciding what the main segmentation is. The latter aims at segmenting an object instance throughout the entire video sequence given only the object mask on the first frame. This can be observed as a pixel-level object tracking problem. Semi-supervised VOS can be subdivided into single-object segmentation and multi-object segmentation. In the team's Letter, they focus on semi-supervised VOS. Deep learning for VOS has gained attention in the research community in recent years. Existing semi-supervised VOS techniques work by constructing deep networks and fine-tuning a pre-trained classifier on a given ground truth in the first frame during online testing. This online fine-tuning of a classifier during testing has been shown to significantly improve accuracy. Illustrative diagram of the proposed segmentation model and approach. Segmentation results. The team conduct an attention-modulating network for the semi-supervised VOS task. Co-author Kaihua Zhang elaborates on the process: “We designed an efficient visual and spatial attention modulator based on the semantic information of the annotated object in the first frame and the spatial information of predicted object mask in the previous frame, respectively, to fast module the segmentation model to focus on the specific object of interest. Then we design a SCAM architecture which includes a channel attention module and a spatial attention module and inject it into segmentation model to further refine its feature maps. In addition, we construct a feature pyramid attention module to mine context information of different scales to solve the problem of multi-scale segmentation. Most existing methods rely on fine-tuning models using first-frame annotations and are time-consuming, making them unsuitable for most practical applications. To address this issue, the proposed approach developed an attention-modulating network to focus on the appearance of a specific object instance in one single feed-forward pass without fine-tuning. Compared with other methods, this method has achieved state-of-art performance on the DAVIS2017 dataset by using attention-modulators, feature attention pyramid modules and focal loss. In order to overcome a sample imbalance problem, reference was made to focal loss which can accelerate the convergence of network training, thus helping to distinguish between difficult and simple samples. VOS remains challenging due to occlusions, fast motion, deformation, and significant appearance variations over time. This method conducts a visual attention modulator to extract semantic information such as category, color and shape from the first frame. The spatial attention modulator fits the predicted location of object masks in the previous frame as a spatial prior to guide the segmentation network to focus on the regions where that target is most likely to appear in the current frame. To solve the multi-scales of segmentation objects, feature pyramid attention modules mined the context information of different scales, achieving better pixel-level attention for the high-level feature maps. The proposed VOS approach is fast, which facilitates many applications, such as interactive video editing and augmented reality. It may be applied to video understanding models in the short term, and after long-term development, it may be applied to robotics, and self-driving cars. Kaihua Zhang notes on his groups future work: “Experiments show that our algorithm performs erroneous instance segmentation when faced with the challenge of occluding each other between similar objects. To tackle this problem, we will leverage a position-sensitive embedding which is capable of distinguishing the pixels of similar objects. We have also found that solving VOS with multiple instances requires template matching to deal with occlusion and temporal propagation to ensure temporal continuity; otherwise the segmentation instance would be lost. Thus, we will use the re-identification module to retrieve lost instances and take its frame as the starting point and use the mask propagation module to bi-directionally recover the lost instances.” The development of VOS in the next decade will achieve higher precision while meeting real-time application requirements. At present, the cost of manual annotation of pixel-level VOS data sets is too expensive, so cheaper large-scale VOS data sets are expected in the future.