Divided attention

doi:10.1049/el.2019.0992

Abstract

Researchers from Nanjing University of Information Science and Technology (NUIST) present an attention-modulating network for video object segmentation with an advanced attention modulator to efficiently modulate a segmentation model to focus on a specific object of interest. The group employ a focal loss that distinguishes simple samples from more difficult ones to accelerate the convergence of network training to achieve state-of-the-art segmentation performance. Video object segmentation (VOS) is a fundamental task in computer vision, with important applications in video editing, robotics, and self-driving cars. VOS tasks are mainly categorised into unsupervised and semi-supervised classifications. The former seeks to find and segment the salient targets in the videos completely without supervision, with the algorithm itself deciding what the main segmentation is. The latter aims at segmenting an object instance throughout the entire video sequence given only the object mask on the first frame. This can be observed as a pixel-level object tracking problem. Semi-supervised VOS can be subdivided into single-object segmentation and multi-object segmentation. In the team's Letter, they focus on semi-supervised VOS. Deep learning for VOS has gained attention in the research community in recent years. Existing semi-supervised VOS techniques work by constructing deep networks and fine-tuning a pre-trained classifier on a given ground truth in the first frame during online testing. This online fine-tuning of a classifier during testing has been shown to significantly improve accuracy. Illustrative diagram of the proposed segmentation model and approach. Segmentation results. The team conduct an attention-modulating network for the semi-supervised VOS task. Co-author Kaihua Zhang elaborates on the process: “We designed an efficient visual and spatial attention modulator based on the semantic information of the annotated object in the first frame and the spatial information of predicted object mask in the previous frame, respectively, to fast module the segmentation model to focus on the specific object of interest. Then we design a SCAM architecture which includes a channel attention module and a spatial attention module and inject it into segmentation model to further refine its feature maps. In addition, we construct a feature pyramid attention module to mine context information of different scales to solve the problem of multi-scale segmentation. Most existing methods rely on fine-tuning models using first-frame annotations and are time-consuming, making them unsuitable for most practical applications. To address this issue, the proposed approach developed an attention-modulating network to focus on the appearance of a specific object instance in one single feed-forward pass without fine-tuning. Compared with other methods, this method has achieved state-of-art performance on the DAVIS2017 dataset by using attention-modulators, feature attention pyramid modules and focal loss. In order to overcome a sample imbalance problem, reference was made to focal loss which can accelerate the convergence of network training, thus helping to distinguish between difficult and simple samples. VOS remains challenging due to occlusions, fast motion, deformation, and significant appearance variations over time. This method conducts a visual attention modulator to extract semantic information such as category, color and shape from the first frame. The spatial attention modulator fits the predicted location of object masks in the previous frame as a spatial prior to guide the segmentation network to focus on the regions where that target is most likely to appear in the current frame. To solve the multi-scales of segmentation objects, feature pyramid attention modules mined the context information of different scales, achieving better pixel-level attention for the high-level feature maps. The proposed VOS approach is fast, which facilitates many applications, such as interactive video editing and augmented reality. It may be applied to video understanding models in the short term, and after long-term development, it may be applied to robotics, and self-driving cars. Kaihua Zhang notes on his groups future work: “Experiments show that our algorithm performs erroneous instance segmentation when faced with the challenge of occluding each other between similar objects. To tackle this problem, we will leverage a position-sensitive embedding which is capable of distinguishing the pixels of similar objects. We have also found that solving VOS with multiple instances requires template matching to deal with occlusion and temporal propagation to ensure temporal continuity; otherwise the segmentation instance would be lost. Thus, we will use the re-identification module to retrieve lost instances and take its frame as the starting point and use the mask propagation module to bi-directionally recover the lost instances.” The development of VOS in the next decade will achieve higher precision while meeting real-time application requirements. At present, the cost of manual annotation of pixel-level VOS data sets is too expensive, so cheaper large-scale VOS data sets are expected in the future.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Divided attention

Abstract

Talk to us

Similar Papers

More From: Electronics Letters

Lead the way for us

Similar Papers

Semi-supervised video object segmentation via an edge attention gated graph convolutional network
Yuqing Zhang ... Yun Liang
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. -
Yuqing Zhang, et. al.Yuqing Zhang ... Yun Liang
01 Aug 2023
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. -

One Spatio-Temporal Sharpening Attention Mechanism for Light-Weight YOLO Models Based on Sharpening Spatial Attention.
Mengfan Xue ... Minghao Chen
Sensors | VOL. 21
Mengfan Xue, et. al.Mengfan Xue ... Minghao Chen
28 Nov 2021
Sensors | VOL. 21

Semi-supervised Video Object Segmentation with Recurrent Neural Network
Xuanguang Ren ... Zhongliang Jing
-
Xuanguang Ren, et. al.Xuanguang Ren ... Zhongliang Jing
01 Sep 2019
01 Sep 2019

Video Object Segmentation Using Kernelized Memory Network With Multiple Kernels.
Hongje Seong ... Junhyuk Hyun
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 45
Hongje Seong, et. al.Hongje Seong ... Junhyuk Hyun
01 Feb 2023
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 45

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Divided attention

Abstract

Talk to us

Similar Papers

More From: Electronics Letters