Abstract

In this paper, we consider the problem of one-shot object segmentation in videos. Given an input video where the object mask of the first frame is provided, our goal is to segment each remaining frame in the video into foreground and background. We propose an attention based knowledge transfer mechanism that transfers the object knowledge from the first frame to other frames in a video. Our model is a Siamese network with two streams. The first stream will process the first frame in a video, and the second stream will produce segmentation mask of any other frame in a video. Each stream is a convolutional neural network (CNN) that produces attention maps in certain layers. Our proposed approach is based on the observation that the attention maps in CNN contain valuable information that can boost the performance of CNN architectures. In our work, we propose a method for transferring the attention maps from the first stream to the second stream in the Siamese architecture. This will allow our model to transfer the knowledge from the first frame (with ground-truth segmentation mask) to other frames in the video. Our experimental results on two benchmark datasets demonstrate that our proposed model outperforms other state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call