Abstract

Scene understanding through pixel-level semantic parsing is one of the main problems in computer vision. Till now, image-based methods and datasets for scene parsing have been well explored. However, the real world is naturally dynamic instead of a static state. Thus, learning to perform video scene parsing is more practical for real-world applications. Considering that few datasets cover an extensive range of scenes and object categories with temporal pixel-level annotations, in this work, we present a large-scale video scene parsing dataset, namely VSPW (Video Scene Parsing in the Wild). To be specific, there are a total of 251,633 frames from 3,536 videos with densely pixel-wise annotations in VSPW, including a large variety of 231 scenes and 124 object categories. Besides, VSPW is densely annotated with a high frame rate of 15 f/s, and over 96% of videos from VSPW have high spatial resolutions from 720P to 4 K. To the best of our knowledge, VSPW is the first attempt to address the challenging video scene parsing task in the wild by considering diverse scenes. Based on our VSPW, we further propose Temporal Attention Blending (TAB) Networks to harness temporal context information for better pixel-level semantic understanding of videos. Extensive experiments on VSPW well demonstrate the superiority of the proposed TAB over other baseline approaches. We hope the new proposed dataset and the explorations in this work can help advance the challenging yet practical video scene parsing task in the future. Both the dataset and the code are available at www.vspwdataset.com.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call