Abstract

Semantic scene segmentation has primarily been addressed by forming high-level visual representations of single images. The problem of semantic segmentation in dynamic scenes has begun to receive attention with the video object segmentation and tracking problem. While there has been some recent work attempt to use deep learning models on the video level, what is not known is how the temporal dynamics information is contributing to the full scene segmentation. Moreover, most existing datasets only provide full scene annotation on non-consecutive images to ensure the variability of scenes, making it even harder to explore novel methods on video-level modeling. To address the above issues, our work takes steps to explore the behavior of modern spatiotemporal modeling approaches by: 1) constructing the MIT DriveSeg dataset, a large-scale video driving scene segmentation dataset, densely annotated for pixel-level semantic classes with 5000 consecutive video frames, and 2) proposing a joint-learning framework that reveals the contribution of temporal dynamics information in regard to different semantic classes in the driving scene. This work is intended to help assess current methods and support further exploration of the value of temporal dynamics information in video-level scene segmentation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call