Video structuring: From pixels to visual entities

Ruxandra Tapu ,Titus Zaharia

doi:10.5281/zenodo.43307

Abstract

In this paper we propose a method for automatic structuring of video documents. The video is firstly segmented into shots based on a scale space filtering graph partition method. For each detected shot the associated static summary is developed using a leap key-frame extraction method. Based on the representative images obtained, we introduce next a combined spatial and temporal video attention model that is able to recognize moving salient objects. The proposed approach extends the state-of-the-art image region based contrast saliency with a temporal attention model. Different types of motion presented in the current shot are determined using a set of homographic transforms, estimated by recursively applying the RANSAC algorithm on the interest point correspondence. Finally, a decision is taken based on the combined spatial and temporal attention models. The experimental results validate the proposed framework and demonstrate that our approach is effective for various types of videos, including noisy and low resolution data.

Full Text