Abstract
Video summarization has become more prominent during the last decade, due to the massive amount of available digital video content. A video summarization algorithm is typically fed an input video and expected to extract a set of important key-frames which represent the entire content, convey semantic meaning and are significantly more concise than the original input. The most wide-spread approach relies on video frame clustering and extraction of the frames closest to the cluster centroids as key-frames. Such a process, although efficient, offloads the burden of semantic scene content modelling exclusively to the employed video frame description/representation scheme, while summarization itself is approached simply as a distance-based data partitioning problem. This work focuses on videos depicting human activities (e.g., from surveillance feeds) which display an attractive property, i.e., each video frame can be seen as a linear combination of elementary visual words (i.e., basic activity components). This is exploited so as to identify the video frames containing only the elementary visual building blocks, which ideally form a set of independent basis vectors that can linearly reconstruct the entire video. In this manner, the semantic content of the scene is considered by the video summarization process itself. The above process is modulated by a traditional distance-based video frame saliency estimation, biasing towards more spread content coverage and outlier inclusion, under a joint optimization framework derived from the Column Subset Selection Problem (CSSP). The proposed algorithm results in a final key-frame set which acts as as salient dictionary for the input video. Empirical evaluation conducted on a publicly available dataset suggest that the presented method outperforms both a baseline clustering-based approach and a state-of-the-art sparse dictionary learning-based algorithm.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have