Abstract

The ever-increasing amount of user-generated audiovisual content has increased the demand for easy navigation across content collections and repositories, necessitating detailed, yet concise content representations. A typical method to this goal is to construct a visual summary, which is significantly more expressive than other alternatives, such as verbal annotations. In this paper, we describe a video summarization technique which is based on the extraction and the fusion of audio and visual data, in order to generate dynamic video summaries, i.e., video summaries that include the most essential video segments from the original video, while maintaining their original temporal sequence. Based on the extracted features, each video segment is classified as being “interesting” or “uninteresting,” and hence included or excluded from the final summary. The originality of our technique is that prior to classification, we employ a transfer learning strategy to extract deep features from pre-trained models as input to the classifiers, making them more intuitive and robust to objectiveness. We evaluate our technique on a large dataset of user-generated videos and demonstrate that the addition of deep features is able to improve classification performance, resulting in more concrete video summaries, compared to the use of only hand-crafted features.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call