Summarization of User-Generated Videos Fusing Handcrafted and Deep Audiovisual Features

Theodoros Psallidas,Stavros J Perantonis,Evaggelos Spyrou

doi:10.1109/smap56125.2022.9941864

Abstract

The ever-increasing amount of user-generated audiovisual content has increased the demand for easy navigation across content collections and repositories, necessitating detailed, yet concise content representations. A typical method to this goal is to construct a visual summary, which is significantly more expressive than other alternatives, such as verbal annotations. In this paper, we describe a video summarization technique which is based on the extraction and the fusion of audio and visual data, in order to generate dynamic video summaries, i.e., video summaries that include the most essential video segments from the original video, while maintaining their original temporal sequence. Based on the extracted features, each video segment is classified as being “interesting” or “uninteresting,” and hence included or excluded from the final summary. The originality of our technique is that prior to classification, we employ a transfer learning strategy to extract deep features from pre-trained models as input to the classifiers, making them more intuitive and robust to objectiveness. We evaluate our technique on a large dataset of user-generated videos and demonstrate that the addition of deep features is able to improve classification performance, resulting in more concrete video summaries, compared to the use of only hand-crafted features.

Full Text