Video Summarization Task Research Articles

The increasing volume of user-generated human-centric video content and its applications, such as video retrieval and browsing, require compact representations addressed by the video summarization literature. Current supervised studies formulate video summarization as a sequence-to-sequence learning problem, and the existing solutions often neglect the surge of the human-centric view, which inherently contains affective content. In this study, we investigate the affective-information enriched supervised video summarization task for human-centric videos. First, we train a visual input-driven state-of-the-art continuous emotion recognition model (CER-NET) on the RECOLA dataset to estimate activation and valence attributes. Then, we integrate the estimated emotional attributes and their high-level embeddings from the CER-NET with the visual information to define the proposed affective video summarization (AVSUM) architectures. In addition, we investigate the use of attention to improve the AVSUM architectures and propose two new architectures based on temporal attention (TA-AVSUM) and spatial attention (SA-AVSUM). We conduct video summarization experiments on the TvSum and COGNIMUSE datasets. The proposed temporal attention-based TA-AVSUM architecture attains competitive video summarization performances with strong improvements for the human-centric videos compared to the state-of-the-art in terms of F-score, self-defined face recall, and rank correlation metrics.

Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, most existing approaches just exploit the visual information while neglecting the audio information. In this brief, we argue that the audio modality can assist vision modality to better understand the video content and structure and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task and develop an audiovisual recurrent network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream long-short term memory (LSTM) is used to encode the audio and visual feature sequentially by capturing their temporal dependency; 2) the audiovisual fusion LSTM is used to fuse the two modalities by exploring the latent consistency between them; and 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, i.e., SumMe and TVsum, have demonstrated the effectiveness of each part and the superiority of AVRN compared with those approaches just exploiting visual information for video summarization.

Video Summarization Task Research Articles

Related Topics

Articles published on Video Summarization Task

Use of Affective Visual Information for Summarization of Human-Centric Videos

AudioVisual Video Summarization.

Topic-aware video summarization using multimodal transformer

An Efficient Method for Underwater Video Summarization and Object Detection Using YoLoV3

Video Summarization Through Reinforcement Learning With a 3D Spatio-Temporal U-Net.

Deep hierarchical LSTM networks with attention for video summarization

Video Summarization Using Deep Neural Networks: A Survey

Hierarchical multimodal transformer to summarize videos

Using independently recurrent networks for reinforcement learning based unsupervised video summarization

Dynamic graph convolutional network for multi-video summarization

Meta Learning for Task-Driven Video Summarization

TTH-RNN: Tensor-Train Hierarchical Recurrent Neural Network for Video Summarization

Deep Reinforcement Learning for Query-Conditioned Video Summarization

Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network

Video summarization via minimum sparse reconstruction

Automatic summarization of rushes video using bipartite graphs

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Video Summarization Task Research Articles

Related Topics

Articles published on Video Summarization Task

Use of Affective Visual Information for Summarization of Human-Centric Videos

AudioVisual Video Summarization.

Topic-aware video summarization using multimodal transformer

An Efficient Method for Underwater Video Summarization and Object Detection Using YoLoV3

Video Summarization Through Reinforcement Learning With a 3D Spatio-Temporal U-Net.

Deep hierarchical LSTM networks with attention for video summarization

Video Summarization Using Deep Neural Networks: A Survey

Hierarchical multimodal transformer to summarize videos

Using independently recurrent networks for reinforcement learning based unsupervised video summarization

Dynamic graph convolutional network for multi-video summarization

Meta Learning for Task-Driven Video Summarization

TTH-RNN: Tensor-Train Hierarchical Recurrent Neural Network for Video Summarization

Deep Reinforcement Learning for Query-Conditioned Video Summarization

Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network

Video summarization via minimum sparse reconstruction

Automatic summarization of rushes video using bipartite graphs