Deep Semantic and Attentive Network for Unsupervised Video Summarization

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

With the rapid growth of video data, video summarization is a promising approach to shorten a lengthy video into a compact version. Although supervised summarization approaches have achieved state-of-the-art performance, they require frame-level annotated labels. Such an annotation process is time-consuming and tedious. In this article, we propose a novel deep summarization framework named Deep Semantic and Attentive Network for Video Summarization (DSAVS) that can select the most semantically representative summary by minimizing the distance between video representation and text representation without any frame-level labels. Another challenge associated with video summarization tasks mainly originates from the difficulty of considering temporal information over a long time. Long Short-Term Memory (LSTM) performs well for temporal dependencies modeling but does not work well with long video clips. Therefore, we introduce a self-attention mechanism into our summarization framework to capture the long-range temporal dependencies among the frames. Extensive experiments on two popular benchmark datasets, i.e., SumMe and TVSum, show that our proposed framework outperforms other state-of-the-art unsupervised approaches and even most supervised methods.

Similar Papers
  • Research Article
  • Cite Count Icon 74
  • 10.1016/j.compeleceng.2021.107618
Deep hierarchical LSTM networks with attention for video summarization
  • Dec 8, 2021
  • Computers and Electrical Engineering
  • Jingxu Lin + 2 more

Deep hierarchical LSTM networks with attention for video summarization

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/ictai50040.2020.00176
Bi-Directional Self-Attention with Relative Positional Encoding for Video Summarization
  • Nov 1, 2020
  • Jingxu Lin + 1 more

Video summarization technique is a promising approach to process large-scale video by shortening the video content into a compact version. Most previous methods used recurrent networks such as Long Short-Term Memory (LSTM) even combined with attention mechanism to achieve state-of-the-art results. However, these networks are complex to implement and cannot be easily parallelized in computation. In this paper, we propose a novel deep summarization framework named Bi-Directional Self-Attention with Relative Positional Encoding for Video Summarization (BiDAVS) that can be highly parallelized. Our proposed BiDAVS considers position information of input sequence and effectively capture long-range temporal dependencies of sequential frames by computing bi-directional attention. Extensive experiments on two popular benchmark datasets, i.e., SumMe and TVSum, show that our proposed model outperforms state-of-the-art approaches.

  • Research Article
  • Cite Count Icon 1
  • 10.52783/jes.1072
Attention-Based Multi-Layered Encoder-Decoder Model for Summarizing Non-Interactive User-Based Videos
  • Apr 4, 2024
  • Journal of Electrical Systems
  • Vasudha Tiwari, Charul Bhatnagar

Video summarization extracts the relevant contents from a video and presents the entire content of the video in a compact and summarized form. User based video summarization, can summarize a video as per the requirement of the user. In this work, a non interactive and a perception-based video summarization technique is proposed that makes use of attention mechanism to capture user’s interest and extract relevant keyshots in temporal sequence from the video content. Here, video summarization has been articulated as a sequence-to-sequence learning problem and a supervised method has been proposed for summarization of the video. Adding layers to the existing network makes it deeper, enables higher level of abstraction and facilitates better feature extraction. Therefore, the proposed model uses a multi-layered, deep summarization encoder-decoder network (MLAVS), with attention mechanism to select final keyshots from the video. The contextual information of the video frames is encoded using a multi-layered Bidirectional Long Short-Term Memory network (BiLSTM) as the encoder. To decode, a multi-layered attention-based Long Short-Term memory (LSTM) using a multiplicative score function is employed. The experiments are performed on the benchmark TVSum dataset and the results obtained are compared with recent works. The results show considerable improvement and clearly demonstrate the efficacy of this methodology against most of the other available state-of-art methods.

  • Research Article
  • Cite Count Icon 9
  • 10.1016/j.patrec.2021.08.017
Learning Video Actions in Two Stream Recurrent Neural Network
  • Nov 1, 2021
  • Pattern Recognition Letters
  • Ehtesham Hassan

Learning Video Actions in Two Stream Recurrent Neural Network

  • Book Chapter
  • 10.1201/9781003277460-16
Framework for Video Summarization Using CNN-LSTM Approach in IoT Surveillance Networks
  • May 9, 2022
  • Chaitrali Chaudhari + 1 more

The surveillance industry is rapidly changing and growing. Processing of surveillance videos to maintain useful information is a challenging and time-consuming task. This work proposes a framework for the effective summarization of surveillance videos by combining the advantages of deep learning to the internet of things (IoT). A convolutional neural network (CNN) is used for the selection of significant video features from the selected video frames and a long short term memory network (LSTM) is used for the generation of video summary in the compact from preserving the salient information. Summarization of the videos can be sent to the receiver over a network providing benefits of reduced bandwidth utilization and transmission cost. The proposed framework has the benefits of autonomous operations without the need for human intervention. Since the time consumed to analyze the summarized text is comparatively less than for the video, quick decision making and faster action in case of emergency is possible increasing the efficiency at crucial situations.

  • Conference Article
  • Cite Count Icon 62
  • 10.1145/3343031.3350992
Stacked Memory Network for Video Summarization
  • Oct 15, 2019
  • Junbo Wang + 5 more

In recent years, supervised video summarization has achieved promising progress with various recurrent neural networks (RNNs) based methods, which treats video summarization as a sequence-to-sequence learning problem to exploit temporal dependency among video frames across variable ranges. However, RNN has limitations in modelling the long-term temporal dependency for summarizing videos with thousands of frames due to the restricted memory storage unit. Therefore, in this paper we propose a stacked memory network called SMN to explicitly model the long dependency among video frames so that redundancy could be minimized in the video summaries produced. Our proposed SMN consists of two key components: Long Short-Term Memory (LSTM) layer and memory layer, where each LSTM layer is augmented with an external memory layer. In particular, we stack multiple LSTM layers and memory layers hierarchically to integrate the learned representation from prior layers. By combining the hidden states of the LSTM layers and the read representations of the memory layers, our SMN is able to derive more accurate video summaries for individual video frames. Compared with the existing RNN based methods, our SMN is particularly good at capturing long temporal dependency among frames with few additional training parameters. Experimental results on two widely used public benchmark datasets: SumMe and TVsum, demonstrate that our proposed model is able to clearly outperform a number of state-of-the-art ones under various settings.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/siu.2019.8806603
Unsupervised Video Summarization with Independently Recurrent Neural Networks
  • Apr 1, 2019
  • Gokhan Yalınız + 1 more

Video summarization, which is one of the research topics that has gained significant acceleration over the past few years, is producing shorter and more concise videos that can represent the content of large videos as diversely as possible. It is observed that the hyperbolic tangent and the sigmoid action function used in long short-term memory (LSTM) and gated recurrent unit (GRU) models which are used in the recent studies on video summarization may cause gradient decay over layers. Besides, entanglement of neurons on recurrent neural network (RNN) can make it troublesome to interpret and develop these networks. In order to solve these issues, in this study, a method that uses deep reinforcement learning together with independently recurrent neural networks (IndRNN) is proposed for unsupervised video summarization problem. In this way, the model can be trained with more steps and has more layers without having any problem related to gradients. Based on the experiments conducted on two benchmark datasets, it is observed that compared to the state-of-the-art techniques on video summarization, better results are obtained.

  • Addendum
  • Cite Count Icon 12
  • 10.1007/s12652-020-02025-8
RETRACTED ARTICLE: Multi-edge optimized LSTM RNN for video summarization
  • May 4, 2020
  • Journal of Ambient Intelligence and Humanized Computing
  • N Archana + 1 more

Video summarization is an inevitable process in this developed communication world. The improvements in digital communication and filmless video recording technologies triggered the tremendous growth of storing and sharing variety of videos. Video summarization is used to optimize the searching and organizing process of different types of videos. Precision, Recall, F-Score and Processing time are the primary evaluation metrics of a video summarization procedure. A frequency domain multi-edge detection process and Multi-Edge optimized Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) are proposed and integrated in this work. The frequency domain Multi-Edge detection is introduced to improve the precision, recall and F-Score whereas, Multi-Edge Optimized LSTM is used to reduce the processing time of the summarization process. Discrete Wavelet Transformation based multi-edge detection algorithm is introduced and integrated with the optimized LSTM to achieve the betterment of the summarization process. The proposed method named as Multi-Edge optimized LSTM RNN for Video Summarization (MOLRVS) is indented to perform the video summarization process in real time video streaming environments to reduce a significant amount of manual interventions.

  • Research Article
  • Cite Count Icon 37
  • 10.1016/j.bspc.2021.102801
A deep learning spatial-temporal framework for detecting surgical tools in laparoscopic videos
  • May 26, 2021
  • Biomedical Signal Processing and Control
  • Tamer Abdulbaki Alshirbaji + 4 more

A deep learning spatial-temporal framework for detecting surgical tools in laparoscopic videos

  • Research Article
  • Cite Count Icon 68
  • 10.1109/tmm.2019.2959451
Unsupervised Video Summarization With Cycle-Consistent Adversarial LSTM Networks
  • Sep 24, 2020
  • IEEE Transactions on Multimedia
  • Li Yuan + 3 more

Video summarization is an important technique to browse, manage and retrieve a large amount of videos efficiently. The main objective of video summarization is to minimize the information loss when selecting a subset of video frames from the original video, hence the summary video can faithfully represent the overall story of the original video. Recently developed unsupervised video summarization approaches are free of requiring tedious annotation on important frames to train a video summarization model and thus are practically attractive. However, their performance is still limited due to the difficulty of minimizing information loss between the summary and original videos. In this paper, we address unsupervised video summarization by developing a novel Cycle-consistent Adversarial LSTM architecture to effectively reduce the information loss in the summary video. The proposed model, named Cycle-SUM, consists of a frame selector and a cycle-consistent learning based evaluator. The selector is a bi-directional LSTM network to capture the long-range relationship between video frames. To overcome the difficulty of specifying a suitable information preserving metric between original video and summary video, the evaluator is introduced to “supervise” selector to improve the video summarization quality. Specifically, the evaluator is composed of two generative adversarial networks (GANs), in which the forward GAN component is learned to reconstruct the original video from summary video, while the backward GAN learns to invert the process. We establish the relation between mutual information maximization and such cycle learning procedure and further introduce cycle-consistent loss to regularize the summarization. Extensive experiments on three video summarization benchmark datasets demonstrate a state-of-the-art performance, and show the superiority of the Cycle-SUM model compared with other unsupervised approaches.

  • Conference Article
  • 10.1109/i2ct54291.2022.9824044
Usage of Parallelization Techniques for Video Summarisation: State-of-the-art, Open Issues, and Future Research Avenues
  • Apr 7, 2022
  • Sonali Karale + 1 more

Videos are one of the prime and important sources of information. They can be availed in huge numbers. Whenever something is to be searched for in the video, the whole video clip has to be searched for which is a very time-consuming process. The search time could be reduced using video summarisation technique which is used to extract the relevant frames or the key-frames from the video in order to get the quick summary of the event. The relevant frames are extracted based on the particular object detection, face recognition, event detection, movements in the frames, etc. Supervised and unsupervised approaches are used to achieve the desired result. As the research got deeper in this area, new advanced and efficient technologies were invented. Deep learning feature extraction algorithms were found to be more efficient as compared to the conventional feature extraction algorithms. To make the video summarisation process faster, a parallel processing approach was also used by the researchers. This paper presents a review of different techniques used for video summarisation available in the literature. We have specifically presented a comparative study of video summarisation techniques implemented using parallel processing and multi-core CPUs. While the state-of-the-art of the field is presented through a comprehensive and systematic study, the paper also presents open issues, challenges, as well as open research avenues in the area. It is concluded that there is a particular scope of improvement in the processing time and use of parallelization in various stages of video summarisation. It is also concluded that the empirical study in the field should make use of a large size of video data set collected from multiple cameras rather than a small dataset collected using a single camera.

  • Research Article
  • Cite Count Icon 32
  • 10.1016/j.knosys.2021.106971
Recurrent generative adversarial networks for unsupervised WCE video summarization
  • Mar 20, 2021
  • Knowledge-Based Systems
  • Libin Lan + 1 more

Recurrent generative adversarial networks for unsupervised WCE video summarization

  • Research Article
  • Cite Count Icon 47
  • 10.1109/tnnls.2021.3119969
AudioVisual Video Summarization.
  • Aug 1, 2023
  • IEEE Transactions on Neural Networks and Learning Systems
  • Bin Zhao + 2 more

Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, most existing approaches just exploit the visual information while neglecting the audio information. In this brief, we argue that the audio modality can assist vision modality to better understand the video content and structure and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task and develop an audiovisual recurrent network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream long-short term memory (LSTM) is used to encode the audio and visual feature sequentially by capturing their temporal dependency; 2) the audiovisual fusion LSTM is used to fuse the two modalities by exploring the latent consistency between them; and 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, i.e., SumMe and TVsum, have demonstrated the effectiveness of each part and the superiority of AVRN compared with those approaches just exploiting visual information for video summarization.

  • Research Article
  • Cite Count Icon 380
  • 10.1109/tcsvt.2019.2904996
Video Summarization With Attention-Based Encoder–Decoder Networks
  • Jul 30, 2019
  • IEEE Transactions on Circuits and Systems for Video Technology
  • Zhong Ji + 3 more

This paper addresses the problem of supervised video summarization by formulating it as a sequence-to-sequence learning problem, where the input is a sequence of original video frames, and the output is a keyshot sequence. Our key idea is to learn a deep summarization network with attention mechanism to mimic the way of selecting the keyshots of human. To this end, we propose a novel video summarization framework named attentive encoder-decoder networks for video summarization (AVS), in which the encoder uses a bidirectional long short-term memory (BiLSTM) to encode the contextual information among the input video frames. As for the decoder, two attention-based LSTM networks are explored by using additive and multiplicative objective functions, respectively. Extensive experiments are conducted on two video summarization benchmark datasets, i.e., SumMe and TVSum. The results demonstrate the superiority of the proposed AVS-based approaches against the state-of-the-art approaches, with remarkable improvements on both datasets.

  • Research Article
  • Cite Count Icon 24
  • 10.1038/s41598-022-11726-3
A temporal dependency feature in lower dimension for lung sound signal classification
  • May 12, 2022
  • Scientific Reports
  • Amy M Kwon + 1 more

Respiratory sounds are expressed as nonlinear and nonstationary signals, whose unpredictability makes it difficult to extract significant features for classification. Static cepstral coefficients such as Mel-frequency cepstral coefficients (MFCCs), have been used for classification of lung sound signals. However, they are modeled in high-dimensional hyperspectral space, and also lose temporal dependency information. Therefore, we propose shifted delta -cepstral coefficients in lower-subspace (SDC-L) as a novel feature for lung sound classification. It preserves temporal dependency information of multiple frames nearby same to original SDC, and improves feature extraction by reducing the hyperspectral dimension. We modified EMD algorithm by adding a stopping rule to objectively select a finite number of intrinsic mode functions (IMFs). The performances of SDC-L were evaluated with three machine learning techniques (support vector machine (SVM), k-nearest neighbor (k-NN) and random forest (RF)) and two deep learning algorithms (multilayer perceptron (MLP) and convolutional neural network (cNN)) and one hybrid deep learning algorithm combining cNN with long short term memory (LSTM) in terms of accuracy, precision, recall and F1-score. We found that the first 2 IMFs were enough to construct our feature. SVM, MLP and a hybrid deep learning algorithm (cNN plus LSTM) outperformed with SDC-L, and the other classifiers achieved equivalent results with all features. Our findings show that SDC-L is a promising feature for the classification of lung sound signals.

Save Icon
Up Arrow
Open/Close