Abstract

The huge amount of video data available these days requires effective management techniques for storage, indexing, and retrieval. Video summarization, a method to manage video data, provides concise versions of the videos for efficient browsing and retrieval. Key frame extraction is a form of video summarization which selects only the most salient frames from a given video. Since the automatic semantic understanding of the video contents is not possible so far, most of the existing works employ low level index features for extracting key frames. However, the usage of low level features results in loss of semantic details, thus leading to a semantic gap. In this context, the saliency-based user attention modeling technique can be used to bridge this semantic gap. In this paper, a key frame extraction scheme based on a visual attention mechanism is proposed. The proposed scheme builds static visual attention method based on multi-scale contrast instead of usual color contrast. The dynamic visual attention model is developed based on novel relative motion intensity and relative motion orientation. An efficient fusion scheme for combining three visual attention values is then proposed. A flexible technique is then used for key frame extraction. The experimental results demonstrate that the proposed mechanism provides excellent results as compared to the some of the other prominent techniques in the literature.

Highlights

  • The amount of video data on the internet is increasing day by day primarily because of increased processing power, faster networks, cheaper storage devices, and rapid development in digital video capture and editing technologies [1]

  • The usage of low level features is inherently associated with the loss of semantic details, creating a big semantic gap

  • The basic assumption in such techniques is to extract those frames as key frames which are visually important for humans based on visual attention models

Read more

Summary

Introduction

The amount of video data on the internet is increasing day by day primarily because of increased processing power, faster networks, cheaper storage devices, and rapid development in digital video capture and editing technologies [1]. The basic assumption in such techniques is to extract those frames as key frames which are visually important for humans based on visual attention models In this way, the semantic details of the videos can be approximated in a better way as compared to the low level features. A visual attention-based mechanism for extracting key frames from the videos is proposed. The framework develops efficient visual saliency-based static and dynamic attention models and combines them using a proposed non-linear weighted fusion mechanism. In videos, motion is an important factor in building human attention model Because of this importance of motion, two different descriptors based on relative motion strength and relative motion direction have been proposed for building saliency maps. The resultant motion vectors are used to compute the two descriptors called relative motion intensity and relative motion consistency

Relative motion intensity
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call