Abstract

Video segmentation into shots is the first step for video indexing and searching. Videos shots are mostly very small in duration and do not give meaningful insight of the visual contents. However, grouping of shots based on similar visual contents gives a better understanding of the video scene; grouping of similar shots is known as scene boundary detection or video segmentation into scenes. In this paper, we propose a model for video segmentation into visual scenes using bag of visual word (BoVW) model. Initially, the video is divided into the shots which are later represented by a set of key frames. Key frames are further represented by BoVW feature vectors which are quite short and compact compared to classical BoVW model implementations. Two variations of BoVW model are used: (1) classical BoVW model and (2) Vector of Linearly Aggregated Descriptors (VLAD) which is an extension of classical BoVW model. The similarity of the shots is computed by the distances between their key frames feature vectors within the sliding window of length L, rather comparing each shot with very long lists of shots which has been previously practiced, and the value of L is 4. Experiments on cinematic and drama videos show the effectiveness of our proposed framework. The BoVW is 25000-dimensional vector and VLAD is only 2048-dimensional vector in the proposed model. The BoVW achieves 0.90 segmentation accuracy, whereas VLAD achieves 0.83.

Highlights

  • The size of video databases is increasing exponentially due to the emergence of cheap and fast Internet

  • Video segmentation is a primary step for video indexing and searching

  • Shot boundary detection divides the videos into small units

Read more

Summary

Introduction

The indexing and retrieval of the videos are getting more difficult. The giant video portals, such as YouTube, Dailymotion, and Google, are investing huge amount on efficient and smart indexing and retrieval so that their portals remain attractive and addictive to the users. To process videos for indexing and searching, the first task is to segment the videos into shots and extract representative frames, known as key frames, from each shot. These key frames are later used for searching, efficient indexing, scene generation, and video classification. To process one frame for the detection of possible objects, it takes 0.5 to 1.5 seconds to identify objects in the frame (cascade object detector is used to identify possible text boards in the frame using Matlab)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call