Abstract

The sheer volume of movies generated these days requires an automated analytics for efficient classification, query-based search, and extraction of desired information. These tasks can only be efficiently performed by a machine learning based algorithm. We address the same issue in this paper by proposing a deep learning based technique for predicting the relevant tags for a movie and segmenting the movie with respect to the predicted tags. We construct a tag vocabulary and create the corresponding dataset in order to train a deep learning model. Subsequently, we propose an efficient shot detection algorithm to find the key frames in the movie. The extracted key frames are analyzed by the deep learning model to predict the top three tags for each frame. The tags are then assigned weighted scores and are filtered to generate a compact set of most relevant tags. This process also generates a corpus which is further used to segment a movie based on a selected tag. We present a rigorous analysis of the segmentation quality with respect to the number of tags selected for the segmentation. Our detailed experiments demonstrate that the proposed technique is not only efficacious in predicting the most relevant tags for a movie, but also in segmenting the movie with respect to the selected tags with a high accuracy.

Highlights

  • The huge amount of multimedia data generated these days makes it an ordeal to envisage techniques which can automatically check the contents of multimedia data to ascertain their authenticity and classify them

  • We transfer the features of a pre-trained Convolution Neural Network (CNN), Inception-V3 [5], for our training task by modifying and re-training its final layer using transfer learning.We further propose an efficient shot detection technique for determining the key frames in a movie which are later used for analytics by the deep learning model

  • Segmenting a video into constituent topics, which can be later retrieved by a query, requires an intelligent semantic analysis of each shot. This is only efficiently possible by a deep learning based algorithm which does not require a priori knowledge of the low-level features. We addressed this issue in a threefold approach: (i) we first proposed an efficient shot boundary detection algorithm which finds the representative key frames of all the shots in a movie, (ii) we trained a convolution neural network on a tag vocabulary to predict the context of each key frame and subsequently generating a compact set of the movie tags without requiring a priori information of image features or user-annotated meta data, and (iii) we offered an on-demand segmentation of the movie based on its predicted set of the tags

Read more

Summary

Introduction

The huge amount of multimedia data generated these days makes it an ordeal to envisage techniques which can automatically check the contents of multimedia data to ascertain their authenticity and classify them . Our preliminary experiments on this topic further reveal that this ostensibly trivial task entails an intelligent analysis of a video to predict its representative tags without human intervention This automatically extracted information has immense applications in optimizing video search, automatically retrieving scenes from videos based on user’s query, object detection and localization, automatic text/subtitles generation for videos, detecting specific events in videos, action recognition, behavior recognition, recommendation systems, etc. Among these applications, scene-driven retrieval is important in the sense that it helps in content-censorship (e.g., automatically censoring the scenes containing nudity, sex, violence, smoking, etc), and in on-demand retrieval of desired scenes from a given movie (e.g., making highlights of a soccer match which contain all the goal events).

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call