Affective Video Retrieval: Violence Detection in Hollywood Movies by Large-Scale Segmental Feature Extraction

Florian Eyben,Björn Schuller,Felix Weninger,Nicolas Lehment,Gerhard Rigoll,Mel Slater

doi:10.1371/journal.pone.0078506

Florian Eyben, Björn Schuller + Show 4 more

Open Access

https://doi.org/10.1371/journal.pone.0078506

Copy DOI

Journal: PloS one	Publication Date: Dec 31, 2013
Citations: 37	License type: CC BY 4.0

Affiliation: Technical University of Munich, Imperial College London

Abstract

Without doubt general video and sound, as found in large multimedia archives, carry emotional information. Thus, audio and video retrieval by certain emotional categories or dimensions could play a central role for tomorrow's intelligent systems, enabling search for movies with a particular mood, computer aided scene and sound design in order to elicit certain emotions in the audience, etc. Yet, the lion's share of research in affective computing is exclusively focusing on signals conveyed by humans, such as affective speech. Uniting the fields of multimedia retrieval and affective computing is believed to lend to a multiplicity of interesting retrieval applications, and at the same time to benefit affective computing research, by moving its methodology “out of the lab” to real-world, diverse data. In this contribution, we address the problem of finding “disturbing” scenes in movies, a scenario that is highly relevant for computer-aided parental guidance. We apply large-scale segmental feature extraction combined with audio-visual classification to the particular task of detecting violence. Our system performs fully data-driven analysis including automatic segmentation. We evaluate the system in terms of mean average precision (MAP) on the official data set of the MediaEval 2012 evaluation campaign's Affect Task, which consists of 18 original Hollywood movies, achieving up to .398 MAP on unseen test data in full realism. An in-depth analysis of the worth of individual features with respect to the target class and the system errors is carried out and reveals the importance of peak-related audio feature extraction and low-level histogram-based video analysis.

Highlights

Affective computing refers to emotional intelligence of technical systems in general, yet so far, research in this domain has mostly been focusing on aspects of human-machine interaction, such as affect sensitive dialogue systems [1]
Endowing systems with the intelligence to describe general multi-modal signals in affective dimensions is believed to lend to many applications including computer aided sound and video design, summarization and search in large multimedia archives; for example, to let a movie director choose ‘creepy’ sounds from a large library, or to let users browse for music or movies with a certain mood
To provide objective metrics of feature relevance and system performance in full realism, we evaluate our own system on the official corpus of the MediaEval 2012 campaign (Affect SubTask) consisting of 18 Hollywood movies extending over 35 hours of audio-visual material in total

Summary

Introduction

Affective computing refers to emotional intelligence of technical systems in general, yet so far, research in this domain has mostly been focusing on aspects of human-machine interaction, such as affect sensitive dialogue systems [1]. Endowing systems with the intelligence to describe general multi-modal signals in affective dimensions is believed to lend to many applications including computer aided sound and video design, summarization and search in large multimedia archives; for example, to let a movie director choose ‘creepy’ sounds from a large library, or to let users browse for music or movies with a certain mood. Another use case is to aid parental guidance by retrieving the most ‘disturbing’ scenes from a movie, such as those associated with highly negative valence. As a special case, yet one of high practical relevance, automatic classification of violent and non-violent movie scenes has been studied

Methods

Results

Discussion

Conclusion