Multimodal Surveillance: Behavior analysis for recognizing stress and aggression

Iulia Lefter

doi:10.4233/uuid:d6b8a31e-71f5-4509-adca-9cd672432c1e

Abstract

Nowadays, camera systems are installed in military areas as well as in public spaces like schools, shopping malls, airports, and football stadiums. Human operators are monitoring the screens, looking for any signs of unwanted behavior and negative incidents. The task requires working personnel 24/7. With the ever increasing number of cameras, surveillance operators become overloaded. The nature of the task to constantly watch screens and the sparsity of notable events are bound to decrease the operators' focus. Furthermore, some events are hard to distinguish by video only: severe events such as gunshots and screams are much easier to hear than see. For these reasons, negative events may go by unnoticed and typically the recorded footage is inspected after the fact. A solution to these problems is the development of automatic multimodal (audio-visual) surveillance systems, which was the aim of this research thesis. Such systems should not take over the decisions of the operators, but should assist them in identifying unwanted behaviour. Operators would be notified when and where to focus. This is likely to reduce the number of missed events caused by screen prioritising or external and internal distractions. It is important to note that such a system should not be limited to recognizing violence. It has been shown that negative emotions and stress might precede aggression. Recognizing them in an early stage is very relevant since adopting proper measures at an early time can prevent the situation from escalating. Therefore, in this research thesis, besides a variety of manifestations of aggression, we have focused on automatically recognizing stress. Our aim was to design and implement a surveillance system that is able to emulate human perception. For that reason, we asked people to annotate stress and aggression on audio-visual recordings. We investigated several approaches to compute their annotations automatically. Recordings from real surveillance cameras are in general not available due to privacy reasons. We had to construct our own datasets. In order to ensure a high degree of realism as well as sufficient samples of stress and aggression, we have designed scenarios and hired semi-professional actors to play them. The actors were free to improvise after they received roles and short scenario descriptions. We have recorded stressful scenes at service desk and aggression related scenarios in a train and train station. To automatically recognize the stress and aggression levels, we have extracted acoustic, linguistic and visual features, referred to as low-level features. Using classifiers, we trained models which can be used to make prediction of stress or aggression level on new data samples. One shortcoming of this approach is that there is a semantic gap between the low-level features and the high-level stress and aggression assessment. We have contributed by bridging the semantic gap with semantically-meaningful intermediate representations of the stress concept. The intermediate representation of stress consists of the degrees to which stress is conveyed by speech and gestures with respect to the semantic message and the way in which the semantic message is expressed (e.g. intonation for speech, speed, rhythm, tension for gestures). Adding such a representation as an intermediate level in the stress recognition architecture improves the stress assessment, especially when the level of stress is high. Having both audio and video offers the possibility to construct a more complete representation of the scene. The multimodal fusion approach is expected to be a solution to deal with the shortcomings of each modality (e.g. noise for audio, occlusion for video). Despite the expected benefits, fusing information coming from different modalities is challenging. Typical problems are that some pieces of information are only apparent in one modality (e.g. verbal fight), and that multiple people in the scene can have different behaviors which might lead to different assessments based on where the focus is. These problems can lead to incongruent, or even contradicting information from the different modalities, which makes coming to the correct interpretation hard. To deal with the problem of fusing incongruent information we have proposed and validated five meta-features: audio-focus, video-focus, context, semantics and history. The meta-features and the audio-only and video-only aggression assessments form the intermediate level of the aggression recognition model. This novel approach significantly improved automatic aggression recognition by multimodal fusion.

Full Text