Abstract

The paper presents a multi-modal emotion recognition system exploiting audio and video (i.e., facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed. The proposed approach avoids the need for detecting and tracking specific facial landmarks throughout the given video sequence, which represents a common source of error in video-based emotion recognition systems, and, therefore, adds robustness to the video processing chain. The audio part of the system, on the other hand, relies on utterance-specific Gaussian Mixture Models (GMMs) adapted from a Universal Background Model (UBM) via the maximum a posteriori probability (MAP) estimation. It improves upon the standard UBM-MAP procedure by exploiting gender information when building the utterance-specific GMMs, thus ensuring enhanced emotion recognition performance. Both the uni-modal parts as well as the combined system are assessed on the challenging multi-modal eNTERFACE'05 corpus with highly encouraging results. The developed system represents a feasible solution to emotion recognition that can easily be integrated into various systems, such as humanoid robots, smart surveillance systems and alike.

Highlights

  • Augmenting humanoid robotic systems with emotion recognition capabilities has recently attracted a lot of attention from both, the speech and computer vision communities

  • In this paper we build upon our work presented in [1, 2] and present a novel multi‐modal emotion recognition system exploiting video and audio information

  • In the paper we presented a multi‐modal emotion recognition system

Read more

Summary

Introduction

Augmenting humanoid robotic systems with emotion recognition capabilities has recently attracted a lot of attention from both, the speech and computer vision communities. The proposed system processes each source of information separately and combines the results and the matching score level Both the video‐ and audio‐ processing parts of the system are implemented using novel approaches that improve upon existing methods from the literature. - region‐based approaches, where facial motion is first measured on certain regions of the face, such as the eye or mouth region, and exploited for emotion recognition. Both types of methods require the detection and tracking of specific facial landmarks throughout the entire length of the image‐ or video‐sequence and are, due to the difficulty of this task, prone to error [1]. The procedure relies solely on the facial region as a whole, which can be robustly and efficiently extracted from video data using existing face detection techniques, as for example, the Viola‐Jones face detector [6]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call