How does real affect affect affect recognition in speech?

Khiet Truong

doi:10.3990/1.9789036528801

Abstract

The automatic analysis of affect is a relatively new and challenging multidisciplinary research area that has gained a lot of interest over the past few years. The research and development of affect recognition systems has opened many opportunities for improving the interaction between man and machine. Although affect can be expressed through multimodal means like hand gestures, facial expressions, and body postures, this dissertation has focused on speech (i.e., vocal expressions) as the main carrier of affect. Speech carries a lot of ‘hidden’ information. By hearing a voice only, humans can guess who is speaking, what language he/she is speaking (or accent or dialect), what age he/she is etc. The goal of automatic speech recognition (ASR) is to recognize what is said. In automatic speech-based emotion recognition, the goal is to recognize how something is said. In this work, several experiments are described which were carried out to investigate how affect can be automatically recognized in speech. One of the first steps in developing speech-based affect recognizers involves finding a spontaneous speech corpus that is labeled with emotions. Machine learning techniques, that are often used to build these recognizers, require these data to learn how to associate specific speech features (e.g., pitch, energy) with certain emotions. However, collecting and labeling real affective speech data has appeared to be difficult. Efforts to collecting affective speech data in the field have been described in this work. As an alternative, speech corpora that contain acted emotional speech (actors are asked to portray certain emotions) have often been used. Advantages of these corpora are that the recording conditions can be controlled, the emotions portrayed can be clearly associated with an emotion label, the costs and effort required to collect such corpora are relatively low, and the recordings are usually made available to the research community. In this work, an acted emotional speech corpus (containing basic, universal emotions like Anger, Boredom, Disgust, Fear, appiness, Neutral, and Sadness) was used to explore and apply recognition techniques and evaluation frameworks, adopted from similar research areas like automatic speaker and language recognition, to automatic emotion recognition. Recognizers were evaluated in a detection framework, and an evaluation for handling so-called ‘out-of-set’ emotions (unknown emotions that were not present in the training data, but which can occur in real-life situations) was presented. Partly due to lack of standardization and shared databases, the evaluation of affect recognizers remains somewhat problematic. While evaluation is an important aspect in development, it has been a relatively underexposed topic of investigation in the emotion research community. The main objections against the use of acted emotional speech corpora are that the expressions are not ‘real’ but rather portrayals of prototype emotions (and hence, expressed rather exaggeratedly), and the emotions portrayed do not often occur in real life situations. Therefore, in this work, spontaneous data has also been used and methods were developed to recognize spontaneous, vocal expressions of affect, like laughter. The task of the laughter detector was to recognize audible laughter in meeting speech data. Using a combination of Gaussian Mixture Models (GMMs)and Support Vector Machines (SVMs), and a combination of prosodic and spectral speech features, relatively low error rates between 3%–12% were achieved. Although the detector did not interpret the affective meaning of the laughter, the detection of laughter alone was informative enough. Part of these findings were used to build a so-called ‘Affective Mirror’ that successfully elicited and recognized laughter with different user groups. Other speech phenomena related to vocal expressions of affect, also in the context of meeting speech data, are the expressions of opinions and sentiments. In this work, it was assumed that opinions are expressed differently from factual statements in terms of tone of voice, and the words used. Classification experiments were carried out to find the best combination of lexical and prosodic features for the discrimination between subjective and non-subjective clauses. As lexical features, word-level, phone-level, and character-level n-grams were used. The experiments showed that a combination of all features yields the best performances, and that the prosodic features were the weakest of all features investigated. In addition, a second task was formulated, namely the discrimination between positive subjective clauses and negative subjective clauses. Similar results for this task were found. The relatively high error rates for both tasks, Cdet = 26%–30%, indicat that these are more difficult recognition problems than that of laughter: the relation between prosodic and lexical features, and subjectivity and polarity (i.e., positive vs. negative), is not as clear as is in the case of laughter. As an intermediate between real affective expressions and acted expressions, elicited affective expressions were employed in this dissertation in several human perception and classification experiments. To this end, a multimodal corpus with elicited affect was recorded. Affective vocal and facial expressions were elicited via a multiplayer first-person shooter video game (Unreal Tournament) that was manipulated by the experimenter. These expressions were captured by close-talk microphones and high-quality webcams, and were afterwards rated by the players themselves on Arousal (active-passive) and Valence (positive-negative) scales. After post-processing the data, perception and classification experiments were carried out on this data. The first experiment carried out with this unique kind of data tried to answer the question how the level of agreement between observers on the perceived emotion is affected when audio-only, video-only, audiovisual, or audiovisual + context information clips containing affective expressions are shown. The observers were asked to rate each clip on Arousal and Valence scales. The results showed that the agreement among human observers was highest when audiovisual clips were shown. Furthermore, the observers reached higher agreement on Valence judgments than Arousal judgments. Additionally, the results indicated that the ‘self’-ratings of the gamers themselves differed somewhat from the ‘observed’-ratings of the human observers. This finding was further investigated in a second xperiment. Six raters re-annotated a substantial part of the corpus. The results confirmed that there is a discrepancy between what the ‘self’-raters (i.e., the gamers themselves) experienced/felt and what observers perceive based on the gamers’ vocal and facial expressions. This finding has consequences for the development of automatic affect analyzers that use these ratings: the goal of affect analyzers can be to recognize ‘felt’ affect, or to recognize ‘observed/perceived’ affect. Two different types of speech-based affect recognizers were developed in parallel to recognize either ‘felt’ or ‘perceived’ affect on continuous Arousal and Valence scales. The results showed that ‘felt’ emotions are much harder to predict than ‘perceived’ emotions. Although these recognizers performed moderately from a classification perspective, the recognizers did not perform too bad in comparison to human performance. The recognizers developed depend much on how the affect data is rated by humans; if this data reflects moderate human judgments of affect, then it can be difficult for the machine to perform well (in an absolute sense). The work presented in this dissertation shows that the automatic recognition of affect in speech is complicated by the fact that real affect, as encountered in reallife situations, is a very complex phenomenon that sometimes cannot be described straightforwardly in ways that can be useful for computer scientists (who would like to build affect recognizers). The use of real affect data has led to the development of recognizers that are more targeted toward affect-related expressions. Laughter and subjectivity are examples of such affect-related expressions. The Arousal and Valence descriptors offer a nice way to describe the meaning of these affective expressions. The relatively high error rates obtained for Arousal and Valence prediction, suggest that the acoustic correlates used in this research only partly capture the characteristics of real affective speech. The search for stronger acoustic correlates or vocal profiles for specific emotions continues. This search is partly complicated by the ‘noise’ that comes with real affect which remains a challenge for the research community working toward automatic affect analyzers.

Full Text