A Two-Dimensional Framework of Multiple Kernel Subspace Learning for Recognizing Emotion in Speech

Xinzhou Xu,Jun Deng,Bjorn Schuller,Li Zhao,Nicholas Cummins,Zixing Zhang,Chen Wu

doi:10.1109/taslp.2017.2694704

Abstract

As a highly active topic in computational paralinguistics, speech emotion recognition (SER) aims to explore ideal representations for emotional factors in speech. In order to improve the performance of SER, multiple kernel learning (MKL) dimensionality reduction has been utilized to obtain effective information for recognizing emotions. However, the solution of MKL usually provides only one nonnegative mapping direction for multiple kernels; this may lead to loss of valuable information. To address this issue, we propose a two-dimensional framework for multiple kernel subspace learning. This framework provides more linear combinations on the basis of MKL without nonnegative constraints, which preserves more information in the learning procedures. It also leverages both of MKL and two-dimensional subspace learning, combining them into a unified structure. To apply the framework to SER, we also propose an algorithm, namely generalised multiple kernel discriminant analysis (GMKDA), by employing discriminant embedding graphs in this framework. GMKDA takes advantage of the additional mapping directions for multiple kernels in the proposed framework. In order to evaluate the performance of the proposed algorithm a wide range of experiments is carried out on several key emotional corpora. These experimental results demonstrate that the proposed methods can achieve better performance compared with some conventional and subspace learning methods in dealing with SER.

Highlights

S PEECH Emotion Recognition (SER), a core area of research within computational paralinguitics [1], focuses on exploiting abstract representations of speech for the classfication or prediction of a range of human affect behaviours [2]–[8]
Results indicate that the supervised methods (LDA / Fisher Discriminant Analysis (FDA), Linear Discriminant Projections (LDP), Locally Discriminant Embedding (LDE), and Graph-based Fisher Analysis (GbFA), etc.) generally perform better than the unsupervised ways (PCA, Locality Preserving Projections (LPP), Locally Linear Embedding (LLE), and Isomap, etc.)
The choice of γ appears to be database dependent; the fluctuations of Unweighted Accuracy (UA) with the changes of parameter γ keep relatively stable on the copora of GEneva Multimodal Emotion Portrayals (GEMEP) and Airplane Behavior Corpus (ABC)

Summary

Introduction

S PEECH Emotion Recognition (SER), a core area of research within computational paralinguitics [1], focuses on exploiting abstract representations of speech for the classfication or prediction of a range of human affect behaviours [2]–[8]. The use of functionals, to generate a single high dimensional representation of an utterance from a set of underlying lowlevel acoustic descriptors, is generally regarded as the default method for capturing paralinguistic information in speech [1]. Previous investigations consistently demonstrate the usefulness of this technique when applied to a range of different SER problems [14], [16], [17]. An utterance level feature vector should be representative of all categories of information, both linguistic and paralinguistic, present within the particular utterance being modelled

Methods

Results

Conclusion