Learning Enhanced Acoustic Latent Representation for Small Scale Affective Corpus with Adversarial Cross Corpora Integration

Chun-Min Chang,Chi-Chun Lee

doi:10.1109/taffc.2021.3126145

Chun-Min Chang, Chi-Chun Lee

Open Access

https://doi.org/10.1109/taffc.2021.3126145

Copy DOI

Journal: IEEE Transactions on Affective Computing	Publication Date: Apr 1, 2023
License type: publisher-specific, author manuscript

Affiliation: National Tsing Hua University

Abstract

Achieving robust cross contexts speech emotion recognition (SER) has become a critical next direction of research for wide adoption of SER technology. The core challenge is in the large variability of affective speech that is highly contextualized. Prior works have worked on this as a transfer learning problem that mostly focuses on developing domain adaptation strategy. However, many of the existing speech emotion corpora, even those considered as large scale, are still limited in size resulting in an unsatisfactory transfer result. On the other hand, directly collecting context-specific corpus often results in an even smaller data size leading to an inevitably non-robust accuracy. In order to mitigate this issue, we propose the concept of enhancing the affect-related variability when learning the in-context acoustic latent representation by integrating out-of-context emotion data. Specifically, we utilize adversarial autoencoder network as our backbone with multiple out-of-context emotion labels derived for each in-context samples that serve as an auxiliary constraint in learning the latent representation. We extensively evaluate our framework using three in-context databases with three out-of-context databases. In this work, we demonstrate not only an improved recognition accuracy but also a comprehensive analysis on the effectiveness of this representation learning strategy.

Full Text