Abstract

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.

Highlights

  • Despite recent advances in the eld of speech emotion recognition, learning representations for natural speech segments that can be used e ciently under noisy and unconstrained conditions still represents a signi cant challenge

  • We make the following contributions: (i) we develop a strong model for facial expression emotion recognition, achieving state of the art performance on the FERPlus benchmark, (ii) we use this computer vision model to label face emotions in the VoxCeleb [50] video dataset – this is a large-scale dataset of emotionunlabelled speaking face-tracks obtained in the wild; (iii) we transfer supervision across modalities from faces to a speech, and train a speech emotion recognition model using speaking facetracks; and, (iv) we demonstrate that the resulting speech model is capable of classifying emotion on two external datasets

  • Challenges associated with emotion distillation: One of the key challenges associated with the proposed method is to achieve a consistent, high quality supervisory signal by the teacher network during the distillation process

Read more

Summary

Introduction

Despite recent advances in the eld of speech emotion recognition, learning representations for natural speech segments that can be used e ciently under noisy and unconstrained conditions still represents a signi cant challenge. Obtaining large, labelled human emotion datasets ‘in the wild’ is hindered by a number of di culties. Since labelling naturalistic speech segments is extremely expensive, most datasets consist of elicited or acted speech. As a consequence of the subjective nature of emotions, labelled datasets often su er from low human annotator agreement, as well as the use of varied labelling schemes (i.e., dimensional or categorical) which can require careful alignment [46]. Cost and time prohibitions often result in datasets with low speaker diversity, making it di cult to avoid speaker adaptation. Supervised techniques trained on such datasets often demonstrate high accuracy for only intra-corpus data, with a natural propensity to over t [42]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.