Abstract

The emotional states of speech have a great impact on the efficiency of speaker recognition (SR) system. Many researchers focus on how to map speech with different emotions to an emotion invariant embedding, which reduces the diversity of data. This paper proposes a new emotion embedding framework with self-attention mechanism for speaker recognition. First, several deep neural networks (DNNs) are trained to classify speakers in different emotional states as emotion embedding extractors during development phase. Then at enrollment stage, these pre-trained models are used to extend emotion embeddings from neutral speech. In order to make the final speaker embedding more representative, the classification model is trained with self-attention mechanism in emotion dimension, so that the framework can automatically annotate the weights of the emotion embeddings. Experiments were carried out on both Mandarin Affective Speech Corpus (MASC) and Crowd-Sourced Emotional Multimodal Actors Dataset (CREMA-D). The results show the proposed method achieves the best of Identification Rate (IR) and Equal Error Rate (EER) which are 59.14%, 15.79% on MASC and 75.98%, 8.14% on CREMA-D compared with state-of-the-art methods. In addition, the cross-database experiments also further demonstrate the practicability of the method in real scenes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call