Speaker recognition is often dependent on the speaker and susceptible to emotional factors, thus mostly decreasing recognition performance, we propose a framework by combining generative adversarial networks and speaker recognition for generating additional speaker-related emotional training speech feature to enhance robustness under different emotional conditions. In this framework, a new speaker emotion-converted generative adversarial network (SEC-GAN) is developed for speaker recognition. Given the neutral speech of the target speaker, SEC-GAN learns speech information to generate speech feature in other emotions based on neutral speech while retaining speaker identity. In addition, a new loss function is designed to retain the speaker’s internal information during feature reconstruction, and an emotion discriminator is introduced to classify the speech feature’s emotion for better emotion generation quality. Based on the origin neutral and generated training data from native speakers of Mandarin Affective Speech Corpus (MASC), the negative impact of emotion mismatch between speech can be decreased by using our framework. This strategy could solve the common problem in reality that most voice control devices enroll user’s calm speech but fail to recognize user’s identity when they are in other emotion. The experimental results on MASC which is 57.59% show the improvement of 8.27% compared with VGG baseline and 5.62% compared with x-vector in accuracy. Our framework also outperforms the existing state-of-the-art method ECAPA-TDNN and other comparison methods.
Read full abstract