Abstract
Emotional conditions cause changes in the speech production system. It produces the differences in the acoustical characteristics compared to neutral conditions. The presence of emotion makes the performance of a speaker verification system degrade. In this paper, we propose a speaker modeling that accommodates the presence of emotions on the speech segments by extracting a speaker representation compactly. The speaker model is estimated by following a similar procedure to the i-vector technique, but it considerate the emotional effect as the channel variability component. We named this method as the emotional variability analysis (EVA). EVA represents the emotion subspace separately to the speaker subspace, like the joint factor analysis (JFA) model. The effectiveness of the proposed system is evaluated by comparing it with the standard i-vector system in the speaker verification task of the Speech Under Simulated and Actual Stress (SUSAS) dataset with three different scoring methods. The evaluation focus in terms of the equal error rate (EER). In addition, we also conducted an ablation study for a more comprehensive analysis of the EVA-based i-vector. Based on experiment results, the proposed system outperformed the standard i-vector system and achieved state-of-the-art results in the verification task for the under-stressed speakers.
Highlights
Speaker verification is the process of accepting or rejecting the identity claim of a speaker [1].This system is commonly used for the applications that use the voice as the identity confirmation, known as biometrics, natural language technologies [2] or as a pre-processing part of the speaker-dependent system, such as conversational-based algorithms [3,4]
The Mahalanobis distance scoring (MDS) seeks to measure the correlation between variables by assuming an anisotropic Gaussian distribution instead
Since emotional variability analysis (EVA) compensates for the emotions, there are some correlations between emotion and speaker’s supervector
Summary
Speaker verification is the process of accepting or rejecting the identity claim of a speaker [1]. This system is commonly used for the applications that use the voice as the identity confirmation, known as biometrics, natural language technologies [2] or as a pre-processing part of the speaker-dependent system, such as conversational-based algorithms [3,4]. Many methods have been explored in terms of verification task [5]. Just a little work that observed the effects of the emotional conditions on the speech characteristics. Emotional conditions (especially stress conditions) are the most crucial factor that highly impacted the voice tone’s characteristics
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.