Predicting speaker recognition reliability by considering emotional content

Srinivas Parthasarathy,Carlos Busso

doi:10.1109/acii.2017.8273636

Abstract

Studies have shown that emotional variability in speech degrades the performance of speaker recognition tasks. Of particular interest is the error produced due to mismatch between training speaker recognition models with neutral speech and testing them with expressive speech. While previous studies have considered categorical emotions, expressive speech during human interaction conveys subtle behaviors that are better characterized with continuous descriptors (e.g., attributes such as arousal, valence, dominance). As the emotion becomes more intense, we expect the performance of speaker recognition tasks to drop. Can we define emotional regions for which the speaker recognition performance is expected to be reliable? This study focuses on automatically predicting reliable regions for speaker recognition by analyzing and predicting the emotional content. We collected a unique emotional database from 80 speakers. We estimate speaker recognition performance as a function of arousal and valence, creating regions in this space where we can reliably recognize the identity of a speaker. Then, we train speech emotion recognizers designed to predict whether the emotional content in a sentence is within the reliable region. The experimental evaluation demonstrates that sentences that are classified as reliable for speaker recognition tasks have lower equal error rate (EER) than sentences that are considered unreliable.

Full Text