Modeling prosodic differences for speaker recognition

André Gustavo Adami

doi:10.1016/j.specom.2007.02.005

Abstract

Prosody plays an important role in discriminating speakers. Due to the complexity of estimating relevant prosodic information, most recognition systems rely on the notion that the statistics of the fundamental frequency (as a proxy for pitch) and speech energy (as a proxy for loudness/stress) distributions can be used to capture prosodic differences between speakers. However, this simplistic notion disregards the temporal aspects and the relationship between prosodic features that determine certain phenomena, such as intonation and stress. We propose an alternative approach that exploits the dynamics between the fundamental frequency and speech energy to capture prosodic differences. The aim is to characterize different intonation, stress, or rhythm patterns produced by the variation in the fundamental frequency and speech energy contours. In our approach, the continuous speech signal is converted into a sequence of discrete units that describe the signal in terms of dynamics of the fundamental frequency and energy contours. Using simple statistical models, we show that the statistical dependency between such discrete units can capture speaker-specific information. On the extended-data speaker detection task of the 2001 and 2003 NIST Speaker Recognition Evaluation, such approach achieves a relative improvement of at least 17% over a system based on the distribution statistics of fundamental frequency, speech energy and their deltas. We also show that they are more robust to communication channel effects than the state-of-the-art speaker recognition system. Since conventional speaker recognition systems do not fully incorporate different levels of information, we show that the prosodic features provide complementary information to conventional systems by fusing the prosodic systems with the state-of-the-art system. The relative performance improvement over the state-of-the-art system is about 42% and 12% for the extended-data task of the 2001 and 2003 NIST Speaker Recognition Evaluation, respectively.

Full Text