Abstract

Scream is defined as sustained, high-energy vocalizations that lack phonological structure. Lack of phonological structure is how scream is identified from other forms of loud vocalization, such as "yell." This study investigates the acoustic aspects of screams and addresses those that are known to prevent standard speaker identification systems from recognizing the identity of screaming speakers. It is well established that speaker variability due to changes in vocal effort and Lombard effect contribute to degraded performance in automatic speech systems (i.e., speech recognition, speaker identification, diarization, etc.). However, previous research in the general area of speaker variability has concentrated on human speech production, whereas less is known about non-speech vocalizations. The UT-NonSpeech corpus is developed here to investigate speaker verification from scream samples. This study considers a detailed analysis in terms of fundamental frequency, spectral peak shift, frame energy distribution, and spectral tilt. It is shown that traditional speaker recognition based on the Gaussian mixture models-universal background model framework is unreliable when evaluated with screams.

Highlights

  • Scream, in this study, is referred to sustained, loud vocalizations with no phonological structure

  • It is reasonable to assume that a considerable amount of data in forensic trials contain stressed speech or screams, which we argue have not been significantly dealt with in the speaker recognition community

  • We evaluated the speaker verification performance for two different types of acoustic features: (i) mel-frequency cepstral coefficients (MFCCs), and (ii) perceptual minimum variance distortionless response (PMVDR)

Read more

Summary

INTRODUCTION

In this study, is referred to sustained, loud vocalizations with no phonological structure. A wide range of studies have focused on finding robust solutions to mismatch conditions including speech under stress (Hansen, 1996; Hansen et al, 2000), Lombard effect (Hansen, 1988), physical task stress (Godin and Hansen, 2008), speech vocal effort (Zhang and Hansen, 2007), speaking styles (Lippmann et al, 1986; Hansen, 1988; Shriberg et al, 2008), long-term aging (Kelly et al, 2014), and more recently non-speech sounds (Nandwana and Hansen, 2014) Such mismatch can be introduced by either speaker- or environment-dependent factors.

BACKGROUND
CORPUS DEVELOPMENT
UT-NonSpeech-I
UT-NonSpeech-II
ACOUSTIC ANALYSIS
Fundamental frequency analysis
Frame energy distribution analysis
Spectral slope analysis
IMPACT ON SPEAKER VERIFICATION SYSTEM
Front-end processing
GMM-UBM framework
Speaker verification experiment
HUMAN PERCEPTION TEST
GMM ANALYSIS
VIII. CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call