Relevant spectro‐temporal modulations for robust speech and nonspeech classification.

Sridhar Krishna Nemala,Mounya Elhilali

doi:10.1121/1.3384192

Abstract

Robust speech/non‐speech classification is an important step in a variety of speech processing applications. For example, in speech and speaker recognition systems designed to work in real world environments, a robust discrimination of speech from other sounds is an essential pre‐processing step. Auditory‐based features at multiple‐scales of time and spectral resolution have been shown to be very useful for the speech/non‐speech classification task [Mesgarani et al., IEEE Trans. Speech Audio Process. 10, 504–516 (2002)]. The features used are computed using a biologically inspired auditory model that maps a given sound to a high‐dimensional representation of its spectro‐temporal modulations (mimicking the various stages taking place along the auditory pathway from the periphery all the way to the primary auditory cortex). In this work, we analyze the contribution of different temporal and spectral modulations for robust speech/non‐speech classification. The results suggest the temporal modulations in the range 12–22 Hz, and spectral modulations in the range 1.5–4 cycles/octave are particularly useful to achieve the robustness in highly noisy and reverberant environments.

Full Text