Abstract

Humans are quite adept at communicating in presence of noise. However most speech processing systems, like automatic speech and speaker recognition systems, suffer from a significant drop in performance when speech signals are corrupted with unseen background distortions. The proposed work explores the use of a biologically-motivated multi-resolution spectral analysis for speech representation. This approach focuses on the information-rich spectral attributes of speech and presents an intricate yet computationally-efficient analysis of the speech signal by careful choice of model parameters. Further, the approach takes advantage of an information-theoretic analysis of the message and speaker dominant regions in the speech signal, and defines feature representations to address two diverse tasks such as speech and speaker recognition. The proposed analysis surpasses the standard Mel-Frequency Cepstral Coefficients (MFCC), and its enhanced variants (via mean subtraction, variance normalization and time sequence filtering) and yields significant improvements over a state-of-the-art noise robust feature scheme, on both speech and speaker recognition tasks.

Highlights

  • Despite the enormous advances in computing technology over the last few decades, progress in the fields of auto-Mel-Frequency Cepstral Coefficients (MFCC) are a classic example of the successful influence of biological intuition onto speech technologies, making them a staple in state-of-the-art ASR and ASV systems (Chen and Bilmes 2007; Kinnunen and Lib 2010)

  • With additive noise conditions reflecting a variety of real acoustic scenarios, the auditory multi-resolution spectral (AMRS) features perform substantially better than the MFCCs—an average relative improvement of 38.9 % on the ASR task and an average relative error rate reduction of 31.9 % on the ASV task

  • We begin to address the issue of versatile speech representations that could bear relevance to both speaker and speech recognition tasks

Read more

Summary

Introduction

Despite the enormous advances in computing technology over the last few decades, progress in the fields of auto-. MFCCs provide a compact form of representing spectral details in the speech signal, that is motivated by both perceptual and computational considerations They exploit the unique nature of frequency mapping in the auditory system, by warping the linear frequency axis into a nonlinear quasi-logarithmic scale. The analysis mimics the spectral tuning of neurons in the central auditory pathway in which individual neurons are tuned to specific tonotopic frequencies (like cochlear filters); they are selective to various spectral shapes, in particular to peaks of various widths on the frequency axis, expanding the cochlear one dimensional tonotopic axis onto a two-dimensional sheet (Schreiner and Calhoun 1995; Versnel et al 1995) This analysis provides a more localized mapping of the spectral profile; that highlights details of bandwidth and spectral patterns in the signal but centers around the dif-. Ωmax is the highest spectral modulation frequency set at 12 CPO (given the spectral resolution of 24 channels per octave)

Choice of scales
Phoneme recognition setup
Performance of AMRS features
Comparison with state-of-the-art noise robust scheme
Findings
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call