Auditory front end in DTW word recognition under noisy, reverberant, and multispeaker conditions.

Kazuaki Obara,Tatsuya Hirahara

doi:10.1121/1.401198

Abstract

In this report three front ends, a fixed Q cochlear filter (FQF), an adaptive Q cochlear filter (AQF), and a Bark DFT (DFT), are compared for use as the front end of a DTW system. The FQF is a conventional cascade/parallel-type cochlear filter that stimulates the asymmetrical filtering characteristics of a basilar membrane system. The AQF is a nonlinear cochlear filter that simulates three level-dependent characteristics of a basilar membrane system [T. Hirahara etal., Proc. ICASSP, 496–499 (1989)]. The DFT front end generates 64-channel Bark scale coefficients based on a 512-point DFT magnitude spectrum. These three front ends have 64 channels covering the frequency range from 1.5 to 19.5 Bark. Recognition performance for clean speech, degraded speech by adding noise and/or reverberation, and under multispeaker conditions, are compared. Four signal-to-noise ratios, S/N=∞, 40, 20, and 10 dB, are set by adding different levels of pink noise to speech data. As for reverberant speech, the impulse responses obtained in the ATR reverberation room, RT=0.2 and 1.1 s, are convolved with speech data. Speech data used in the experiments are 216 phoneme-balanced Japanese words uttered by two male and two female speakers. A standard dynamic time warping (DTW) system is used as a back end. The experiments results are as follows: (1) For noisy speech, AQF performance is equal to that of FQF but more robust than that of DFT. (2) For reverberant speech, AQF is affected more than DFT but the performance is better than that of FQF. (3) For speaker variation, AQF gives better performance than do FQF or DFT. While the advantage of the AQF front end is small with an HMM back end [T. Hirahara etal., Proc. ICSLP, 381–384 (1990)], these results show that the AQF is an excellent front end for a DTW recognition system.

Full Text