Hands-free speech recognition by a microphone array and HMM composition

Satoshi Nakamura,Kiyohiro Shikano,Tetsuya Takiguchi,Takesi Yamada

doi:10.1121/1.416498

Abstract

Hands-free speech interface is one of the final goals of human–machine interface. This paper introduces two methods for distant talking speech recognition in noisy and reverberant rooms. The first method is speech recognition using a microphone array. The microphone array enables one to enhance a speech signal using spatial phase differences even in environments where unstationary noises exist. The proposed method is composed of two modules: (1) a high SNR signal retrieval by a delay-and-sum beamformer and (2) localization and trace of the speaker’s direction by extracting a signal power and pitch harmonics. The second method is speech recognition based on the HMM composition. The proposed HMM composition is obtained by extending the HMM composition method of an additive noise to that of the convolutional acoustical transfer function. The HMMs are prepared beforehand for clean speech, noise, and acoustical transfer function. Then the HMM composition is conducted twice for a speech HMM and an acoustical transfer function HMM in the cepstrum domain and for the distorted speech HMMs and noise HMM in a linear spectral domain. The speaker-dependent/independent word recognition experiments using tied-mixture monophone HMMs are carried out and have clarified the effectiveness of the proposed methods. Furthermore, an effective coupling of these methods is also discussed.

Full Text