Target Speech Source Research Articles

Automatic speech recognition (ASR) is essential for human-humanoid communication. One of the main problems with ASR by a humanoid is that it is inevitably generates motor noises. These noises are easily captured by the humanoid's microphones because the noise sources are closer to the microphones than the target speech source. Thus, the signal-to-noise ratio (SNR) of input speech becomes quite low (sometimes less than 0 [dB] ) . However, it is possible to estimate these noises by using information on the humanoid's motions and gestures. This paper proposes a method to improve ASR for a humanoid with motor noises by utilizing its motion/gesture information. The method consists of noise suppression and missing-feature-theory-based ASR (MFT-ASR) . The proposed noise suppression technique is based on spectral subtraction, and a white noise is added to blur distortion of suppression. MFT-ASR improves ASR by masking unreliable acoustic features in the input sound. The motion/gesture information is used for obtaining the unreliable acoustic features. Furthermore, we also evaluated with the acoustic model adaptation technique called MLLR (Maximum Likelihood Linear Regression) . Un-supervised MLLR was used for the adaptation. We evaluated the proposed method through recognition of speech recorded by using Honda ASIMO in a room with reverberation. The noise data contained 34 kinds of noises: motor noises without motions, gesture noises, walking noises, and other kind of noises. The experimental results show that the proposed method outperforms the conventional multi-condition training technique.