Noise-robust Automatic Speech Recognition (ASR) is essential for robots which are expected to communicate with human in a daily environment. In such an environment, Voice Activity Detection (VAD) performance becomes poor, and ASR performance deteriorates due to noises and VAD failures. To cope with these problems, it is said that humans improve speech recognition performance by using visual information like lip reading. Thus, we propose two-layered audio-visual integration framework for VAD and ASR. The two-layered AV integration framework includes three crucial methods. The first is Audio-Visual Voice Activity Detection (AV-VAD) based on Bayesian network. The second is a new lip-related visual feature which is robust for visual noises. The last one is microphone array processing to improve Signal-to-Noise Ratio (SNR) of input signal. We implemented prototype audio-visual speech recognition system based on our proposed framework using HARK which is our robot audition system. Through voice activity detection and speech recognition experiments, we showed the effectiveness of Audio-Visual integration, microphone array processing, and their combination for VAD and ASR. Preliminary results show that our system improves 20 and 9.7 points of ASR results with/without microphone array processing, respectively, and also improves robustness against several auditory/visual noise conditions.
Read full abstract