Strategies for distant speech recognitionin reverberant environments

Marc Delcroix,Nobutaka Ito,Atsunori Ogawa,Takaaki Hori,Keisuke Kinoshita,Masakiyo Fujimoto,Shoko Araki,Tomohiro Nakatani,Yotaro Kubo,Miquel Espi,Takuya Yoshioka

doi:10.1186/s13634-015-0245-7

Marc Delcroix, Nobutaka Ito + Show 9 more

Open Access

https://doi.org/10.1186/s13634-015-0245-7

Copy DOI

Journal: EURASIP Journal on Advances in Signal Processing	Publication Date: Jul 19, 2015
Citations: 79	License type: CC BY 4.0

Affiliation: NTT (Japan)

Abstract

Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.

Highlights

Automatic speech recognition (ASR) is being increasingly used in everyday life
For SimData, we observe that performance continues to improve up to a total of about 40 coefficients for 1ch and 2ch cases and longer for 8ch. These results suggest that for acoustic environments covered by the REVERB challenge, i.e., an RT60 up to about 700 ms, filter length of about 300 ms may be sufficient in practice
The system we present was developed for the REVERB challenge

Summary

Introduction

Automatic speech recognition (ASR) is being increasingly used in everyday life. There are two main technical challenges with the REVERB task, i.e., the acoustic conditions that include a large amount of reverberation in addition to a nonnegligible amount of background noise and the mismatch between the training data obtained from simulation and the RealData set. The REVERB challenge provided a baseline multicondition training data set that consists of simulated reverberant speech with additional noise. Small artifacts or distortions caused by the SE front-end may affect recognition performance Another cause of the mismatch is the different acoustic conditions seen during training and testing, which is noticeable with the RealData set. 4.2 Preliminary experiments on SE front-end We first investigate the influence on ASR performance of characteristics of the SE configurations, such as the prediction filter length, the scheme for reverberation reduction (linear filtering or spectral subtraction) and the processing order of WPE and MVDR. 4.3 Preliminary results on ASR back-end Let us analyze different factors influencing the ASR back-end such as the number of hidden layers and the size of the input context, the influence of training data, and the choice of adaptation strategy

Influence of number of hidden layers and input context size

Results using extended multi-condition training data set

Findings

Conclusion