Effectiveness of dereverberation, feature transformation, discriminative training methods, and system combination approach for various reverberant environments

Yuuki Tachioka,Tomohiro Narita,Shinji Watanabe

doi:10.1186/s13634-015-0241-y

Yuuki Tachioka, Tomohiro Narita + Show 1 more

Open Access

PDF Available

https://doi.org/10.1186/s13634-015-0241-y

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

The recently released REverberant Voice Enhancement and Recognition Benchmark (REVERB) challenge includes a reverberant automatic speech recognition (ASR) task. This paper describes our proposed system based on multi-channel speech enhancement preprocessing and state-of-the-art ASR techniques. For preprocessing, we propose a single-channel dereverberation method with reverberation time estimation, which is combined with multichannel beamforming that enhances direct sound compared with the reflected sound. In addition, this paper also focuses on state-of-the-art ASR techniques such as discriminative training of acoustic models including the Gaussian mixture model, subspace Gaussian mixture model, and deep neural networks, as well as various feature transformation techniques. Although, for the REVERB challenge, it is necessary to handle various acoustic environments, a single ASR system tends to be overly tuned for a specific environment, which degrades the performance in the mismatch environments. To overcome this mismatch problem with a single ASR system, we use a system combination approach using multiple ASR systems with different features and different model types because a combination of various systems that have different error patterns is beneficial. In particular, we use our discriminative training technique for system combination that achieves better generalization by making systems complementary with the modified discriminative criteria. Experiments show the effectiveness of these approaches, reaching 6.76 and 18.60 % word error rates on the REVERB simulated and real test sets. These are 68.8 and 61.5 % relative improvements over the baseline.

Highlights

Automatic speech recognition (ASR) using distant microphones can overcome application restrictions of places and devices and widen the usage of speech interfaces
linear discriminant analysis (LDA) can reduce the influence of reverberation because the long context input features can handle the distorted speech features across several frames due to the influence of longer reverberation than the window size of the shorttime Fourier transform (STFT) [18, 19]. This property is effective for reverberant speech recognition, and this paper investigates the effectiveness of LDA on ASR performance in detail with varying context sizes
Gaussian mixture model (GMM)-based acoustic models are obtained by using discriminative training techniques [6, 7] and this paper deals with deep neural networks (DNN) [13] that have recently attracted great attention, and we have shown promising results in noisy environments [16]

Summary

Introduction

Automatic speech recognition (ASR) using distant microphones can overcome application restrictions of places and devices and widen the usage of speech interfaces. Users can control distant home appliances by voice without touching the devices. In such a scenario, it is necessary to address reverberation, which is composed of reflected sounds from walls, ceilings, or furniture, in addition to the direct sound from a sound source. This paper focuses on the speech recognition task, which is a mediumsized vocabulary continuous speech recognition task, in order to evaluate the ASR performance in reverberant environments. In such a scenario, speech enhancement before ASR is important and impacts ASR performance.

Methods

Results

Conclusion