Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation

Md Jahangir Alam,Patrick Kenny,Vishwa Gupta,Pierre Dumouchel

doi:10.1186/s13634-015-0238-6

Abstract

The REVERB challenge provides a common framework for the evaluation of feature extraction techniques in the presence of both reverberation and additive background noise. State-of-the-art speech recognition systems perform well in controlled environments, but their performance degrades in realistic acoustical conditions, especially in real as well as simulated reverberant environments. In this contribution, we utilize multiple feature extractors including the conventional mel-filterbank, multi-taper spectrum estimation-based mel-filterbank, robust mel and compressive gammachirp filterbank, iterative deconvolution-based dereverberated mel-filterbank, and maximum likelihood inverse filtering-based dereverberated mel-frequency cepstral coefficient features for speech recognition with multi-condition training data. In order to improve speech recognition performance, we combine their results using ROVER (Recognizer Output Voting Error Reduction). For two- and eight-channel tasks, to get benefited from the multi-channel data, we also use ROVER, instead of the multi-microphone signal processing method, to reduce word error rate by selecting the best scoring word at each channel. As in a previous work, we also apply i-vector-based speaker adaptation which was found effective. In speech recognition task, speaker adaptation tries to reduce mismatch between the training and test speakers. Speech recognition experiments are conducted on the REVERB challenge 2014 corpora using the Kaldi recognizer. In our experiments, we use both utterance-based batch processing and full batch processing. In the single-channel task, full batch processing reduced word error rate (WER) from 10.0 to 9.3 % on SimData as compared to utterance-based batch processing. Using full batch processing, we obtained an average WER of 9.0 and 23.4 % on the SimData and RealData, respectively, for the two-channel task, whereas for the eight-channel task on the SimData and RealData, the average WERs found were 8.9 and 21.7 %, respectively.

Highlights

A key component in hands-free man-machine interaction is the automatic speech recognition (ASR)
5.2 Results obtained with full batch processing There are a few differences between utterance-based batch processing and full batch processing
We computed the i-vector for each speaker using the multi-taper mel-frequency cepstral coefficients (MFCCs) features and used these i-vectors during training/recognition using other features

Summary

Introduction

A key component in hands-free man-machine interaction is the automatic speech recognition (ASR). We use a hybrid DNN-HMM architecture with several variants of filterbank features and one cepstral feature (maximum likelihood inverse filtering-based dereverberated (MLIFD) cepstral coefficients) for the REVERB challenge 2014 tasks. 4.3 Training with filterbank features For training DNN-HMM models from the baseline (i.e., conventional mel-filterbank (MFB)) features, from the MMFBl (multi-taper mel-filterbank with logarithmic nonlinearity) and MMFBp (multi-taper mel-filterbank with power-law nonlinearity), from the RCGFB (robust compressive gammachirp filterbank) and RMFB (robust mel-filterbank) features, and from the ITD-based dereverberated MFB (ITD-MFB) features, we generate 23dimensional filterbank features per frame for each of the abovementioned front-ends.

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Advances in Signal Processing	Publication Date: Jun 19, 2015
Citations: 22	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Advances in Signal Processing

Lead the way for us

Similar Papers

A SYSTEM COMBINATION FOR MALAY BROADCAST NEWS TRANSCRIPTION
Zainab A Khalaf ... Li-Pei Wong
Jurnal Teknologi | VOL. 77
Zainab A Khalaf, et. al.Zainab A Khalaf ... Li-Pei Wong
30 Nov 2015
Jurnal Teknologi | VOL. 77

An efficient multistage Rover method for Automatic Speech recognition
Xu Haihua ... Guanyong Wu
-
Xu Haihua, et. al.Xu Haihua ... Guanyong Wu
01 Jun 2009
01 Jun 2009

Theoretical Analysis of Diversity in an Ensemble of Automatic Speech Recognition Systems
Kartik Audhkhasi ... Panayiotis G Georgiou
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 22
Kartik Audhkhasi, et. al.Kartik Audhkhasi ... Panayiotis G Georgiou
01 Mar 2014
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 22

Applying Multi-scale Features in Deep Convolutional Neural Networks for Myanmar Speech Recognition
Thandar Soe ... Nyein Nyein Oo
-
Thandar Soe, et. al.Thandar Soe ... Nyein Nyein Oo
01 Jul 2018
01 Jul 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Advances in Signal Processing