Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition

Aditya Arie Nugraha,Kazumasa Yamamoto,Seiichi Nakagawa

doi:10.1186/1687-4722-2014-13

Abstract

We present a feature enhancement method that uses neural networks (NNs) to map the reverberant feature in a log-melspectral domain to its corresponding anechoic feature. The mapping is done by cascade NNs trained using Cascade2 algorithm with an implementation of segment-based normalization. Experiments using speaker identification (SID) and automatic speech recognition (ASR) systems were conducted to evaluate the method. The experiments of SID system was conducted by using our own simulated and real reverberant datasets, while the CENSREC-4 evaluation framework was used as the evaluation for the ASR system. The proposed method could remarkably improve the performance of both systems by using limited stereo data and low speaker-variant data as the training data. From the evaluation using SID, we reached 26.0% and 34.8% of error rate reduction (ERR) relative to the baseline by using simulated and real data, respectively, by using only one pair of utterances for matched condition cases. Then, by using combined dataset containing 15 pairs of utterances by one speaker from three positions in a room, we could reach 93.7% of average identification rate (three known and two unknown positions), which was 42.2% of ERR relative to the use of cepstral mean normalization (CMN). From the evaluation using ASR, by using 40 pairs of utterances as the NN training data, we could reach 78.4% of ERR relative to the baseline by using simulated utterances by five speakers. Moreover, we could reach 75.4% and 71.6% of ERR relative to the baseline by using real utterances by five speakers and one speaker, respectively.

Highlights

The use of distant-talking microphones for automatic speech recognition (ASR) system or automatic speaker identification (SID) system can improve user convenience
We focused on developing a dereverberation approach, instead of improving the identification accuracy based on discriminative classification approach, so the use of Gaussian mixture model (GMM) approach should be sufficient for our purpose in evaluating the proposed dereverberation method
Note that we can regard this baseline as the result of enhancement using cepstral mean normalization (CMN) because it was used as preprocessing of GMM training data

Summary

Introduction

The use of distant-talking microphones for automatic speech recognition (ASR) system or automatic speaker identification (SID) system can improve user convenience. Cascade NNs trained using the Cascade algorithm with the Resilient Backpropagation (RPROP) weight update algorithm, which is a variation of batch backpropagation algorithm, were used These two most important parts are most likely the reason why the proposed method could generalize and perform remarkably well for a limited number of stereo data (one or five pairs of utterances; corresponds to less than 1 min of utterance). In [29,30], a denoising autoencoder (DAE), which is one of deep neural network (DNN) approaches, was used to do a mapping of coefficient vectors from a sequence of reverberant speech to a sequence of clean speech They introduced the use of short and long window. The method proposed in this work does a mapping from a N-frame segment to a one-frame segment of log-melspectral coefficients by using cascade NNs and requires only few training data. The NN is used because it should be able to capture a non-linear relation across the frames, which is caused by the insufficiency of analysis window (frame) length in capturing the reverberation effect and other complex factors

Overview of neural network

The estimation function

Non-causal model

Evaluation using automatic speaker identification system

Method

Evaluation using automatic speech recognition system

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Eurasip Journal on Audio, Speech, and Music Processing	Publication Date: Apr 10, 2014
Citations: 46	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eurasip Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

Ses Tanıma için Derin Öğrenme Mimarileri Üzerine Derleme
Yeşim Dokuz ... Zekeriya Tüfekci̇
European Journal of Science and Technology | VOL. -
Yeşim Dokuz, et. al.Yeşim Dokuz ... Zekeriya Tüfekci̇
30 Apr 2020
European Journal of Science and Technology | VOL. -

Theoretical Analysis of Diversity in an Ensemble of Automatic Speech Recognition Systems
Kartik Audhkhasi ... Andreas M Zavou
IEEE/ACM transactions on audio, speech, and language processing | VOL. 22
Kartik Audhkhasi, et. al.Kartik Audhkhasi ... Andreas M Zavou
01 Mar 2014
IEEE/ACM transactions on audio, speech, and language processing | VOL. 22

An analysis-by-synthesis approach to vocal tract modeling for robust speech recognition
Ziad Al Bawab
-
Ziad Al BawabZiad Al Bawab
01 Jan 2012
01 Jan 2012

Using Auxiliary Sources of Knowledge for Automatic Speech Recognition

-

01 Jan 2004
01 Jan 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eurasip Journal on Audio, Speech, and Music Processing