Abstract

We present a feature enhancement method that uses neural networks (NNs) to map the reverberant feature in a log-melspectral domain to its corresponding anechoic feature. The mapping is done by cascade NNs trained using Cascade2 algorithm with an implementation of segment-based normalization. Experiments using speaker identification (SID) and automatic speech recognition (ASR) systems were conducted to evaluate the method. The experiments of SID system was conducted by using our own simulated and real reverberant datasets, while the CENSREC-4 evaluation framework was used as the evaluation for the ASR system. The proposed method could remarkably improve the performance of both systems by using limited stereo data and low speaker-variant data as the training data. From the evaluation using SID, we reached 26.0% and 34.8% of error rate reduction (ERR) relative to the baseline by using simulated and real data, respectively, by using only one pair of utterances for matched condition cases. Then, by using combined dataset containing 15 pairs of utterances by one speaker from three positions in a room, we could reach 93.7% of average identification rate (three known and two unknown positions), which was 42.2% of ERR relative to the use of cepstral mean normalization (CMN). From the evaluation using ASR, by using 40 pairs of utterances as the NN training data, we could reach 78.4% of ERR relative to the baseline by using simulated utterances by five speakers. Moreover, we could reach 75.4% and 71.6% of ERR relative to the baseline by using real utterances by five speakers and one speaker, respectively.

Highlights

  • The use of distant-talking microphones for automatic speech recognition (ASR) system or automatic speaker identification (SID) system can improve user convenience

  • We focused on developing a dereverberation approach, instead of improving the identification accuracy based on discriminative classification approach, so the use of Gaussian mixture model (GMM) approach should be sufficient for our purpose in evaluating the proposed dereverberation method

  • Note that we can regard this baseline as the result of enhancement using cepstral mean normalization (CMN) because it was used as preprocessing of GMM training data

Read more

Summary

Introduction

The use of distant-talking microphones for automatic speech recognition (ASR) system or automatic speaker identification (SID) system can improve user convenience. Cascade NNs trained using the Cascade algorithm with the Resilient Backpropagation (RPROP) weight update algorithm, which is a variation of batch backpropagation algorithm, were used These two most important parts are most likely the reason why the proposed method could generalize and perform remarkably well for a limited number of stereo data (one or five pairs of utterances; corresponds to less than 1 min of utterance). In [29,30], a denoising autoencoder (DAE), which is one of deep neural network (DNN) approaches, was used to do a mapping of coefficient vectors from a sequence of reverberant speech to a sequence of clean speech They introduced the use of short and long window. The method proposed in this work does a mapping from a N-frame segment to a one-frame segment of log-melspectral coefficients by using cascade NNs and requires only few training data. The NN is used because it should be able to capture a non-linear relation across the frames, which is caused by the insufficiency of analysis window (frame) length in capturing the reverberation effect and other complex factors

Overview of neural network
The estimation function
Non-causal model
Evaluation using automatic speaker identification system
Method
Evaluation using automatic speech recognition system
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call