Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

Xiong Xiao,Eng Siong Chng,Duc Hoang Ha Nguyen,Shengkui Zhao,Douglas L Jones,Xionghu Zhong,Haizhou Li

doi:10.1186/s13634-015-0300-4

Abstract

This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.

Highlights

Automatic speech recognition (ASR) systems and handsfree speech acquisition systems have achieved satisfactory performance for close-talk microphones
In the deep neural networks (DNN)-based speech coefficient mapping, parallel training data of reverberant speech and clean speech are used to train the DNN to predict clean speech. This mapping approach is applied to both speech enhancement and ASR feature enhancement tasks
We proposed a LS postprocessing and a sequential cost function to incorporate the constraint of dynamic features to improve the smoothness of the enhanced magnitude spectrum

Summary

Introduction

Automatic speech recognition (ASR) systems and handsfree speech acquisition systems have achieved satisfactory performance for close-talk microphones. While the previous two studies focus on predicting lowdimensional feature vector for ASR, in [27], deep neural networks (DNN) are used to directly estimate the highdimension log-magnitude spectrum for speech denoising. This method was later applied as a preprocessor for a robust ASR task [28]. DNN and other neural-network-based speech coefficients mapping approach have the potential to produce an accurate clean speech estimate, they rely on a representative parallel speech corpus for training the neural networks To address this limitation, we propose a feature adaptation method that only requires clean speech data during training. In the three sections, we will describe the three stages in more detail

Classic approaches

DNN-based speech coefficients mapping with a dynamic feature constraint

Comparison between cross transform and DNN-based feature compensation

Effect of CMN preprocessing on different evaluation metrics

Findings

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Advances in Signal Processing	Publication Date: Jan 13, 2016
Citations: 66	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Advances in Signal Processing

Lead the way for us

Similar Papers

Phase and reverberation aware DNN for distant-talking speech enhancement
Zeyan Oo ... Longbiao Wang
Multimedia Tools and Applications | VOL. 77
Zeyan Oo, et. al.Zeyan Oo ... Longbiao Wang
20 Feb 2018
Multimedia Tools and Applications | VOL. 77

Combined speech enhancement and auditory modelling for robust distributed speech recognition
Ronan Flynn ... Edward Jones
Speech Communication | VOL. 50
Ronan Flynn, et. al.Ronan Flynn ... Edward Jones
20 May 2008
Speech Communication | VOL. 50

Local trajectory based speech enhancement for robust speech recognition with deep neural network
Yongbin You ... Yanmin Qian
-
Yongbin You, et. al.Yongbin You ... Yanmin Qian
01 Jul 2015
01 Jul 2015

Automatic Speech Recognition and Pronunciation Error Detection of Dutch Non-native Speech: cumulating speech resources in a pluricentric language
X Wei ... C Cucchiarini
Speech Communication | VOL. 144
X Wei, et. al.X Wei ... C Cucchiarini
01 Oct 2022
Speech Communication | VOL. 144

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Advances in Signal Processing