Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise

Martin Wollmer,Gerhard Rigoll,Felix Weninger,Bjorn Schuller,Zixing Zhang

doi:10.1109/icassp.2013.6638983

Martin Wollmer, Gerhard Rigoll + Show 3 more

Open Access

https://doi.org/10.1109/icassp.2013.6638983

Copy DOI

Publication Date: May 1, 2013
Citations: 57	License type: other-oa

Affiliation: Technical University of Munich

Abstract

The recognition of spontaneous speech in highly variable noise is known to be a challenge, especially at low signal-to-noise ratios (SNR). In this paper, we investigate the effect of applying bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks for speech feature enhancement in noisy conditions. BLSTM networks tend to prevail over conventional neural network architectures, whenever the recognition or regression task relies on an intelligent exploitation of temporal context information. We show that BLSTM networks are well-suited for mapping from noisy to clean speech features and that the obtained recognition performance gain is partly complementary to improvements via additional techniques such as speech enhancement by non-negative matrix factorization and probabilistic feature generation by Bottleneck-BLSTM networks. Compared to simple multi-condition training or feature enhancement via standard recurrent neural networks, our BLSTM-based feature enhancement approach leads to remarkable gains in word accuracy in a highly challenging task of recognizing spontaneous speech at SNR levels between -6 and 9 dB.

Full Text