A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition

Vishal Passricha,Rajesh Kumar Aggarwal

doi:10.1515/jisys-2018-0372

Abstract

Abstract Deep neural networks (DNNs) have been playing a significant role in acoustic modeling. Convolutional neural networks (CNNs) are the advanced version of DNNs that achieve 4–12% relative gain in the word error rate (WER) over DNNs. Existence of spectral variations and local correlations in speech signal makes CNNs more capable of speech recognition. Recently, it has been demonstrated that bidirectional long short-term memory (BLSTM) produces higher recognition rate in acoustic modeling because they are adequate to reinforce higher-level representations of acoustic data. Spatial and temporal properties of the speech signal are essential for high recognition rate, so the concept of combining two different networks came into mind. In this paper, a hybrid architecture of CNN-BLSTM is proposed to appropriately use these properties and to improve the continuous speech recognition task. Further, we explore different methods like weight sharing, the appropriate number of hidden units, and ideal pooling strategy for CNN to achieve a high recognition rate. Specifically, the focus is also on how many BLSTM layers are effective. This paper also attempts to overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN. Next, various non-linearities with or without dropout are analyzed for speech tasks. Experiments indicate that proposed hybrid architecture with speaker-adapted features and maxout non-linearity with dropout idea shows 5.8% and 10% relative decrease in WER over the CNN and DNN systems, respectively.

Highlights

Deep neural network (DNN)-based acoustic models have almost displaced Gaussian mixture models (GMMs) from automatic speech recognition (ASR) systems [18]
The Convolutional neural networks (CNNs)-bidirectional long shortterm memory (BLSTM) hybrid trained by including speaker-adaption and maxout + dropout can achieve a 24.25%, 10%, and 5.8% relative improvement over the Gaussian-based hidden Markov model (HMM), DNN, and CNN, respectively
A more powerful hybrid acoustic model is proposed by including the advantages of CNN, BLSTM, and fully connected layers

Summary

Introduction

Deep neural network (DNN)-based acoustic models have almost displaced Gaussian mixture models (GMMs) from automatic speech recognition (ASR) systems [18]. Convolutional neural networks (CNNs) successfully model the structural locality from the feature space [24] They reduce the translational variance and take care of disturbances and small shifts in the feature space because they adopt the pooling at a local frequency region. They can utilize the long-time dependencies among the speech frames by exploiting prior knowledge of speech signal. Vanishing and exploding gradient problem bound the capability of RNNs to learn time dependencies [3] To tackle these problems, long short-term memory (LSTM) was introduced that controls the flow of information by a special unit called memory block [31]. A special architecture that operates the input sequence in both directions to make decisions came into existence and is called bidirectional LSTM (BLSTM) [33]

Objectives

Methods

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Intelligent Systems	Publication Date: Mar 5, 2019
Citations: 58	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Intelligent Systems

Lead the way for us

Similar Papers

Automatic Modulation Recognition Based on a DCN-BiLSTM Network.
Kai Liu ... Wanjun Gao
Sensors | VOL. 21
Kai Liu, et. al.Kai Liu ... Wanjun Gao
24 Feb 2021
Sensors | VOL. 21

The role of hidden layers in learning motor control in autonomous systems: [formula omitted]. Mechanical Engineering Department, Massachusetts Institute of Technology, 77 Massachusetts Avenue Room 3-449, Cambridge, MA 02139
-
Neural Networks | VOL. 1
--
01 Jan 1987
Neural Networks | VOL. 1

Deep Convolutional Neural Networks for Large-scale Speech Tasks
Tara N Sainath ... Bhuvana Ramabhadran
Neural Networks | VOL. 64
Tara N Sainath, et. al.Tara N Sainath ... Bhuvana Ramabhadran
16 Sep 2014
Neural Networks | VOL. 64

Challenges and Techniques for Dialectal Arabic Speech Recognition and Machine Translation
Mohamed Elmahdy
Qatar Foundation Annual Research Forum Proceedings | VOL. 2011
Mohamed ElmahdyMohamed Elmahdy
01 Nov 2011
Qatar Foundation Annual Research Forum Proceedings | VOL. 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Intelligent Systems