Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition

Tessfu Geteye Fantaye,Tulu Tilahun Hailu,Junqing Yu

doi:10.3390/computers9020036

Abstract

Deep neural networks (DNNs) have shown a great achievement in acoustic modeling for speech recognition task. Of these networks, convolutional neural network (CNN) is an effective network for representing the local properties of the speech formants. However, CNN is not suitable for modeling the long-term context dependencies between speech signal frames. Recently, the recurrent neural networks (RNNs) have shown great abilities for modeling long-term context dependencies. However, the performance of RNNs is not good for low-resource speech recognition tasks, and is even worse than the conventional feed-forward neural networks. Moreover, these networks often overfit severely on the training corpus in the low-resource speech recognition tasks. This paper presents the results of our contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems. The optimal neural network structures and training strategies for the proposed neural network models are explored. Experiments were conducted on the Amharic and Chaha datasets, as well as on the limited language packages (10-h) of the benchmark datasets released under the Intelligence Advanced Research Projects Activity (IARPA) Babel Program. The proposed neural network models achieve 0.1–42.79% relative performance improvements over their corresponding feed-forward DNN, CNN, bidirectional RNN (BRNN), or bidirectional gated recurrent unit (BGRU) baselines across six language collections. These approaches are promising candidates for developing better performance acoustic models for low-resource speech recognition tasks.

Highlights

Neural network-based deep learning techniques are the state-of-the-art acoustic modeling approaches by replacing the conventional Gaussian mixture model (GMM) technique since 2011.various researchers have applied those approaches in either hybrid or end-to-end acoustic modeling for developing speech recognition systems
We describe the effectiveness of our proposed neural network acoustic models, which are described in Section 3, for low-resource-language speech recognition systems
The Deep neural networks (DNNs), convolutional neural network (CNN), conventional bidirectional RNN (BRNN), and bidirectional gated recurrent unit (BGRU) models were developed as baseline models to compare with the proposed advanced neural network acoustic models

Summary

Introduction

Neural network-based deep learning techniques are the state-of-the-art acoustic modeling approaches by replacing the conventional Gaussian mixture model (GMM) technique since 2011.various researchers have applied those approaches in either hybrid or end-to-end acoustic modeling for developing speech recognition systems. Computers 2020, 9, 36 deep recurrent neural networks (conventional recurrent neural network (RNN) [17], long short-term memory (LSTM) [18,19,20,21,22], and gated recurrent unit (GRU) [23,24,25,26]) are examined for both low- and high-resource-languages speech recognition tasks These neural network models have strong and weak sides. DNN has good discriminative power for classifying the features into the target classes It has the following limitations: (1) It does not have structures that explicitly model the prior knowledge of the speech signals, such as the local properties within the speech frames and long-term dependencies among speech frames. Even if the above model size regularization methods reduce the shortcomings of the DNN to some extent, the total number of parameters of the DNN is large, which makes it difficult to train an optimal model for low-resource speech recognition systems

Results

Discussion

Conclusion