Abstract

Multitask learning (MTL) is helpful for improving the performance of related tasks when the training dataset is limited and sparse, especially for low-resource languages. Amharic is a low-resource language and suffers from the problems of training data scarcity, sparsity, and unevenness. Consequently, fundamental acoustic units-based speech recognizers perform worse compared with the speech recognizers of technologically favored languages. This paper presents the results of our contributions to the use of various hybrid acoustic modeling units for the Amharic language. The fundamental acoustic units, namely, syllable, phone, and rounded phone units-based deep neural network (DNN) models have been developed. Various hybrid acoustic units have been investigated by jointly training the fundamental acoustic units via the MTL technique. Those hybrid units and the fundamental units are discussed and compared. The experimental results demonstrate that all the fundamental units-based DNN models outperform the Gaussian mixture models (GMM) with relative performance improvements of 14.14%-23.31%. All the hybrid units outperform the fundamental acoustic units with relative performance improvements of 1.33%-4.27%. The syllable and phone units exhibit higher performance under sufficient and limited training datasets, respectively. All the hybrid units are useful with both sufficient and limited training datasets and outperformed the fundamental units. Overall, our results show that DNN is an effective acoustic modeling technique for the Amharic language. The context-dependent (CD) syllable is the more suitable unit if a sufficient training corpus is available and the accuracy of the recognizer is prioritized. The CD phone is a superior unit if the available training dataset is limited and realizes the highest accuracy and fast recognition speed. The hybrid acoustic units perform the best under both sufficient and limited training datasets and achieve the highest accuracy.

Highlights

  • Deep neural networks (DNNs) were introduced into speech recognition research in 2011 as an acoustic modeling technique in the hybrid deep neural network (DNN)-Hidden Markov Model (HMM) and as a feature extractor for the tandem Gaussian mixture models (GMM)-HMM andThe associate editor coordinating the review of this manuscript and approving it for publication was Stavros Ntalampiras.DNN-HMM models

  • This study proposes hybrid acoustic modeling units by jointly training the fundamental acoustic modeling units to share the training data among them via the Multitask learning (MTL)-DNN paradigm, in contrast with Tachbelie et al [29] and Tachbelie et al [37], in which the hybrid units are suggested by manipulating the training dataset via backoff from sparse syllables to phones

  • 4) INVESTIGATION OF HYBRID ACOUSTIC MODELING UNITS USING THE MTL-DNN PARADIGM According to the experimental results that are presented in Sections V-B 2 and 3, the performances of the CD-syllable-unit-based GMM and DNN models are affected by the sparsity and unevenness of the syllables in the training dataset, in addition to the scarcity of sufficient training corpora

Read more

Summary

Introduction

Deep neural networks (DNNs) were introduced into speech recognition research in 2011 as an acoustic modeling technique in the hybrid DNN-Hidden Markov Model (HMM) and as a feature extractor for the tandem GMM-HMM andThe associate editor coordinating the review of this manuscript and approving it for publication was Stavros Ntalampiras.DNN-HMM models. Comparing the two experimental results, the speaker-adapted and speakerindependent DNN models that are trained using CD syllable units realize absolute performance improvements of 2.36% and 4.9%, respectively.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call