Abstract

Researchers of many nations have developed automatic speech recognition (ASR) to show their national improvement in information and communication technology for their languages. This work intends to improve the ASR performance for Myanmar language by changing different Convolutional Neural Network (CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate (WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76% SER on TestSet2.

Highlights

  • Automatic speech recognition research has been doing for more than four decades

  • The second convolutional layer is added on top of the pooling layer and the experiments are further done to investigate the best number of feature maps for the second convolutional layer

  • The better accuracy of automatic speech recognition for Myanmar language is investigated by changing the hyperparameters of Convolutional Neural Network (CNN)

Read more

Summary

INTRODUCTION

Automatic speech recognition research has been doing for more than four decades. The purpose of speech recognition system is to be convenient for humans with a computer, robot or any machine in interaction via speech. The multi-CNN acoustic models were combined based on a Recognizer Output Voting Error Reduction (ROVER) algorithm for final speech recognition experiments They showed that integration of temporal multi-scale features in model training achieved the lower error rate over the best individual system on one temporal scale feature. In this paper [12], Aye Nyein Mon [et al,] explored the effect of tones for Myanmar language speech recognition using Convolutional Neural Network (CNN). The hyperparameters of CNN architecture are not changed and they are used as default settings for all experiments It showed that in comparison with Deep Neural Network (DNN) baseline, the CNN model achieves a better ASR performance over DNN. The different parameters of CNN such as feature map numbers and pooling sizes have been changed to get the better ASR accuracy for Myanmar language.

MYANMAR LANGUAGE
Convolution Ply
Pooling Ply
EXPERIMENTS
Optimization of CNN Parameters
Number of Feature Maps of First Convolutional Layer
Pooling Size
Number of Feature Maps of Second Convolutional Layer
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.