Abstract

This article presents the recent improvements in Serbian speech recognition that were obtained by using contemporary deep neural networks based on sequence-discriminative training to train robust acoustic models. More specifically, several variants of the new large vocabulary continuous speech recognition (LVCSR) system are described, all based on the lattice-free version of the maximum mutual information (LF-MMI) training criterion. The parameters of the system were varied to achieve best possible word error rate (WER) and character error rate (CER), using the largest speech database for Serbian in existence and the best n-gram based language model made for general purposes. In addition to tuning the neural network itself – its layers, complexity, layer splicing and more – other language-specific optimizations were explored, such as the usage of accent-specific vowel phoneme models, and its combination with pitch features to produce the best possible results. Finally, speech database tuning was tested as well – artificial database expansion by modifying speech speed in utterances, as well as volume scaling in an attempt to improve speech variability. The results suggest that 8-layer deep neural network with moderately sized 625-neuron layers works best in the given environment, without the need for speech database augmentation or volume adjustments, and that pitch features in combination with the introduction of accented vowel models provide the best performance out of all experiments.

Highlights

  • The results showed that 8-layer deep neural network with 625-neuron layers works best in the given environment, without the need for speech database augmentation or volume adjustments, and that pitch features in combination with the introduction of accented vowel models provide the best performance out of all experiments

  • This paper represents an overview of results and improvements in automatic speech recognition with systems trained on the largest Serbian speech database using an effective contemporary deep neural network (DNN) architecture

  • There have been several experiments with a few different neural network based, as well as Gaussian mixture model (GMM) based architectures. These are mostly systems trained on smaller speech databases consisting of telephone recordings with limited spectral range, and they were tested on smaller vocabularies [1,2]

Read more

Summary

Introduction

This paper represents an overview of results and improvements in automatic speech recognition with systems trained on the largest Serbian speech database using an effective contemporary deep neural network (DNN) architecture. There have been several experiments with a few different neural network based, as well as Gaussian mixture model (GMM) based architectures. These are mostly systems trained on smaller speech databases consisting of telephone recordings with limited spectral range, and they were tested on smaller vocabularies (up to around 14000 words) [1,2]. They are based on the crossentropy classification criterion. That system had input alignments from a speaker adaptive training (SAT) stage [4], and used modified stochastic gradient descent (SGD) optimization and parameter averaging [5] to compute DNN parameter values in a given number of training epochs

Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.