Abstract

The end-to-end speech recognition technology solves the problem that each component is independent and models cannot be jointly optimized in the traditional speech recognition model. It incorporates such components as the acoustic model, language model, and decoding unit of the hybrid model into a single neural network, that can avoid the inherent defects of multiple modules and greatly reduces the complexity of the speech recognition model. In this research, an Amdo-Tibetan speech recognition system is constructed based on Listen, Attend and Spell (LAS) model by the end-to-end speech recognition technology. It can realize the direct conversion from Amdo-Tibetan speech sequence to the corresponding character sequence and greatly reduces the difficulty of building the Amdo-Tibetan speech recognition model. To further improve the performance of the proposed system, the following improvements have been made: firstly, the Multi-Head Attention mechanism is introduced to improve the alignment accuracy between state vectors of decoder and encoder; secondly, the label smoothing technique is adopted to solve the problem of over-fitting; thirdly, an N-gram language model is combined with the LAS model to increase the accuracy of speech recognition and the maximum mutual information (MMI) criterion is employed for discriminative training; and finally, transfer learning is utilized to overcome the problem of insufficient training data. Experimental results show that the proposed model can significantly enhance the performance of Amdo-Tibetan speech recognition.

Highlights

  • The era of artificial neural networks research has ushered since American neurophysiologist Warren McCulloch and mathematician Walter Pitts presented the concept of artificial neural network and its mathematical model in their joint work in 1943

  • The Multi-Head Attention mechanism is introduced to improve the alignment accuracy between state vectors of decoder and encoder, the label smoothing and discriminative training technique is adopted to optimize the training process of the model, an N-gram language model is combined with the LAS model to increase the accuracy of speech recognition, and transfer learning is utilized to overcome the problem of insufficient training data

  • Experimental results show that the end-to-end model proposed in this work can significantly improve the performance of Amdo-Tibetan speech recognition

Read more

Summary

INTRODUCTION

The era of artificial neural networks research has ushered since American neurophysiologist Warren McCulloch and mathematician Walter Pitts presented the concept of artificial neural network and its mathematical model in their joint work in 1943. Zhao et al [17] establish a Tibetan multi-task recognition framework based on WaveNet-CTC It identifies Tibetan speech recognition, speaker recognition, and dialect recognition simultaneously in an end-to-end network and achieves better performance than the task-specific model. After analyzing the pronunciation characteristics and determining the modeling unit of the Amdo-Tibetan, an efficient end-to-end speech recognition system is proposed based on the Listen, Attend and Spell (LAS) model. It can directly convert from a speech sequence to the corresponding character sequence, and its training process is much more efficient than the traditional model.

CHARACTERISTICS OF TIBETAN PRONUNCIATION
TIBETAN MODELING UNIT
LABEL SMOOTHING REGULARIZATION
EXTERNAL LANGUAGE MODEL
DISCRIMINATIVE TRAINING
SUMMARY OF MODEL COMPLEXITY
EXPERIMENT AND DISCUSSION
DATABASE The experiments are carried out on three corpora
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.