End-to-End Amdo-Tibetan Speech Recognition Based on Knowledge Transfer

Xiaojun Zhu,Heming Huang

doi:10.1109/access.2020.3023783

Xiaojun Zhu, Heming Huang

Open Access

https://doi.org/10.1109/access.2020.3023783

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 7	License type: CC BY 4.0

Affiliation: Lanzhou City University, Qinghai Normal University

Abstract

The end-to-end speech recognition technology solves the problem that each component is independent and models cannot be jointly optimized in the traditional speech recognition model. It incorporates such components as the acoustic model, language model, and decoding unit of the hybrid model into a single neural network, that can avoid the inherent defects of multiple modules and greatly reduces the complexity of the speech recognition model. In this research, an Amdo-Tibetan speech recognition system is constructed based on Listen, Attend and Spell (LAS) model by the end-to-end speech recognition technology. It can realize the direct conversion from Amdo-Tibetan speech sequence to the corresponding character sequence and greatly reduces the difficulty of building the Amdo-Tibetan speech recognition model. To further improve the performance of the proposed system, the following improvements have been made: firstly, the Multi-Head Attention mechanism is introduced to improve the alignment accuracy between state vectors of decoder and encoder; secondly, the label smoothing technique is adopted to solve the problem of over-fitting; thirdly, an N-gram language model is combined with the LAS model to increase the accuracy of speech recognition and the maximum mutual information (MMI) criterion is employed for discriminative training; and finally, transfer learning is utilized to overcome the problem of insufficient training data. Experimental results show that the proposed model can significantly enhance the performance of Amdo-Tibetan speech recognition.

Highlights

The era of artificial neural networks research has ushered since American neurophysiologist Warren McCulloch and mathematician Walter Pitts presented the concept of artificial neural network and its mathematical model in their joint work in 1943
The Multi-Head Attention mechanism is introduced to improve the alignment accuracy between state vectors of decoder and encoder, the label smoothing and discriminative training technique is adopted to optimize the training process of the model, an N-gram language model is combined with the LAS model to increase the accuracy of speech recognition, and transfer learning is utilized to overcome the problem of insufficient training data
Experimental results show that the end-to-end model proposed in this work can significantly improve the performance of Amdo-Tibetan speech recognition

Summary

INTRODUCTION

The era of artificial neural networks research has ushered since American neurophysiologist Warren McCulloch and mathematician Walter Pitts presented the concept of artificial neural network and its mathematical model in their joint work in 1943. Zhao et al [17] establish a Tibetan multi-task recognition framework based on WaveNet-CTC It identifies Tibetan speech recognition, speaker recognition, and dialect recognition simultaneously in an end-to-end network and achieves better performance than the task-specific model. After analyzing the pronunciation characteristics and determining the modeling unit of the Amdo-Tibetan, an efficient end-to-end speech recognition system is proposed based on the Listen, Attend and Spell (LAS) model. It can directly convert from a speech sequence to the corresponding character sequence, and its training process is much more efficient than the traditional model.

CHARACTERISTICS OF TIBETAN PRONUNCIATION

TIBETAN MODELING UNIT

LABEL SMOOTHING REGULARIZATION

EXTERNAL LANGUAGE MODEL

DISCRIMINATIVE TRAINING

SUMMARY OF MODEL COMPLEXITY

EXPERIMENT AND DISCUSSION

DATABASE The experiments are carried out on three corpora

Findings

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

End-to-End Amdo-Tibetan Speech Recognition Based on Knowledge Transfer

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Exploring recurrent neural network based acoustic and linguistic modeling for children's speech recognition
Sreeram Ganji ... Rohit Sinha
-
Sreeram Ganji, et. al.Sreeram Ganji ... Rohit Sinha
01 Nov 2017
01 Nov 2017

Joint unsupervised adaptation of n-gram and RNN language models via LDA-based hybrid mixture modeling
Ryo Masumura ... Yushi Aono
-
Ryo Masumura, et. al.Ryo Masumura ... Yushi Aono
01 Dec 2017
01 Dec 2017

Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition
X Chen ... Mark J.F Gales
-
X Chen, et. al.X Chen ... Mark J.F Gales
20 Aug 2017
20 Aug 2017

Deep Learning Based Language Modeling for Domain-Specific Speech Recognition
Jing Zhu ... Xinwei Gong
-
Jing Zhu, et. al.Jing Zhu ... Xinwei Gong
01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

End-to-End Amdo-Tibetan Speech Recognition Based on Knowledge Transfer

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access