End-to-end Tibetan Ando dialect speech recognition based on hybrid CTC/attention architecture

Jingwen Sun,Hongwu Yang,Man Wang,Gang Zhou

doi:10.1109/apsipaasc47483.2019.9023130

Abstract

End-to-end automatic speech recognition reduces the difficulty of building a speech recognition system through single network architecture. The tokenization, pronunciation dictionary and phonetic context-dependency trees required in the traditional deep learning-based speech recognition are omitted in this system to simplify the complex modeling process. This paper proposes a method to realize Tibetan Ando dialect speech recognition with end-to-end speech recognition model based on hybrid connectionist temporal classification (CTC)/attention. A bidirectional long short-term memory network (BLSTM) is used for the encoder network through 80 mel-scale filter-bank coefficients alone with pitch features form total 83-dimensionals acoustic features to train the network. We compared proposed method with the methods only based on CTC architecture and the structure only based on attention architecture by adjusting CTC weight of the system. The result shows that the hybrid model can obtain optimal weight to achieves the highest recognition rate of 64.5% when the CTC weight is 0.2.

Full Text