CNN-Self-Attention-DNN Architecture For Mandarin Recognition

Chengtao Cai,Dongning Guo

doi:10.1109/ccdc49329.2020.9164333

Abstract

Connectionist temporal classification (CTC) is a frequently used approach for end-to-end speech recognition. It can be used to calculate CTC loss with artificial neural network such as recurrent neural network (RNN) and convolutional neural network (CNN). Recently, the self-attention architecture has been proposed to replace RNN due to its parallelism in NLP. In this paper, we propose CNN-SELF-ATTENTION-DNN CTC architecture which use self-attention to replace RNN and combine with CNN and deep neural network (DNN). We evaluate it with an about 170 hours Mandarin speech dataset and achieved 7% absolute character error rate (CER) reduction compared to DEEP-CNN with less training time than Bi-LSTM. Finally, we do an analysis on the test results to find the factors that affect CER.

Full Text