Abstract

Connectionist temporal classification (CTC) is a frequently used approach for end-to-end speech recognition. It can be used to calculate CTC loss with artificial neural network such as recurrent neural network (RNN) and convolutional neural network (CNN). Recently, the self-attention architecture has been proposed to replace RNN due to its parallelism in NLP. In this paper, we propose CNN-SELF-ATTENTION-DNN CTC architecture which use self-attention to replace RNN and combine with CNN and deep neural network (DNN). We evaluate it with an about 170 hours Mandarin speech dataset and achieved 7% absolute character error rate (CER) reduction compared to DEEP-CNN with less training time than Bi-LSTM. Finally, we do an analysis on the test results to find the factors that affect CER.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call