End-to-End Low-Resource Speech Recognition with a Deep CNN-LSTM Encoder

Weizhe Wang,Hongwu Yang,Xiaodong Yang

doi:10.1109/icicsp50920.2020.9232119

Abstract

The performance of large vocabulary continuous automatic speech recognition (ASR) has improved tremendously due to the application of deep learning. However, building a low-resource ASR system remains a challenging task due to the difficulty of data collection and the lack of linguistic knowledge in the low-resource language. In this paper, we proposed an end-to-end low-resource speech recognition method and validated using the Tibetan language as an example. We firstly designed a Tibetan text corpus, and we also recorded a Tibetan speech corpus. Then, we extracted the spectrogram as a feature of each speech. The encoder is a single structure of the deep convolutional neural network (CNN) based on the VGG network and a hybrid network of deep CNN and long short-term memory (LSTM) network. The connectionist temporal classification (CTC) network sits on top of the encoder to infer the alignments between speech and label sequences. The experimental results show that the single structure encoder achieved a word error rate of 36.85%. The hybrid structure encoder achieves a 6% word error rate reduction compared to a single structure encoder. Also, when increasing the number of CNN layers in the encoder, the word error rate is further reduced.

Full Text