Abstract

Automatic speech recognition (ASR) is a fundamental technology in the field of artificial intelligence. End-to-end (E2E) ASR is favored for its state-of-the-art performance. However, E2E speech recognition still faces speech spatial information loss and text local information loss, which results in the increase of deletion and substitution errors during inference. To overcome this challenge, we propose a novel Enhancing Hybrid Architecture with Fast Attention and Capsule Network (termed En-HACN), which can model the position relationships between different acoustic unit features to improve the discriminability of speech features while providing the text local information during inference. Firstly, a new CNN-Capsule Network (CNN-Caps) module is proposed to capture the spatial information in the spectrogram through the capsule output and dynamic routing mechanism. Then, we design a novel hybrid structure of LocalGRU Augmented Decoder (LA-decoder) that generates text hidden representations to obtain text local information of the target sequences. Finally, we introduce fast attention instead of self-attention in En-HACN, which improves the generalization ability and efficiency of the model in long utterances. Experiments on corpora Aishell-1 and Librispeech demonstrate that our En-HACN has achieved the state-of-the-art compared with existing works. Besides, experiments on the long utterances dataset based on Aishell-1-long show that our model has a high generalization ability and efficiency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call