Speech GAU: A Single Head Attention for Mandarin Speech Recognition for Air Traffic Control

Shiyu Zhang,Jianguo Kong,Yabin Li,Chao Chen,Haijun Liang

doi:10.3390/aerospace9080395

Shiyu Zhang, Jianguo Kong + Show 3 more

Open Access

https://doi.org/10.3390/aerospace9080395

Copy DOI

Journal: Aerospace	Publication Date: Jul 22, 2022
Citations: 8	License type: CC BY 4.0

Affiliation: Civil Aviation Flight University of China

Abstract

The rise of end-to-end (E2E) speech recognition technology in recent years has overturned the design pattern of cascading multiple subtasks in classical speech recognition and achieved direct mapping of speech input signals to text labels. In this study, a new E2E framework, ResNet–GAU–CTC, is proposed to implement Mandarin speech recognition for air traffic control (ATC). A deep residual network (ResNet) utilizes the translation invariance and local correlation of a convolutional neural network (CNN) to extract the time-frequency domain information of speech signals. A gated attention unit (GAU) utilizes a gated single-head attention mechanism to better capture the long-range dependencies of sequences, thus attaining a larger receptive field and contextual information, as well as a faster training convergence rate. The connectionist temporal classification (CTC) criterion eliminates the need for forced frame-level alignments. To address the problems of scarce data resources and unique pronunciation norms and contexts in the ATC field, transfer learning and data augmentation techniques were applied to enhance the robustness of the network and improve the generalization ability of the model. The character error rate (CER) of our model was 11.1% on the expanded Aishell corpus, and it decreased to 8.0% on the ATC corpus.

Full Text