This study addresses the challenges of low accuracy and high computational demands in Tibetan speech recognition by investigating the application of end-to-end networks. We propose a decoding strategy that integrates Connectionist Temporal Classification (CTC) and Attention mechanisms, capitalizing on the benefits of automatic alignment and attention weight extraction. The Conformer architecture is utilized as the encoder, leading to the development of the Conformer-CTC/Attention model. This model first extracts global features from the speech signal using the Conformer, followed by joint decoding of these features through CTC and Attention mechanisms. To mitigate convergence issues during training, particularly with longer input feature sequences, we introduce a Probabilistic Sparse Attention mechanism within the joint CTC/Attention framework. Additionally, we implement a maximum entropy optimization algorithm for CTC, effectively addressing challenges such as increased path counts, spike distributions, and local optima during training. We designate the proposed method as the MaxEnt-Optimized Probabilistic Sparse Attention Conformer-CTC/Attention Model (MPSA-Conformer-CTC/Attention). Experimental results indicate that our improved model achieves a word error rate reduction of 10.68% and 9.57% on self-constructed and open-source Tibetan datasets, respectively, compared to the baseline model. Furthermore, the enhanced model not only reduces memory consumption and training time but also improves generalization capability and accuracy.
Read full abstract