Nowadays, low-resource automatic speech recognition (ASR) is a challenging task. The traditional low-resource automatic speech recognition methods failed to capture pronunciation variations and did not have sufficient phone frame alignment capabilities. Some studies have found that pronunciation variations are mainly reflected in the distribution of resonance peaks for vowels and compound vowels and are particularly prominent in spectrograms. Inspired by this idea, we combine it with deep learning techniques and propose a hybrid acoustic model to address the difficulty of capturing pronunciation variation in low-resource ASR. We introduce a pronunciation difference processing (PDP) block to capture resonance peak variations. And we add an improved GRU network at the back end of the model to enhance the alignment of phone frame states. At the same time, we introduce a multi-head attention to combines coarse and fine-grained features of the audio and spectrum to highlights differences in resonant peaks. Finally, we analyzed the effect of different structure parameters and coding positions for the results. Our method was evaluated on the Aidatatang and IBAN datasets. Among them, the results show that adding the PDP module respectively reduces 1.84%, 0.26%WER and 5.2%, 4.3%SER as compared to the baseline mainstream model. After adding the improved GRU, the results show that adding the PDP module respectively reduces 1.92%, 0.38%WER and 5.6%, 4.4 %SER. At the same time, after we introduced multi-head attention, the results show that adding the PDP module respectively reduces 2.33 %,0.45%WER and 6.0%, 4.8 %SER.
Read full abstract