Abstract

Nowadays, low-resource automatic speech recognition (ASR) is a challenging task. The traditional low-resource automatic speech recognition methods failed to capture pronunciation variations and did not have sufficient phone frame alignment capabilities. Some studies have found that pronunciation variations are mainly reflected in the distribution of resonance peaks for vowels and compound vowels and are particularly prominent in spectrograms. Inspired by this idea, we combine it with deep learning techniques and propose a hybrid acoustic model to address the difficulty of capturing pronunciation variation in low-resource ASR. We introduce a pronunciation difference processing (PDP) block to capture resonance peak variations. And we add an improved GRU network at the back end of the model to enhance the alignment of phone frame states. At the same time, we introduce a multi-head attention to combines coarse and fine-grained features of the audio and spectrum to highlights differences in resonant peaks. Finally, we analyzed the effect of different structure parameters and coding positions for the results. Our method was evaluated on the Aidatatang and IBAN datasets. Among them, the results show that adding the PDP module respectively reduces 1.84%, 0.26%WER and 5.2%, 4.3%SER as compared to the baseline mainstream model. After adding the improved GRU, the results show that adding the PDP module respectively reduces 1.92%, 0.38%WER and 5.6%, 4.4 %SER. At the same time, after we introduced multi-head attention, the results show that adding the PDP module respectively reduces 2.33 %,0.45%WER and 6.0%, 4.8 %SER.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.