Abstract

As an endangered language, Tujia language is a non-renewable intangible cultural resource. Automatic speech recognition (ASR) uses artificial intelligence technology to facilitate manually label Tujia language, which is an effective means to protect this language. However, due to the fact that Tujia language has few native speakers, few labeled corpus, and much noise in the corpus. The acoustic models thus suffer from over fitting and lowe noise immunity, which seriously harms the accuracy of recognition. To tackle the deficiencies, an approach of audio-visual speech recognition (AVSR) based on Transformer-CTC is proposed, which reduces the dependence of acoustic models on noise and the quantity of data by introducing visual modality in-formation including lip movements. Specifically, the new approach enhances the expression of speakers’ feature space through the fusion of audio and visual information, thus solving the problem of less available information for single modality. Experiment results show that the optimal CER of AVSR is 8.2% lower than that of traditional models, and 11.8% lower than that for lip reading. The proposed AVSR tackles the issue of low accuracy in recognizing endangered languages. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call