Abstract

Speech recognition technology has many applications in our daily life. However, for many low-resource languages without written forms, acquiring sufficient training data remains a significant challenge for building accurate ASR models. The Tu language, spoken by an ethnic minority group in Qinghai Province in China, is one such example. Due to the lack of written records and the great diversity in regional pronunciations, there has been little previous research on Tu-language speech recognition. This work seeks to address this research gap by creating the first speech dataset for the Tu language spoken in Huzhu County, Qinghai. We first formulated the relevant pronunciation rules for the Tu language based on linguistic analysis. Then, we constructed a new speech corpus, named HZ-TuDs, through targeted data collection and annotation. Based on the HZ-TuDs dataset, we designed several baseline sequence-to-sequence deep neural models for end-to-end Tu-language speech recognition. Additionally, we proposed a novel SA-conformer model, which combines convolutional and channel attention modules to better extract speech features. Experiments showed that our proposed SA-conformer model can significantly reduce the character error rate from 23% to 12%, effectively improving the accuracy of Tu language recognition compared to previous approaches. This demonstrates the effectiveness of our dataset construction and model design efforts in advancing speech recognition technology for this low-resource minority language.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call