Abstract

End-to-end architecture has shown outstanding performance in the field of speech recognition, but achieving such performance typically requires a large amount of annotated data. In languages with sufficient corpus and resources, satisfactory recognition results have been achieved. However, for some low-resource data with insufficient training data, the lack of training data remains a bottleneck in building speech recognition systems. This paper presents an approach that utilizes self-supervised feature extraction and transfer learning to enhance the performance of acoustic models in low-resource languages. The proposed strategy involves retraining the fundamental acoustic model, originally trained on resource-rich languages, using a limited amount of low-resource speech data within an end-to-end architecture. By doing so, it aims to construct an improved acoustic model tailored for low-resource languages. The results of the model in the Tibetan language dataset show significant improvement, with a relative decrease of 26% in word error rate from 13.8% to 10.2% on the test set.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call