Abstract

End-to-end architecture has shown outstanding performance in the field of speech recognition, but achieving such performance typically requires a large amount of annotated data. In languages with sufficient corpus and resources, satisfactory recognition results have been achieved. However, for some low-resource data with insufficient training data, the lack of training data remains a bottleneck in building speech recognition systems. This paper presents an approach that utilizes self-supervised feature extraction and transfer learning to enhance the performance of acoustic models in low-resource languages. The proposed strategy involves retraining the fundamental acoustic model, originally trained on resource-rich languages, using a limited amount of low-resource speech data within an end-to-end architecture. By doing so, it aims to construct an improved acoustic model tailored for low-resource languages. The results of the model in the Tibetan language dataset show significant improvement, with a relative decrease of 26% in word error rate from 13.8% to 10.2% on the test set.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.