Abstract

Current automatic speech recognition (ASR) systems achieve over 90–95% accuracy, depending on the methodology applied and datasets used. However, the level of accuracy decreases significantly when the same ASR system is used by a non-native speaker of the language to be recognized. At the same time, the volume of labeled datasets of non-native speech samples is extremely limited both in size and in the number of existing languages. This problem makes it difficult to train or build sufficiently accurate ASR systems targeted at non-native speakers, which, consequently, calls for a different approach that would make use of vast amounts of large unlabeled datasets. In this paper, we address this issue by employing dual supervised learning (DSL) and reinforcement learning with policy gradient methodology. We tested DSL in a warm-start approach, with two models trained beforehand, and in a semi warm-start approach with only one of the two models pre-trained. The experiments were conducted on English language pronounced by Japanese and Polish speakers. The results of our experiments show that creating ASR systems with DSL can achieve an accuracy comparable to traditional methods, while simultaneously making use of unlabeled data, which obviously is much cheaper to obtain and comes in larger sizes.

Highlights

  • Speech recognition has been the subject of extensive research since the second half of the previous century

  • The scores show the best accuracy of MSTT model that we managed to obtain during the training process

  • The learning process will try to maximize the long time reward associated with logP(t|text to speech (TTS)(t); MSTT ), and in such an event, the MSTT model understands that t word becomes a label for an incorrect TTS(t) speech sample

Read more

Summary

Introduction

Speech recognition has been the subject of extensive research since the second half of the previous century. The speech recognition techniques and methodologies that have been developed recently can work with up to 90–95% accuracy, depending on the dataset and benchmark test used [1]. Such accuracy levels can be reached only when the system is used for recognizing the speech of native speakers (e.g., English language for North American people). In the case of non-native speakers, even the most advanced speech recognition systems can only achieve an accuracy of up to 50–60%. The main reason for such a drop is that non-native speakers have a different mother tongue than the one that is being recognized.

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.