Abstract

High-performance automatic speech recognition (ASR) systems are regularly trained on tens of thousands of hours annotated speech (i.e. speech paired with correct transcripts). Collecting such amount of data is prohibitively costly if done manually (i.e. humans listening and transcribing audio clips). However, raw speech data (without transcripts) is widely available and easily collectable. This paper proposes an automatic method that uses approximate transcripts of raw speech and an already existing ASR system to generate annotations. The method is evaluated in terms of annotation efficiency (i.e. the percentage of the initial raw speech corpus for which it provides annotations) and in terms of data usefulness for further training ASR systems. We show that, although the method is able to produce less data than other methods, the ASR system retrained using the newly created dataset performs significantly better than the baseline. Furthermore, we report ASR results that are better by 17% to 25% than what was reported up to now on Romanian read and spontaneous speech.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.