Abstract

As the output quality of neural networks in the fields of automatic speech recognition (ASR) and text-to-speech (TTS) continues to improve, new opportunities are becoming available to train models in a weakly supervised fashion, thus minimizing the manual effort required to annotate new audio data for supervised training. While weak supervision has recently shown very promising results in the domain of ASR, speech synthesis has not yet been thoroughly investigated regarding this technique despite requiring the equivalent training dataset structure of aligned audio-transcript pairs. In this work, we compare the performance of TTS models trained using a well-curated and manually labeled training dataset to others trained on the same audio data with text labels generated using both grapheme- and phoneme-based ASR models. Phoneme-based approaches seem especially promising, since even for wrongly predicted phonemes, the resulting word is more likely to sound similar to the originally spoken word than for grapheme-based predictions. For evaluation and ranking, we generate synthesized audio outputs from all previously trained models using input texts sourced from a selection of speech recognition datasets covering a wide range of application domains. These synthesized outputs are subsequently fed into multiple state-of-the-art ASR models with their output text predictions being compared to the initial TTS model input texts. This comparison enables an objective assessment of the intelligibility of the audio outputs from all TTS models, by utilizing metrics like word error rate and character error rate. Our results not only show that models trained on data generated with weak supervision achieve comparable quality to models trained on manually labeled datasets, but can outperform the latter, even for small, well-curated speech datasets. These findings suggest that the future creation of labeled datasets for supervised training of TTS models may not require any manual annotation but can be fully automated.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.