Abstract

Due to the lack of a large annotated speech corpus, many low-resource Indian languages struggle to utilize recent advancements in deep neural network architectures for Automatic Speech Recognition (ASR) tasks. Collecting large-scale databases is an expensive and time-consuming task. Current approaches lack extensive traditional expert-based data acquisition guidelines, as they are tedious and complex. In this work, we present the International Institute of Information Technology Hyderabad-Crowd Sourced Telugu Database (IIITH-CSTD), a Telugu corpus collected through crowdsourcing. In particular, our main objective is to mitigate the low-resource problem for Telugu. We also present the sources, crowdsourcing pipeline, and the protocols used to collect the corpus for a low-resource language, namely, Telugu. Data of approximately 2,000 hours of transcribed audio is presented and released in this article, covering three major regional dialects of the Telugu language in three different (i.e., read, conversational and spontaneous) speaking styles on topics such as politics, sports, and arts, science, and so on. 1 We also present the experimental results of the collected corpus on ASR tasks. We hope this work will motivate researchers to curate large-scale annotated speech data for other low-resource Indic languages.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.