Abstract

Code-switching is a common form of communication in every bilingual or multilingual culture. It has become common in an ethnically diverse country like India, where multiple languages are spoken within the society. The need for automatic speech recognition (ASR) systems to handle code-switching has recently increased. However, there are currently relatively few code-switching resources accessible to train these kinds of systems. Therefore, it is highly essential to create code-switching resources. This paper explains the creation of a bilingual Hindi-English corpus named HEBiC corpus at VIT Bhopal University. This corpus has 7.5 hours of read speech data and has 58,245 words. The corpus is highly diverse and includes speakers from 27 states of India, each exhibiting distinct accents and dialects. The corpus also includes 600 sentences with a vocabulary size of 3575 words. The sources and methods used to gather the corpus and the statistical analysis are described in detail in this paper. We have evaluated the corpus performance with baseline and end-to-end models in terms of word error rate (WER).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.