Abstract
Code-switching (CS), the practice of alternating between two or more languages in conversations, is pervasive in most multi-lingual communities. CS texts have a complex interplay between languages and occur in informal contexts that make them harder to collect and construct NLP tools for. We approach this problem through Language Modeling (LM) on a new Hindi-English mixed corpus containing 59,189 unique sentences collected from blogging websites. We implement and discuss different Language Models derived from a multi-layered LSTM architecture. We hypothesize that encoding language information strengthens a language model by helping to learn code-switching points. We show that our highest performing model achieves a test perplexity of 19.52 on the CS corpus that we collected and processed. On this data we demonstrate that our performance is an improvement over AWD-LSTM LM (a recent state of the art on monolingual English).
Highlights
Code-switching (CS) is a widely studied linguistic phenomenon where two different languages are interleaved
The task of language modeling is very important to several downstream applications in NLP including speech recognition, machine translation, etc
We address the task of language modeling in CS text with a dual objective: (1) predicting the word, and (2) predicting the language of the word
Summary
Code-switching (CS) is a widely studied linguistic phenomenon where two different languages are interleaved. Data obtained from online sources is often noisy because of spelling, script, morphological, and grammatical variations. These sources of noise make it quite challenging to build robust NLP tools (Cetinoglu et al, 2016). The task of language modeling is very important to several downstream applications in NLP including speech recognition, machine translation, etc. This is important in domains that lack annotated data, such as code-switching, where the need to leverage unsupervised techniques is a must.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.