Abstract

Code-switching (CS), the practice of alternating between two or more languages in conversations, is pervasive in most multi-lingual communities. CS texts have a complex interplay between languages and occur in informal contexts that make them harder to collect and construct NLP tools for. We approach this problem through Language Modeling (LM) on a new Hindi-English mixed corpus containing 59,189 unique sentences collected from blogging websites. We implement and discuss different Language Models derived from a multi-layered LSTM architecture. We hypothesize that encoding language information strengthens a language model by helping to learn code-switching points. We show that our highest performing model achieves a test perplexity of 19.52 on the CS corpus that we collected and processed. On this data we demonstrate that our performance is an improvement over AWD-LSTM LM (a recent state of the art on monolingual English).

Highlights

  • Code-switching (CS) is a widely studied linguistic phenomenon where two different languages are interleaved

  • The task of language modeling is very important to several downstream applications in NLP including speech recognition, machine translation, etc

  • We address the task of language modeling in CS text with a dual objective: (1) predicting the word, and (2) predicting the language of the word

Read more

Summary

Introduction

Code-switching (CS) is a widely studied linguistic phenomenon where two different languages are interleaved. Data obtained from online sources is often noisy because of spelling, script, morphological, and grammatical variations. These sources of noise make it quite challenging to build robust NLP tools (Cetinoglu et al, 2016). The task of language modeling is very important to several downstream applications in NLP including speech recognition, machine translation, etc. This is important in domains that lack annotated data, such as code-switching, where the need to leverage unsupervised techniques is a must.

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.