Abstract

AbstractWord-level language identification is an essential prerequisite for extracting useful information from code-mixed social media content. Previous studies in word-level language identification show two important observations. First, the local context is an important indicator of the language of a word when a word is valid in multiple languages. Second, considering the word in isolation from its context leads to more effective language classification when a word is borrowed or embedded into sentences of other languages. In this paper, we propose a framework for language identification that makes use of a dynamic switching mechanism for effective language classification of both words that are borrowed or embedded from other languages as well as words that are valid in multiple languages. For a given input, the proposed switching mechanism makes a dynamic decision to bias its prediction either towards the prediction obtained by the contextual information or that obtained by the word in isolation. In contrast to existing studies that rely upon large amounts of annotated data for robust performance in a multilingual environment, the proposed approach uses minimal annotated resources and no external resources, making it easily extendible to newer languages. Evaluation over a corpus of transliterated Facebook comments shows that the proposed approach outperforms its baseline counterparts: classification based on the contextual information, classification based on the word in isolation, as well as an ensemble of the two classifiers.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.