Abstract

With the explosion of multilingual content on Web, particularly in social media platforms, identification of languages present in the text is becoming an important task for various applications. While automatic language identification (ALI) in social media text is considered to be a non-trivial task due to the presence of slang words, misspellings, creative spellings and special elements such as hashtags, user mentions etc., ALI in multilingual environment becomes even more challenging task. In a highly multilingual society, code-mixing without affecting the underlying language sense has become a natural phenomenon. In such a dynamic environment, conversational text alone often fails to identify the underlying languages present in the text. This paper proposes various methods of exploiting social conversational features for enhancing ALI performance. Although social conversational features for ALI have been explored previously using methods like probabilistic language modeling, these models often fail to address issues related to code-mixing, phonetic typing, out-of-vocabulary etc. which are prevalent in a highly multilingual environment. This paper differs in the way the social conversational features are used to propose text refinement strategies that are suitable for ALI in highly multilingual environment. The contributions in this paper therefore includes the following. First, this paper analyzes the characteristics of various social conversational features by exploiting language usage patterns. Second, various methods of text refinement suitable for language identification are proposed. Third, the effects of the proposed refinement methods are investigated using various sentence level language identification frameworks. From various experimental observations over three conversational datasets collected from Facebook, Youtube and Twitter social media platforms, it is evident that our proposed method of ALI using social conversational features outperforms the baseline counterparts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.