Abstract
Social media becomes an important and convenient tool to access information that is beneficial in education, marketing, finance and communication. The number of social media users grows significantly with each passing day, resulting in a massive volume of data easily available for Natural Language Processing (NLP) researchers. People especially in multilingual societies prefer to write in multiple languages and use code-mixing and code-switching approaches to express their views, thus making the task of NLP more challenging and complex. Therefore, a language identification system for building complex NLP systems using code-mixed data is an absolute necessity. Although language identification for English and other monolingual languages is a solved problem in many NLP applications, but due to noisy nature of code-mixed text, language identification is a complex task and still an unsolved problem. From the recent past machine learning approaches have gathered significant attention in the field of classification problems. In this paper machine learning approaches using Multinomial Naïve Bayes, Decision Tree and Support Vector Machine have been used for word-level identification of languages in English-Hindi and English-Urdu code-mixed social media text. Support Vector Machine with accuracy of 83.58% and 75.79 % for Hindi-English and Urdu-English respectively performs better than other two approaches.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.