Effective Distributed Representation of Code-Mixed Text

Aditya Malte,Sheetal Sonawane

doi:10.1109/indicon47234.2019.9028960

Abstract

As an increasing number of people embrace social media, mining data generated from the same has become an important task. Possible applications range from opinion mining, sentiment analysis to hate speech detection. More importantly, analyzing code-mixed multilingual text has gained popularity due to the reason that it holds important socio-cultural clues that may be lost in translation. Methods to effectively analyse code-mixed Hindi/English(Hinglish) text have been explored in this paper. Firstly, we generate a large scale code-mixed corpus that would aid in further research of code mixed text on social media. High-quality word embeddings are trained on this code-mixed text. Finally, we demonstrate the efficacy of our proposed method by training machine learning models that improve upon the previous state-of-the-art using a much lighter and explainable architecture. Our main intention behind training the classifier model was not only high performance but also good model explainability and speed.

Full Text