Abstract

Social media users have the proclivity to write majority of the data for under resourced languages in code-mixed format. Code-mixing is defined as mixing of two or more languages in a single sentence. Research in code-mixed text helps apprehend security threats, prevalent on social media platforms. In such instances, language identification is an imperative task of code-mixed text. The focus of this paper is to carry out a word-level language identification (WLLI) of Malayalam-English code-mixed data, from social media platforms like YouTube. This study was centered around BERT, a transformer model, along with its variants - CamemBERT, DistilBERT - for intuitive perception of the language at the word-level. The propounded approach entails tagging Malayalam-English code-mixed data set with six labels: Malayalam (mal), English (eng), acronyms (acr), universal (univ), mixed (mix) and undefined (undef). Newly developed corpus of Malayalam-English was deployed for appraisal of the effectiveness of state-of-the-art models like BERT. Evaluation of the proffered approach, accomplished with other code-mixed language such as Hindi-English, notched a 9% increase in the F1-score.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.