Transformer Based Language Identification for Malayalam-English Code-Mixed Text

S Thara,Prabaharan Poornachandran

doi:10.1109/access.2021.3104106

S Thara, Prabaharan Poornachandran

Open Access

https://doi.org/10.1109/access.2021.3104106

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 26	License type: CC BY-NC-ND 4.0

Affiliation: Amrita Vishwa Vidyapeetham University

Abstract

Social media users have the proclivity to write majority of the data for under resourced languages in code-mixed format. Code-mixing is defined as mixing of two or more languages in a single sentence. Research in code-mixed text helps apprehend security threats, prevalent on social media platforms. In such instances, language identification is an imperative task of code-mixed text. The focus of this paper is to carry out a word-level language identification (WLLI) of Malayalam-English code-mixed data, from social media platforms like YouTube. This study was centered around BERT, a transformer model, along with its variants - CamemBERT, DistilBERT - for intuitive perception of the language at the word-level. The propounded approach entails tagging Malayalam-English code-mixed data set with six labels: Malayalam (mal), English (eng), acronyms (acr), universal (univ), mixed (mix) and undefined (undef). Newly developed corpus of Malayalam-English was deployed for appraisal of the effectiveness of state-of-the-art models like BERT. Evaluation of the proffered approach, accomplished with other code-mixed language such as Hindi-English, notched a 9% increase in the F1-score.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Transformer Based Language Identification for Malayalam-English Code-Mixed Text

Abstract

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed social Media Text in English and Roman Hindi
Shashi Shekhar ... M.M Sufyan Beg
Computación y Sistemas | VOL. 24
Shashi Shekhar, et. al.Shashi Shekhar ... M.M Sufyan Beg
09 Dec 2020
Computación y Sistemas | VOL. 24

Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text
... Kanika Agarwal
-
, et. al. ... Kanika Agarwal
01 Jan 2021
01 Jan 2021

Navigating Social Media in #Ophthalmology
Edmund Tsui ... Rajesh C Rao
Ophthalmology | VOL. 126
Edmund Tsui, et. al.Edmund Tsui ... Rajesh C Rao
20 May 2019
Ophthalmology | VOL. 126

An Architectural Framework for Word level Language Identification in Mixed Script Text
Dipti Singh ... Shashi Shekhar
-
Dipti Singh, et. al.Dipti Singh ... Shashi Shekhar
03 Mar 2023
03 Mar 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Transformer Based Language Identification for Malayalam-English Code-Mixed Text

Abstract

Talk to us

Similar Papers

More From: IEEE Access