Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets.

Ahmad Fathan Hidayatullah,Daphne T.C Lai,Atika Qazi,Rosyzie Anna Apong

doi:10.7717/peerj-cs.1312

Ahmad Fathan Hidayatullah, Daphne T.C Lai + Show 2 more

Open Access

https://doi.org/10.7717/peerj-cs.1312

Copy DOI

Abstract

With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including language identification (LID) tasks. This study presents a word-level language identification model for code-mixed Indonesian, Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-Javanese-English language identification (IJELID). To ensure reliable dataset annotation, we provide full details of the data collection and annotation standards construction procedures. Some challenges encountered during corpus creation are also discussed in this paper. Then, we investigate several strategies for developing code-mixed language identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results show that fine-tuned IndoBERTweet models can identify languages better than the other techniques. This is the result of BERT's ability to understand each word's context from the given text sequence. Finally, we show that sub-word language representation in BERT models can provide a reliable model for identifying languages in code-mixed texts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PeerJ Computer Science	Publication Date: Jun 22, 2023
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets.

Abstract

Talk to us

Similar Papers

More From: PeerJ Computer Science

Lead the way for us

Similar Papers

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed social Media Text in English and Roman Hindi
Shashi Shekhar ... M.M Sufyan Beg
Computación y Sistemas | VOL. 24
Shashi Shekhar, et. al.Shashi Shekhar ... M.M Sufyan Beg
09 Dec 2020
Computación y Sistemas | VOL. 24

Machine learning approach towards language identification of Code-Mixed Hindi-English and Urdu-English Social Media Text
Gazi Imtiyaz Ahmad ... Jimmy Singla
-
Gazi Imtiyaz Ahmad, et. al.Gazi Imtiyaz Ahmad ... Jimmy Singla
10 Mar 2022
10 Mar 2022

Language Identification in Overlapped Multi-lingual Speeches
Zuhragvl Aysa ... Askar Hamdulla
-
Zuhragvl Aysa, et. al.Zuhragvl Aysa ... Askar Hamdulla
22 Jul 2022
22 Jul 2022

Roman to Gurmukhi Social Media Text Normalization
Jagroop Kaur ... Jaswinder Singh
International Journal of Intelligent Computing and Cybernetics | VOL. 13
Jagroop Kaur, et. al.Jagroop Kaur ... Jaswinder Singh
30 Oct 2020
International Journal of Intelligent Computing and Cybernetics | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets.

Abstract

Talk to us

Similar Papers

More From: PeerJ Computer Science