A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development

Ahmad Fathan Hidayatullah,Rosyzie Anna Apong,Daphne Teck Ching Lai,Atika Qazi

doi:10.1109/access.2022.3223703

Ahmad Fathan Hidayatullah, Rosyzie Anna Apong + Show 2 more

Open Access

https://doi.org/10.1109/access.2022.3223703

Copy DOI

Abstract

The mix of native language with other languages (code-mixing) in social media has posed a severe challenge for language identification (LID) systems. It has encouraged research on code-mixed LID solutions. This study investigated the techniques, challenges, and dataset availability with corresponding quality criteria and developed a comprehensive framework for code-mixed LID. This study addressed four research issues to identify gaps and future work opportunities in tackling code-mixed LID challenges. Based on our analysis of reviewed studies, we outlined key points for future research in code-mixed LID. We demonstrated a taxonomy of applied techniques for code-mixed LID and highlighted the different technique variants. In code-mixed LID tasks, we discovered four significant challenges: ambiguity, lexical borrowing, non-standard words, and intra-word code-mixing. This systematic literature review recognised 32 code-mixed datasets available for LID. We proposed five features to describe the quality criteria dataset. The features are the number of instances or sentences, percentage of code-mixed types in the data, number of tokens, number of unique tokens, and average sentence length. Finally, we synthesised the methodologies and proposed a conceptual framework for subsequent studies through our literature analysis.

Full Text