Code-mixing is a prevalent phenomenon in contemporary communication, particularly in social media interactions. In Indonesia, code-mixing is common, with languages such as Indonesian, Javanese, and English frequently mixed in everyday conversation. Accurate language identification at the word level is important for various downstream natural language processing tasks, such as sentiment analysis and translation. This study investigates machine learning techniques for word-level language identification in Indonesian-Javanese-English code-mixed texts. We compared three categories of machine learning models: traditional machine learning, BLSTM-based architectures, and transformer-based pre-trained models. Our experiments demonstrate that fine-tuning pre-trained models achieves strong performance, with XLM-RoBERTa and IndoBERTweet achieving the best F1 scores of 93.69% and 93.63%, respectively. Our results highlight the effectiveness of fine-tuning pre-trained models for language identification tasks, demonstrating their ability to comprehend sentence contexts and produce accurate label predictions.
Read full abstract