Identification of 10 Regional Indonesian Languages Using Machine Learning

Azhar Baihaqi Nugraha,Ade Romadhony

doi:10.33395/sinkron.v8i4.12989

Abstract

Language Identification plays a pivotal role in deciphering the rich tapestry of Indonesia's diverse regional languages, encompassing a wide spectrum of scripts, and spoken forms. Language Identification, an integral component of Natural Language Processing, is frequently addressed through Text Classification. In this study, we embark on the task of identifying 10 Indonesian languages, leveraging the NusaX dataset, with the overarching objective of contextual language determination. To achieve this, we harness a diverse array of machine learning techniques, including Support Vector Machine, Naïve Bayes Classifier, Decision Tree, Rocchio Classification, Logistic Regression, and Random Forest. We complement these methods with two distinct feature extraction approaches: N-gram and TF-IDF. This comprehensive approach enables us to construct robust models for language identification. Our findings unveil the strong efficacy of these models in discerning Indonesian languages, with the Naïve Bayes Classifier emerging as the frontrunner, achieving an impressive accuracy rate of 99.2% with TF-IDF and an even more remarkable 99.4% with N-Gram. To gain deeper insights, we delve into error analysis, revealing that misclassifications often stem from shared words across different languages. This research is underpinned by the necessity for a robust language identification model, underscoring its critical role within the complex linguistic landscape of Indonesian regional languages. These results hold great promise for applications in automated language processing and understanding within this diverse and multifaceted linguistic context.

Full Text