Performance of Methods in Identifying Similar Languages Based on String to Word Vector

Herry Sujaini

doi:10.23917/khif.v6i1.8199

Abstract

Indonesia has a large number of local languages that have cognate words, some of which have similarities among each other. Automatic identification within a family of languages faces problems, so it is necessary to learn the best performer of language identification methods in doing the task. This study made an effort to identification Indonesian local languages, which used String to Word Vector approach. A string vector refers to a collection of ordered words. In a string vector, a word is represented as an element or value, while the word becomes an attribute or feature in each numeric vector. Among Naive Bayes, SMO, J48, and ZeroR classifiers, SMO is found to be the most accurate classifier with a level of accuracy at 95.7% for 10-fold cross-validation and 94.4% for 60%: 40%. The best tokenizer in this classification is Character N-Gram. All classifiers, except ZeroR shows increased accuracy when using Character N-Gram Tokenizer compared to Word Tokenizer. The best features of this system are the TriGram and FourGram Character. The TriGram is preferred because it requires smaller training data. The highest accuracy value in the combination experiment is 0.965 obtained at a combination of IDF = FALSE and WC = TRUE, regardless the conditions of the TF.

Highlights

Language identification functions to identify or recognize the language of a text
Inverse Document Frequency (IDF) scales how often a word appears in different sentence text, which means words that appear in many dialects that cannot be used as features [19]
60% Experiment: 40% of this test and training data shows that all classifiers, except ZeroR have increased accuracy when using Character NGram Tokenizer compared to Word Tokenizer

Summary

Introduction

Language identification functions to identify or recognize the language (or dialect) of a text. The author identifies features manually by using these features from interesting devices using Tweets as a dataset for the training and testing process They use three different classifications, namely the Naïve Bayes classification, the Logistic Regression classification, and the Support Vector Machine classification. They vary the size of LM on the performance of their approach and study the impact of two types of preprocessing techniques. Safitri [8] conducted a study on the identification of spoken languages with phonotactics in Minangkabau, Sundanese, and Javanese languages, concluding that the PRLM Method showed the highest accuracy using telephone identifiers trained for English and Russian with an average of 77.42% and 75.94%. This paper discusses the performance of the Language Identification method, for languages that have similarities based on the String to Word Vector

Method

Results and Discussion

Conclusion