Sentiment analysis on Bahasa Indonesia tweets using Unibigram models and machine learning techniques

B H Iswanto,V Poerwoto

doi:10.1088/1757-899x/434/1/012255

B H Iswanto, V Poerwoto

Open Access

https://doi.org/10.1088/1757-899x/434/1/012255

Copy DOI

Abstract

Sentiment analysis on English tweets has its challenges. In addition to frequent use of the informal language, the words used are usually less consistent, contain abbreviations, and mixed with local languages. In this study, we combined n-grams feature selection models, i.e., unigrams, bigrams, and unibigram (1 + 2-grams) to analyze the public opinion of the Bahasa Indonesia tweets about the presidential candidates of the Republic of Indonesia in the 2014 presidential election. The experiment was carried out using Naive Bayes classifiers, Maximum Entropy classifiers, and Support Vector Machines with and without stop words removal and stemming in pre-processed tweet documents. The experiment results show the best performance is achieved by Naïve Bayes classifiers with unibigram feature models without removing the stop words and stemming process. This method achieved high performance with the precision and recall up to 85.50% indicating that automatic sentiment analysis of tweet documents using well-known supervised learning methods is feasible for the Indonesian language. More interestingly is the fact that stop words removal and stemming process on the corpus made the classification performances worse compared with the corpus that experienced cleansing only.

Full Text