Abstract

The main problems in sentiment analysis models on Indonesian YouTube comments are unstructured data and low classification accuracy. Sentiment analysis for Indonesian, which is different from English, requires proper preprocessing and classification methods. Previous research usually using Linear Support Vector Machine (SVM), Naive Bayes and Decision Tree. Although the accuracy of SVM is better than other algorithms, it still needs to be improved. This study aims to compare the performance of the tree-based ensemble method and feature selection to improve the sentiment analysis model for Indonesian YouTube comments. This research crawled Indonesian YouTube comments from different domains and produce ten datasets. The preprocessing’s method in this research was removed stopword, convert slang words, and stemming. For feature selection, we tested two vectorizer method, i.e. Term Frequency (TF) or Term Frequency/Inverse Document Frequency (TF-IDF). The model build using six machine learning, consist of four tree-based ensemble machine learning to raise better accuracy, Linear SVM and Decision Tree. We use tree-based ensemble machine learning, they are Random Forest, and Extra Tree represents bagging ensemble. AdaBoost and Gradient Boosting represent boosting ensemble. SVM and Decision tree as a comparison. Based on experiments by combining feature selection and ensemble machine learning, it can be concluded that the type of vectorizer has little effect on classification accuracy. In all experiments, the best machine learning methods are Extra Tree with an accuracy of 93.39% and AdaBoost with an accuracy of 92.53%. Whereas, the use of TF or TF-IDF does not significantly affect accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call