Abstract

Text mining is a basic concept of sentiment analysis and a discipline that combines linguistics and computer science with machine learning techniques. Text mining is used to change the text to be more structured. While machine learning focuses on finding and developing algorithms to build a system that can simulate or imitate a pattern from a dataset. In this study, supervised learning is used which is a basic machine learning technique with comparing the Naive Bayes Classifier algorithm model, namely Multinomial Nave Bayes and Bernoulli Naive Bayes with sentiment objects from Twitter. This study also uses Term Weighting techniques, namely TFIDF and TF-RF in each model. This study was conducted to determine the best combination of each model with Term Weighting and to test the model's accuracy, the researcher uses a random and balanced dataset to find out whether the dataset is very influential in the model. The first step in this research is crawling the data using the Twitter API, then the data is labeled. After the data is labeled, the data will enter an important step in the research, namely preprocessing and term weighting. The data that has been labeled is cleaned and converted into structured data so that the data is ready for analysis. The preprocessing data are weighted using the TF-IDF and TF-RF techniques, then classified one by one using 2 NBC models, so in this study there are 4 model schemes, namely Multinomial and TF-IDF, Bernoulli and TFIDF, Multinomial and TF-RF as well as Bernoulli and TF-RF. The last stage of this research is testing using Confusion Matrix, and then validated with K-Fold Cross Validation, testing is carried out to see the best performance of the 4 schemes. The result of the 4 schemes, TF-IDF and TF-RF with Bernoulli Naive Bayes schemes from the results of the Confusion Matrix test produce the best accuracy 61%, and the average accuracy value of the 5-fold validation is 60%. And the one with the lowest accuracy value lies in the Multinomial Naive Bayes model and TF-IDF which is 58% from Confusion Matrix, with an average value of 59% from the 5-fold validation. The researcher conducted several experiments using balanced data, and Bernoulli with 2 word weights had the highest accuracy value.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call