Detecting spam comments on Indonesia’s Instagram posts

Ali Akbar Septiandri,Okiriza Wibisono

doi:10.1088/1742-6596/801/1/012069

Ali Akbar Septiandri, Okiriza Wibisono

Open Access

https://doi.org/10.1088/1742-6596/801/1/012069

Copy DOI

Abstract

In this paper we experimented with several feature sets for detecting spam comments in social media contents authored by Indonesian public figures. We define spam comments as comments which have promotional purposes (e.g. referring other users to products and services) and thus not related to the content to which the comments are posted. Three sets of features are evaluated for detecting spams: (1) hand-engineered features such as comment length, number of capital letters, and number of emojis, (2) keyword features such as whether the comment contains advertising words or product-related words, and (3) text features, namely, bag-of-words, TF-IDF, and fastText embeddings, each combined with latent semantic analysis. With 24,000 manually-annotated comments scraped from Instagram posts authored by more than 100 Indonesian public figures, we compared the performance of these feature sets and their combinations using 3 popular classification algorithms: Na¨ıve Bayes, SVM, and XGBoost. We find that using all three feature sets (with fastText embedding for the text features) gave the best F1-score of 0.9601 on a holdout dataset. More interestingly, fastText embedding combined with hand-engineered features (i.e. without keyword features) yield similar F1-score of 0.9523, and McNemar’s test failed to reject the hypothesis that the two results are not significantly different. This result is important as keyword features are largely dependent on the dataset and may not be as generalisable as the other feature sets when applied to new data. For future work, we hope to collect bigger and more diverse dataset of Indonesian spam comments, improve our model’s performance and generalisability, and publish a programming package for others to reliably detect spam comments.

Full Text