Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection

Brian Heredia,Michael Crawford,Joseph D Prusa,Taghi M Khoshgoftaar

doi:10.1007/s13278-017-0456-z

Abstract

As fake reviews become more prominent on the web, a method to differentiate between untruthful and truthful reviews becomes increasingly necessary. However, detection of false reviews may be difficult, as determining the validity of a review based solely on text can be nearly impossible for a human. In this study, we aim to determine the effectiveness of machine learning techniques, specifically ensemble techniques and the combination of feature selection and ensemble techniques, for the detection of spam reviews. In addition to traditional ensemble techniques, such as Boosting and Bagging, we employ techniques that combine ensemble methods with a form of feature selection: Select-Boost, Select-Bagging and Random Forest. For Select-Boost and Select-Bagging, we combine the Boosting and Bagging approaches with three different feature rankers. Random Forest was performed using 100, 250, and 500 trees. Our results show a combination of Select-Boost, multinomial naive Bayes and, either Chi-squared or signal-to-noise, significantly outperforms all methods except Random Forest using 500 trees. There is no significant difference between the feature subset sizes tested when using Select-Boost with multinomial naive Bayes, regardless of the feature selection technique employed. To the best of our knowledge, this is the first study to examine the effect of a combination of ensemble techniques and feature selection in the domain of spam review detection.

Full Text