Survey of review spam detection using machine learning techniques

Michael Crawford,Hamzah Al Najada,Joseph D Prusa,Aaron N Richter,Taghi M Khoshgoftaar

doi:10.1186/s40537-015-0029-9

Michael Crawford, Hamzah Al Najada + Show 3 more

Open Access

https://doi.org/10.1186/s40537-015-0029-9

Copy DOI

Journal: Journal of Big Data	Publication Date: Oct 5, 2015
Citations: 364	License type: CC BY 4.0

Affiliation: Florida Atlantic University

Abstract

Online reviews are often the primary factor in a customer’s decision to purchase a product or service, and are a valuable source of information that can be used to determine public opinion on these products or services. Because of their impact, manufacturers and retailers are highly concerned with customer feedback and reviews. Reliance on online reviews gives rise to the potential concern that wrongdoers may create false reviews to artificially promote or devalue products and services. This practice is known as Opinion (Review) Spam, where spammers manipulate and poison reviews (i.e., making fake, untruthful, or deceptive reviews) for profit or gain. Since not all online reviews are truthful and trustworthy, it is important to develop techniques for detecting review spam. By extracting meaningful features from the text using Natural Language Processing (NLP), it is possible to conduct review spam detection using various machine learning techniques. Additionally, reviewer information, apart from the text itself, can be used to aid in this process. In this paper, we survey the prominent machine learning techniques that have been proposed to solve the problem of review spam detection and the performance of different approaches for classification and detection of review spam. The majority of current research has focused on supervised learning methods, which require labeled data, a scarcity when it comes to online review spam. Research on methods for Big Data are of interest, since there are millions of online reviews, with many more being generated daily. To date, we have not found any papers that study the effects of Big Data analytics for review spam detection. The primary goal of this paper is to provide a strong and comprehensive comparative study of current research on detecting review spam using various machine learning techniques and to devise methodology for conducting further investigation.

Highlights

As the Internet continues to grow in both size and importance, the quantity and impact of online reviews continually increases
A common approach in text mining is to use a bag of words approach where the presence of individual words, or small groups of words are used as features; several studies have found that this approach is not sufficient to train a classifier with adequate performance in review spam detection
In recent years, review spam detection has received significant attention in both business and academia due to the potential impact fake reviews can have on consumer behavior and purchasing decisions

Summary

Introduction

As the Internet continues to grow in both size and importance, the quantity and impact of online reviews continually increases. A common approach in text mining is to use a bag of words approach where the presence of individual words, or small groups of words are used as features; several studies have found that this approach is not sufficient to train a classifier with adequate performance in review spam detection. There are many studies that consider different sets of features for the study of review spam detection utilizing a variety of machine learning techniques. In this paper we discuss machine learning techniques that have been proposed for the detection of online review spam, with an emphasis on feature engineering and the impact of those features on the performance of the spam detectors. Each occurrence of a word within a review will be represented by a “1” if it exists in that review and “0” otherwise

Review1

Review4

Method

Findings

Conclusion