Abstract

Online reviews play an increasingly important role in the purchase decisions of potential customers. Incidentally, driven by the desire to gain profit or publicity, spammers may be hired to write fake reviews and promote or demote the reputation of products or services. Correspondingly, opinion spam detection has attracted attention from both business and research communities in recent years. However, unlike other tasks such as news classification or blog classification, the existing review spam datasets are typically limited due to the expensiveness of human annotation, which may further affect detection performance even if excellent classifiers have been developed. We propose a novel approach in this paper to boost opinion spam detection performance by fully utilizing the existing labelled small-size dataset. We first design an annotation extension scheme that uses extra tree classifiers to train multiple estimators and then iteratively generate reliable labelled samples from unlabeled ones. Subsequently, we examine neural network scenarios on a newly extended dataset to learn the distributed representation. Experimental results suggest that the proposed approach has better generalization capability and improved performance than state-of-the-art methods.

Highlights

  • Product reviews have played an increasingly important role in the purchase decisions of potential customers

  • We present a brief review of the related work mainly from three perspectives, namely, deceptive opinion spam detection, semi-supervised self-labelled techniques, and neural networks for learning the distributed representation

  • This finding suggests that using a typical small amount of labelled spam data which are readily available but expensive to obtain with a large amount of unlabelled data for training is feasible in opinion spam detection

Read more

Summary

Introduction

Product reviews have played an increasingly important role in the purchase decisions of potential customers. We propose a semi-supervised ensemble learning-based annotation extension scheme that trains multiple estimators to extend spam review label set with reliable unlabeled samples. We conduct several experiments by training state-of-the-art neural network models on extended dataset to examine the extension classification performance. Neural networks have been proven highly effective for text classification tasks, and large-size datasets are usually preferable for these models to achieve better performance. The main contributions of this study are listed as follows: We propose a semi-supervised self-training based annotation extension scheme that trains multiple classifiers to extend the existing spam review label set from unlabeled samples. Neural networks have been proven highly effective for text classification tasks, and large-size datasets are usually preferred for these models to enhance the performance.

Deceptive Opinion Spam Detection
Semi-supervised Self-Labelled Techniques
Distributed Representation Learning
Overview of the Approach
Annotation Extension
4: For each unlabelled review u in U
Reliability Score
Feature Encoding
Neural Models
BiLSTM-RNN
TextCNN
Experiment Setup
Results and Analysis
Initial Self-learning Classifiers
Conclusions and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.