Comment spam detection by sequence mining

Ravi Kant,Krishnan S Kumar,Srinivasan H Sengamedu

doi:10.1145/2124295.2124318

Abstract

Comments are supported by several web sites to increase user participation. Users can usually comment on a variety of media types - photos, videos, news articles, blogs, etc. Comment spam is one of the biggest challenges facing this feature. The traditional approach to combat spam is to train classifiers using various machine learning techniques. Since the commonly used classifiers work on the entire comment text, it is easy to mislead them by embedding spam content in good content.In this paper, we make several contributions towards comment spam detection. (1) We propose a new framework for spam detection that is immune to embed attacks. We characterize spam by a set of frequently occurring sequential patterns. (2) We introduce a variant (called min-closed) of the frequent closed sequence mining problem that succinctly captures all the frequently occurring patterns. We prove as well as experimentally show that the set of min-closed sequences is an order of magnitude smaller than the set of closed sequences and yet has exactly the same coverage. (3) We describe MCPRISM, extension of the recently published PRISM algorithm that effectively mines min-closed sequences, using prime encoding. In the process, we solve the open problem of using the prime-encoding technique to speed up traditional closed sequence mining. (4) We finally need to whittle down the set of frequent subsequences to a small set without sacrificing coverage. This problem is NP-Hard but we show that the coverage function is submodular and hence the greedy heuristic gives a fast algorithm that is close to optimal. We then describe the experiments that were carried out on a large real world comment data and the publicly available Gazelle dataset. (1) We show that nearly 80% of spam on real world data can be effectively captured by the mined sequences at very low false positive rates. (2) The sequences mined are highly discriminative. (3) On Gazelle data, the proposed algorithmic enhancements are faster by at least by a factor and by an order of magnitude on the larger comment dataset.

Full Text