Scaling Reuse Detection in the Web through Two-way Boosting with Signatures and LSH

Jong Wook Kim

doi:10.9717/kmms.2013.16.6.735

Abstract

The emergence of Web 2.0 technologies, such as blogs and wiki, enable even naive users to easily create and share content on the Web using freely available content sharing tools. Wide availability of almost free data and promiscuous sharing of content through social networking platforms created a content borrowing phenomenon, where the same content appears (in many cases in the form of extensive quotations) in different outlets. An immediate side effect of this phenomenon is that identifying which content is re-used by whom is becoming a critical tool in social network analysis, including expert identification and analysis of information flow. Internet-scale reuse detection, however, poses extremely challenging scalability issues: considering the large size of user created data on the web, it is essential that the techniques developed for content-reuse detection should be fast and scalable. Thus, in this paper, we propose a qSignlsh algorithm, a mechanism for identifying multi-sentence content reuse among documents by efficiently combining sentence-level evidences. The experiment results show that qSignlsh significantly improves the reuse detection speed and provides high recall.

Full Text