Abstract

Electronic media has been developing rapidly nowadays, resulting in a large number of news articles produced online, and thus duplication detection is needed. Besides, articles duplication is directly related to articles plagiarism. Existing studies on news articles duplication detection mainly focus on newspaper articles, and we further explore the duplication detection in news articles from the newest online We Media data. In this paper, we propose a tool, NDFinder, using fingerprinting technique with hash index to detect articles duplication. To validate our proposed approach, we crawled a total number of 33,244 news articles data for detection. The results show that our tool accurately detects 2,150 duplicate articles, and the overall precision reaches 97%. Moreover, we apply our approach to detect plagiarism articles based on our duplication results, and we successfully identify 64 pairs of plagiarism articles in our collected data. We further conduct an empirical study and summarize 8 most commonly used plagiarism patterns in plagiarism articles.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call