Retrieving similar documents from the web

Adriano C M Pereira ,Nívio Ziviani

doi:10.5555/2017024.2017028

Abstract

This paper presents a mechanism for detecting and retrieving documents from the web with a similarity relation to a suspicious document. The process is composed of three stages: a) generation of a of the suspicious document, b) gathering candidate documents from the web and c) comparison of each candidate document and the suspicious document. In the first stage, the fingerprint of the suspicious document is used as its identification. The fingerprint is composed of representative sentences of the document. In the second stage, the sentences composing the fingerprint are used as queries submitted to a serach engine. The documents identified by the URLs returned from the search engine are collected to form a set of similarity candidate documents. In the third stage, the candidate documents are compared to the suspicious document. The process of comparing the documents uses two different methods: Shingles and Patricia tree. We implemented and evaluated the methods used for generating the document fingerprint and for comparing the suspicious document with the candidate documents. The experiments were performed using a collection of plagiarized documents constructed specially for this work. The best experimental result shows that in 61.53% of the tries the total number of source documents used in the composition were retrieved from the Web. In this case, in only 5.44% of the executions less than 50% of source documents used in the composition were retrieved from the Web. For the best fingerprint implemented, on average 87.06% of the documents were retrieved.

Full Text