Abstract

This paper presents a mechanism for detecting and retrieving documents from the web with a similarity relation to a suspicious document. The process is composed of three stages: a) generation of a of the suspicious document, b) gathering candidate documents from the web and c) comparison of each candidate document and the suspicious document. In the first stage, the fingerprint of the suspicious document is used as its identification. The fingerprint is composed of representative sentences of the document. In the second stage, the sentences composing the fingerprint are used as queries submitted to a serach engine. The documents identified by the URLs returned from the search engine are collected to form a set of similarity candidate documents. In the third stage, the candidate documents are compared to the suspicious document. The process of comparing the documents uses two different methods: Shingles and Patricia tree. We implemented and evaluated the methods used for generating the document fingerprint and for comparing the suspicious document with the candidate documents. The experiments were performed using a collection of plagiarized documents constructed specially for this work. The best experimental result shows that in 61.53% of the tries the total number of source documents used in the composition were retrieved from the Web. In this case, in only 5.44% of the executions less than 50% of source documents used in the composition were retrieved from the Web. For the best fingerprint implemented, on average 87.06% of the documents were retrieved.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.