Constructing a text corpus for inexact duplicate detection

Jack G Conrad,Cindy P Schriber

doi:10.1145/1008992.1009131

Constructing a text corpus for inexact duplicate detection

Jack G Conrad, Cindy P Schriber

https://doi.org/10.1145/1008992.1009131

Copy DOI

Publication Date: Jul 25, 2004

Citations: 30

Affiliation: Thomson Reuters (United States)

#Professional Searchers #Proprietary Environments + Show 5 more

Abstract
Full-Text
Similar Papers

Abstract

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its negative effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling inexact duplicate documents.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.