Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Jack G Conrad,Cindy P Schriber

doi:10.1002/asi.20363

Abstract

AbstractAs online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close variants. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its deleterious effect on search results. Harnessing the expertise of both client‐users and professional searchers, we establish principled methods to generate a test collection for identifying and handling nonidentical duplicate documents. We subsequently examine a flexible method of characterizing and comparing documents to permit the identification of near duplicates. This method has produced promising results following an extensive evaluation using a production‐based test collection created by domain experts.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Abstract

Talk to us

Similar Papers

More From: Journal of the American Society for Information Science and Technology

Lead the way for us

Journal: Journal of the American Society for Information Science and Technology	Publication Date: Mar 22, 2006
Citations: 3

Similar Papers

Online duplicate document detection
Jack G Conrad ... Cindy P Schriber
-
Jack G Conrad, et. al.Jack G Conrad ... Cindy P Schriber
03 Nov 2003
03 Nov 2003

Constructing a text corpus for inexact duplicate detection
Jack G Conrad ... Cindy P Schriber
-
Jack G Conrad, et. al.Jack G Conrad ... Cindy P Schriber
25 Jul 2004
25 Jul 2004

Salton and Buckley’s Landmark Research in Experimental Text Information Retrieval
Christine F Marton
Evidence based library and information practice | VOL. 6
Christine F MartonChristine F Marton
15 Dec 2011
Evidence based library and information practice | VOL. 6

A Semi-Automated Record De-Duplication Technique for a Data Warehouse Environment
Vaishali Wangikar* ... Sachin Deshmukh
The International Journal of Innovative Technology and Exploring Engineering | VOL. 9
Vaishali Wangikar*, et. al.Vaishali Wangikar* ... Sachin Deshmukh
30 Jan 2020
The International Journal of Innovative Technology and Exploring Engineering | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Abstract

Talk to us

Similar Papers

More From: Journal of the American Society for Information Science and Technology