Online duplicate document detection

Jack G Conrad,Cindy P Schriber,Xi S Guo

doi:10.1145/956863.956946

Abstract

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and determine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a 'fingerprint' of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environments, collections of documents are always changing, with new documents, or new versions of documents, arriving frequently, and other documents periodically removed. When an enterprise proceeds to freeze a training collection in order to stabilize the underlying repository of such features and its associated collection statistics, issues of coverage and completeness arise. We show that even with very large training collections possessing extremely high feature correlations before and after updates, underlying fingerprints remain sensitive to subtle changes. We explore alternative solutions that benefit from the development of massive meta-collections made up of sizable components from multiple domains. This technique appears to offer a practical foundation for fingerprint stability. We also consider mechanisms for updating training collections while mitigating signature instability. Our research is divided into three parts. We begin with a study of the distribution of duplicate types in two broad-ranging news collections consisting of approximately 50 million documents. We then examine the utility of document signatures in addressing identical or nearly identical duplicate documents and their sensitivity to collection updates. Finally, we investigate a flexible method of characterizing and comparing documents in order to permit the identification of non-identical duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Online duplicate document detection

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Managing déjà vu: Collection building for the identification of nonidentical duplicate documents
Jack G Conrad ... Cindy P Schriber
Journal of the American Society for Information Science and Technology | VOL. 57
Jack G Conrad, et. al.Jack G Conrad ... Cindy P Schriber
22 Mar 2006
Journal of the American Society for Information Science and Technology | VOL. 57

New criteria for wrapper feature selection to enhance bearing fault classification
Mohammed Amine Sahraoui ... Ikhlas Meddour
Advances in Mechanical Engineering | VOL. 15
Mohammed Amine Sahraoui, et. al.Mohammed Amine Sahraoui ... Ikhlas Meddour
01 Jun 2023
Advances in Mechanical Engineering | VOL. 15

Enhanced performance by time-frequency-phase feature for EEG-based BCI systems.
Baolei Xu ... Xuxian Yin
The Scientific World Journal | VOL. 2014
Baolei Xu, et. al.Baolei Xu ... Xuxian Yin
01 Jan 2014
The Scientific World Journal | VOL. 2014

자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구
Jae-Yun Lee
Journal of the Korean Society for Library and Information Science | VOL. 39
Jae-Yun LeeJae-Yun Lee
01 Jun 2005
Journal of the Korean Society for Library and Information Science | VOL. 39

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Online duplicate document detection

Abstract

Talk to us

Similar Papers