Abstract

Detecting near-duplicate documents efficiently is an indispensable capability for many applications, such as searching engines, information retrieval systems, and recommendation systems. In this paper, we propose a novel content presentation method for near-duplicate document detection from a large collection of Chinese documents. The proposed method, called multi-aggregation fingerprint (MAF), consists of sentence-level feature extraction and multi-feature aggregation. Compared with terms, sentences are more representative and contain more abundant and integrated information. Thus, we extract the crucial information of sentences to form the sentence features. To improve the accuracy and efficiency of near-duplicate document detection, we exploit both holistic characteristics of sentence features in the dataset and the statistic information of sentence features belonging to a document. Accordingly, we split the sentence feature space based on the distribution of features in the dataset. Each sentence feature is assigned to the nearest partition of the feature space, and multiple sentence features are aggregated into a compact and global fingerprint. Experimental results show the proposed MAF method can produce competitive results on the Chinese document dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call