Aggregating sentence-level features for Chinese near-duplicate document detection

Yan Liang Yan Liang,Zhenjing Wan Zhenjing Wan,Xue Jiang Xue Jiang,Ning Feng Ning Feng,Feng Xu Feng Xu,Yizheng Tao Yizheng Tao,Shan Gao Shan Gao

doi:10.1109/icnsc.2017.8000087

Abstract

Detecting near-duplicate documents efficiently is an indispensable capability for many applications, such as searching engines, information retrieval systems, and recommendation systems. In this paper, we propose a novel content presentation method for near-duplicate document detection from a large collection of Chinese documents. The proposed method, called multi-aggregation fingerprint (MAF), consists of sentence-level feature extraction and multi-feature aggregation. Compared with terms, sentences are more representative and contain more abundant and integrated information. Thus, we extract the crucial information of sentences to form the sentence features. To improve the accuracy and efficiency of near-duplicate document detection, we exploit both holistic characteristics of sentence features in the dataset and the statistic information of sentence features belonging to a document. Accordingly, we split the sentence feature space based on the distribution of features in the dataset. Each sentence feature is assigned to the nearest partition of the feature space, and multiple sentence features are aggregated into a compact and global fingerprint. Experimental results show the proposed MAF method can produce competitive results on the Chinese document dataset.

Full Text