Detection and elimination of similar Web pages based on text structure and string of feature code

Zhongyang Xiong,Man Ya,Yufang Zhang

doi:10.3724/sp.j.1087.2013.00554

Detection and elimination of similar Web pages based on text structure and string of feature code

Zhongyang Xiong, Man Ya + Show 1 more

https://doi.org/10.3724/sp.j.1087.2013.00554

Copy DOI

Journal: Journal of Computer Applications

Publication Date: Sep 24, 2013

#Detection Of Web Pages #Web Pages + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

In order to reduce the interference of the duplicated Web pages,and improve the efficiency of detection and elimination of similar Web pages,a new kind of large-scale Web page detection algorithm was proposed.Firstly,adopting the Web label values,the algorithm created the text structure trees to realize the fingerprint similarity calculation layer by layer.Secondly,the head and tail words of a certain sentence,in which high frequency punctuations occur,were extracted out as the feature code.Lastly,the fingerprint similarity of Web page features was discriminated with Bloom filter algorithm.The experimental results show that the algorithm can improve the recall rate up to more than 90%,and reduce the time complexity to O(n).

Full Text