Abstract

In order to reduce the interference of the duplicated Web pages,and improve the efficiency of detection and elimination of similar Web pages,a new kind of large-scale Web page detection algorithm was proposed.Firstly,adopting the Web label values,the algorithm created the text structure trees to realize the fingerprint similarity calculation layer by layer.Secondly,the head and tail words of a certain sentence,in which high frequency punctuations occur,were extracted out as the feature code.Lastly,the fingerprint similarity of Web page features was discriminated with Bloom filter algorithm.The experimental results show that the algorithm can improve the recall rate up to more than 90%,and reduce the time complexity to O(n).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call