Abstract

On one hand, redundant pages could increase searching burden of the search engine. On the other hand, they would lower the user’s experience. So it is necessary to deal with the pages. To achieve near-replicas detection, most of the algorithms depend on web page content extraction currently. But the cost of content extraction is large and it is difficult. What’s more, it becomes much harder to extract web content properly. This paper addresses these issues through the following ways: it gets the definition of the largest number of common character by taking antisense concept of edit distance; it suggests that the feature string of web page built by a previous Chinese character of period in simple processing text; and it utilizes the largest number of common character to calculate the overlap factor between the feature strings of web page. As a consequence, this paper hopes to achieve near-replicas detection in high noise environment, avoiding extracting the content of web page. The algorithm is proven efficient in our experiment testing: the recall rate of web pages reaches 96.7%, and the precision rate reaches 97.8%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.