The Research of Web Page De-duplication Based on Web Pages Reshipment Statement

Min-Yan Wang,Dong-Sheng Liu

doi:10.1109/dbta.2009.64

The Research of Web Page De-duplication Based on Web Pages Reshipment Statement

Min-Yan Wang, Dong-Sheng Liu

https://doi.org/10.1109/dbta.2009.64

Copy DOI

Publication Date: Apr 1, 2009

Citations: 7

Affiliation: Zhejiang Gongshang University

#Duplicated Web Pages #Web Pages + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

Web page de-duplication module is an important part of search engine system, which can improve its performance and quality with filtering the web pages downloaded by crawler system of search engine and eliminating the duplicated web pages. This paper from the source of duplicated web pages--reshipment proposes a web page de-duplication method that the information including original websites and web titles are extracted to eliminate duplicated web pages based on feature codes. Experiments show that this method can achieve satisfactory results in eliminating large-scale duplicated web pages.

Full Text