Abstract

Near duplicate web pages are web pages that differ only slightly in content. The existence of near duplicate web pages are due to exact replica of the original site, mirrored sites, versioned sites, and multiple representations of the same physical object and plagiarized documents. The identification of similar or near duplicate pages in a large collection is a significant problem with wide spread applications. Here we propose a four stage algorithm for finding near duplicates of an input Web page from a huge repository. We propose a Term Document Weight (TDW) matrix based algorithm with four phases - preprocessing, Feature weighting, Filtering and Verification. The system receives an input web page and a similarity threshold in its first phase and performs some pre processing operations on it. In the second phase, weights of features are calculated using Analytic Combination Criteria (ACC). In the third phase, Prefix and Positional filtering are performed to reduce the size of candidate records, and it returns an optimal set of near duplicate web pages in the Verification phase after calculating their similarity using Minimum Weight Overlapping (MWO) method.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.