Clustering of Template-Generated Webpages Using DOM Tree Paths of URLs

Tanveer I Bagban,Prakash Jayant Kulkarni

doi:10.4018/ijsi.297994

Abstract

Web templates are layouts for webpages that enable rapid and easy access to web content. Web data integration solutions use template based wrapper tools to extract product information from e-commerce websites. Given a collection of webpages, wrapper tools are used to discover the template portion of a webpage and extract data from it. These wrapper based data extraction techniques require pages created with the same template belong to the same cluster. Clustering these webpages based on their template is a significant challenge. While there are algorithms for clustering webpages based on their template, they are computationally intensive to be applied at web scale. By examining the DOM tree paths of URLs on a webpage, the proposed work presents a highly scalable methodology for clustering template-generated webpages. Further, the locality sensitive hashing (LSH) technique is used to reduce the cost of clustering. The proposed technique is found to be more precise and cost effective than the existing baseline methods when tested on three separate real-time data sets.

Full Text