Abstract

Web templates are layouts for webpages that enable rapid and easy access to web content. Web data integration solutions use template based wrapper tools to extract product information from e-commerce websites. Given a collection of webpages, wrapper tools are used to discover the template portion of a webpage and extract data from it. These wrapper based data extraction techniques require pages created with the same template belong to the same cluster. Clustering these webpages based on their template is a significant challenge. While there are algorithms for clustering webpages based on their template, they are computationally intensive to be applied at web scale. By examining the DOM tree paths of URLs on a webpage, the proposed work presents a highly scalable methodology for clustering template-generated webpages. Further, the locality sensitive hashing (LSH) technique is used to reduce the cost of clustering. The proposed technique is found to be more precise and cost effective than the existing baseline methods when tested on three separate real-time data sets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call