Improved Web page clustering algorithm based on partial tag tree matching

Rui Li,Si-Wang Zhou,Jun-Yu Zeng

doi:10.3724/sp.j.1087.2010.00818

Abstract

In the process of Web information extraction,Web pages on the target websites should be clustered in order to detect and generate templates that are used to extract required information.Traditional page clustering algorithm based on DOM tree edit distance is not suitable for the complex Document Object Model(DOM)tree structure pages created from dynamic templates.In this paper,an improved Web page clustering algorithm was proposed based on partial tag tree matching.In the proposed algorithm,the appropriate weights were assigned to the nodes according to their effects on the layout of Web pages and the level difference between template nodes and non-template nodes.After that,the structure similarity between Web pages was computed efficiently based on partial tree matching approach.Compared with the traditional algorithms,the experimental results show that the proposed algorithm is of higher accuracy in clustering dynamic Web pages and lower computing complexity.

Full Text