Abstract

With the exponential growth of both the amount and diversity of the information that the Web encompasses, automatic classification of topic-specific Web sites is highly desirable. We propose a novel approach for Web site classification based on the content, structure and context information of Web sites. In our approach, the site structure is represented as a two-layered tree in which each page is modeled as a DOM (document object model) tree and a site tree is used to hierarchically link all pages within the site. Two context models are presented to capture the topic dependences in the site. Then the hidden Markov tree (HMT) model is utilized as the statistical model of the site tree and the DOM tree, and an HMT-based classifier is presented for their classification. Moreover, for reducing the download size of Web sites but still keeping high classification accuracy, an entropy-based approach is introduced to dynamically prune the site trees. On these bases, we employ the two-phase classification system for classifying Web sites through a fine-to-coarse recursion. The experiments show our approach is able to offer high accuracy and efficient process performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.