Abstract

The domain name field in a universal resource locator (URL) has been viewed as a natural choice to organize Web pages. For example, Web search results may be grouped in terms of domains and presented to users as clusters for ease of visualization. However, using this approach, large Web sites, such as Geocities, W3C, and www. cs. umd. edu , tend to yield many matches that leads to a few large, flat structured, and unorganized clusters. As a matter of fact, many pages in these sites are actually “logical domains” by themselves. For example, Web sites for projects at a university or the XML section at W3C could be viewed as “logical domains”. In this paper, we propose the concept of a logical domain, which is identified by semantic relatedness, as opposed to a physical domain, which is identified simply by domain name. The identification of logical domain is important to many Web applications, such as query result reorganization, site map generation, and topic distillation. We have developed and implemented a set of rules based on link structure, path information, document metadata, and citations to identify logical domain entry pages (i.e., root pages of logical domains). The importance of these rules are automatically adjusted using a novel decision tree algorithm and training data provided by human feedback. We also develop techniques to define the boundary of each logical domain based on identified logical domain entry pages. We have conducted extensive experiments on real Web sites to evaluate the effectiveness of our proposed techniques. The experimental results show that our techniques perform very well in extracting logical domains in a Web site.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.