Abstract

Focused crawlers have as their main goal to crawl Web pages that are relevant to a specific topic or user interest, playing an important role for a great variety of applications. In general, they work by trying to find and crawl all kinds of pages deemed as related to an implicitly declared topic. However, users are often not simply interested in any document about a topic, but instead they may want only documents of a given type or genre on that topic to be retrieved. In this article, we describe an approach to focused crawling that exploits not only content-related information but also genre information present in Web pages to guide the crawling process. This approach has been designed to address situations in which the specific topic of interest can be expressed by specifying two sets of terms, the first describing genre aspects of the desired pages and the second related to the subject or content of these pages, thus requiring no training or any kind of preprocessing. The effectiveness, efficiency and scalability of the proposed approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi of computer science courses, job offers in the computer science field and sale offers of computer equipments. These experiments show that focused crawlers constructed according to our genre-aware approach achieve levels of F1 superior to 88%, requiring the analysis of no more than 65% of the visited pages in order to find 90% of the relevant pages. In addition, we experimentally analyze the impact of term selection on our approach and evaluate a proposed strategy for semi-automatic generation of such terms. This analysis shows that a small set of terms selected by an expert or a set of terms specified by a typical user familiar with the topic is usually enough to produce good results and that such a semi-automatic strategy is very effective in supporting the task of selecting the sets of terms required to guide a crawling process.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.