Abstract

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler.

Highlights

  • With the rapid growth of network information, the Internet has become the greatest information base

  • An experiment was designed to indicate that the proposed method of web page classification and the algorithm of link priority evaluation (LPE) can improve the performance of focused crawlers

  • We presented a novel focused crawler which increases the collection performance by using the web page classifier and the link priority evaluation algorithm

Read more

Summary

Introduction

With the rapid growth of network information, the Internet has become the greatest information base. The first important task of those researches is to collect relevant information from the Internet, namely, crawling web pages. Focused crawlers have become increasingly important in gathering information from web pages for finite resources and have been used in a variety of applications such as search engines, information extraction, digital libraries, and text classification. Classifying the web pages and selecting the URLs are two most important steps of the focused crawler. We set different weights to different sections based on their expression ability for page content. Most of the weighting methods are based on link features [8, 9] that include current page, anchor text, linkcontext, and URL string.

Related Work
Web Page Classification
Link Priority Evaluation
Improved Focused Crawler
Experimental Results and Discussion
Evaluate the Performance of Web Page Classifier
Evaluate the Performance of Focused Crawler
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call