ISurfer: A Focused Web Crawler Based on Incremental Learning from Positive Samples

Yunming Ye,Yiming Lu,Fanyuan Ma,Matthew Chiu,Joshua Huang

doi:10.1007/978-3-540-24655-8_13

Abstract

This paper presents a focused Web crawling system iSurfer for information retrieval from the Web. Different from other focused crawlers, iSurfer uses an incremental method to learn a page classification model and a link prediction model. It employs an online sample detector to incrementally distill new samples from crawled Web pages for online updating of the model learned. Other focused crawling systems use classifiers that are built from initial positive and negative samples and can not learn incrementally. The performances of these classifiers depend on the topical coverage of the initial positive and negative samples. However, the initial samples, particularly the negative ones, with a good coverage of target topics are difficult to find. Therefore, the iSurfer’s incremental learning strategy has an advantage. It starts from a few positive samples and gains more integrated knowledge about the target topics over time. Our experiments on various topics have demonstrated that the incremental learning method can improve the harvest rate with a few initial samples.KeywordsFocused CrawlerIncremental LearningPositive Sample Based LearningLink PredictionWeb Page Classification

Full Text