Abstract
Topic relevance of pages and hyperlinks is the key issue in focused crawling. In this paper, an improved topic relevance algorithm for focused crawling is proposed. First, we implement a prototype system of the focused crawler - a topic-specific news gathering system which is prepared for comparative experiments on different similarity measures with the anchor text. Second, experiments on Chinese text corpus show that using LSI (Latent Semantic Indexing) outperforms using TF-IDF (term frequency- inverse document frequency) for hyperlink topic relevance prediction and pages topic relevance calculation. Third, in real crawling experiments on the prototype system, the crawler using TF-IDF has high performance with the accumulated topic relevance increasing quickly at the beginning of crawling, however the crawler using LSI can find more related pages and tunnel through. Fourth, combining their advantages of LSI and TF-IDF, we propose TFIDF+LSI algorithm to guide the crawling. Last, the crawler using TFIDF+LSI performs the same crawl task and demonstrates the combination advantage of TF-IDF and LSI. The experiment suggests that the crawler's performance using TFIDF+LSI is greatly superior to that using either TF-IDF or LSI respectively.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have