An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

Houqing Lu,Donghui Zhan,Lei Zhou,Dengchao He

doi:10.1155/2016/6406901

Houqing Lu, Donghui Zhan + Show 2 more

Open Access

https://doi.org/10.1155/2016/6406901

Copy DOI

Abstract

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler.

Highlights

With the rapid growth of network information, the Internet has become the greatest information base
An experiment was designed to indicate that the proposed method of web page classification and the algorithm of link priority evaluation (LPE) can improve the performance of focused crawlers
We presented a novel focused crawler which increases the collection performance by using the web page classifier and the link priority evaluation algorithm

Summary

Introduction

With the rapid growth of network information, the Internet has become the greatest information base. The first important task of those researches is to collect relevant information from the Internet, namely, crawling web pages. Focused crawlers have become increasingly important in gathering information from web pages for finite resources and have been used in a variety of applications such as search engines, information extraction, digital libraries, and text classification. Classifying the web pages and selecting the URLs are two most important steps of the focused crawler. We set different weights to different sections based on their expression ability for page content. Most of the weighting methods are based on link features [8, 9] that include current page, anchor text, linkcontext, and URL string.

Related Work

Web Page Classification

Link Priority Evaluation

Improved Focused Crawler

Experimental Results and Discussion

Evaluate the Performance of Web Page Classifier

Evaluate the Performance of Focused Crawler

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mathematical Problems in Engineering	Publication Date: Jan 1, 2016
Citations: 39	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematical Problems in Engineering

Lead the way for us

Similar Papers

Automatic Recovery of Broken Links Using Information Retrieval Techniques
Shoaib Hayat ... Muhammad Riaz
-
Shoaib Hayat, et. al.Shoaib Hayat ... Muhammad Riaz
07 Sep 2018
07 Sep 2018

Focused crawling enhanced by CBP–SLC
Tao Peng ... Lu Liu
Knowledge Based Systems | VOL. 51
Tao Peng, et. al.Tao Peng ... Lu Liu
11 Jul 2013
Knowledge Based Systems | VOL. 51

An Algorithm of Topic Distillation Based on Anchor Text
Jiang Kai-Zhong ... Wu Yuan-Qiong
-
Jiang Kai-Zhong, et. al.Jiang Kai-Zhong ... Wu Yuan-Qiong
01 Jan 2008
01 Jan 2008

A Survey on Improving the Web Search Ranking by User Behavior Information
Mohamed Husain ... Rakesh Ranjan
SSRN | VOL. -
Mohamed Husain, et. al.Mohamed Husain ... Rakesh Ranjan
01 Jan 2009
SSRN | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematical Problems in Engineering