Clustering-Based Topical Web Crawling for Topic-Specific Information Retrieval Guided by Incremental Classifier

Tao Peng,Lu Liu

doi:10.1142/s0218194015500011

Abstract

Today more and more information on the Web makes it difficult to get domain-specific information due to the huge amount of data sources and the keywords that have few features. Anchor texts, which contain a few features of a specific topic, play an important role in domain-specific information retrieval, especially in Web page classification. However, the features contained in anchor texts are not informative enough. This paper presents a novel incremental method for Web page classification enhanced by link-contexts and clustering. Directly applying the vector of anchor text to a classifier might not get a good result because of the limited amount of features. Link-context is used first to obtain the contextual information of the anchor text. Then, a hierarchical clustering method is introduced to cluster feature vectors and content unit, which increases the length of a feature vector belonging to one specific class. Finally, incremental SVM is proposed to get the final classifier and increase the accuracy and efficiency of a classifier. Experimental results show that the performance of our proposed method outperforms the conventional topical Web crawler in Harvest rate and Target recall.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Clustering-Based Topical Web Crawling for Topic-Specific Information Retrieval Guided by Incremental Classifier

Abstract

Talk to us

Similar Papers

More From: International Journal of Software Engineering and Knowledge Engineering

Lead the way for us

Journal: International Journal of Software Engineering and Knowledge Engineering	Publication Date: Feb 1, 2015
Citations: 4

Similar Papers

Neural networks for web page classification based on augmented PCA
A Selamat ... S Omatu
-
A Selamat, et. al.A Selamat ... S Omatu
20 Jul 2003
20 Jul 2003

Web news classification using neural networks based on PCA
A Selamat ... S Omatu
-
A Selamat, et. al.A Selamat ... S Omatu
05 Aug 2002
05 Aug 2002

Clustering-based topical Web crawling using CFu-tree guided by link-context
Lu Liu ... Tao Peng
Frontiers of Computer Science | VOL. 8
Lu Liu, et. al.Lu Liu ... Tao Peng
26 May 2014
Frontiers of Computer Science | VOL. 8

Web page feature selection and classification using neural networks
Ali Selamat
Information Sciences | VOL. 158
Ali SelamatAli Selamat
04 Sep 2003
Information Sciences | VOL. 158

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Clustering-Based Topical Web Crawling for Topic-Specific Information Retrieval Guided by Incremental Classifier

Abstract

Talk to us

Similar Papers

More From: International Journal of Software Engineering and Knowledge Engineering