Abstract

Today more and more information on the Web makes it difficult to get domain-specific information due to the huge amount of data sources and the keywords that have few features. Anchor texts, which contain a few features of a specific topic, play an important role in domain-specific information retrieval, especially in Web page classification. However, the features contained in anchor texts are not informative enough. This paper presents a novel incremental method for Web page classification enhanced by link-contexts and clustering. Directly applying the vector of anchor text to a classifier might not get a good result because of the limited amount of features. Link-context is used first to obtain the contextual information of the anchor text. Then, a hierarchical clustering method is introduced to cluster feature vectors and content unit, which increases the length of a feature vector belonging to one specific class. Finally, incremental SVM is proposed to get the final classifier and increase the accuracy and efficiency of a classifier. Experimental results show that the performance of our proposed method outperforms the conventional topical Web crawler in Harvest rate and Target recall.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.