Abstract
Today more and more information on the Web makes it difficult to get domain-specific information due to the huge amount of data sources and the keywords that have few features. Anchor texts, which contain a few features of a specific topic, play an important role in domain-specific information retrieval, especially in Web page classification. However, the features contained in anchor texts are not informative enough. This paper presents a novel incremental method for Web page classification enhanced by link-contexts and clustering. Directly applying the vector of anchor text to a classifier might not get a good result because of the limited amount of features. Link-context is used first to obtain the contextual information of the anchor text. Then, a hierarchical clustering method is introduced to cluster feature vectors and content unit, which increases the length of a feature vector belonging to one specific class. Finally, incremental SVM is proposed to get the final classifier and increase the accuracy and efficiency of a classifier. Experimental results show that the performance of our proposed method outperforms the conventional topical Web crawler in Harvest rate and Target recall.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Software Engineering and Knowledge Engineering
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.