Abstract
As the Web is a large collection of data growing daily, an automatic Web page classification mechanism is needed to effectively reach to useful information. Majority of the Web pages are in the form of HTML documents, therefore the aim of this study is to explore the effect of HTML tags on classification process, and try to determine the most valuable HTML tags for feature extraction of the classification task. To achieve this goal, we employ 13 different datasets, and use 5 popular classifiers that are SVM, naive bayes (NB), kNN, C4.5, and OneR. The statistical analysis shows that, the features extracted by using solely the anchor, or tags can be used as an alternative to the features extracted from the whole Web page. SVM is the best among the classifiers used in this study. Using the HTML tags for feature extraction improves classification accuracy.
Highlights
The Web is a large collection of documents of various kinds
The results indicated that the naïve bayes (NB) is good for Web page classification when combination of HTML tag and term is used as features
In this study we used both the HTML tags and the stemmed terms that belong to each tag, and all the terms from the Web pages as classification features
Summary
The Web is a large collection of documents of various kinds. Many people use the Internet to find and gather information on certain topics. It is not easy to reach to a desired information by using the standard search engines. Possible reasons for this problem are [1]; 1. The Web pages are increasing exponentially, it is difficult to keep the index of search engines up-to-date. 2. When a user seeks information on a search engine, too many irrelevant pages containing search terms are presented
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.