Abstract

The immense number of documents published on the web requires the utilization of automatic classifiers that allow organizing and obtaining information from these large resources. Typically, automatic web pages classifiers handle millions of web pages, tens of thousands of features, and hundreds of categories. Most of the classifiers use the vector space model to represent the dataset of web pages. The components of each vector are computed using the term frequency inversed document frequency (TFIDF) scheme. Unfortunately, TFIDF-based classifiers face the problem of the large-scale size of input data that leads to a long processing time and an increase in resource requests. Therefore, there is an increasing demand to alleviate these problems by reducing the size of the input data without influencing the classification results. In this paper, we propose a novel approach that improves web page classifiers by reducing the size of the input data (i.e. web pages and feature reduction) by using the hypertext induced topic search (HITS) algorithm. We employ HITS results for weighting remaining features. We evaluate the performance of the proposed approach by comparing it with the TFIDF-based classifier. We demonstrate that our approach significantly reduces the time needed for classification.

Highlights

  • The number of pages published on the World Wide Web is estimated to be in the billions

  • (see Section 5), we have found that support vector machine (SVM) web page classifiers that use the term frequency inversed document frequency (TFIDF) scheme have some disadvantages

  • In the fifth step of this approach, we propose to calculate the weights using the authority vector, which is the output of the hypertext induced topic search (HITS) algorithm, instead of using the TFIDF scheme

Read more

Summary

Introduction

The number of pages published on the World Wide Web is estimated to be in the billions (http://www.internetlivestats.com/total-number-of-websites/). The mining of these pages requires incredible intellectual efforts that exceed human capacities. Web pages as a whole have no unifying structure; i.e. there is variability of structuring style and content creation is much greater than in traditional collections of textual documents [1]. It is impossible to apply dataset management and traditional information retrieval techniques to web pages for information and knowledge extraction. Its principal objective is to use methods and techniques of data mining to extract knowledge contained in web pages, taking into account their unstructured nature, during preprocessing and feature extraction phases

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.