New use of the HITS algorithm for fast web page classification

Mohamed Nadjib Meadi,Abdelmalik Taleb Ahmed,Mohamed Chaouki Babahenini

doi:10.3906/elk-1501-236

Mohamed Nadjib Meadi, Abdelmalik Taleb Ahmed + Show 1 more

Open Access

https://doi.org/10.3906/elk-1501-236

Copy DOI

Abstract

The immense number of documents published on the web requires the utilization of automatic classifiers that allow organizing and obtaining information from these large resources. Typically, automatic web pages classifiers handle millions of web pages, tens of thousands of features, and hundreds of categories. Most of the classifiers use the vector space model to represent the dataset of web pages. The components of each vector are computed using the term frequency inversed document frequency (TFIDF) scheme. Unfortunately, TFIDF-based classifiers face the problem of the large-scale size of input data that leads to a long processing time and an increase in resource requests. Therefore, there is an increasing demand to alleviate these problems by reducing the size of the input data without influencing the classification results. In this paper, we propose a novel approach that improves web page classifiers by reducing the size of the input data (i.e. web pages and feature reduction) by using the hypertext induced topic search (HITS) algorithm. We employ HITS results for weighting remaining features. We evaluate the performance of the proposed approach by comparing it with the TFIDF-based classifier. We demonstrate that our approach significantly reduces the time needed for classification.

Highlights

The number of pages published on the World Wide Web is estimated to be in the billions
(see Section 5), we have found that support vector machine (SVM) web page classifiers that use the term frequency inversed document frequency (TFIDF) scheme have some disadvantages
In the fifth step of this approach, we propose to calculate the weights using the authority vector, which is the output of the hypertext induced topic search (HITS) algorithm, instead of using the TFIDF scheme

Summary

Introduction

The number of pages published on the World Wide Web is estimated to be in the billions (http://www.internetlivestats.com/total-number-of-websites/). The mining of these pages requires incredible intellectual efforts that exceed human capacities. Web pages as a whole have no unifying structure; i.e. there is variability of structuring style and content creation is much greater than in traditional collections of textual documents [1]. It is impossible to apply dataset management and traditional information retrieval techniques to web pages for information and knowledge extraction. Its principal objective is to use methods and techniques of data mining to extract knowledge contained in web pages, taking into account their unstructured nature, during preprocessing and feature extraction phases

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES	Publication Date: Jan 1, 2017
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

New use of the HITS algorithm for fast web page classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES

Lead the way for us

Similar Papers

Automatic Web Page Classification System with Improved Accuracy
Chait Hra ... Dr.G.M Lingaraju
Webology | VOL. 18
Chait Hra, et. al.Chait Hra ... Dr.G.M Lingaraju
23 Dec 2021
Webology | VOL. 18

Automatic Topic-Based Web Page Classification Using Deep Learning
Siti Hawa Apandi ... Norkhairi Ahmad
JOIV : International Journal on Informatics Visualization | VOL. 7
Siti Hawa Apandi, et. al.Siti Hawa Apandi ... Norkhairi Ahmad
30 Nov 2023
JOIV : International Journal on Informatics Visualization | VOL. 7

메타 태그를 이용한 자동 웹페이지 분류 시스템
Sang-Il Kim ... Hwa-Sung Kim
The Journal of Korea Information and Communications Society | VOL. 38B
Sang-Il Kim, et. al.Sang-Il Kim ... Hwa-Sung Kim
30 Apr 2013
The Journal of Korea Information and Communications Society | VOL. 38B

Web page classification: a survey of perspectives, gaps, and future directions
Mahdi Hashemi
Multimedia Tools and Applications | VOL. 79
Mahdi HashemiMahdi Hashemi
10 Jan 2020
Multimedia Tools and Applications | VOL. 79

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

New use of the HITS algorithm for fast web page classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: TURKISH JOURNAL OF ELECTRICAL ENGINEERING &amp; COMPUTER SCIENCES

More From: TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES