A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Eda Baykan,Ingmar Weber,Ludmila Marian,Monika Henzinger

doi:10.1145/1993053.1993057

Abstract

Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on the Web

Lead the way for us

Journal: ACM Transactions on the Web	Publication Date: Jul 1, 2011
Citations: 78

Similar Papers

Stacking Ensemble-based Automatic Web Page Classification
Deeksha Deeksha ... Kashish Bhatia
-
Deeksha Deeksha, et. al.Deeksha Deeksha ... Kashish Bhatia
01 Jul 2021
01 Jul 2021

Automatic Web Page Classification System with Improved Accuracy
Chait Hra ... Dr.S Jagannatha
Webology | VOL. 18
Chait Hra, et. al.Chait Hra ... Dr.S Jagannatha
23 Dec 2021
Webology | VOL. 18

Joint Web-Feature (JFEAT): A Novel Web Page Classification Framework
Lim Han ... Saadat Alhashmi
Communications of the IBIMA | VOL. -
Lim Han, et. al.Lim Han ... Saadat Alhashmi
20 Apr 2010
Communications of the IBIMA | VOL. -

Enhancing the Performance of Feature Selection using a Hybrid Genetic Algorithm
A Kovalan ... N Vanjulavalli
International Journal of Computer Applications | VOL. 124
A Kovalan, et. al.A Kovalan ... N Vanjulavalli
18 Aug 2015
International Journal of Computer Applications | VOL. 124

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on the Web