SUPERVISED TERM WEIGHTING METHODS FOR URL CLASSIFICATION

R. Rajalakshmi

doi:10.3844/jcssp.2014.1969.1976

Abstract

Many term weighting methods are suggested in the literature for Information Retrieval and Text Categorization. Term weighting method, a part of feature selection process is not yet explored for URL classification problem. We classify a web page using its URL alone without fetching its content and hence URL based classification is faster than other methods. In this study, we investigate the use of term weighting methods for selecting relevant URL features and their impact on the performance of URL classification. We propose a New Relevance Factor (NRF) for the supervised term weighting method to compute the URL weights and perform multiclass classification of URLs using Naive Bayes Classifier. To evaluate the proposed method, we have conducted various experiments on ODP dataset and our experimental results show that the proposed supervised term weighting method based on NRF is suitable for URL classification. We have achieved 11% improvement in terms of Precision over the existing binary classifier methods and 22% improvement in terms of F1 when compared with existing multiclass classifiers.

Highlights

Web page classification is the task of assigning one of the predefined category labels to the web page being considered based on its contents and topic it talks about
To evaluate the proposed method, we have conducted various experiments on Open Directory Project (ODP) dataset and our experimental results show that the proposed supervised term weighting method based on New Relevance Factor (NRF) is suitable for URL classification
The proposed NRF based term weighting method is better than the RF method and the accuracy is improved by 3% for this method

Summary

Introduction

Web page classification is the task of assigning one of the predefined category labels to the web page being considered based on its contents and topic it talks about It resembles text categorization, but with more challenges due to the presence of hyperlinks, images and multimedia content. Some of the issues in the content based classification systems are the following: (i) Contents are needed for extracting features forcing us to download the page for classification purpose (ii) wastes bandwidth in unnecessary downloads (iii) slows down the classification process as excessive features are to be extracted Other than this additional burden, content based classification systems are not sufficient to address the following challenges when (i) web page contains only images (ii) the content is hidden behind the images (iii) contains dynamic content

Methods

Results

Discussion

Conclusion