Classification of web pages on attractiveness: A supervised learning approach

Ganesh Khade,Samit Bhattacharya,Sudhakar Kumar

doi:10.1109/ihci.2012.6481867

Abstract

Random surfers spend very little time on a web page. If the most important web page content fails to attract his attention within the short time span, he will move away to some other page, thus defeating the purpose of the web page designer. In order to predict if the contents of a web page will catch a random surfer's attention or not, we propose a machine learning based approach to classify web pages into “bad” and “not bad” classes, where the “bad” class implies poor attention drawing ability. We propose to divide web page contents into “objects”, which are coherent regions of web page conveying the same information, to develop the classifier approach. We surveyed 100 web pages sampled from the Internet to identify the type and frequency of objects used in web page design. From our survey, we identified six types of objects that are most important in determining the class of a web page, in terms of its attention drawing capability. We used the WEKA tool to implement the machine learning approach. Two different strategies of percentage split and three different strategies of cross validation are used to check for accuracy of the classifier. We have experimented with 65 algorithms supported by WEKA and found that the algorithms RBF network and Random subspace, among the 65, gives the best performance, with about 83% accuracy.

Full Text