Identifying auxiliary web images using combination of analyses

Tewson Seeoun,Choochart Haruechaiyasak,Toshiaki Kondo

doi:10.1145/1631272.1631530

Abstract

As the Web gains more popularity, Web sites become richer in media. Besides text, another most common form of media is image. A Web page can utilize images in various ways such as to illustrate stories, to summarize data and to decorate the page. This leads to a large amount of images embedded in Web pages. However, not all Web images are informative, i.e., engaged with the page for the purpose of delivering useful information. The uninformative or images are, for example, logos and banner advertisements. The benefit of classifying Web images as ``informative or auxiliary is the efficient use of available resources. The images are insignificant and can be ignored in many tasks including search engine's indexing, for the sake of conciseness of search results, and Web page printing, to reduce ink usage. This paper proposes a solution for the HP Multimedia Grand Challenge to identify informative multimedia contents in Web pages. Our approach is based on a supervised machine learning model trained from a set of 32 features gathered from content analysis of images, Web page layout, and domain name. We adopt the Support Vector Machines (SVM) algorithm to train the classifier. The model is optimized by a grid search technique to select the appropriate set of kernel parameters. The evaluation results based on the 10-fold cross-validation yielded the classification accuracy of 94.08%. The classification results are used to annotate each image accordingly, as in the prototype implementtaion, each image is highlighted with different border color.

Full Text