Exploiting tag and word correlations for improved webpage clustering

Anusua Trivedi,Piyush Rai,Hal Daumé,Scott L Duvall

doi:10.1145/1871985.1871989

Abstract

Automatic clustering of webpages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, webpage clustering algorithms only use features extracted from the page-text. However, the advent of social-bookmarking websites, such as StumbleUpon and Delicious, has led to a huge amount of user-generated content such as the tag information that is associated with the webpages. In this paper, we present a subspace based feature extraction approach which leverages tag information to complement the page-contents of a webpage to extract highly discriminative features, with the goal of improved clustering performance. In our approach, we consider page-text and tags as two separate views of the data, and learn a shared subspace that maximizes the correlation between the two views. Any clustering algorithm can then be applied in this subspace. We compare our subspace based approach with a number of baselines that use tag information in various other ways, and show that the subspace based approach leads to improved performance on the webpage clustering task. Although our results here are on the webpage clustering task, the same approach can be used for webpage classification as well. In the end, we also suggest possible future work for leveraging tag information in webpage clustering, especially when tag information is present for not all, but only for a small number of webpages.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Exploiting tag and word correlations for improved webpage clustering

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Leveraging Social Bookmarks from Partially Tagged Corpus for Improved Web Page Clustering
Anusua Trivedi ... Piyush Rai
ACM Transactions on Intelligent Systems and Technology | VOL. 3
Anusua Trivedi, et. al.Anusua Trivedi ... Piyush Rai
01 Sep 2012
ACM Transactions on Intelligent Systems and Technology | VOL. 3

Query Directed Web Page Clustering
Daniel Crabtree ... Peter Andreae
-
Daniel Crabtree, et. al.Daniel Crabtree ... Peter Andreae
01 Dec 2006
01 Dec 2006

Toward creating a fairer ranking in search engine results
Ruoyuan Gao ... Chirag Shah
Information Processing & Management | VOL. 57
Ruoyuan Gao, et. al.Ruoyuan Gao ... Chirag Shah
16 Oct 2019
Information Processing & Management | VOL. 57

Viewpoint Diversity in Search Results
Tim Draws ... Mehmet Orcun Yalcin
-
Tim Draws, et. al.Tim Draws ... Mehmet Orcun Yalcin
01 Jan 2023
01 Jan 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploiting tag and word correlations for improved webpage clustering

Abstract

Talk to us

Similar Papers