Abstract

Automatic clustering of Web pages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, Web page clustering algorithms use only features extracted from the page-text. However, the advent of social-bookmarking Web sites, such as StumbleUpon.com and Delicious.com, has led to a huge amount of user-generated content such as the social tag information that is associated with the Web pages. In this article, we present a subspace based feature extraction approach that leverages the social tag information to complement the page-contents of a Web page for extracting beter features, with the goal of improved clustering performance. In our approach, we consider page-text and tags as two separate views of the data, and learn a shared subspace that maximizes the correlation between the two views. Any clustering algorithm can then be applied in this subspace. We then present an extension that allows our approach to be applicable even if the Web page corpus is only partially tagged, that is, when the social tags are present for not all, but only for a small number of Web pages. We compare our subspace based approach with a number of baselines that use tag information in various other ways, and show that the subspace based approach leads to improved performance on the Web page clustering task. We also discuss some possible future work including an active learning extension that can help in choosing which Web pages to get tags for, if we only can get the social tags for only a small number of Web pages.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.