Topic Modeling : Clustering of Deep Webpages

Muhunthaadithya C,Rohit J.V,Sadhana Kesavan,Sivasankar E

doi:10.5121/csit.2015.51302

Abstract

The internet is comprised of massive amount of information in the form of zillions of web pages.This information can be categorized into the surface web and the deep web. The existing search engines can effectively make use of surface web information.But the deep web remains unexploited yet. Machine learning techniques have been commonly employed to access deep web content. Under Machine Learning, models provide a simple way to analyze large volumes of unlabeled text. A topic consists of a cluster of words that frequently occur together. Using contextual clues, models can connect words with similar meanings and distinguish between words with multiple meanings. Clustering is one of the key solutions to organize the deep web databases.In this paper, we cluster deep web databases based on the relevance found among deep web forms by employing a generative probabilistic model called Latent Dirichlet Allocation(LDA) for modeling content representative of deep web databases. This is implemented after preprocessing the set of web pages to extract page contents and form contents.Further, we contrive the distribution of “topics per document” and “words per topic” using the technique of Gibbs sampling. Experimental results show that the proposed method clearly outperforms the existing clustering methods.

Full Text