A scalable parallel algorithm for building web directories

Karthick Seshadri,Aswin Maruthappan,Mukunthapriya Sundar Raman

doi:10.1002/cpe.6121

Abstract

SummaryWeb directories like Wikipedia and Open Directory Mozilla facilitate efficient information retrieval (IR) of web documents from a huge web corpus. Maintenance of these web directories is understandably a difficult task that requires manual curation by human editors or semi‐automated mechanisms. Research on parallel algorithms for the automated curation of these web directories will be beneficial to the IR domain. Hence, in this article, we propose a parallel algorithm for automatically creating web directories from a corpus of web‐documents. We have used centrality‐based techniques to split the corpus into fine‐grained clusters and subsequently an agglomeration based on locality sensitive hashing to identify coarse‐grained clusters in the web‐directory. Experimental results show that the algorithm generates meaningful hierarchies of the input corpus as measured by cluster‐validity indices, like F‐measure, rand index, and cluster purity. The algorithm achieves a significant speedup and scales well both with the number of processors and the size of the input corpus.

Full Text