Abstract

SummaryWeb directories like Wikipedia and Open Directory Mozilla facilitate efficient information retrieval (IR) of web documents from a huge web corpus. Maintenance of these web directories is understandably a difficult task that requires manual curation by human editors or semi‐automated mechanisms. Research on parallel algorithms for the automated curation of these web directories will be beneficial to the IR domain. Hence, in this article, we propose a parallel algorithm for automatically creating web directories from a corpus of web‐documents. We have used centrality‐based techniques to split the corpus into fine‐grained clusters and subsequently an agglomeration based on locality sensitive hashing to identify coarse‐grained clusters in the web‐directory. Experimental results show that the algorithm generates meaningful hierarchies of the input corpus as measured by cluster‐validity indices, like F‐measure, rand index, and cluster purity. The algorithm achieves a significant speedup and scales well both with the number of processors and the size of the input corpus.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.