A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

Maria Th Kotouza,Pericles A Mitkas,Fotis E Psomopoulos

doi:10.1186/s13677-019-0150-y

Abstract

Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. Hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users’ queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters’ connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.

Highlights

Hierarchical clustering has been proven to be a useful technique in the field of document organization, as it constructs a hierarchy structure of document collections and sub-collections
The best results achieved by an algorithm for each one of the datasets are highlighted as boldface, whereas the second highest results are presented in italics
This subsection is divided into five parts: a) the comparison against baseline hierarchical clustering algorithms in terms of effectiveness is further discussed in “Effectiveness evaluation” section, b) the comparison against a baseline division hierarchical clustering algorithm in terms of memory usage and computational time is further discussed in “Performance statistical evaluation” section, c) the performance experiments of the proposed method running in the cloud is further discussed in “Performance testing in the cloud” section, d) the complexity analysis is presented in “Complexity analysis” section, and e) the overall proposed framework presented in “A new document clustering framework” section applied on the NYTimes dataset is further discussed in “Experimental results on the NYTimes dataset” section

Summary

Introduction

Hierarchical clustering has been proven to be a useful technique in the field of document organization, as it constructs a hierarchy structure of document collections and sub-collections Such a structure can make the browsing and navigation process easier and quicker [1] by hiding irrelevant information from the users. A medium to large set of documents can contain over 10,000 documents; this means that there can be millions of term-document relations, leading to an extremely high computational complexity and memory usage. This issue arises from the way most classical hierarchical clustering methods are implemented: they are based on the formulation of high dimensional distance matrices, used for pairwise comparisons between all the available data points

Methods

Results

Conclusion