Parallel clustering algorithm for large-scale biological data sets.

Minchao Wang,Jiang Xie,Luonan Chen,Huiran Zhang,Wu Zhang,Hao Xie,Wang Ding,Yike Guo,Dongbo Dai

doi:10.1371/journal.pone.0091315

Minchao Wang, Jiang Xie + Show 7 more

Open Access

PDF Available

https://doi.org/10.1371/journal.pone.0091315

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundsRecent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.MethodsTwo types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.ResultA speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.

Highlights

Data clustering is to group a set of objects in such a way that objects in the same group have higher similarity with each other than those in the other groups
Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm
The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity

Summary

Introduction

Data clustering is to group a set of objects in such a way that objects in the same group (cluster) have higher similarity with each other than those in the other groups (clusters) It is a common technology for data mining and analysis in many fields, such as pattern recognition, machine learning, bioinformatics and so on. Many novel clustering algorithms have been introduced to handle biological problems, including protein families/superfamilies detecting [1,2], metabolic networks analysis [3,4,5] and proteinprotein interaction (PPI) networks analysis [6,7]. Some classic clustering algorithms such as linkage, graph partition are taken to handle the difference between microbial sequences. The correlation between the comparison of the human microbiome and various disease can be extracted from the clustering results

Methods

Discussion

Conclusion