Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis

Qi Mao,Wei Zheng,Volker Mai,Li Wang,Yijun Sun,Yunpeng Cai

doi:10.1109/icdm.2015.90

Abstract

The rapid development of sequencing technology has led to an explosive accumulation of genomics data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, the standard hierarchical clustering method scales poorly due to its quadratic time and space complexities stemming mainly from the need of computing and storing a pairwise distance matrix. It is thus necessary to minimize the number of pairwise distances computed without degrading clustering performance. On the other hand, as high-performance computing systems are becoming widely accessible, it is highly desirable that a clustering method can be easily adapted to parallel computing environments for further speedup, which is not a trivial task for hierarchical clustering. We proposed a new hierarchical clustering method that achieves good clustering performance and high scalability on large sequence datasets. It consists of two stages. In the first stage, a new landmark-based active hierarchical divisive clustering method was proposed that partitions a large-scale sequence dataset into groups, and in the second stage, a fast hierarchical agglomerative clustering method is applied to each group. By assembling hierarchies from both stages, the hierarchy of the data can be easily recovered. Theoretical results showed that our method can recover the true hierarchy with a high probability under some mild conditions and has a linearithmic time complexity with respect to the number of input sequences. The proposed method also facilitates an efficient parallel implementation. Empirical results on various datasets showed that our method achieved clustering accuracy comparable to ESPRIT-Tree and ran faster than greedy heuristic methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Hierarchical Means Clustering
Maurizio Vichi ... Patrick J F Groenen
Journal of Classification | VOL. 39
Maurizio Vichi, et. al.Maurizio Vichi ... Patrick J F Groenen
23 Sep 2022
Journal of Classification | VOL. 39

Clustering sentences to discover events from multiple news articles using Buckshot and Fractionation
D Saravanapriya ... M Karthikeyan
-
D Saravanapriya, et. al.D Saravanapriya ... M Karthikeyan
01 Dec 2014
01 Dec 2014

Hierarchical and Non- Hierarchical Cluster Classification of Precipitation Time Series Data in Nigeria
I O Agada ... O Peter
Nigerian Journal of Theoretical and Environmental Physics | VOL. 2
I O Agada, et. al.I O Agada ... O Peter
31 Mar 2024
Nigerian Journal of Theoretical and Environmental Physics | VOL. 2

ROCK TYPING USING HYDRAULIC FLOW UNITS AND CLUSTERING METHODS, A CASE STUDY: BANGESTAN RESERVOIR IN MANSURI OIL FIELD
...
-
, et. al. ...
22 May 2015
22 May 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis

Abstract

Talk to us

Similar Papers