Abstract

The rapid development of sequencing technology has led to an explosive accumulation of genomics data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, the standard hierarchical clustering method scales poorly due to its quadratic time and space complexities stemming mainly from the need of computing and storing a pairwise distance matrix. It is thus necessary to minimize the number of pairwise distances computed without degrading clustering performance. On the other hand, as high-performance computing systems are becoming widely accessible, it is highly desirable that a clustering method can be easily adapted to parallel computing environments for further speedup, which is not a trivial task for hierarchical clustering. We proposed a new hierarchical clustering method that achieves good clustering performance and high scalability on large sequence datasets. It consists of two stages. In the first stage, a new landmark-based active hierarchical divisive clustering method was proposed that partitions a large-scale sequence dataset into groups, and in the second stage, a fast hierarchical agglomerative clustering method is applied to each group. By assembling hierarchies from both stages, the hierarchy of the data can be easily recovered. Theoretical results showed that our method can recover the true hierarchy with a high probability under some mild conditions and has a linearithmic time complexity with respect to the number of input sequences. The proposed method also facilitates an efficient parallel implementation. Empirical results on various datasets showed that our method achieved clustering accuracy comparable to ESPRIT-Tree and ran faster than greedy heuristic methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call