HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences

João F Matias Rodrigues,Christian Von Mering

doi:10.1093/bioinformatics/btt657

João F Matias Rodrigues, Christian Von Mering

Open Access

https://doi.org/10.1093/bioinformatics/btt657

Copy DOI

Journal: Bioinformatics	Publication Date: Nov 9, 2013
Citations: 48	License type: CC BY 3.0

Affiliation: SIB Swiss Institute of Bioinformatics

Abstract

Motivation: Nucleotide sequence data are being produced at an ever increasing rate. Clustering such sequences by similarity is often an essential first step in their analysis—intended to reduce redundancy, define gene families or suggest taxonomic units. Exact clustering algorithms, such as hierarchical clustering, scale relatively poorly in terms of run time and memory usage, yet they are desirable because heuristic shortcuts taken during clustering might have unintended consequences in later analysis steps.Results: Here we present HPC-CLUST, a highly optimized software pipeline that can cluster large numbers of pre-aligned DNA sequences by running on distributed computing hardware. It allocates both memory and computing resources efficiently, and can process more than a million sequences in a few hours on a small cluster.Availability and implementation: Source code and binaries are freely available at http://meringlab.org/software/hpc-clust/; the pipeline is implemented in C++ and uses the Message Passing Interface (MPI) standard for distributed computing.Contact: mering@imls.uzh.chSupplementary Information: Supplementary data are available at Bioinformatics online.

Highlights

The time complexity of hierarchical clustering algorithms (HCA) is quadratic OðN2Þ or even worse OðN2 log NÞ, depending on the selected cluster linkage method (Day and Edelsbrunner, 1984)
HCAs have a number of advantages that make them attractive for applications in biology: (i) they are well defined and should be reproducible across implementations, (ii) they require nothing but a pairwise distance matrix as input and (iii) they are agglomerative, meaning that sets of clusters at arbitrary similarity thresholds can be extracted quickly by post-processing, once a complete clustering run has been executed
HCAs have been widely adopted in biology, in areas ranging from data mining to sequence analysis to evolutionary biology

Summary

INTRODUCTION

The time complexity of hierarchical clustering algorithms (HCA) is quadratic OðN2Þ or even worse OðN2 log NÞ, depending on the selected cluster linkage method (Day and Edelsbrunner, 1984). We present a distributed implementation of an HCA that can handle large numbers of sequences It can compute single-, complete- and average-linkage clusters in a single run and produces a merge-log from which clusters can subsequently be parsed at any threshold. In contrast to CD-HIT, UCLUST and ESPRIT, which all take unaligned sequence data as their input, HPC-CLUST (like MOTHUR) takes as input a set of pre-aligned sequences. This allows for flexibility in the choice of alignment algorithm; a future version of HPCCLUST may include the alignment step as well. Additional benchmarks are shown and discussed in the Supplementary Material

METHODS

Clustering performance on a single computer

CONCLUSION