Abstract

Motivation: Nucleotide sequence data are being produced at an ever increasing rate. Clustering such sequences by similarity is often an essential first step in their analysis—intended to reduce redundancy, define gene families or suggest taxonomic units. Exact clustering algorithms, such as hierarchical clustering, scale relatively poorly in terms of run time and memory usage, yet they are desirable because heuristic shortcuts taken during clustering might have unintended consequences in later analysis steps.Results: Here we present HPC-CLUST, a highly optimized software pipeline that can cluster large numbers of pre-aligned DNA sequences by running on distributed computing hardware. It allocates both memory and computing resources efficiently, and can process more than a million sequences in a few hours on a small cluster.Availability and implementation: Source code and binaries are freely available at http://meringlab.org/software/hpc-clust/; the pipeline is implemented in C++ and uses the Message Passing Interface (MPI) standard for distributed computing.Contact: mering@imls.uzh.chSupplementary Information: Supplementary data are available at Bioinformatics online.

Highlights

  • The time complexity of hierarchical clustering algorithms (HCA) is quadratic OðN2Þ or even worse OðN2 log NÞ, depending on the selected cluster linkage method (Day and Edelsbrunner, 1984)

  • HCAs have a number of advantages that make them attractive for applications in biology: (i) they are well defined and should be reproducible across implementations, (ii) they require nothing but a pairwise distance matrix as input and (iii) they are agglomerative, meaning that sets of clusters at arbitrary similarity thresholds can be extracted quickly by post-processing, once a complete clustering run has been executed

  • HCAs have been widely adopted in biology, in areas ranging from data mining to sequence analysis to evolutionary biology

Read more

Summary

INTRODUCTION

The time complexity of hierarchical clustering algorithms (HCA) is quadratic OðN2Þ or even worse OðN2 log NÞ, depending on the selected cluster linkage method (Day and Edelsbrunner, 1984). We present a distributed implementation of an HCA that can handle large numbers of sequences It can compute single-, complete- and average-linkage clusters in a single run and produces a merge-log from which clusters can subsequently be parsed at any threshold. In contrast to CD-HIT, UCLUST and ESPRIT, which all take unaligned sequence data as their input, HPC-CLUST (like MOTHUR) takes as input a set of pre-aligned sequences. This allows for flexibility in the choice of alignment algorithm; a future version of HPCCLUST may include the alignment step as well. Additional benchmarks are shown and discussed in the Supplementary Material

METHODS
Clustering performance on a single computer
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call