Abstract

BLAST is arguably the single most important piece of software ever written for the biological sciences. It is the core of most bioinformatics workflows, being a critical component of genome homology searches and annotation. It has influenced the landscape of biology by aiding in everything from functional characterization of genes to pathogen detection to the development of novel vaccines. While BLAST is very popular, it is also often one of the most computationally intensive parts of bioinformatics analysis. In our workflows, BLAST typically takes the majority of cpu time, and we need to parallelize to finish in a reasonable time frame. Waiting for BLAST to finish without having any clue of how long it’s going to take is kind of depressing, and you could waste a day of work trying to run a job that would never finish. If you feel the same way we do, then check out Cunningham, a tool we designed to estimate BLAST runtimes for shotgun sequence datasets using sequence composition statistics. We’ve trained its models on real metagenomic sequence data using the Amazon EC2 cloud, and it will provide a relatively quick estimate for datasets with up to tens of millions of sequences. It’s not perfect, but it’ll give you at least some idea of expected runtime, how large a cluster you’re going to need, how much you’ll need to partition your data, etc. We use it all the time now, so we hope it’ll be useful to someone else out there. Cunningham has been implemented in CloVR for efficient autoscaling in the cloud and is freely available at http://clovr.org.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.