Scaling bioinformatics applications on HPC

Mike Mikailov,Nicholas Petrick,Shraddha Thakkar,Zhichao Liu,Weida Tong,Fu-Jyh Luo,Stuart Barkley,Lohit Valleru,Stephen Whitney

doi:10.1186/s12859-017-1902-7

Abstract

BackgroundRecent breakthroughs in molecular biology and next generation sequencing technologies have led to the expenential growh of the sequence databases. Researchrs use BLAST for processing these sequences. However traditional software parallelization techniques (threads, message passing interface) applied in newer versios of BLAST are not adequate for processing these sequences in timely manner.MethodsA new method for array job parallelization has been developed which offers O(T) theoretical speed-up in comparison to multi-threading and MPI techniques. Here T is the number of array job tasks. (The number of CPUs that will be used to complete the job equals the product of T multiplied by the number of CPUs used by a single task.) The approach is based on segmentation of both input datasets to the BLAST process, combining partial solutions published earlier (Dhanker and Gupta, Int J Comput Sci Inf Technol_5:4818-4820, 2014), (Grant et al., Bioinformatics_18:765-766, 2002), (Mathog, Bioinformatics_19:1865-1866, 2003). It is accordingly referred to as a “dual segmentation” method. In order to implement the new method, the BLAST source code was modified to allow the researcher to pass to the program the number of records (effective number of sequences) in the original database. The team also developed methods to manage and consolidate the large number of partial results that get produced. Dual segmentation allows for massive parallelization, which lifts the scaling ceiling in exciting ways.ResultsBLAST jobs that hitherto failed or slogged inefficiently to completion now finish with speeds that characteristically reduce wallclock time from 27 days on 40 CPUs to a single day using 4104 tasks, each task utilizing eight CPUs and taking less than 7 minutes to complete.ConclusionsThe massive increase in the number of tasks when running an analysis job with dual segmentation reduces the size, scope and execution time of each task. Besides significant speed of completion, additional benefits include fine-grained checkpointing and increased flexibility of job submission. “Trickling in” a swarm of individual small tasks tempers competition for CPU time in the shared HPC environment, and jobs submitted during quiet periods can complete in extraordinarily short time frames. The smaller task size also allows the use of older and less powerful hardware. The CDRH workhorse cluster was commissioned in 2010, yet its eight-core CPUs with only 24GB RAM work well in 2017 for these dual segmentation jobs. Finally, these techniques are excitingly friendly to budget conscious scientific research organizations where probabilistic algorithms such as BLAST might discourage attempts at greater certainty because single runs represent a major resource drain. If a job that used to take 24 days can now be completed in less than an hour or on a space available basis (which is the case at CDRH), repeated runs for more exhaustive analyses can be usefully contemplated.

Highlights

Recent breakthroughs in molecular biology and generation sequencing technologies have led to the expenential growh of the sequence databases
The Basic Local Alignment Search Tool (BLAST) family of programs is used to address a fundamental problem in bioinformatics research: sequence search and alignment
Developers of BLAST apply a heuristic algorithm using a statistical model to speed up the search process and achieve linear time complexity

Summary

Introduction

Recent breakthroughs in molecular biology and generation sequencing technologies have led to the expenential growh of the sequence databases. The BLAST family of programs is used to address a fundamental problem in bioinformatics research: sequence search and alignment. Using these programs scientists compare query sequences with a library or database of sequences like GenBank [4] to identify library sequences that resemble each query sequence. Developers of BLAST apply a heuristic algorithm using a statistical model to speed up the search process and achieve linear time complexity This approach produces less accurate results than the exhaustive Needleman-Wunch [5] and Smith-Waterman [6] algorithms created earlier for the same purposes. They are problematic for practical use in resourceconstrained environments

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Dec 1, 2017
Citations: 11	License type: open-access

R Discovery Prime

R Discovery Prime

Scaling bioinformatics applications on HPC

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

MCMC2 (version 1.1.1): A Monte Carlo code for multiply charged clusters
David A Bonhommeau
Computer Physics Communications | VOL. 196
David A BonhommeauDavid A Bonhommeau
08 Jul 2015
MCMC2 (version 1.1.1): A Monte Carlo code for multiply charged clusters
David A Bonhommeau

Scaling and Parallelization of Big Data Analysis on HPC and Cloud Systems
Mike Mikailov ... Stephen Whitney
-
Mike Mikailov, et. al.Mike Mikailov ... Stephen Whitney
01 Apr 2019
01 Apr 2019

Research on the Applications of Molecular Biotechnology in Medicine
Fang Wang ... Lan Wang
-
Fang Wang, et. al.Fang Wang ... Lan Wang
01 Jan 2015
01 Jan 2015

Chapter Sixteen - Explanatory Chapter: Next Generation Sequencing
Srinivasan Yegnasubramanian
Methods in Enzymology | VOL. 529
Srinivasan YegnasubramanianSrinivasan Yegnasubramanian
01 Jan 2013
Methods in Enzymology | VOL. 529

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scaling bioinformatics applications on HPC

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics