Abstract

The growth of next-generation sequencing (NGS) datasets poses a challenge to the alignment of reads to reference genomes in terms of alignment quality and execution speed. Some available aligners have been shown to obtain high quality mappings at the expense of long execution times. Finding fast yet accurate software solutions is of high importance to research, since availability and size of NGS datasets continue to increase. In this work we present an efficient parallelization approach for NGS short-read alignment on multi-core clusters. Our approach takes advantage of a distributed shared memory programming model based on the new UPC++ language. Experimental results using the CUSHAW3 aligner show that our implementation based on dynamic scheduling obtains good scalability on multi-core clusters. Through our evaluation, we are able to complete the single-end and paired-end alignments of 246 million reads of length 150 base-pairs in 11.54 and 16.64 minutes, respectively, using 32 nodes with four AMD Opteron 6272 16-core CPUs per node. In contrast, the multi-threaded original tool needs 2.77 and 5.54 hours to perform the same alignments on the 64 cores of one node. The source code of our parallel implementation is publicly available at the CUSHAW3 homepage (http://cushaw3.sourceforge.net).

Highlights

  • The application of next-generation sequencing (NGS) technologies has led to an explosion of short-read sequence datasets

  • MerAligner [29] is a parallel Unified Parallel C (UPC) short-read aligner for distributed-memory architectures which obtains good scalability on multi-core clusters. merAligner optimizes the distribution of the reference genome index in case that it is too large to fit in one node

  • We have analyzed the scalability of our UPC++ implementation by aligning four Illumina short-read datasets to the human genome hg38

Read more

Summary

Introduction

The application of next-generation sequencing (NGS) technologies has led to an explosion of short-read sequence datasets. The alignment of produced sequences to a given reference genome, i.e. short-read alignment (SRA), is one of the most important basic operations required for further downstream analysis. All of them are based on the seed-and-extend approach, using different seeding policies This approach maps a given read by first identifying seeds on the genome using efficient indexing data structures. MerAligner [29] is a parallel UPC short-read aligner for distributed-memory architectures which obtains good scalability on multi-core clusters. The execution model of UPC++ is single program multiple data (SPMD) As this language is able to work on both shared-memory and distributed-memory systems, each independent execution unit (UPC++ process) can be implemented as an OS process or a POSIX thread (Pthread). UPC++ takes advantage of C++ language features, such as templates, objectoriented design, operator overloading, and lambda functions (in C++ 11) to provide advanced PGAS features

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.