Abstract

Next generation sequencing (NGS) data analysis is highly compute intensive. In-memory computing, vectorization, bulk data transfer, CPU frequency scaling are some of the hardware features in the modern computing architectures. To get the best execution time and utilize these hardware features, it is necessary to tune the system level parameters before running the application. We studied the GATK-HaplotypeCaller which is part of common NGS workflows, that consume more than 43% of the total execution time. Multiple GATK 3.x versions were benchmarked and the execution time of HaplotypeCaller was optimized by various system level parameters which included: (i) tuning the parallel garbage collection and kernel shared memory to simulate in-memory computing, (ii) architecture-specific tuning in the PairHMM library for vectorization, (iii) including Java 1.8 features through GATK source code compilation and building a runtime environment for parallel sorting and bulk data transfer (iv) the default ’on-demand’ mode of CPU frequency is over-clocked by using ’performance-mode’ to accelerate the Java multi-threads. As a result, the HaplotypeCaller execution time was reduced by 82.66% in GATK 3.3 and 42.61% in GATK 3.7. Overall, the execution time of NGS pipeline was reduced to 70.60% and 34.14% for GATK 3.3 and GATK 3.7 respectively.

Highlights

  • The impact of generation sequencing (NGS) technologies in revolutionizing the biological and clinical sciences has been unprecedented[1, 2]

  • The execution time of genome alignment using Burrows-Wheeler Aligner (BWA) can be improved by parallelization that includes: (a) thread-parallelization by using multi-threads[12], (b) data-parallelization by splitting the input into distinct chunks or intermediate data followed by processing the chunks one-by-one within or across the node[13], and (c) data-parallel with concurrent execution by splitting the data into disjoint chunks distributing the Biomedical Informatics, Research Branch, Sidra Medical and Research Center, Post Box No 26999, Doha, Qatar

  • Non-Uniform Memory Access (NUMA) based multi-CPU is the features of modern High Performance Computing (HPC) architecture and more than 2 Terabyte of main memory can be possible within a single node

Read more

Summary

Introduction

The impact of generation sequencing (NGS) technologies in revolutionizing the biological and clinical sciences has been unprecedented[1, 2]. In addition to data-parallelization (e.g. distribution of independent chunks of data across the CPUs), concurrent parallelization (e.g. multi-threading) is implemented in multi-core CPUs of modern HPC systems[5, 14] These types of BWA optimizations are done in our earlier paper[3, 14] in the traditional HPC system. These implementations simulate in-memory computing concept, which may bring better performance benefit but fails in resource utilization[14] To address this issue, we proposed optimization of data intensive computing model by using optimal number of multi-threads for every sample execution and processing multiple samples within a node in an empirically parallel manner[3]. Most of the variant discovery algorithms fail to scale-up on multi-core HPC systems and this results in multi-threading overhead, poor scalability and underutilization of HPC resources[16] To address these challenges, data-parallelization and pipeline parallel execution models are implemented in ref. The cache fusion was used to improve the performance of genome alignment, choke elimination was used to eliminate the waiting time in the workflow and the framework of merged portion algorithm was invoked for better performance and optimal resource utilization in an optimized data portion model[17]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.