IBM Power Research Articles

This paper introduces the first asynchronous, task-based formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a new formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original and hostile LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is also capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations (i.e., Intel MKL and Elemental) for the polar decomposition on latest shared-memory vendors' systems (i.e., Intel Haswell/Broadwell/Knights Landing, NVIDIA K80/P100 GPUs and IBM Power8), while maintaining high numerical accuracy.

Read full abstract

BackgroundThe decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies.ResultsWe present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms.ConclusionsEven with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1211-6) contains supplementary material, which is available to authorized users.

Read full abstract

IBM Power Research Articles

Related Topics

Articles published on IBM Power

Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

Low-synchronization, mostly lock-free, elastic scheduling for streaming runtimes

IBM Power9 Processor Architecture

Mixed-size concurrency: ARM, POWER, C/C++11, and SC

A hybrid computational strategy to address WGS variant analysis in >5000 samples.

Performance analysis of the Kahan‐enhanced scalar product on current multi‐core and many‐core processors

Heat transfer and entropy generation in mixed convection of a nanofluid within an inclined skewed cavity

X10 and APGAS at Petascale

Workload acceleration with the IBM POWER vector-scalar architecture

PATer: A Hardware Prefetching Automatic Tuner on IBM POWER8 Processor

Deterministic Random Walk: A New Preconditioner for Power Grid Analysis

MCMC2 (version 1.1.1): A Monte Carlo code for multiply charged clusters

Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8

A Partitioned Global Address Space implementation of the European Centre for Medium Range Weather Forecasts Integrated Forecasting System

A data-centric algorithm for automated detection and extraction of isoparametric surfaces

High performance locks for multi-level NUMA systems

IBM POWER8 performance features and evaluation

Debugging post-silicon fails in the IBM POWER8 bring-up lab

The cache and memory subsystems of the IBM POWER8 processor

Transactional memory support in the IBM POWER8 processor

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

IBM Power Research Articles

Related Topics

Articles published on IBM Power

Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

Low-synchronization, mostly lock-free, elastic scheduling for streaming runtimes

IBM Power9 Processor Architecture

Mixed-size concurrency: ARM, POWER, C/C++11, and SC

A hybrid computational strategy to address WGS variant analysis in >5000 samples.

Performance analysis of the Kahan‐enhanced scalar product on current multi‐core and many‐core processors

Heat transfer and entropy generation in mixed convection of a nanofluid within an inclined skewed cavity

X10 and APGAS at Petascale

Workload acceleration with the IBM POWER vector-scalar architecture

PATer: A Hardware Prefetching Automatic Tuner on IBM POWER8 Processor

Deterministic Random Walk: A New Preconditioner for Power Grid Analysis

MCMC2 (version 1.1.1): A Monte Carlo code for multiply charged clusters

Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8

A Partitioned Global Address Space implementation of the European Centre for Medium Range Weather Forecasts Integrated Forecasting System

A data-centric algorithm for automated detection and extraction of isoparametric surfaces

High performance locks for multi-level NUMA systems

IBM POWER8 performance features and evaluation

Debugging post-silicon fails in the IBM POWER8 bring-up lab

The cache and memory subsystems of the IBM POWER8 processor

Transactional memory support in the IBM POWER8 processor