The parallelism motifs of genomic data analysis.

Katherine Yelick,Ariful Azad,Saliya Ekanayake,Benjamin Brock,Giulia Guidi,Rob Egan,Cristina Teodoropol,Aydın Buluç,Steven Hofmeyr,Marquita Ellis,Oguz Selvitopi,Leonid Oliker,Evangelos Georganas,Muaaz Awan

doi:10.1098/rsta.2019.0394

Abstract

Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing.This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Highlights

The future of scientific computing will be increasingly data intensive due to the growth of data from sequencers, telescopes, microscopes, light sources, particle detectors and embedded environmental sensors
We describe parallelization challenges and approaches for high-performance genomic data analysis using a series of examples drawn in large part from the ExaBiome project, including k-mer counting, alignment, genome assembly, protein clustering and machine learning
We describe at a high level some of the algorithms and parallelization approaches used in genomic data analysis, selecting a set of problems that represent a diverse set of computational patterns and are prevalent across multiple applications

Summary

Discussion

Cite this article: Yelick K et al 2020 The parallelism motifs of genomic data analysis. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require largescale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’

Introduction

ExaBiome overview

A sampling of genomic analyses

Comparison with other parallelism motifs

Findings

Hardware and software support for parallel genome analysis

Summary

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences	Publication Date: Jan 20, 2020
Citations: 13	License type: cc-by

R Discovery Prime

The parallelism motifs of genomic data analysis.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences

Lead the way for us

Similar Papers

Numerical algorithms for high-performance computational science
Jack Dongarra ... Nicholas J Higham
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences | VOL. 378
Jack Dongarra, et. al.Jack Dongarra ... Nicholas J Higham
20 Jan 2020
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences | VOL. 378

Hierarchical algorithms on hierarchical architectures.
D E Keyes ... G Turkiyyah
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences | VOL. 378
D E Keyes, et. al.D E Keyes ... G Turkiyyah
20 Jan 2020
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences | VOL. 378

Big telescope, big data: towards exascale with the Square Kilometre Array.
A M M Scaife
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences | VOL. 378
A M M ScaifeA M M Scaife
20 Jan 2020
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences | VOL. 378

The physics of numerical analysis: a climate modelling case study.
T N Palmer
Philosophical transactions. Series A, Mathematical, physical, and engineering sciences | VOL. 378
T N PalmerT N Palmer
20 Jan 2020
Philosophical transactions. Series A, Mathematical, physical, and engineering sciences | VOL. 378

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

The parallelism motifs of genomic data analysis.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences