Computational pan-genomics: status, promises and challenges.

Ali Ghaffaari ,Siavash Sheikhizadeh ,Lodewyk F. A. Wessels ,Paul J. Kersey ,Mohammed El-Kebir ,Benjamin Langmead ,Knut Reinert ,Veli Mäkinen ,Paul I.w. De Bakker ,Bas E. Dutilh ,Valentina Boeva ,Gunnar W. Klau ,Ashley D. Sanders ,Paul Medvedev ,Jiayin Wang ,Thomas Abeel ,Nadia Pisanti ,Raoul J. P. Bonnal ,Fabio Vandin ,Alexander Schönhuth ,Tobias Marschall ,Eric Rivals ,Sven Rahmann ,Kai Ye ,John C. Mu ,Eleazar Eskin ,Ben Raphael ,Can Alkan ,Manja Marz ,Daniel Valenzuela ,Matthias Schlesner ,Rayan Chikhi ,Louis J. Dijkstra ,Wigard P. Kloosterman ,Corinna Ernst ,Benedict Paten ,Ole Schulz-Trieglaff ,Sandra Smit ,Erwin Datema ,Pierre Peterlongo ,Jeroen De Ridder ,Francesca Chiaromonte ,Francesca D. Ciccarelli ,Erik Garrison ,Eric-Wubbo Lameijer ,Pieter B. Neerincx ,Evan E. Eichler ,David Porubsky ,Marcel Martin ,Victor Guryev ,Jan O. Korbel ,Cornelia M. Van Duijn ,Klaasjan G. Ouwens ,Carl Shneider ,Adam M. Novak ,Dick De Ridder ,Robin Cijvat ,Jasmijn A. Baaijens ,Ying Zhang

doi:10.1093/bib/bbw089

Abstract

Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.

Highlights

In 1995, the complete genome sequence for the bacterium Haemophilus influenzae was published [1], followed by the sequence for the eukaryote Saccharomyces cerevisiae in 1996 [2] and the landmark publication of the human genome in 2001 [3, 4]
Scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets
While being aware that the above definition of a pangenome is general, we argue that it is instrumental for identifying common computational problems that occur in different disciplines

Summary

Introduction

In 1995, the complete genome sequence for the bacterium Haemophilus influenzae was published [1], followed by the sequence for the eukaryote Saccharomyces cerevisiae in 1996 [2] and the landmark publication of the human genome in 2001 [3, 4]. These sequences, and many more that followed, have served as ‘reference genomes’, which formed the basis for both major advances in functional genomics and for studying genetic variation by re-sequencing other individuals from the same species [5,6,7,8]. Such a reference sequence can take a number of forms, including:

Objectives

Methods

Findings

Conclusion