Abstract
The recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.
Highlights
The human reference genome is a fundamental resource for human genetics and biomedical research
We will first describe a data model for reference pangenome graphs, which establishes the foundation of this article
We will demonstrate the utility of pangenome graphs with a human graph generated from twenty human haplotypes and a primate graph generated from four species
Summary
The human reference genome is a fundamental resource for human genetics and biomedical research. The primary sequences of the reference genome GRCh38 [1] are a mosaic of haplotypes with each haplotype segment derived from a single human individual They cannot represent the genetic diversity in human populations, and as a result, each individual may carry thousands of large germline variants absent from the reference genome [2]. Some of these variants are likely associated with phenotype [3] but are often missed or misinterpreted when we map sequence data to GRCh38, in particular with short reads [4].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.