Abstract

In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.

Highlights

  • What is pangenomics? The current human reference genome, GRCh38 (Schneider et al, 2017), derives from a draft sequence that was constructed from a handful of individuals (Lander et al, 2001) likely of African and European ancestries (Reich et al, 2009)

  • Use cases Graph genomes can be used for inference of extension and phasing from sparse information derived from SNP chips and RNA sequencing (RNA-seq)

  • Ongoing improvements in sequencing technology and diminishing costs make the generation of high-quality genome assemblies from diverse populations possible in a way today that could only have been imagined during the Human Genome Project (HGP)

Read more

Summary

Introduction

What is pangenomics? The current human reference genome, GRCh38 (Schneider et al, 2017), derives from a draft sequence that was constructed from a handful of individuals (Lander et al, 2001) likely of African and European ancestries (Reich et al, 2009). A human “pangenome” is a representation of all genomic variation observed in human populations (Computational Pan-Genomics Consortium, 2018) In this context, a pangenome is a more comprehensive representation of genetic diversity than an individual diploid genome or a reference comprised of linear chromosomes built from multiple individuals, such as GRCh38. Bias and missing sequence may still persist in a pangenome, their effects should be substantially less, and may even be ameliorated by adding new content to the framework In addition to these issues with the current reference, several studies using long reads have reported an average of ~20,000 structural variants (SV) per human genome, most of which fall within repetitive elements and segmental duplications (HGSVC) (Audano et al, 2019; Chaisson et al, 2015). It is first essential to phase haplotypes from at least one of the parents (Fan et al, 2012; Kitzman et al, 2012)

Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call