A 3-way hybrid approach to generate a new high-quality chimpanzee reference genome (Pan_tro_3.0).

Lukas F K Kuderna ,Arcadi Navarro,John Huddleston,David Gordon,Hafid Laayouni,Chad Tomlinson,Aitor Serres Armero,Javier Herrero,Paolo Ribeca,Inna Povolotskaya,Jaume Betranpetit,Annabel Tran,Lars Feuk,Joel Armstrong,Benedict Paten,Tyler Alioto,Edward Green ,Ladeana W Hillier ,Wesley C Warren ,Raquel García-Pérez ,Evan E Eichler ,Andrew J Sharp ,Jèssica Gómez-Garrido ,Daniel J Ho ,Ian T Fiddes ,Tomàs Marquès‐Bonet

doi:10.1093/gigascience/gix098

Abstract

The chimpanzee is arguably the most important species for the study of human origins. A key resource for these studies is a high-quality reference genome assembly; however, as with most mammalian genomes, the current iteration of the chimpanzee reference genome assembly is highly fragmented. In the current iteration of the chimpanzee reference genome assembly (Pan_tro_2.1.4), the sequence is scattered across more then 183 000 contigs, incorporating more than 159 000 gaps, with a genome-wide contig N50 of 51 Kbp. In this work, we produce an extensive and diverse array of sequencing datasets to rapidly assemble a new chimpanzee reference that surpasses previous iterations in bases represented and organized in large scaffolds. To this end, we show substantial improvements over the current release of the chimpanzee genome (Pan_tro_2.1.4) by several metrics, such as increased contiguity by >750% and 300% on contigs and scaffolds, respectively, and closure of 77% of gaps in the Pan_tro_2.1.4 assembly gaps spanning >850 Kbp of the novel coding sequence based on RNASeq data. We further report more than 2700 genes that had putatively erroneous frame-shift predictions to human in Pan_tro_2.1.4 and show a substantial increase in the annotation of repetitive elements. We apply a simple 3-way hybrid approach to considerably improve the reference genome assembly for the chimpanzee, providing a valuable resource for the study of human origins. Furthermore, we produce extensive sequencing datasets that are all derived from the same cell line, generating a broad non-human benchmark dataset.

Highlights

To test the potentially combinatorial power of varied sequencing and mapping strategies, we created several different datasets on different platforms to try to leverage the advantages of each, as the shortcomings of 1 sequencing strategy might be compensated for by another [1]
We show substantial improvements over the current release of the chimpanzee genome (Pan tro 2.1.4) by several metrics, such as increased contiguity by >750% and 300% on contigs and scaffolds, respectively, and closure of 77% of gaps in the Pan tro 2.1.4 assembly gaps spanning >850 kilo base pairs (Kbp) of the novel coding sequence based on RNASeq data
These diverse datasets complement the resources that were already available for the same cell line, namely 6-fold coverage of ABI Sanger capillary reads used for the initial chimpanzee genome assembly, a 100-bp paired Illumina HiSeq data, a fosmid library at 6-fold physical coverage with available end sequences, a Bacterial Artificial Chromosome (BAC) library at 3-fold physical coverage with available end sequences and around 700 finished BACs [4]

Summary

Introduction

To test the potentially combinatorial power of varied sequencing and mapping strategies, we created several different datasets on different platforms to try to leverage the advantages of each, as the shortcomings of 1 sequencing strategy might be compensated for by another [1]. We show substantial improvements over the current release of the chimpanzee genome (Pan tro 2.1.4) by several metrics, such as increased contiguity by >750% and 300% on contigs and scaffolds, respectively, and closure of 77% of gaps in the Pan tro 2.1.4 assembly gaps spanning >850 Kbp of the novel coding sequence based on RNASeq data.

Results

Conclusion