Hybrid de novo genome assembly and centromere characterization of the gray mouse lemur (Microcebus murinus)

Peter A Larsen,Richard A Gibbs,R Alan Harris,Adam D Brown,Donna M Muzny,Neva C Durand,Kim C Worley,Shwetha C Murali,C Ryan Campbell,Muhammad S Shamim,Jennifer Shelton,Susan J Brown,Anne D Yoder,Beth A Sullivan,Olga Dudchenko,Ido Machol,Muthuswamy Raveendran,E Aiden ,Yue Liu,Jeffrey Rogers

doi:10.1186/s12915-017-0439-6

Abstract

BackgroundThe de novo assembly of repeat-rich mammalian genomes using only high-throughput short read sequencing data typically results in highly fragmented genome assemblies that limit downstream applications. Here, we present an iterative approach to hybrid de novo genome assembly that incorporates datasets stemming from multiple genomic technologies and methods. We used this approach to improve the gray mouse lemur (Microcebus murinus) genome from early draft status to a near chromosome-scale assembly.MethodsWe used a combination of advanced genomic technologies to iteratively resolve conflicts and super-scaffold the M. murinus genome.ResultsWe improved the M. murinus genome assembly to a scaffold N50 of 93.32 Mb. Whole genome alignments between our primary super-scaffolds and 23 human chromosomes revealed patterns that are congruent with historical comparative cytogenetic data, thus demonstrating the accuracy of our de novo scaffolding approach and allowing assignment of scaffolds to M. murinus chromosomes. Moreover, we utilized our independent datasets to discover and characterize sequences associated with centromeres across the mouse lemur genome. Quality assessment of the final assembly found 96% of mouse lemur canonical transcripts nearly complete, comparable to other published high-quality reference genome assemblies.ConclusionsWe describe a new assembly of the gray mouse lemur (Microcebus murinus) genome with chromosome-scale scaffolds produced using a hybrid bioinformatic and sequencing approach. The approach is cost effective and produces superior results based on metrics of contiguity and completeness. Our results show that emerging genomic technologies can be used in combination to characterize centromeres of non-model species and to produce accurate de novo chromosome-scale genome assemblies of complex mammalian genomes.

Highlights

The de novo assembly of repeat-rich mammalian genomes using only high-throughput short read sequencing data typically results in highly fragmented genome assemblies that limit downstream applications
We present an iterative approach to hybrid de novo genome assembly that incorporates datasets stemming from multiple genomic technologies and methods, namely Illumina, PacBio, Hi-C, and BioNano (Fig. 1, Additional file 1: Figure S1)
A second iteration of these two super-scaffolding steps corrected 308 putative misjoins, clustered 6934 contigs (85% of total contigs) representing 2.47 Gb (99%) of assembled sequence, and ordered 98% of the total sequence length in these clusters

Summary

Introduction

The de novo assembly of repeat-rich mammalian genomes using only high-throughput short read sequencing data typically results in highly fragmented genome assemblies that limit downstream applications. Perhaps one of the most exciting areas of advancement has been in the field of genome sequencing and assembly, where it is possible for individual researchers to produce genome assemblies for organisms of their choosing Despite these recent advancements, there remain significant challenges to the production of high-quality de novo eukaryotic genome assemblies. An ideal de novo whole genome assembly will be as continuous as possible (i.e., have minimal gaps), will accurately reflect the linear organization of chromosomes, and will contain few, if any, errors in nucleotide sequence Such high-quality assemblies can be annotated with all the genomic features that biologists wish to investigate, including protein coding genes, noncoding genes, regulatory sequences, repetitive regions, and heterochromatic regions, including telomeres and centromeres. The low cost of NGS, combined with its success in producing high accuracy sequences, is driving the production of many new de novo genome assemblies using solely NGS data

Methods

Results

Discussion

Conclusion