CSA: A high-throughput chromosome-scale assembly pipeline for vertebrate genomes.

Christophe Klopp,Heiner Kuhl,Ling Li,Matthias Stöck,Sven Wuertz,Xu-Fang Liang

doi:10.1093/gigascience/giaa034

Abstract

BackgroundEasy-to-use and fast bioinformatics pipelines for long-read assembly that go beyond the contig level to generate highly continuous chromosome-scale genomes from raw data remain scarce.ResultChromosome-Scale Assembler (CSA) is a novel computationally highly efficient bioinformatics pipeline that fills this gap. CSA integrates information from scaffolded assemblies (e.g., Hi-C or 10X Genomics) or even from diverged reference genomes into the assembly process. As CSA performs automated assembly of chromosome-sized scaffolds, we benchmark its performance against state-of-the-art reference genomes, i.e., conventionally built in a laborious fashion using multiple separate assembly tools and manual curation. CSA increases the contig lengths using scaffolding, local re-assembly, and gap closing. On certain datasets, initial contig N50 may be increased up to 4.5-fold. For smaller vertebrate genomes, chromosome-scale assemblies can be achieved within 12 h using low-cost, high-end desktop computers. Mammalian genomes can be processed within 16 h on compute-servers. Using diverged reference genomes for fish, birds, and mammals, we demonstrate that CSA calculates chromosome-scale assemblies from long-read data and genome comparisons alone. Even contig-level draft assemblies of diverged genomes are helpful for reconstructing chromosome-scale sequences. CSA is also capable of assembling ultra-long reads.ConclusionsCSA can speed up and simplify chromosome-level assembly and significantly lower costs of large-scale family-level vertebrate genome projects.

Highlights

Easy-to-use and fast bioinformatics pipelines for long-read assembly that go beyond the contig level to generate highly continuous chromosome-scale genomes from raw data remain scarce
We show that Chromosome-Scale Assembler (CSA) is able to produce chromosomal-level assemblies for smaller vertebrate genomes within 12 h on low-cost computing equipment ($1,000–2,000, Intel i7, 128 GB random access memory (RAM)), using just long-read data and a diverged reference genome as input
Our results show that long-read data and well-chosen, orderlevel state-of-the-art reference genomes enable CSA to calculate highly continuous assemblies for most chromosomes, but in some cases clade-specific problems have to be resolved by manual curation

Summary

Introduction

Easy-to-use and fast bioinformatics pipelines for long-read assembly that go beyond the contig level to generate highly continuous chromosome-scale genomes from raw data remain scarce. CSA integrates information from scaffolded assemblies (e.g., Hi-C or 10X Genomics) or even from diverged reference genomes into the assembly process. Using diverged reference genomes for fish, birds, and mammals, we demonstrate that CSA calculates chromosome-scale assemblies from long-read data and genome comparisons alone. Most vertebrate genomes can be assembled using noisy long reads [1,2,3] and the results—in terms of assembly contiguity, measured as contig N50—can outperform results obtained by short-read sequencing >100×. The contig N50 of today’s noisy long-read assemblies reaches lengths similar to the scaffold N50 of high-quality short-read genome assemblies obtained some years ago. Most of them produce only contigs [10,11,12,13,14,15] and do not incorporate additional information to order these contigs into scaffolds, which would enable further gap closing and lead to chromosomal-level assemblies

Methods

Results

Conclusion