Construction of Whole Genomes from Scaffolds Using Single Cell Strand-Seq Data.

Mark Hills,Peter M Lansdorp,Kieran O’Neill,Ester Falconer,Ashley D Sanders,Kerstin Howe,Victor Guryev

doi:10.3390/ijms22073617

Abstract

Accurate reference genome sequences provide the foundation for modern molecular biology and genomics as the interpretation of sequence data to study evolution, gene expression, and epigenetics depends heavily on the quality of the genome assembly used for its alignment. Correctly organising sequenced fragments such as contigs and scaffolds in relation to each other is a critical and often challenging step in the construction of robust genome references. We previously identified misoriented regions in the mouse and human reference assemblies using Strand-seq, a single cell sequencing technique that preserves DNA directionality Here we demonstrate the ability of Strand-seq to build and correct full-length chromosomes by identifying which scaffolds belong to the same chromosome and determining their correct order and orientation, without the need for overlapping sequences. We demonstrate that Strand-seq exquisitely maps assembly fragments into large related groups and chromosome-sized clusters without using new assembly data. Using template strand inheritance as a bi-allelic marker, we employ genetic mapping principles to cluster scaffolds that are derived from the same chromosome and order them within the chromosome based solely on directionality of DNA strand inheritance. We prove the utility of our approach by generating improved genome assemblies for several model organisms including the ferret, pig, Xenopus, zebrafish, Tasmanian devil and the Guinea pig.

Highlights

We previously showed that Strand-seq locates sister chromatid exchanges (SCEs) at unparalleled resolution, seen as a template strand switching from W to C or vice versa [4,17,19,20]
The quality of genome assemblies is determined by the methods employed to build them, the algorithms used to create contigs and chromosomes, and the complexity of the genome
Algorithms used to build contigs from overlapping sequences can vary wildly [14], often resulting in chimeric contigs which may be retained in future builds

Summary

Introduction

The mouse [1] and human [2] genome references have revolutionized biomedical research and facilitated many advances in studies of transcription, epigenetics, genetic variation, evolution, and cancer [3]. While both assemblies are of very high quality, they still contain fragments that have not been localized to specific chromosomes, and large regions (typically flanked by unbridged gaps) that are incorrectly oriented with respect to adjacent scaffolds [4,5]. Assemblies themselves evolve over time as sequences are added, gaps are closed, and errors resolved

Methods

Results

Conclusion