Abstract

We describe an integrative approach to improve contiguity and haploidy of a reference genome assembly and demonstrate its impact with practical examples. With two novel features of Lep-Anchor software and a combination of dense linkage maps, overlap detection and bridging long reads, we generated an improved assembly of the nine-spined stickleback (Pungitius pungitius) reference genome. We were able to remove a significant number of haplotypic contigs, detect more genetic variation and improve the contiguity of the genome, especially that of X chromosome. However, improved scaffolding cannot correct for mosaicism of erroneously assembled contigs, demonstrated by a de novo assembly of a 1.6-Mbp inversion. Qualitatively similar gains were obtained with the genome of three-spined stickleback (Gasterosteus aculeatus). Since the utility of genome-wide sequencing data in biological research depends heavily on the quality of the reference genome, the improved and fully automated approach described here should be helpful in refining reference genome assemblies.

Highlights

  • Great deal of present-­day research in biology is based on genomic data that are processed and analysed in the context of a linear reference genome

  • Starting from an existing high-­ quality contig assembly, original PacBio reads and ultra-­dense linkage maps for the nine-­spined stickleback (Pungitius pungitius), we were able to generate a significantly improved reference genome using largely automated methods

  • Faced with the dilemma of correctly separating duplicated genome regions while simultaneously collapsing and merging haplotypic differences into a haploid sequence, all assembly programmes are poised to make errors. The magnitude of these errors depends on the heterozygosity of the reference individual and on the type of input data, long reads spanning more distant sites and capable of creating longer haplotype blocks, while the direction of the bias to either too long or too short genome depends on the algorithm

Read more

Summary

Introduction

Great deal of present-­day research in biology is based on genomic data that are processed and analysed in the context of a linear reference genome Typical examples of this are whole-­genome sequencing studies where sequencing reads are mapped to the reference genome and the characteristics of interest are derived from local dissimilarities and statistics based on the alignments (Korneliussen et al, 2014; Schraiber & Akey, 2015). The profound problem is that the physical connectivity is lost during sequencing and recovering that in the assembly stage is notoriously difficult. To this end, high-q­ uality linkage maps are valuable and allow inferring the physical order and orientation of the assembled contigs (Pengelly & Collins, 2019; Rastas, 2020; Stemple, 2013)

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.