Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms

Nadège Guiglielmoni,Karine Van Doninck,Alessandro Derzelle,Jean-François Flot,Antoine Houtain

doi:10.1186/s12859-021-04118-3

Nadège Guiglielmoni, Karine Van Doninck + Show 3 more

Open Access

https://doi.org/10.1186/s12859-021-04118-3

Copy DOI

Abstract

BackgroundLong-read sequencing is revolutionizing genome assembly: as PacBio and Nanopore technologies become more accessible in technicity and in cost, long-read assemblers flourish and are starting to deliver chromosome-level assemblies. However, these long reads are usually error-prone, making the generation of a haploid reference out of a diploid genome a difficult enterprise. Failure to properly collapse haplotypes results in fragmented and structurally incorrect assemblies and wreaks havoc on orthology inference pipelines, yet this serious issue is rarely acknowledged and dealt with in genomic projects, and an independent, comparative benchmark of the capacity of assemblers and post-processing tools to properly collapse or purge haplotypes is still lacking.ResultsWe tested different assembly strategies on the genome of the rotifer Adineta vaga, a non-model organism for which high coverages of both PacBio and Nanopore reads were available. The assemblers we tested (Canu, Flye, NextDenovo, Ra, Raven, Shasta and wtdbg2) exhibited strikingly different behaviors when dealing with highly heterozygous regions, resulting in variable amounts of uncollapsed haplotypes. Filtering reads generally improved haploid assemblies, and we also benchmarked three post-processing tools aimed at detecting and purging uncollapsed haplotypes in long-read assemblies: HaploMerger2, purge_haplotigs and purge_dups.ConclusionsWe provide a thorough evaluation of popular assemblers on a non-model eukaryote genome with variable levels of heterozygosity. Our study highlights several strategies using pre and post-processing approaches to generate haploid assemblies with high continuity and completeness. This benchmark will help users to improve haploid assemblies of non-model organisms, and evaluate the quality of their own assemblies.

Highlights

Long-read sequencing is revolutionizing genome assembly: as Pacific Biosciences (PacBio) and Nanopore technologies become more accessible in technicity and in cost, longread assemblers flourish and are starting to deliver chromosome-level assemblies
In addition to assessing the ability of these seven assemblers to collapse highly heterozygous regions, we investigated whether adding a pre-assembly read-filtering step or removing uncollapsed haplotypes post-assembly improved the assembly
In the case of a haploid assembly of a diploid genome, all homozygous k-mers (i.e., k-mers that are shared by the two haplotypes) should be represented in the assembly, whereas only half of the heterozygous k-mers (i.e., k-mers that are found in only one haplotype) should be represented. To detect both underpurging and overpurging, we focused in our benchmark on the k-mer completeness of heterozygous k-mers: as we expect only half of them to be present in a haploid assembly, a well-collapsed assembly should exhibit a k-mer completeness of about 50%, whereas a lower value indicates that too many k-mers were lost and a higher value indicates that too many k-mers were retained

Summary

Introduction

Long-read sequencing is revolutionizing genome assembly: as PacBio and Nanopore technologies become more accessible in technicity and in cost, longread assemblers flourish and are starting to deliver chromosome-level assemblies. These long reads are usually error-prone, making the generation of a haploid reference out of a diploid genome a difficult enterprise. On the other hand, have systematic errors in homopolymeric regions and Nanopore contigs generally require further correction using Illumina or PacBio reads, in a process called “polishing” [5, 6]. Nanopore reads are currently much longer than PacBio reads, with runs attaining N50s over 100 kilobases (kb) and longest reads spanning over 1 Megabase (Mb) [7, 8]

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jun 5, 2021
Citations: 40	License type: open-access

R Discovery Prime

R Discovery Prime

Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Evaluation of assembly methods combining long-reads and short-reads to obtain Paenibacillus sp. R4 high-quality complete genome.
Seung Chul Shin ... Woong Choi
3 Biotech | VOL. 10
Seung Chul Shin, et. al.Seung Chul Shin ... Woong Choi
19 Oct 2020
3 Biotech | VOL. 10

A high-quality, long-read genome assembly of the endangered ring-tailed lemur (Lemur catta).
Marc Palmada-Flores ... Jacquelyn Mountcastle
GigaScience | VOL. 11
Marc Palmada-Flores, et. al.Marc Palmada-Flores ... Jacquelyn Mountcastle
01 Apr 2022
GigaScience | VOL. 11

The Draft Genome of the MD-2 Pineapple
Raimi M Redwan ... Subbiah V Kumar
-
Raimi M Redwan, et. al.Raimi M Redwan ... Subbiah V Kumar
01 Jan 2018
01 Jan 2018

Hybrid assembly of ultra-long Nanopore reads augmented with 10x-Genomics contigs: Demonstrated with a human genome
Zhanshan (Sam) Ma ... Ya-Ping Zhang
Genomics | VOL. 111
Zhanshan (Sam) Ma, et. al.Zhanshan (Sam) Ma ... Ya-Ping Zhang
27 Dec 2018
Genomics | VOL. 111

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics