Abstract
MotivationRapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.ResultsHere we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines.Availability and implementationThe source code is written in C and is available at https://github.com/dfguan/purge_dups.Supplementary information Supplementary data are available at Bioinformatics online.
Highlights
The superior and increasing throughput of long-read sequencing technologies, such as from Pacific Biosciences (Pacbio) and Oxford Nanopore Technologies (ONT), is revolutionizing the sequencing of genomes for new species (Phillippy, 2017)
Haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation
Long-read assemblers, such as Falcon (Chin et al, 2016) and Canu (Koren et al, 2017), typically generate haplotype-fused paths of a diploid genome, with Falcon-unzip (Chin et al, 2016) further able to separate the initial assembly into primary contigs and haplotigs
Summary
The superior and increasing throughput of long-read sequencing technologies, such as from Pacific Biosciences (Pacbio) and Oxford Nanopore Technologies (ONT), is revolutionizing the sequencing of genomes for new species (Phillippy, 2017). Results: Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.