Abstract
MotivationThe recent technological advances in genome sequencing techniques have resulted in an exponential increase in the number of sequenced human and non-human genomes. The ever increasing number of assemblies generated by novel de novo pipelines and strategies demands the development of new software to evaluate assembly quality and completeness. One way to determine the completeness of an assembly is by detecting its Presence–Absence variations (PAV) with respect to a reference, where PAVs between two assemblies are defined as the sequences present in one assembly but entirely missing in the other one. Beyond assembly error or technology bias, PAVs can also reveal real genome polymorphism, consequence of species or individual evolution, or horizontal transfer from viruses and bacteria.ResultsWe present scanPAV, a pipeline for pairwise assembly comparison to identify and extract sequences present in one assembly but not the other. In this note, we use the GRCh38 reference assembly to assess the completeness of six human genome assemblies from various assembly strategies and sequencing technologies including Illumina short reads, 10× genomics linked-reads, PacBio and Oxford Nanopore long reads, and Bionano optical maps. We also discuss the PAV polymorphism of seven Tasmanian devil whole genome assemblies of normal animal tissues and devil facial tumour 1 (DFT1) and 2 (DFT2) samples, and the identification of bacterial sequences as contamination in some of the tumorous assemblies.Availability and implementationThe pipeline is available under the MIT License at https://github.com/wtsi-hpag/scanPAV.Supplementary information Supplementary data are available at Bioinformatics online.
Highlights
For a complete catalogue of genetic variations, it is important to include Presence–Absence Variations (PAVs) as sources of genetic divergence and diversity together with SNPs/indels and CNVs
The identification of PAVs in genome comparisons can be useful to detect real polymorphism or lateral transfer, but can help assess an assembly completeness, and strengths and weaknesses of a new technology or a new assembly pipeline. We present both types of PAV analyses using scanPAV: (i) the study of presence–absence sequences for the human reference GRCh38 and six other human assemblies, to assess the technologies used and the assembly strategies; and (ii) the PAVs detection for seven Tasmanian devil de novo assemblies from normal animal
We show the use of scanPAV to assess the completeness of six human genome assemblies compared to the reference GRCh38
Summary
For a complete catalogue of genetic variations, it is important to include Presence–Absence Variations (PAVs) as sources of genetic divergence and diversity together with SNPs/indels and CNVs. The identification of PAVs in genome comparisons can be useful to detect real polymorphism or lateral transfer, but can help assess an assembly completeness, and strengths and weaknesses of a new technology or a new assembly pipeline. We present both types of PAV analyses using scanPAV: (i) the study of presence–absence sequences for the human reference GRCh38 and six other human assemblies, to assess the technologies used and the assembly strategies; and (ii) the PAVs detection for seven Tasmanian devil de novo assemblies from normal animal. Tissues and devil facial tumour samples (Stammnitz, M.R. et al, The origins and vulnerabilities of two transmissible cancers in Tasmanian devils, in press at Cancer Cell)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.