Abstract
Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.
Highlights
Structural variants (SVs) are defined as DNA rearrangements ≥50 bp and include copy number variants (CNVs; deletions and duplications) as well as insertions, inversions, translocations, and more complex combinations of these described events (Alkan et al, 2011; Sudmant et al, 2015)
Since not all the SV types are included in the NA24385 truth callset, we generated synthetic Oxford Nanopore Technologies (ONT) data (∼154 Gbp throughput) that we refer to as SI00001 on, harboring deletions and insertions as well as inversions, duplications, and translocations using the SV simulator VISOR
While short-read sequencing has been considered the gold standard for the majority of sequencing projects for years (Roberts et al, 2021), such data have biases in whole-genome sequencing studies due to the uneven coverage of regions with high/low GC and difficulty of mapping short reads in lowcomplexity regions
Summary
Structural variants (SVs) are defined as DNA rearrangements ≥50 bp and include copy number variants (CNVs; deletions and duplications) as well as insertions, inversions, translocations, and more complex combinations of these described events (Alkan et al, 2011; Sudmant et al, 2015). Despite the importance of SVs, they have been largely understudied compared to SNVs because of dominant short-read sequencing technologies hindering their identification, especially in lowcomplexity regions, which are known to be SV hotspots (Mills et al, 2011). Long-read sequencing from Pacific Biosciences and Oxford Nanopore Technologies (ONT) has emerged in recent years (Chaisson et al, 2015; Jain et al, 2016) and proved invaluable in identifying previously intractable DNA sequences (Li and Freudenberg, 2014; Bolognini et al, 2020) and close gaps in the human genome assemblies and unraveling otherwise undetected SVs at population-scale (Beyter et al, 2020; Wu et al, 2021)
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have