Abstract
Long-read sequencing enables variant detection in genomic regions that are considered difficult-to-map by short-read sequencing. To fully exploit the benefits of longer reads, here we present a deep learning method NanoCaller, which detects SNPs using long-range haplotype information, then phases long reads with called SNPs and calls indels with local realignment. Evaluation on 8 human genomes demonstrates that NanoCaller generally achieves better performance than competing approaches. We experimentally validate 41 novel variants in a widely used benchmarking genome, which could not be reliably detected previously. In summary, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing.
Highlights
Single-nucleotide polymorphisms (SNPs) and small insertions/deletions are two common types of genetic variants in human genomes
For SNP calling in NanoCaller, candidate SNP sites are selected according to the specified thresholds for minimum coverage and minimum frequency of alternative alleles
In the “Results” section, we present performances of five NanoCaller models: ONTHG001, Oxford Nanopore Technology (ONT)-HG002, circular consensus sequencing (CCS)-HG001, CCS-HG002, and Continuous Long Read Sequencing (CLR)-HG002; the first four datasets have both SNP and indel deep learning models, whereas CLR-HG002 consists of only a SNP model
Summary
Single-nucleotide polymorphisms (SNPs) and small insertions/deletions (indels) are two common types of genetic variants in human genomes. Variant calling methods on short reads, such as GATK [1] and FreeBayes [2], achieved excellent performance to detect SNPs and small indels in genomic regions marked as traditional “high-confidence regions” in various benchmarking tests [3,4,5]. Since these methods were developed for short-read sequencing data with low per-base error
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have