NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

Qian Liu,Li Fang,Mian Umair Ahsan,Kai Wang

doi:10.1186/s13059-021-02472-2

Qian Liu, Li Fang + Show 2 more

Open Access

https://doi.org/10.1186/s13059-021-02472-2

Copy DOI

Abstract

Long-read sequencing enables variant detection in genomic regions that are considered difficult-to-map by short-read sequencing. To fully exploit the benefits of longer reads, here we present a deep learning method NanoCaller, which detects SNPs using long-range haplotype information, then phases long reads with called SNPs and calls indels with local realignment. Evaluation on 8 human genomes demonstrates that NanoCaller generally achieves better performance than competing approaches. We experimentally validate 41 novel variants in a widely used benchmarking genome, which could not be reliably detected previously. In summary, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing.

Highlights

Single-nucleotide polymorphisms (SNPs) and small insertions/deletions are two common types of genetic variants in human genomes
For SNP calling in NanoCaller, candidate SNP sites are selected according to the specified thresholds for minimum coverage and minimum frequency of alternative alleles
In the “Results” section, we present performances of five NanoCaller models: ONTHG001, Oxford Nanopore Technology (ONT)-HG002, circular consensus sequencing (CCS)-HG001, CCS-HG002, and Continuous Long Read Sequencing (CLR)-HG002; the first four datasets have both SNP and indel deep learning models, whereas CLR-HG002 consists of only a SNP model

Summary

Introduction

Single-nucleotide polymorphisms (SNPs) and small insertions/deletions (indels) are two common types of genetic variants in human genomes. Variant calling methods on short reads, such as GATK [1] and FreeBayes [2], achieved excellent performance to detect SNPs and small indels in genomic regions marked as traditional “high-confidence regions” in various benchmarking tests [3,4,5]. Since these methods were developed for short-read sequencing data with low per-base error

Methods

Results

Discussion

Conclusion