A multi-task convolutional deep neural network for variant calling in single molecule sequencing

Ruibang Luo,Michael C Schatz,Fritz J Sedlazeck,Tak-Wah Lam

doi:10.1038/s41467-019-09025-z

Abstract

The accurate identification of DNA sequence variants is an important, but challenging task in genomics. It is particularly difficult for single molecule sequencing, which has a per-nucleotide error rate of ~5–15%. Meeting this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. For the well-characterized NA12878 human sample, Clairvoyante achieves 99.67, 95.78, 90.53% F1-score on 1KP common variants, and 98.65, 92.57, 87.26% F1-score for whole-genome analysis, using Illumina, PacBio, and Oxford Nanopore data, respectively. Training on a second human sample shows Clairvoyante is sample agnostic and finds variants in less than 2 h on a standard server. Furthermore, we present 3,135 variants that are missed using Illumina but supported independently by both PacBio and Oxford Nanopore reads. Clairvoyante is available open-source (https://github.com/aquaskyline/Clairvoyante), with modules to train, utilize and visualize the model.

Highlights

The accurate identification of DNA sequence variants is an important, but challenging task in genomics
We carried out unit tests and answered the following questions in Supplementary Note, Unit tests, including (1) What are the characteristics of false positives and false negatives? (2) Can lower learning rate and longer training provide better performance? (3) Can a model train on truth variants from multiple samples provide better performance? (4) Can a higher input data quality improve the variant calling performance? (5) Is the current network design sufficient in terms of learning capacity?
Using a high-performance desktop graphics processing unit (GPU) model GTX 1080 Ti, 170 s are needed per epoch, which leads to about 5 h to finish training a model with the fast training mode

Summary

Introduction

The accurate identification of DNA sequence variants is an important, but challenging task in genomics It is difficult for single molecule sequencing, which has a pernucleotide error rate of ~5–15%. Meeting this demand, we developed Clairvoyante, a multitask five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. Previous works have intensively studied the different data characteristics that might contribute to higher variant calling performance, including the properties of the sequencing instrument[2], the quality of the preceding sequence aligners[3], and the alignability of the genome reference[4]. Experiments calling variants in multiple human genomes both at common variant sites and genome-wide show that Clairvoyante is on par with GATK UnifiedGenotyper on Illumina data, and substantially outperforms Nanopolish and DeepVariant on PacBio and ONT data on accuracy and speed

Methods

Results

Conclusion