Abstract

BackgroundModern Next Generation- and Third Generation- Sequencing methods such as Illumina and PacBio Circular Consensus Sequencing platforms provide accurate sequencing data. Parallel developments in Deep Learning have enabled the application of Deep Neural Networks to variant calling, surpassing the accuracy of classical approaches in many settings. DeepVariant, arguably the most popular among such methods, transforms the problem of variant calling into one of image recognition where a Deep Neural Network analyzes sequencing data that is formatted as images, achieving high accuracy. In this paper, we explore an alternative approach to designing Deep Neural Networks for variant calling, where we use meticulously designed Deep Neural Network architectures and customized variant inference functions that account for the underlying nature of sequencing data instead of converting the problem to one of image recognition.ResultsResults from 27 whole-genome variant calling experiments spanning Illumina, PacBio and hybrid Illumina-PacBio settings suggest that our method allows vastly smaller Deep Neural Networks to outperform the Inception-v3 architecture used in DeepVariant for indel and substitution-type variant calls. For example, our method reduces the number of indel call errors by up to 18%, 55% and 65% for Illumina, PacBio and hybrid Illumina-PacBio variant calling respectively, compared to a similarly trained DeepVariant pipeline. In these cases, our models are between 7 and 14 times smaller.ConclusionsWe believe that the improved accuracy and problem-specific customization of our models will enable more accurate pipelines and further method development in the field. HELLO is available at https://github.com/anands-repo/hello

Highlights

  • Modern Generation- and Third Generation- Sequencing methods such as Illumina and PacBio Circular Consensus Sequencing platforms provide accurate sequencing data

  • Tool versions and training hardware In addition to HELLO, we performed experiments using DeepVariant version 1.1 [17] and Genome Analysis ToolKit (GATK) version 4.2.0.0 [18]. Both HELLO and DeepVariant were trained using the same datasets in all our experiments

  • GATK was run for Illumina and PacBio datasets using the Docker image downloaded from Dockerhub

Read more

Summary

Introduction

Modern Generation- and Third Generation- Sequencing methods such as Illumina and PacBio Circular Consensus Sequencing platforms provide accurate sequencing data. DeepVariant, arguably the most popular among such methods, transforms the problem of variant calling into one of image recognition where a Deep Neural Network analyzes sequencing data that is formatted as images, achieving high accuracy. PacBio CCS reads are not affected by mapping and mappability issues as severely because of their longer length While these reads have low average error-rates, the errors are of indel type and are highly context specific [8], it may be hard to call certain types of indel variants even when there is sufficient read coverage. Small variant calling using data from these two sequencing technologies provide high accuracy results. Since both sequencing platforms have different error profiles, it is beneficial to combine data from the two platforms to perform hybrid variant calling that compensates each other’s weaknesses

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call