Abstract

Nanopore sequencing is promising because of its long read length and high speed. During sequencing, a strand of DNA/RNA passes through a biological nanopore, which causes the current in the pore to fluctuate. During basecalling, context-dependent current measurements are translated into the base sequence of the DNA/RNA strand. Accurate and fast basecalling is vital for downstream analyses such as genome assembly and detecting single-nucleotide polymorphisms and genomic structural variants. However, owing to the various changes in DNA/RNA molecules, noise during sequencing, and limitations of basecalling methods, accurate basecalling remains a challenge. In this paper, we propose Causalcall, which uses an end-to-end temporal convolution-based deep learning model for accurate and fast nanopore basecalling. Developed on a temporal convolutional network (TCN) and a connectionist temporal classification decoder, Causalcall directly identifies base sequences of varying lengths from current measurements in long time series. In contrast to the basecalling models using recurrent neural networks (RNNs), the convolution-based model of Causalcall can speed up basecalling by matrix computation. Experiments on multiple species have demonstrated the great potential of the TCN-based model to improve basecalling accuracy and speed when compared to an RNN-based model. Besides, experiments on genome assembly indicate the utility of Causalcall in reference-based genome assembly.

Highlights

  • Nanopore sequencing is a novel third-generation sequencing technology (Leggett and Clark, 2017), focusing on high-throughput, single-molecule, real-time, long-read, and direct DNA/RNA sequencing

  • We showed the potential of the temporal convolutional network (TCN)-based model in improving the accuracy and speed of basecalling, when compared to the recurrent neural networks (RNNs)-based model

  • Causalcall has the lowest insertion rate on phage and human. These results show that Causalcall effectively learned the base-related patterns hidden in the current measurements. They indicate that the TCN-based model of Causalcall has greater potential for improving the accuracy of basecalling than the RNN-based model of Chiron

Read more

Summary

Introduction

Nanopore sequencing is a novel third-generation sequencing technology (Leggett and Clark, 2017), focusing on high-throughput, single-molecule, real-time, long-read, and direct DNA/RNA sequencing. It has rapidly developed in recent years and is used in research in a range of biological fields, such as bacterial/viral/plant/human genome assembly and DNA methylation detection (Loman et al, 2015; Quick et al, 2016; Xiao et al, 2017; Michael et al, 2018; Jain et al, 2018; Xiao et al, 2018; Liu et al, 2019). The current changes indicate the k-mers that pass through the nanopores Such current measurements can be used to identify the base sequences of the DNA/RNA strands

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call