Abstract

Nanopore sequencing is regarded as one of the most promising third-generation sequencing (TGS) technologies. Since 2014, Oxford Nanopore Technologies (ONT) has developed a series of devices based on nanopore sequencing to produce very long reads, with an expected impact on genomics. However, the nanopore sequencing reads are susceptible to a fairly high error rate owing to the difficulty in identifying the DNA bases from the complex electrical signals. Although several basecalling tools have been developed for nanopore sequencing over the past years, it is still challenging to correct the sequences after applying the basecalling procedure. In this study, we developed an open-source DNA basecalling reviser, NanoReviser, based on a deep learning algorithm to correct the basecalling errors introduced by current basecallers provided by default. In our module, we re-segmented the raw electrical signals based on the basecalled sequences provided by the default basecallers. By employing convolution neural networks (CNNs) and bidirectional long short-term memory (Bi-LSTM) networks, we took advantage of the information from the raw electrical signals and the basecalled sequences from the basecallers. Our results showed NanoReviser, as a post-basecalling reviser, significantly improving the basecalling quality. After being trained on standard ONT sequencing reads from public E. coli and human NA12878 datasets, NanoReviser reduced the sequencing error rate by over 5% for both the E. coli dataset and the human dataset. The performance of NanoReviser was found to be better than those of all current basecalling tools. Furthermore, we analyzed the modified bases of the E. coli dataset and added the methylation information to train our module. With the methylation annotation, NanoReviser reduced the error rate by 7% for the E. coli dataset and specifically reduced the error rate by over 10% for the regions of the sequence rich in methylated bases. To the best of our knowledge, NanoReviser is the first post-processing tool after basecalling to accurately correct the nanopore sequences without the time-consuming procedure of building the consensus sequence. The NanoReviser package is freely available at https://github.com/pkubioinformatics/NanoReviser.

Highlights

  • Oxford Nanopore Technologies (ONT) has recently shown the ability to use nanopores to achieve long-read sequencing, which has a tremendous impact on genomics (Brown and Clarke, 2016)

  • It is possible to produce long sequencing reads when using the MinION device with large arrays of nanopores, nanopore sequencing reads do show a fairly high error rate which could be introduced by abnormal fluctuations of the current or by the improper translation from raw electrical signals into DNA sequence, which has been sought to be corrected by the basecalling procedure developed by ONT (Rang et al, 2018)

  • DeepNano (Boža et al, 2017), the first basecaller based on deep learning and one applying recurrent neural networks (RNNs) to predict DNA bases, showed that a deep learning model could make a crucial contribution to the basecalling performance

Read more

Summary

Introduction

Oxford Nanopore Technologies (ONT) has recently shown the ability to use nanopores to achieve long-read sequencing, which has a tremendous impact on genomics (Brown and Clarke, 2016). BasecRAWller (Stoiber and James, 2017), designed to use the raw outputs of nanopores with unidirectional RNNs to basecall DNA bases, was shown to be able to process the electrical signals in a streaming fashion but at the expense of basecalling accuracy. Chiron (Teng et al, 2018), designed to adopt a combination of convolution neural network (CNN) and RNN methodologies together with a connectionist temporal classification decoder (CTC decoder) that was successfully used in machine translation, achieved a high basecalling accuracy. Despite this high accuracy, Chiron, required a high much computer time to get these accurate results. The only tool proposed previously for MinION sequencing error correction, nanoCORR (James, 2016), requires additional next-generation sequencing short read sequences, and is not a practical tool for error correction in its true sense

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call