Abstract
BackgroundLong read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area.ResultsHere we developed a Convolutional Neural Network (CNN) based method, called MiniScrub, for identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments to minimize their interference in downstream assembly process. MiniScrub first generates read-to-read overlaps via MiniMap2, then encodes the overlaps into images, and finally builds CNN models to predict low-quality segments. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality, and improves read error correction in the metagenome setting. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors.ConclusionsMiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub.
Highlights
Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification
We developed a method called MiniScrub that performs de novo long read scrubbing using the combined power of fast approximate read-to-read overlapping, deep Convolutional Neural Networks, and a novel method for pileup image generation
MiniScrub uses minimizers to quickly overlap long reads, encodes these overlaps into pileup images, and uses a convolutional neural network to predict parts of reads below a certain quality threshold that should be removed
Summary
Method overviewThe three steps involved in MiniScrub are illustrated in Fig. 1 and explained in further detail in the subsections below. The first step is training a CNN model, a step only needs to be done once, in order to learn the error profile of a certain sequencing technology and base caller. The model training step starts with building a training set with reads from a known reference genome. These reads are mapped using GraphMap [26] to the reference genomes. For each read segment we calculate its percent identity, e.g. the percentage of bases in the read that match the reference, as labels. We use a modified version of MiniMap2 [22] to obtain read-to-read overlaps between all reads in the training set (see below for details), and embed relevant information (minimizers matched, distance between minimizers, and base quality scores) into Red-Green-Blue (RGB) pixels to form “pileup” images. One image is generated for each read, and is broken into the same short segments as above
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.