Abstract

Nucleotides ratcheted through the biomolecular pores of nanopore sequencers generate raw picoamperage currents, which are segmented into step-current level signals representing the nucleotide sequence. These ‘squiggles’ are a noisy, distorted representation of the underlying true stepped current levels due to experimental and algorithmic factors. We were interested in developing a simulation model to support a white-box approach to identify common distortions, rather than relying on commonly used black box neural network techniques for basecalling nanopore signals. Dynamic time warped-space averaging (DTWA) techniques can generate a consensus from multiple noisy signals without introducing key feature distortions that occur with standard averaging. As a preprocessing tool, DTWA could provide cleaner and more accurate current signals for direct RNA or DNA analysis tools. However, DTWA approaches need modification to take advantage of the a-priori knowledge regarding a common, underlying gold-standard RNA / DNA sequence. Using experimental data, we derive a simulation model to provide known squiggle distortion signals to assist in validating the performance of analysis tools such as DTWA. Simulation models were evaluated by comparing mocked and experimental squiggle characteristics from one Enolase mRNA squiggle group produced by an Oxford MinION nanopore sequencer, and cross-validated using other Enolase, Sequin R1_71_1 and Sequin R2_55_3 mRNA studies. New techniques identified high inserted but low deleted base rates, generating consistent x1.7 squiggle event to base called ratios. Similar probability density and cumulative distribution functions, PDF and CDF, were found across all studies. Experimental PDFs were not the normal distributions expected if squiggle distortion arose from segmentation algorithm artefacts, or through individual nucleotides randomly interacting with individual nanopores. Matching experimental and mocked CDFs required the assumption that there are unique features associated with individual raw-current data streams. Z-normalized signal-to-noise ratios suggest intrinsic sensor limitations being responsible for half the gold standard and noisy squiggle DTW differences.

Highlights

  • Raw current signals are recorded as each DNA or RNA molecule is ratcheted one nucleotide at a time by a motor protein through a nanopore biomolecule in a sensor array on a device such as the Oxford Nanopore MinION sequencer [1, 2]

  • Several simulation models were proposed for one temporal grouping, and initially accepted based on whether their mocked cumulative distribution function, M-CDF, matched the experimental Enolase E-CDF of that group

  • We have undertaken an investigation of simulation models to characterize the typical distortions introduced by production of raw current signals, and their segmentation into current step level, squiggles, given the stochastic nature of the motor protein, current measurement

Read more

Summary

Introduction

Raw current signals (nanostreams) are recorded as each DNA or RNA molecule is ratcheted one nucleotide at a time by a motor protein through a nanopore biomolecule in a sensor array on a device such as the Oxford Nanopore MinION sequencer [1, 2]. Raw signal segmentation algorithms generate squiggles that are a noisy and distorted representation of the underlying true stepped current levels due many factors These include 1) the uneven production of current steps per unit time due to the stochastic nature of the motor protein driving the steps, 2) homopolymerism where long chains of multiple identical bases are misinterpreted, 3) in-silico chimeric reads, 4) experimental sensor errors and noise generated measuring the current by steric configuration of the nucleotides and 5) segmentation artefacts. These errors can be represented as a certain probability of insertions and deletion into the squiggle. These respectively represent signals falsely interpreted as the presence of additional bases, or the failure of the passage of a base through a nanopore to generate a raw signal that can be segmented into a squiggle event

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call