Abstract

Third-generation sequencing offers some advantages over its next-generation sequencing predecessor, but with the caveat of harboring a much higher error rate. Clustering related sequences is an essential task in modern biology. To accurately cluster sequences rich in errors, error type and frequency need to be accounted for. Levenshtein distance is a well-established mathematical algorithm for measuring edit distance between words and can specifically weight insertions, deletions and substitutions. However, there are drawbacks to using Levenshtein distance in a biological context, and hence, has rarely been used for this purpose. Therefore, this work describes novel modifications to the Levenshtein distance algorithm that makes it optimized for clustering error-rich biological sequencing data. This research also has led to new observations and characterization of how Levenshtein distance behaves under the novel modifications. The new computation tool developed during this work, Third-Generation Optimized Levenshtein distance (3GOLD), is more accurate than classic Levenshetin distance, Sequence-Levenshtein distance, Starcode, CD-HIT-EST and DNACLUST for clustering both simulated and biological third-generation sequenced reads produced by Pacific Biosciences and Oxford Nanopore Technologies platforms. Furthermore, 3GOLD is appropriate for datasets of unknown cluster centroids, such as those generated with unique molecular identifiers as well as known centroids such as barcoded datasets. A strength of this approach is high accuracy in resolving small clusters and mitigating the number of singletons.--Author's abstract

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call