MicroTaboo: a general and practical solution to the k-disjoint problem

Mohammed Al-Jaff,Manfred Grabherr,Eric Sandström

doi:10.1186/s12859-017-1644-6

Abstract

BackgroundA common challenge in bioinformatics is to identify short sub-sequences that are unique in a set of genomes or reference sequences, which can efficiently be achieved by k-mer (k consecutive nucleotides) counting. However, there are several areas that would benefit from a more stringent definition of “unique”, requiring that these sub-sequences of length W differ by more than k mismatches (i.e. a Hamming distance greater than k) from any other sub-sequence, which we term the k-disjoint problem. Examples include finding sequences unique to a pathogen for probe-based infection diagnostics; reducing off-target hits for re-sequencing or genome editing; detecting sequence (e.g. phage or viral) insertions; and multiple substitution mutations. Since both sensitivity and specificity are critical, an exhaustive, yet efficient solution is desirable.ResultsWe present microTaboo, a method that allows for efficient and extensive sequence mining of unique (k-disjoint) sequences of up to 100 nucleotides in length. On a number of simulated and real data sets ranging from microbe- to mammalian-size genomes, we show that microTaboo is able to efficiently find all sub-sequences of a specified length W that do not occur within a threshold of k mismatches in any other sub-sequence. We exemplify that microTaboo has many practical applications, including point substitution detection, sequence insertion detection, padlock probe target search, and candidate CRISPR target mining.ConclusionsmicroTaboo implements a solution to the k-disjoint problem in an alignment- and assembly free manner. microTaboo is available for Windows, Mac OS X, and Linux, running Java 7 and higher, under the GNU GPLv3 license, at: https://MohammedAlJaff.github.io/microTaboo

Highlights

A common challenge in bioinformatics is to identify short sub-sequences that are unique in a set of genomes or reference sequences, which can efficiently be achieved by k-mer (k consecutive nucleotides) counting
There are several areas in bioinformatics and biomedical research that benefit from identifying short subsequences among a pool of reference sequences that are as unique as possible, i.e. the most similar sub-sequence differs by a given number of mismatches or more
Point-mutation detection We simulated a run which aimed at detecting substitution mutations between closely related strains, such as in gaining drug resistance, by randomly generating substitutions in three genomes of different sizes: the Tobacco leaf curl Japan virus [16] (TbLCJV, ~2.5 kbps), E. coli (~5.5Mbps), Saccharomyces cerevisiae [17] (12Mbps), and Candida albicans [18] (~14.3Mbps)

Summary

Results

We present microTaboo, a method that allows for efficient and extensive sequence mining of unique (k-disjoint) sequences of up to 100 nucleotides in length. On a number of simulated and real data sets ranging from microbe- to mammalian-size genomes, we show that microTaboo is able to efficiently find all sub-sequences of a specified length W that do not occur within a threshold of k mismatches in any other sub-sequence. We exemplify that microTaboo has many practical applications, including point substitution detection, sequence insertion detection, padlock probe target search, and candidate CRISPR target mining. Conclusions: microTaboo implements a solution to the k-disjoint problem in an alignment- and assembly free manner. MicroTaboo is available for Windows, Mac OS X, and Linux, running Java 7 and higher, under the GNU GPLv3 license, at: https://MohammedAlJaff.github.io/microTaboo Conclusions: microTaboo implements a solution to the k-disjoint problem in an alignment- and assembly free manner. microTaboo is available for Windows, Mac OS X, and Linux, running Java 7 and higher, under the GNU GPLv3 license, at: https://MohammedAlJaff.github.io/microTaboo

Background

Results and discussion

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MicroTaboo: a general and practical solution to the k-disjoint problem

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Journal: BMC Bioinformatics	Publication Date: May 2, 2017
License type: open-access

Similar Papers

CRISPR-GE: A Convenient Software Toolkit for CRISPR-Based Genome Editing
Xianrong Xie ... Yao-Guang Liu
Molecular Plant | VOL. 10
Xianrong Xie, et. al.Xianrong Xie ... Yao-Guang Liu
15 Jun 2017
Molecular Plant | VOL. 10

Author response: Programmed genome editing of the omega-1 ribonuclease of the blood fluke, Schistosoma mansoni
Wannaporn Ittiprasert ...
-
Wannaporn Ittiprasert, et. al.Wannaporn Ittiprasert ...
12 Dec 2018
12 Dec 2018

Opportunities for unlocking the potential of genomics for African trees.
Barnabas H Daru ... Abraham E Van Wyk
The New phytologist | VOL. 210
Barnabas H Daru, et. al.Barnabas H Daru ... Abraham E Van Wyk
22 Dec 2015
The New phytologist | VOL. 210

Replacing the SpCas9 HNH domain by deaminases generates compact base editors with an alternative targeting scope
Lukas Villiger ... Gerald Schwank
Molecular Therapy - Nucleic Acids | VOL. 26
Lukas Villiger, et. al.Lukas Villiger ... Gerald Schwank
26 Aug 2021
Molecular Therapy - Nucleic Acids | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MicroTaboo: a general and practical solution to the k-disjoint problem

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics