Absent words and the (dis)similarity analysis of DNA sequences: an experimental study.

Mohammad Saifur Rahman,Maxime Crochemore,Tanver Athar,Ali Alatabbi,M Sohel Rahman

doi:10.1186/s13104-016-1972-z

Mohammad Saifur Rahman, Maxime Crochemore + Show 3 more

Open Access

https://doi.org/10.1186/s13104-016-1972-z

Copy DOI

Abstract

BackgroundAn absent word with respect to a sequence is a word that does not occur in the sequence as a factor; an absent word is minimal if all its factors on the other hand occur in that sequence. In this paper we explore the idea of using minimal absent words (MAW) to compute the distance between two biological sequences. The motivation and rationale of our work comes from the potential advantage of being able to extract as little information as possible from large genomic sequences to reach the goal of comparing sequences in an alignment-free manner.FindingsWe report an experimental study on the use of absent words as a distance measure among biological sequences. We provide recommendations to use the best index based on our analysis. In particular, our analysis reveals that the best performers are: the length weighted index of relative absent word sets, the length weighted index of the symmetric difference of the MAW sets, and the Jaccard distance between the MAW sets. We also found that during the computation of the absent words, the reverse complements of the sequences should also be considered.ConclusionThe use of MAW to compute the distance between two biological sequences has potential advantage over alignment based methods. It is expected that this potential advantage would encourage researchers and practitioners to use this as a (dis)similarity measure in the context of sequence comparison and phylogeny reconstruction. Therefore, we present here a comparison among different possible models and indexes and pave the path for the biologists and researchers to choose an appropriate model for such comparisons.Electronic supplementary materialThe online version of this article (doi:10.1186/s13104-016-1972-z) contains supplementary material, which is available to authorized users.

Highlights

An absent word with respect to a sequence is a word that does not occur in the sequence as a factor; an absent word is minimal if all its factors on the other hand occur in that sequence
The use of minimal absent words (MAW) to compute the distance between two biological sequences has potential advantage over alignment based methods
It is expected that this potential advantage would encourage researchers and practitioners to use this as asimilarity measure in the context of sequence comparison and phylogeny reconstruction

Summary

Introduction

An absent word with respect to a sequence is a word that does not occur in the sequence as a factor; an absent word is minimal if all its factors on the other hand occur in that sequence. The motivation and rationale of our work comes from the potential advantage of being able to extract as little information as possible from large genomic sequences to reach the goal of comparing sequences in an alignment-free manner

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Research Notes	Publication Date: Mar 22, 2016
Citations: 26	License type: cc-by

R Discovery Prime

R Discovery Prime

Absent words and the (dis)similarity analysis of DNA sequences: an experimental study.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Research Notes

Lead the way for us

Similar Papers

On finding minimal absent words
Armando J Pinho ... Paulo Jsg Ferreira
BMC Bioinformatics | VOL. 10
Armando J Pinho, et. al.Armando J Pinho ... Paulo Jsg Ferreira
08 May 2009
BMC Bioinformatics | VOL. 10

Linear-time computation of minimal absent words using suffix array.
Carl Barton ... Alice Heliou
BMC Bioinformatics | VOL. 15
Carl Barton, et. al.Carl Barton ... Alice Heliou
01 Dec 2014
BMC Bioinformatics | VOL. 15

ADACT: a tool for analysing (dis)similarity among nucleotide and protein sequences using minimal and relative absent words.
Mujtahid Akon ... M Sohel Rahman
Bioinformatics (Oxford, England) | VOL. 37
Mujtahid Akon, et. al.Mujtahid Akon ... M Sohel Rahman
25 Nov 2020
Bioinformatics (Oxford, England) | VOL. 37

Minimal Absent Words in Prokaryotic and Eukaryotic Genomes
Sara P Garcia ... João M O S Rodrigues
PLoS ONE | VOL. 6
Sara P Garcia, et. al.Sara P Garcia ... João M O S Rodrigues
31 Jan 2011
PLoS ONE | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Absent words and the (dis)similarity analysis of DNA sequences: an experimental study.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Research Notes