Abstract

BackgroundAlignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts.ResultsA new alignment-free sequence similarity analysis method, called SSAW is proposed. SSAW stands for Sequence Similarity Analysis using the Stationary Discrete Wavelet Transform (SDWT). It extracts k-mers from a sequence, then maps each k-mer to a complex number field. Then, the series of complex numbers formed are transformed into feature vectors using the stationary discrete wavelet transform. After these steps, the original sequence is turned into a feature vector with numeric values, which can then be used for clustering and/or classification.ConclusionsUsing two different types of applications, namely, clustering and classification, we compared SSAW against the the-state-of-the-art alignment free sequence analysis methods. SSAW demonstrates competitive or superior performance in terms of standard indicators, such as accuracy, F-score, precision, and recall. The running time was significantly better in most cases. These make SSAW a suitable method for sequence analysis, especially, given the rapidly increasing volumes of sequence data required by most modern applications.

Highlights

  • Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts

  • Alignment-free methods have been used on various sequence analysis problems in biology and medicine, including Deoxyribonucleic acid (DNA) sequences [6,7,8], RNA sequences [9], protein sequences [10, 11], as well as in detection of single nucleotide variants in genomes [12]

  • According to Bonhamcarter et al [25],the wordbased methods can be further divided into five categories, namely, base-base correlations (BBC), feature frequency profiles (FFPs), compositional vectors(CVs), string composition methods, and the D2-statistic family

Read more

Summary

Introduction

Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts. Efficient and accurate similarity analysis for a large number of sequences is a challenging problem in computational biology [1, 2]. Alignment-based and alignmentfree sequence similarity analysis are the two primary approaches to this problem. Alignment-free methods have been used on various sequence analysis problems in biology and medicine, including DNA sequences [6,7,8], RNA sequences [9], protein sequences [10, 11], as well as in detection of single nucleotide variants in genomes [12], Alignment-free approaches are broadly divided into two groups [3]: word-based methods and information theory based methods. According to Bonhamcarter et al [25],the wordbased methods can be further divided into five categories, namely, base-base correlations (BBC), feature frequency profiles (FFPs), compositional vectors(CVs), string composition methods, and the D2-statistic family

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call