Abstract

Burrows-Wheeler Transform (BWT) is an extremely useful tool for textual lossless data compression. Recently, it has found many applications to bioinformatics. In this paper, BWT is introduced from the view of combinatorics, and then an equivalence relation on words is proposed which shows that the transformation captures some common features of equivalent words. Based on the rationale that to what extent two words differ can be evaluated by the factors excluding their common features, a matrix representation for a DNA sequence is defined by means of a “subtraction operation” between the original word and its BWT word, thus a DNA sequences is converted into a 24-D vector whose components are the spectral norms of such matrices. To illustrate the use of the quantitative characterization of DNA sequences, phylogenetic trees of the full β-globin genes of 15 species and the S segments of 13 hantaviruses are constructed. The resulting monophyletic clusters agree well with the established taxonomic groups.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call