PEER: A direct method for biosequence pattern mining through waits of optimal k-mers

Uddalak Mitra,Balaram Bhattacharyya,Tathagato Mukhopadhyay

doi:10.1016/j.ins.2019.12.072

Abstract

Achieving accuracy of alignment-based methods at linear time complexity is desirable for biosequence studies. k-mer statistics is the principal alternative, but selecting the optimal k is crucial for best feature extraction. Prevalent methods require successive trials upon incrementing k for best match with a reference phylogeny tree.We observe that successive intervals(or waits) of optimal length k-mers contain precise information of the sequence such that feature extraction is possible from entropies of the waits. We introduce a method, Pattern Extraction through Entropy Retrieval(PEER), that transforms a sequence into a vector of wait entropies of optimal k-mers. Distance between a pair of sequences amounts to the Euclidean Distance between their wait vectors. We present an analytical determination of optimal k from maximality of total wait entropy. This makes PEER free from the usual multiple trials for obtaining optimal k.We conduct experiments on several benchmark datasets of omics clades for phylogeny analysis and perform an in-depth comparison against seven state-of-the-art alignment-free methods. Phylogeny tree from PEER distance closely resembles the corresponding biological taxonomy and achieves the best Robinson-Foulds score. PEER can sense small artificial mutations within sequence. It is highly scalable with linear time complexity, exceptionally useful for comparing long sequences.

Full Text