Sequence Comparison Alignment-Free Approach Based on Suffix Tree andL-WordsFrequency

Inês Soares,Ana Goios,António Amorim

doi:10.1100/2012/450124

Abstract

The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset length L—L-words—in each sequence is rapidly calculated. Based on the L-words frequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.

Highlights

During the last decades many sequence comparison methods have been developed in order to recover evolutionary and phylogenetic signals as well as for the discovery of pathogenic mutations [1, 2].The most common approaches are based on sequence alignments [3, 4]
The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps
We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks

Summary

Introduction

During the last decades many sequence comparison methods have been developed in order to recover evolutionary and phylogenetic signals as well as for the discovery of pathogenic mutations [1, 2]. The most common approaches are based on sequence alignments [3, 4]. Many alignment-free methods have been proposed [5, 7,8,9] which, being based on word frequencies or on match lengths, are algorithmically simple and computationally faster than alignment methods. The L-words counting in a sequence is usually performed considering a one base sliding window, overlapping. We present a new approach that determines a single optimal word length, L, and generates L-words frequency profiles using suffix tree theory. The algorithm was applied to a variety of mtDNA sequences that are difficult to handle by automated alignment methods and the performance was compared to the available word counting alignment-free methodologies

Methods

Results and Discussion

10 Pan troglodytes 22 Pan paniscus 29 primates 104 Homosapiens 150 Homosapiens