Abstract
BackgroundAlignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced.ResultsIn this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction.ConclusionsOur method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.
Highlights
Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, in the estimation of sequence similarity measures to construct phylogenetic trees
Over the past two decades, many similarity measures based on alignment-free methods have been proposed for sequence comparison for a diverse range of bioinformatics applications
All the experiments were run on a system having two 2.4 GHz 14-Core Intel E52680 V4 processors and 256 GB of main memory, and running RedHat Enterprise Linux (RHEL) 7.0 operating system
Summary
Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, in the estimation of sequence similarity measures to construct phylogenetic trees. Alignment-free methods are used to construct the pairwise distance matrix, a symmetric matrix of sequence similarity measures computed for every pair in the given set of sequences. Alignment-free methods for computation of similarity measures can be classified based on whether the seeds are exact or approximate and whether the seeds are of fixed- or variable-length. The most popular among the fixed-length exact seed methods are kmerbased techniques, which proceed by first constructing the sets of all the kmers (kmers are fixed-length exact seeds of length k) of a pair of sequences, followed by the estimation of a similarity measure either based on the kmer frequency profile (Eg. Euclidean distance, CVTree [5],FFP [6]) or based on the intersection/differences of the kmer sets (Eg. Jaccard coefficient). Methods using approximate fixed-length such as spaced-seeds approaches [8] allow the use of kmers with mismatches at specific locations and make use of multiple patterns to improve accuracy
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.