An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

Sriram P Chockalingam,Jodh Pannu,Sahar Hooshmand,Sharma V Thankachan,Srinivas Aluru

doi:10.1186/s12859-020-03738-5

Sriram P Chockalingam, Jodh Pannu + Show 3 more

Open Access

https://doi.org/10.1186/s12859-020-03738-5

Copy DOI

Abstract

BackgroundAlignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced.ResultsIn this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction.ConclusionsOur method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.

Highlights

Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, in the estimation of sequence similarity measures to construct phylogenetic trees
Over the past two decades, many similarity measures based on alignment-free methods have been proposed for sequence comparison for a diverse range of bioinformatics applications
All the experiments were run on a system having two 2.4 GHz 14-Core Intel E52680 V4 processors and 256 GB of main memory, and running RedHat Enterprise Linux (RHEL) 7.0 operating system

Summary

Introduction

Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, in the estimation of sequence similarity measures to construct phylogenetic trees. Alignment-free methods are used to construct the pairwise distance matrix, a symmetric matrix of sequence similarity measures computed for every pair in the given set of sequences. Alignment-free methods for computation of similarity measures can be classified based on whether the seeds are exact or approximate and whether the seeds are of fixed- or variable-length. The most popular among the fixed-length exact seed methods are kmerbased techniques, which proceed by first constructing the sets of all the kmers (kmers are fixed-length exact seeds of length k) of a pair of sequences, followed by the estimation of a similarity measure either based on the kmer frequency profile (Eg. Euclidean distance, CVTree [5],FFP [6]) or based on the intersection/differences of the kmer sets (Eg. Jaccard coefficient). Methods using approximate fixed-length such as spaced-seeds approaches [8] allow the use of kmers with mismatches at specific locations and make use of multiple patterns to improve accuracy

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Journal: BMC Bioinformatics	Publication Date: Nov 1, 2020
License type: open-access

Similar Papers

Diversity of Sequences Folding to Highly and Poorly Designable Structures
Sumudu P Leelananda ... Robert L Jernigan
Biophysical Journal | VOL. 102
Sumudu P Leelananda, et. al.Sumudu P Leelananda ... Robert L Jernigan
01 Jan 2012
Biophysical Journal | VOL. 102

Combinatorial design of protein sequences with applications to lattice and real proteins
Arnab Bhattacherjee ... Parbati Biswas
The Journal of Chemical Physics | VOL. 131
Arnab Bhattacherjee, et. al.Arnab Bhattacherjee ... Parbati Biswas
23 Sep 2009
The Journal of Chemical Physics | VOL. 131

Effect of training datasets on support vector machine prediction of protein‐protein interactions
Siaw Ling Lo ... Cong Zhong Cai
PROTEOMICS | VOL. 5
Siaw Ling Lo, et. al.Siaw Ling Lo ... Cong Zhong Cai
01 Mar 2005
PROTEOMICS | VOL. 5

Phylogeny reconstruction: increasing the accuracy of pairwise distance estimation using Bayesian inference of evolutionary rates
Matan Ninio ... Tal Pupko
Bioinformatics | VOL. 23
Matan Ninio, et. al.Matan Ninio ... Tal Pupko
15 Jan 2007
Bioinformatics | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics