Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.

Chris-Andre Leimeister,Burkhard Morgenstern

doi:10.1093/bioinformatics/btu331

Chris-Andre Leimeister, Burkhard Morgenstern

Open Access

https://doi.org/10.1093/bioinformatics/btu331

Copy DOI

Journal: Bioinformatics	Publication Date: May 13, 2014
Citations: 147	License type: CC BY 3.0

Affiliation: University of Göttingen

Abstract

Motivation: Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays.Results: To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood.Availability and implementation: kmacs is implemented in C++, and the source code is freely available at http://kmacs.gobics.de/Contact: chris.leimeister@stud.uni-goettingen.deSupplementary information: Supplementary data are available at Bioinformatics online.

Highlights

Comparative sequence analysis traditionally relies on pairwise or multiple sequence alignment
We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays
Phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches

Summary

Introduction

Comparative sequence analysis traditionally relies on pairwise or multiple sequence alignment. With the huge datasets that are produced by next-generation sequencing technologies, today’s alignment algorithms reach their limits. With the growing number of completely or partially sequenced genomes, there is an urgent demand for faster sequence-comparison methods. Over the past two decades, a wide variety of alignment-free approaches were proposed (Vinga and Almeida, 2003). Aligning two sequences takes time proportional to the product of their lengths, most alignment-free methods run in linear time. They are, increasingly used for genome-based phylogeny reconstruction and for large-scale protein sequence comparison. It is known, that alignment-free methods are generally less accurate than alignment-based approaches

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.
Sylvain Forêt ... Conrad J Burden
BMC Bioinformatics | VOL. Suppl 7 5
Sylvain Forêt, et. al.Sylvain Forêt ... Conrad J Burden
01 Dec 2006
BMC Bioinformatics | VOL. Suppl 7 5

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.
Sebastian Horwege ... Martin Kollmar
Nucleic acids research | VOL. 42
Sebastian Horwege, et. al.Sebastian Horwege ... Martin Kollmar
14 May 2014
Nucleic acids research | VOL. 42

ALFRED: A Practical Method for Alignment-Free Distance Computation.
Sharma V Thankachan ... Yongchao Liu
Journal of Computational Biology | VOL. 23
Sharma V Thankachan, et. al.Sharma V Thankachan ... Yongchao Liu
03 May 2016
Journal of Computational Biology | VOL. 23

A novel method for comparative analysis of DNA sequences by Ramanujan-Fourier transform.
Changchuan Yin ... Jiasong Wang
Journal of Computational Biology | VOL. 21
Changchuan Yin, et. al.Changchuan Yin ... Jiasong Wang
01 Dec 2014
Journal of Computational Biology | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics