Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

Sebastian Horwege,Sebastian Lindner,Marcus Boden,Burkhard Morgenstern,Chris-André Leimeister,Klas Hatje,Martin Kollmar

doi:10.1093/nar/gku398

Sebastian Horwege, Sebastian Lindner + Show 5 more

Open Access

https://doi.org/10.1093/nar/gku398

Copy DOI

Journal: Nucleic acids research	Publication Date: May 14, 2014
Citations: 68	License type: CC BY 3.0

Affiliation: University of Göttingen

Abstract

In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of so-called spaced words in the input sequences, i.e. words containing ‘don't care’ or ‘wildcard’ symbols at certain pre-defined positions. Various distance measures can then be defined on sequences based on their different spaced-word composition. Our second approach defines the distance between two sequences by estimating for each position in the first sequence the length of the longest substring at this position that also occurs in the second sequence with up to k mismatches. Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction. The two alignment-free programmes are accessible through a web interface at ‘Göttingen Bioinformatics Compute Server (GOBICS)’: http://spaced.gobics.de http://kmacs.gobics.de and the source codes can be downloaded.

Highlights

Comparative sequence analysis and phylogeny reconstruction are traditionally based on pairwise or multiple sequence alignments
Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction
After calculating the relative frequencies of all spaced words according to the fixed pattern P, our programme can use different distance measures to define pairwise distances among the input sequences based on their relative spacedword frequencies

Summary

Introduction

Comparative sequence analysis and phylogeny reconstruction are traditionally based on pairwise or multiple sequence alignments. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. Various distance measures can be defined on sequences based on their different spaced-word composition.

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nucleic acids research

Lead the way for us

Similar Papers

Estimating evolutionary distances between genomic sequences from spaced-word matches.
Burkhard Morgenstern ... Bingyao Zhu
Algorithms for Molecular Biology | VOL. 10
Burkhard Morgenstern, et. al.Burkhard Morgenstern ... Bingyao Zhu
11 Feb 2015
Algorithms for Molecular Biology | VOL. 10

Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.
Chris-Andre Leimeister ... Burkhard Morgenstern
Bioinformatics | VOL. 30
Chris-Andre Leimeister, et. al.Chris-Andre Leimeister ... Burkhard Morgenstern
13 May 2014
Bioinformatics | VOL. 30

Development and Role of the Human Reference Sequence in Personal Genomics
Todd M Smith ... Sandra G Porter
-
Todd M Smith, et. al.Todd M Smith ... Sandra G Porter
16 Jun 2014
16 Jun 2014

<title>DNA sequence similarity search through content-based retrieval technique</title>
Chia Hung Yeh ... Po Yi Sung
-
Chia Hung Yeh, et. al.Chia Hung Yeh ... Po Yi Sung
27 Aug 2003
27 Aug 2003

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nucleic acids research