Abstract

BackgroundVarious approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between pairs of input sequences. Haubold et al. (J Comput Biol 16:1487–1500, 2009) showed how the average number of substitutions per position between two DNA sequences can be estimated based on the average length of exact common substrings.ResultsIn this paper, we study the length distribution of k-mismatch common substrings between two sequences. We show that the number of substitutions per position can be accurately estimated from the position of a local maximum in the length distribution of their k-mismatch common substrings.

Highlights

  • Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between pairs of input sequences

  • Other approaches are based on the matching statistics [10], that is on the length of common substrings of the input sequences [11, 12]

  • Since there is no exact solution to the k-mismatch longest common substring problem that is fast enough to be applied to long genomic sequences, we proposed a simple heuristic: we first search for longest exact matches and extend these matches until the k + 1st mismatch occurs

Read more

Summary

Introduction

Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between pairs of input sequences. Distances are calculated from the average length of these k-mismatch common substrings as in ACS; the implementation of this approach is called kmacs.

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call