Abstract

BackgroundAlignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ2-statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies.ResultsWe develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r1 and r2, respectively. We show through both simulations and theoretical studies that the optimal k= max(r1,r2)+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains.ConclusionOur studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences.

Highlights

  • Alignment-free sequence comparison using counts of word patterns has become an active research topic due to the large amount of sequence data from the new sequencing technologies

  • Optimal word length for the comparison of Markov sequences using the χ 2-statistic The following theorem gives the optimal word length for the comparison of two sequences using the χ 2-statistics given in Eqs. 4 and (5)

  • We present simulation results to show the power of the statistic Sk in Eqs. (4) and (5) for different values of sequence length and word pattern length

Read more

Summary

Introduction

Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. The most widely used methods are alignment based algorithms such as the Smith-Waterman algorithm [1], BLAST [2], BLAT [3], etc. In such studies, homologous genes among the genomes are identified, aligned, and their relationships inferred using phylogenetic analysis tools to obtain gene trees. Most alignment based methods do not consider the non-conserved regions resulting in loss of information. Another drawback of the alignment based method is the extremely long time needed for the analysis, especially when the number of genome sequences is large

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.