Optimal choice of word length when comparing two Markov sequences using a \u03c72-statistic

Xin Bai,Fengzhu Sun,Kujin Tang,Michael Waterman,Jie Ren

doi:10.1186/s12864-017-4020-z

Xin Bai, Fengzhu Sun + Show 3 more

Open Access

https://doi.org/10.1186/s12864-017-4020-z

Copy DOI

Journal: BMC Genomics	Publication Date: Oct 1, 2017
Citations: 8	License type: open-access

Affiliation: Fudan University, University of Southern California

Abstract

BackgroundAlignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ2-statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies.ResultsWe develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r1 and r2, respectively. We show through both simulations and theoretical studies that the optimal k= max(r1,r2)+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains.ConclusionOur studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences.

Highlights

Alignment-free sequence comparison using counts of word patterns has become an active research topic due to the large amount of sequence data from the new sequencing technologies
Optimal word length for the comparison of Markov sequences using the χ 2-statistic The following theorem gives the optimal word length for the comparison of two sequences using the χ 2-statistics given in Eqs. 4 and (5)
We present simulation results to show the power of the statistic Sk in Eqs. (4) and (5) for different values of sequence length and word pattern length

Summary

Introduction

Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. The most widely used methods are alignment based algorithms such as the Smith-Waterman algorithm [1], BLAST [2], BLAT [3], etc. In such studies, homologous genes among the genomes are identified, aligned, and their relationships inferred using phylogenetic analysis tools to obtain gene trees. Most alignment based methods do not consider the non-conserved regions resulting in loss of information. Another drawback of the alignment based method is the extremely long time needed for the analysis, especially when the number of genome sequences is large

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Optimal choice of word length when comparing two Markov sequences using a \u03c72-statistic

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction
Peng Zeng ... Jing Cai
Chinese medicine | VOL. 17
Peng Zeng, et. al.Peng Zeng ... Jing Cai
09 Aug 2022
Chinese medicine | VOL. 17

Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.
Jie Ren ... Kai Song
Bioinformatics | VOL. 32
Jie Ren, et. al.Jie Ren ... Kai Song
30 Jun 2015
Bioinformatics | VOL. 32

Normal and compound poisson approximations for pattern occurrences in NGS reads.
Zhiyuan Zhai ... Kai Song
Journal of Computational Biology | VOL. 19
Zhiyuan Zhai, et. al.Zhiyuan Zhai ... Kai Song
01 Jun 2012
Journal of Computational Biology | VOL. 19

Accurate Prediction of RH Genotypes Using Whole Genome Sequencing Data
Yan Zheng ... Stella T Chou
Blood | VOL. 132
Yan Zheng, et. al.Yan Zheng ... Stella T Chou
29 Nov 2018
Blood | VOL. 132

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimal choice of word length when comparing two Markov sequences using a \u03c72-statistic

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics