The Statistics of Parametrized Syncmers in a Simple Mutation Process Without Spurious Matches.

John L Spouge,Pijush Das,Ye Chen,Martin Frith

doi:10.1089/cmb.2024.0508

Abstract

Introduction: Often, bioinformatics uses summary sketches to analyze next-generation sequencing data, but most sketches are not well understood statistically. Under a simple mutation model, Blanca et al. analyzed complete sketches, that is, the complete set of unassembled k-mers, from two closely related sequences. The analysis extracted a point mutation parameter θ quantifying the evolutionary distance between the two sequences. Methods: We extend the results of Blanca et al. for complete sketches to parametrized syncmer sketches with downsampling. A syncmer sketch can sample k-mers much more sparsely than a complete sketch. Consider the following simple mutation model disallowing insertions or deletions. Consider a reference sequence A (e.g., a subsequence from a reference genome), and mutate each nucleotide in it independently with probability θ to produce a mutated sequence B (corresponding to, e.g., a set of reads or draft assembly of a related genome). Then, syncmer counts alone yield an approximate Gaussian distribution for estimating θ. The assumption disallowing insertions and deletions motivates a check on the lengths of A and B. The syncmer count from B yields an approximate Gaussian distribution for its length, and a p-value can test the length of B against the length of A using syncmer counts alone. Results: The Gaussian distributions permit syncmer counts alone to estimate θ and mutated sequence length with a known sampling error. Under some circumstances, the results provide the sampling error for the Mash containment index when applied to syncmer counts. Conclusions: The approximate Gaussian distributions provide hypothesis tests and confidence intervals for phylogenetic distance and sequence length. Our methods are likely to generalize to sketches other than syncmers and may be useful in assembling reads and related applications.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The Statistics of Parametrized Syncmers in a Simple Mutation Process Without Spurious Matches.

Abstract

Talk to us

Similar Papers

More From: Journal of computational biology : a journal of computational molecular cell biology

Lead the way for us

Similar Papers

Approximate distribution of log 2 ( A + χ 2 ) and its applications
Yuan Qi ... Rongrong Qian
IET Communications | VOL. 7
Yuan Qi, et. al.Yuan Qi ... Rongrong Qian
01 Jul 2013
IET Communications | VOL. 7

Bayesian alternative to the ISO-GUM's use of the Welch–Satterthwaite formula
Raghu N Kacker
Metrologia | VOL. 43
Raghu N KackerRaghu N Kacker
18 Nov 2005
Metrologia | VOL. 43

Modified non-Gaussian multivariate statistical process monitoring based on the Gaussian distribution transformation
Wenyou Du ... Wei Zhou
Journal of Process Control | VOL. 85
Wenyou Du, et. al.Wenyou Du ... Wei Zhou
31 Oct 2019
Journal of Process Control | VOL. 85

On the Asymptotic Distribution of the "PSI-Squared" Goodness of Fit Criteria for Markov Chains and Markov Sequences
B R Bhat
The Annals of Mathematical Statistics | VOL. 32
B R BhatB R Bhat
01 Mar 1961
The Annals of Mathematical Statistics | VOL. 32

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Statistics of Parametrized Syncmers in a Simple Mutation Process Without Spurious Matches.

Abstract

Talk to us

Similar Papers

More From: Journal of computational biology : a journal of computational molecular cell biology