Abstract

BackgroundHigh-density genomic data is often analyzed by combining information over windows of adjacent markers. Interpretation of data grouped in windows versus at individual locations may increase statistical power, simplify computation, reduce sampling noise, and reduce the total number of tests performed. However, use of adjacent marker information can result in over- or under-smoothing, undesirable window boundary specifications, or highly correlated test statistics. We introduce a method for defining windows based on statistically guided breakpoints in the data, as a foundation for the analysis of multiple adjacent data points. This method involves first fitting a cubic smoothing spline to the data and then identifying the inflection points of the fitted spline, which serve as the boundaries of adjacent windows. This technique does not require prior knowledge of linkage disequilibrium, and therefore can be applied to data collected from individual or pooled sequencing experiments. Moreover, in contrast to existing methods, an arbitrary choice of window size is not necessary, since these are determined empirically and allowed to vary along the genome.ResultsSimulations applying this method were performed to identify selection signatures from pooled sequencing FST data, for which allele frequencies were estimated from a pool of individuals. The relative ratio of true to false positives was twice that generated by existing techniques. A comparison of the approach to a previous study that involved pooled sequencing FST data from maize suggested that outlying windows were more clearly separated from their neighbors than when using a standard sliding window approach.ConclusionsWe have developed a novel technique to identify window boundaries for subsequent analysis protocols. When applied to selection studies based on FST data, this method provides a high discovery rate and minimizes false positives. The method is implemented in the R package GenWin, which is publicly available from CRAN.

Highlights

  • High-density genomic data is often analyzed by combining information over windows of adjacent markers

  • Simulations Simulations showed that both sliding windows and distinct windows of five or 10 single nucleotide polymorphisms (SNPs) identified markedly fewer quantitative trait loci (QTL) than larger window sizes (Table 1)

  • Data varied greatly and the significance thresholds that were set in the simulations without selection were so high that it was extremely difficult to exceed them

Read more

Summary

Introduction

High-density genomic data is often analyzed by combining information over windows of adjacent markers. A recurrent question that arises during the analysis of high-density genotyping or sequencing information is how to best analyze noisy data This question is relevant when analyzing sequence data from pooled samples of populations for which, depending on the number of individuals pooled and the level of coverage per site, estimates of individual base pair (bp) allele frequencies can be very imprecise [1]. To account for this variability, methods based on estimating parameters over “windows” have been successfully used to reduce sampling error while retaining true signal in studies aimed. Highly correlated statistics are generated, since each window overlaps with its neighboring windows

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.