Abstract

BackgroundThe Illumina 450k array has been widely used in epigenetic association studies. Current quality-control (QC) pipelines typically remove certain sets of probes, such as those containing a SNP or with multiple mapping locations. An additional set of potentially problematic probes are those with DNA methylation distributions characterized by two or more distinct clusters separated by gaps. Data-driven identification of such probes may offer additional insights for downstream analyses.ResultsWe developed a procedure, termed “gap hunting,” to identify probes showing clustered distributions. Among 590 peripheral blood samples from the Study to Explore Early Development, we identified 11,007 “gap probes.” The vast majority (9199) are likely attributed to an underlying SNP(s) or other variant in the probe, although SNP-affected probes exist that do not produce a gap signals. Specific factors predict which SNPs lead to gap signals, including type of nucleotide change, probe type, DNA strand, and overall methylation state. These expected effects are demonstrated in paired genotype and 450k data on the same samples. Gap probes can also serve as a surrogate for the local genetic sequence on a haplotype scale and can be used to adjust for population stratification.ConclusionsThe characteristics of gap probes reflect potentially informative biology. QC pipelines may benefit from an efficient data-driven approach that “flags” gap probes, rather than filtering such probes, followed by careful interpretation of downstream association analyses. Our results should translate directly to the recently released Illumina EPIC array given the similar chemistry and content design.Electronic supplementary materialThe online version of this article (doi:10.1186/s13072-016-0107-z) contains supplementary material, which is available to authorized users.

Highlights

  • The Illumina 450k array has been widely used in epigenetic association studies

  • Of the 473,864 autosomal probes we measured in Study to Explore Early Development (SEED) I participants on the 450k, we identified 11,007 (2.3%) with clustered distributions of DNA methylation (DNAm) values which we term “gap signals.”

  • Available from the same SEED individuals, we found that these 3 methylation clusters correspond to genotype for single nucleotide polymorphism (SNP) rs299872; this SNP is located at the interrogated C site (Fig. 1, top panel)

Read more

Summary

Introduction

The Illumina 450k array has been widely used in epigenetic association studies. Current quality-control (QC) pipelines typically remove certain sets of probes, such as those containing a SNP or with multiple mapping loca‐ tions. Probes are characterized by 3 distinct features: a CpG site of interest, a single base extension (SBE) that incorporates a fluorescently labeled nucleotide for detection, and an additional 48 or 49 base pairs. Type I uses two probes per interrogated CpG site, one for a methylated sequence and one for unmethylated sequence, with measurement based on signal from a single color channel (red or green) determined by the nucleotide base incorporated via SBE. Type II probes use a single probe with measurement based on the ratio of red and green signal intensities (a two-color array rather than one-color) [11]. In this design, the C base of the CpG site overlaps with the SBE site

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call