A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform

Joanna Zhuang,Andrew E Teschendorff,Martin Widschwendter

doi:10.1186/1471-2105-13-59

Joanna Zhuang, Andrew E Teschendorff + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-13-59

Copy DOI

Abstract

BackgroundThe 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology that allows the methylation state of over 27,000 CpGs to be assayed. While feature selection and classification methods have been comprehensively explored in the context of gene expression data, relatively little is known as to how best to perform feature selection or classification in the context of Illumina Infinium methylation data. Given the rising importance of epigenomics in cancer and other complex genetic diseases, and in view of the upcoming epigenome wide association studies, it is critical to identify the statistical methods that offer improved inference in this novel context.ResultsUsing a total of 7 large Illumina Infinium 27k Methylation data sets, encompassing over 1,000 samples from a wide range of tissues, we here provide an evaluation of popular feature selection, dimensional reduction and classification methods on DNA methylation data. Specifically, we evaluate the effects of variance filtering, supervised principal components (SPCA) and the choice of DNA methylation quantification measure on downstream statistical inference. We show that for relatively large sample sizes feature selection using test statistics is similar for M and β-values, but that in the limit of small sample sizes, M-values allow more reliable identification of true positives. We also show that the effect of variance filtering on feature selection is study-specific and dependent on the phenotype of interest and tissue type profiled. Specifically, we find that variance filtering improves the detection of true positives in studies with large effect sizes, but that it may lead to worse performance in studies with smaller yet significant effect sizes. In contrast, supervised principal components improves the statistical power, especially in studies with small effect sizes. We also demonstrate that classification using the Elastic Net and Support Vector Machine (SVM) clearly outperforms competing methods like LASSO and SPCA. Finally, in unsupervised modelling of cancer diagnosis, we find that non-negative matrix factorisation (NMF) clearly outperforms principal components analysis.ConclusionsOur results highlight the importance of tailoring the feature selection and classification methodology to the sample size and biological context of the DNA methylation study. The Elastic Net emerges as a powerful classification algorithm for large-scale DNA methylation studies, while NMF does well in the unsupervised context. The insights presented here will be useful to any study embarking on large-scale DNA methylation profiling using Illumina Infinium beadarrays.

Highlights

The 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology that allows the methylation state of over 27,000 CpGs to be assayed
We focused on four different powerful classification algorithms, which have been popular in the gene expression field: (i) Supervised Principal components analysis (PCA) (SPCA) [21], (ii) the LASSO algorithm [40], (iii) the Elastic Net (ELNET) [32] and (iv) Support Vector Machines (SVM) [33,41]
In this study we ask if the effect size of CpGs associated with a phenotype of interest ("signal to noise ratio"-SNR) and their number ("signal strength”) have an impact on the performance of the different feature selection methods and if this depends on the methylation measure used

Summary

Introduction

The 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology that allows the methylation state of over 27,000 CpGs to be assayed. Most statistical reports on Infinium 27k DNAm data have focused on unsupervised clustering and normalisation methods [16,17,18,19], but as yet no study has performed a comprehensive comparison of feature selection and classification methods in this type of data This is surprising given that feature selection and classification methods have been extensively explored in the context of gene expression data, see e.g. Given that the high density Illumina Infinium 450k methylation array is starting to be used [10,35] and that this array offers the coverage and scalability for epigenome wide association studies (EWAS) [36], it has become a critical and urgent question to determine how best to perform feature selection on these beadarrays

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Apr 24, 2012
Citations: 162	License type: cc-by

R Discovery Prime

R Discovery Prime

A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Differentiation of fat-poor angiomyolipoma from clear cell renal cell carcinoma in contrast-enhanced MDCT images using quantitative feature classification.
Han Sang Lee ... Dae Chul Jung
Medical Physics | VOL. 44
Han Sang Lee, et. al.Han Sang Lee ... Dae Chul Jung
09 Jun 2017
Medical Physics | VOL. 44

Comparison of Feature Selection Methods and Machine Learning Classifiers for Radiomics Analysis in Glioma Grading
Pan Sun ... Vincent Ct Mok
IEEE Access | VOL. 7
Pan Sun, et. al.Pan Sun ... Vincent Ct Mok
01 Jan 2019
IEEE Access | VOL. 7

A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data of Glioma
Heba Abusamra
Procedia Computer Science | VOL. 23
Heba AbusamraHeba Abusamra
01 Jan 2013
Procedia Computer Science | VOL. 23

Predication of different stages of Alzheimer’s disease using neighborhood component analysis and ensemble decision tree
Mingwu Jin ... Weishu Deng
Journal of Neuroscience Methods | VOL. 302
Mingwu Jin, et. al.Mingwu Jin ... Weishu Deng
24 Feb 2018
Journal of Neuroscience Methods | VOL. 302

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics