Incorporating genetic networks into case-control association studies with high-dimensional DNA methylation data

Kipoong Kim,Hokeun Sun

doi:10.1186/s12859-019-3040-x

Abstract

BackgroundIn human genetic association studies with high-dimensional gene expression data, it has been well known that statistical selection methods utilizing prior biological network knowledge such as genetic pathways and signaling pathways can outperform other methods that ignore genetic network structures in terms of true positive selection. In recent epigenetic research on case-control association studies, relatively many statistical methods have been proposed to identify cancer-related CpG sites and their corresponding genes from high-dimensional DNA methylation array data. However, most of existing methods are not designed to utilize genetic network information although methylation levels between linked genes in the genetic networks tend to be highly correlated with each other.ResultsWe propose new approach that combines data dimension reduction techniques with network-based regularization to identify outcome-related genes for analysis of high-dimensional DNA methylation data. In simulation studies, we demonstrated that the proposed approach overwhelms other statistical methods that do not utilize genetic network information in terms of true positive selection. We also applied it to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project.ConclusionsThe proposed variable selection approach can utilize prior biological network information for analysis of high-dimensional DNA methylation array data. It first captures gene level signals from multiple CpG sites using data a dimension reduction technique and then performs network-based regularization based on biological network graph information. It can select potentially cancer-related genes and genetic pathways that were missed by the existing methods.

Highlights

In human genetic association studies with high-dimensional gene expression data, it has been well known that statistical selection methods utilizing prior biological network knowledge such as genetic pathways and signaling pathways can outperform other methods that ignore genetic network structures in terms of true positive selection
Since each gene consists of 10 CpG sites, we considered four representative group-based tests such as two sample t-test based on Principal component analysis (PCA), global test [24], SAM-GS [25], and Hotelling’s T2 test [26]
The proposed approach is first to capture gene level signals from multiple CpG sites using a dimension reduction technique like normalized principal components and to perform network-based regularization based on biological network graph information

Summary

Introduction

In human genetic association studies with high-dimensional gene expression data, it has been well known that statistical selection methods utilizing prior biological network knowledge such as genetic pathways and signaling pathways can outperform other methods that ignore genetic network structures in terms of true positive selection. In recent epigenetic research on case-control association studies, relatively many statistical methods have been proposed to identify cancer-related CpG sites and their corresponding genes from high-dimensional DNA methylation array data. Network-based regularization proposed by Li and Li [1, 6] have shown promising selection results for analysis of high-dimensional gene expression data. Variable selection can be conducted with relatively fast computational speeds even for high-dimensional genomic data, as we adopt one of the well-designed computational algorithms such as cyclic coordinate descent and gradient descent algorithms [11,12,13,14]

Methods

Results

Conclusion