For many practical high-dimensional problems, interactions have been increasingly found to play important roles beyond main effects. A representative example is gene-gene interaction. Joint analysis, which analyzes all interactions and main effects in a single model, can be seriously challenged by high dimensionality. For high-dimensional data analysis in general, marginal screening has been established as effective for reducing computational cost, increasing stability, and improving estimation/selection performance. Most of the existing marginal screening methods are designed for the analysis of main effects only. The existing screening methods for interaction analysis are often limited by making stringent model assumptions, lacking robustness, and/or requiring predictors to be continuous (and hence lacking flexibility). A unified marginal screening approach tailored to interaction analysis is developed, which can be applied to regression, classification, and survival analysis. Predictors are allowed to be continuous and discrete. The proposed approach is built on Coefficient of Variation (CV) filters based on information entropy. Statistical properties are rigorously established. It is shown that the CV filters are almost insensitive to the distribution tails of predictors, correlation structure among predictors, and sparsity level of signals. An efficient two-stage algorithm is developed to make the proposed approach scalable to ultrahigh-dimensional data. Simulations and the analysis of TCGA LUAD data further establish the practical superiority of the proposed approach.
Read full abstract