Nonparametric Variable Selection Using Machine Learning Algorithms in High Dimensional (Large P, Small N) Biomedical Applications

Christina M.R.

doi:10.5772/13541

Abstract

Biomedical data is facing an ever increasing amount of data that resist classical methods. Classical methods cannot be applied in the case of high dimensional datasets where the number of parameters greatly exceeds the number of observations, the so-called “large p small n” problem. Machine Learning techniques have had tremendous success in these realms in a wide-variety of disciplines. Often these machine learning tools are combined to include a variable selection step and model building step. In some cases the goal of the analysis may be exploratory in nature and the researcher is more interested in knowing which set of variables are strongly related to the output variable rather than predictive accuracy. For those situations, the goal of the analysis may be to provide a ranking of the input variables based on their relative importance in predicting the outcome. Other purposes for variable selection include elimination of redundant or irrelevant variables and to improve the performance of the predictive algorithm. Even if prediction is the goal of the analysis, several machine learning algorithms require that some dimension reduction is done prior to the model building, thus variable selection is an important problem. Let Y be the outcome of interest. Y can be continuous or categorical. When Y is continuous we call this a regression problem and when Y is categorical we call this a classification problem. Let 1 p X ,...,X be a set of potential predictors (also called inputs). X and Y are vectors of n observations. The goal of variable selection, broadly defined, is finding the set of X’s that are strongly related the outcome Y. Even for moderate values of p, estimating all possible linear models ( 2 ) is computationally expensive and thus there needs to be some dimension reduction. If p is large, and the set of all X’s contain redundant, irrelevant or highly correlated variables, such as the case in many biomedical applications including genome wide association studies and microarray studies, then the problem can be difficult. Further complicating matters, real-world data can have X’s that are of mixed type, where predictors are measured on different scales (categorical versus continuous) and the relationship between the outcome may be highly non-linear with high-order interactions. Generally, one can consider several machine learning methods for variable selection: one is a greedy search algorithm that examines the conditional probability distribution of Y, the

Full Text