Abstract

Biomedical data is facing an ever increasing amount of data that resist classical methods. Classical methods cannot be applied in the case of high dimensional datasets where the number of parameters greatly exceeds the number of observations, the so-called “large p small n” problem. Machine Learning techniques have had tremendous success in these realms in a wide-variety of disciplines. Often these machine learning tools are combined to include a variable selection step and model building step. In some cases the goal of the analysis may be exploratory in nature and the researcher is more interested in knowing which set of variables are strongly related to the output variable rather than predictive accuracy. For those situations, the goal of the analysis may be to provide a ranking of the input variables based on their relative importance in predicting the outcome. Other purposes for variable selection include elimination of redundant or irrelevant variables and to improve the performance of the predictive algorithm. Even if prediction is the goal of the analysis, several machine learning algorithms require that some dimension reduction is done prior to the model building, thus variable selection is an important problem. Let Y be the outcome of interest. Y can be continuous or categorical. When Y is continuous we call this a regression problem and when Y is categorical we call this a classification problem. Let 1 p X ,...,X be a set of potential predictors (also called inputs). X and Y are vectors of n observations. The goal of variable selection, broadly defined, is finding the set of X’s that are strongly related the outcome Y. Even for moderate values of p, estimating all possible linear models ( 2 ) is computationally expensive and thus there needs to be some dimension reduction. If p is large, and the set of all X’s contain redundant, irrelevant or highly correlated variables, such as the case in many biomedical applications including genome wide association studies and microarray studies, then the problem can be difficult. Further complicating matters, real-world data can have X’s that are of mixed type, where predictors are measured on different scales (categorical versus continuous) and the relationship between the outcome may be highly non-linear with high-order interactions. Generally, one can consider several machine learning methods for variable selection: one is a greedy search algorithm that examines the conditional probability distribution of Y, the

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.