Abstract

BackgroundThe development of accurate classification models depends upon the methods used to identify the most relevant variables. The aim of this article is to evaluate variable selection methods to identify important variables in predicting a binary response using nonlinear statistical models. Our goals in model selection include producing non-overfitting stable models that are interpretable, that generate accurate predictions and have minimum bias. This work was motivated by data on clinical and laboratory features of Helicobacter pylori infections obtained from 60 individuals enrolled in a prospective observational study.ResultsWe carried out a comprehensive performance comparison of several nonlinear classification models over the H. pylori data set. We compared variable selection results by Multivariate Adaptive Regression Splines (MARS), Logistic Regression with regularization, Generalized Additive Models (GAMs) and Bayesian Variable Selection in GAMs. We found that the MARS model approach has the highest predictive power because the nonlinearity assumptions of candidate predictors are strongly satisfied, a finding demonstrated via deviance chi-square testing procedures in GAMs.ConclusionsOur results suggest that the physiological free amino acids citrulline, histidine, lysine and arginine are the major features for predicting H. pylori peptic ulcer disease on the basis of amino acid profiling.

Highlights

  • The analysis of high-dimensional data, where the number of predictors exceeds the sample size, poses many challenges for statisticians and calls for new statistical methodologies in order to select relevant variables in multivariate data, feature selection is used to overcome the curse of dimensionality by removing non-essential variables to achieve a model with predictive accuracy

  • We carried out a comprehensive performance comparison of several nonlinear classification models over the H. pylori data set

  • We found that the Multivariate Adaptive Regression Splines (MARS) model approach has the highest predictive power because the nonlinearity assumptions of candidate predictors are strongly satisfied, a finding demonstrated via deviance chi-square testing procedures in Generalized Additive Models (GAMs)

Read more

Summary

Introduction

The analysis of high-dimensional data, where the number of predictors exceeds the sample size, poses many challenges for statisticians and calls for new statistical methodologies in order to select relevant variables in multivariate data, feature selection is used to overcome the curse of dimensionality by removing non-essential variables to achieve a model with predictive accuracy. Since highly complex models are penalized by increased total error, regularization helps reduce complexity in classification by minimizing over-fitting of the training data. We evaluated this by maximizing goodness-of-fit and simultaneously minimizing the number of variables selected. In this study discriminative features were identified that associated with H. pylori peptic ulcer disease. We found that various free amino acid measurements could be associated with disease outcome Many of these variables are highly correlated and which of the factors will result in the most stable classifier is unknown. The aim of this article is to evaluate variable selection methods to identify important variables in predicting a binary response using nonlinear statistical models. This work was motivated by data on clinical and laboratory features of Helicobacter pylori infections obtained from 60 individuals enrolled in a prospective observational study

Objectives
Methods
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.