A Variable Ranking Method for Machine Learning Models with Correlated Features: In-Silico Validation and Application for Diabetes Prediction

Martina Vettoretti,Barbara Di Camillo

doi:10.3390/app11167740

Abstract

When building a predictive model for predicting a clinical outcome using machine learning techniques, the model developers are often interested in ranking the features according to their predictive ability. A commonly used approach to obtain a robust variable ranking is to apply recursive feature elimination (RFE) on multiple resamplings of the training set and then to aggregate the ranking results using the Borda count method. However, the presence of highly correlated features in the training set can deteriorate the ranking performance. In this work, we propose a variant of the method based on RFE and Borda count that takes into account the correlation between variables during the ranking procedure in order to improve the ranking performance in the presence of highly correlated features. The proposed algorithm is tested on simulated datasets in which the true variable importance is known and compared to the standard RFE-Borda count method. According to the root mean square error between the estimated rank and the true (i.e., simulated) feature importance, the proposed algorithm overcomes the standard RFE-Borda count method. Finally, the proposed algorithm is applied to a case study related to the development of a predictive model of type 2 diabetes onset.

Highlights

Machine learning (ML) techniques are increasingly being adopted in a variety of medical applications for the development of clinical predictive models, i.e., models for the prediction of outcomes of clinical interest, using a set of candidate variables or features
The results of the variable ranking obtained for the representative dataset of Section 2.2.2 are reported in Table 4 for both the standard recursive feature elimination (RFE)-Borda count method and the proposed algorithm, performed on B = 100 training set variants generated by bootstrap resampling
We can observe that the standard RFE-Borda count approach, which ignores variable correlation, commits some ranking mistakes: x2 is ranked below x3; x5 is ranked in the 6th position, below x6; x8 is ranked in the 9th position, after x15; x9 and x10 are ranked in the 14th and the 18th position, respectively, and they are surpassed in the ranking even by noise variables, such as x18, x19, and x20

Summary

Introduction

Machine learning (ML) techniques are increasingly being adopted in a variety of medical applications for the development of clinical predictive models, i.e., models for the prediction of outcomes of clinical interest, using a set of candidate variables or features. I.e., the ordering of features based on their importance for outcome prediction [1], is useful both to provide an interpretation of the model, i.e., to compare the predictive ability of different variables, and to perform a feature selection, or model reduction, i.e., to identify the most important features and remove the unnecessary variables from the model. Models with a large number of input variables can be more difficult to interpret: noisy features, which are not related to the outcome, can have small and implausible effects in the identified model [2]. The models with many input variables are not easy to implement in the clinical practice because some variables may be difficult to collect in different clinical contexts [3]

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Aug 23, 2021
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Variable Ranking Method for Machine Learning Models with Correlated Features: In-Silico Validation and Application for Diabetes Prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Applying Different Machine Learning Techniques for Prediction of COVID-19 Severity.
Safynaz Abdel-Fattah Sayed ... Sabah Sayed Mohammad
IEEE access : practical innovations, open solutions | VOL. 9
Safynaz Abdel-Fattah Sayed, et. al.Safynaz Abdel-Fattah Sayed ... Sabah Sayed Mohammad
01 Jan 2020
IEEE access : practical innovations, open solutions | VOL. 9

Feature selection using non-parametric correlations and important features on recursive feature elimination for stock price prediction
Arif Mudi Priyatno ... Wahyu Febri Ramadhan Sudirman
International Journal of Electrical and Computer Engineering (IJECE) | VOL. 14
Arif Mudi Priyatno, et. al.Arif Mudi Priyatno ... Wahyu Febri Ramadhan Sudirman
01 Apr 2024
International Journal of Electrical and Computer Engineering (IJECE) | VOL. 14

CCrFS: Combine Correlation Features Selection for Detecting Phishing Websites Using Machine Learning
Jimmy Moedjahedy ... Fawaz Khaled Alarfaj
Future Internet | VOL. 14
Jimmy Moedjahedy, et. al.Jimmy Moedjahedy ... Fawaz Khaled Alarfaj
27 Jul 2022
Future Internet | VOL. 14

Estimation of Forest Stock Volume Using Sentinel-2 MSI, Landsat 8 OLI Imagery and Forest Inventory Data
Yangyang Zhou ... Zhongke Feng
Forests | VOL. 14
Yangyang Zhou, et. al.Yangyang Zhou ... Zhongke Feng
29 Jun 2023
Forests | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Variable Ranking Method for Machine Learning Models with Correlated Features: In-Silico Validation and Application for Diabetes Prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences