Abstract
When building a predictive model for predicting a clinical outcome using machine learning techniques, the model developers are often interested in ranking the features according to their predictive ability. A commonly used approach to obtain a robust variable ranking is to apply recursive feature elimination (RFE) on multiple resamplings of the training set and then to aggregate the ranking results using the Borda count method. However, the presence of highly correlated features in the training set can deteriorate the ranking performance. In this work, we propose a variant of the method based on RFE and Borda count that takes into account the correlation between variables during the ranking procedure in order to improve the ranking performance in the presence of highly correlated features. The proposed algorithm is tested on simulated datasets in which the true variable importance is known and compared to the standard RFE-Borda count method. According to the root mean square error between the estimated rank and the true (i.e., simulated) feature importance, the proposed algorithm overcomes the standard RFE-Borda count method. Finally, the proposed algorithm is applied to a case study related to the development of a predictive model of type 2 diabetes onset.
Highlights
Machine learning (ML) techniques are increasingly being adopted in a variety of medical applications for the development of clinical predictive models, i.e., models for the prediction of outcomes of clinical interest, using a set of candidate variables or features
The results of the variable ranking obtained for the representative dataset of Section 2.2.2 are reported in Table 4 for both the standard recursive feature elimination (RFE)-Borda count method and the proposed algorithm, performed on B = 100 training set variants generated by bootstrap resampling
We can observe that the standard RFE-Borda count approach, which ignores variable correlation, commits some ranking mistakes: x2 is ranked below x3; x5 is ranked in the 6th position, below x6; x8 is ranked in the 9th position, after x15; x9 and x10 are ranked in the 14th and the 18th position, respectively, and they are surpassed in the ranking even by noise variables, such as x18, x19, and x20
Summary
Machine learning (ML) techniques are increasingly being adopted in a variety of medical applications for the development of clinical predictive models, i.e., models for the prediction of outcomes of clinical interest, using a set of candidate variables or features. I.e., the ordering of features based on their importance for outcome prediction [1], is useful both to provide an interpretation of the model, i.e., to compare the predictive ability of different variables, and to perform a feature selection, or model reduction, i.e., to identify the most important features and remove the unnecessary variables from the model. Models with a large number of input variables can be more difficult to interpret: noisy features, which are not related to the outcome, can have small and implausible effects in the identified model [2]. The models with many input variables are not easy to implement in the clinical practice because some variables may be difficult to collect in different clinical contexts [3]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.