Abstract
Classifying subjects into risk categories is a common challenge in medical research. Machine Learning (ML) methods are widely used in the areas of risk prediction and classification. The primary objective of such algorithms is to use several features to predict dichotomous responses (e.g., healthy/at risk). Similar to statistical inference modelling, ML modelling is subject to the problem of class imbalance and is affected by the majority class, increasing the false-negative rate. In this study, we built and evaluated thirty-six ML models to classify approximately 4300 female and 4100 male participants from the UK Biobank into three categorical risk statuses based on discretised visceral adipose tissue (VAT) measurements from magnetic resonance imaging. We also examined the effect of sampling techniques on the models when dealing with class imbalance. The sampling techniques used had a significant impact on the classification and resulted in an improvement in risk status prediction by facilitating an increase in the information contained within each variable. Based on domain expert criteria the best three classification models for the female and male cohort visceral fat prediction were identified. The Area Under Receiver Operator Characteristic curve of the models tested (with external data) was 0.78 to 0.89 for females and 0.75 to 0.86 for males. These encouraging results will be used to guide further development of models to enable prediction of VAT value. This will be useful to identify individuals with excess VAT volume who are at risk of developing metabolic disease ensuring relevant lifestyle interventions can be appropriately targeted.
Highlights
Real-world data are often imbalanced and lack uniform distribution across classes
Of all methods were computed, they showed that resampling methods resulted in an improvement in Classified Instances ratio (CCI) compared to the original Targeted dataset (TD)
When the performance of the Logistic Regression (LR), Artificial neural network (ANN), C4.5 and Random Forest (RF) models for the female cohort was evaluated, it was apparent that the Random Under Sampling (RUS) dataset was poorer than when the TD data set was used, Fig. 9
Summary
Real-world data are often imbalanced and lack uniform distribution across classes. Classification of imbalanced datasets is a significant challenge across both industrial and research domains [1]. When resampling methods are applied, questions over their suitability are often raised [9]. For example: is the new resampled dataset representative of the population in relation to the response variable? Is it acceptable to artificially generate synthetic data of class subjects when training Machine Learning (ML) classification models? It has been argued that by using sampling methods, the original class ratio is lost during the training process and that this affects the accuracy metrics [10]. Training ML models with synthetic data may compromise accuracy measures by deceiving the process of crossvalidation sampling [11]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.