Improving random forest predictions in small datasets from two-phase sampling designs

Sunwoo Han,Youyi Fong,Brian D Williamson

doi:10.1186/s12911-021-01688-3

Sunwoo Han, Youyi Fong + Show 1 more

Open Access

https://doi.org/10.1186/s12911-021-01688-3

Copy DOI

Abstract

BackgroundWhile random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive.MethodsUsing an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning.ResultsOur experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions.ConclusionIn small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.

Highlights

While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive
In this paper we studied the optimal use of random forest (RF) for classification on a dataset from a two-phase sampling design, a common situation in prevention studies of public health importance, which often have a small number of disease endpoints
inverse sampling probability weighting (IPW) led to poorer performance due to the class imbalance problem in the RF training step

Summary

Introduction

While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. Prediction of a binary disease outcome from a collection of clinical covariates and biomarker measurements is a common task in biomedical studies. Han et al BMC Medical Informatics and Decision Making (2021) 21:322 explanation.) Studies using the two-phase sampling designs often have a small number of disease endpoints and a high cost associated with measuring biomarkers such that only a small representative subset of controls have biomarker measurements. Most conventional machine learning methods tend to be unsuccessful in situations with small sample sizes because the methods require a substantial amount of training data

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Medical Informatics and Decision Making	Publication Date: Nov 22, 2021
Citations: 41	License type: open-access

R Discovery Prime

R Discovery Prime

Improving random forest predictions in small datasets from two-phase sampling designs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making

Lead the way for us

Similar Papers

Goodness-of-fit two-phase sampling designs for time-to-event outcomes: a simulation study based on New York University Women’s Health Study for breast cancer
Myeonggyun Lee ... Anne Zeleniuch-Jacquotte
BMC Medical Research Methodology | VOL. 23
Myeonggyun Lee, et. al.Myeonggyun Lee ... Anne Zeleniuch-Jacquotte
19 May 2023
BMC Medical Research Methodology | VOL. 23

Robust risk prediction with biomarkers under two-phase stratified cohort design.
Rebecca Payne ... Majken K Jensen
Biometrics | VOL. 72
Rebecca Payne, et. al.Rebecca Payne ... Majken K Jensen
01 Apr 2016
Biometrics | VOL. 72

Threshold-Based Subgroup Testing in Logistic Regression Models in Two-Phase Sampling Designs
Ying Huang ... Youyi Fong
Journal of the Royal Statistical Society Series C: Applied Statistics | VOL. 70
Ying Huang, et. al.Ying Huang ... Youyi Fong
01 Mar 2021
Journal of the Royal Statistical Society Series C: Applied Statistics | VOL. 70

Estimating the hazard rate difference from case-cohort studies.
Jie K Hu ... Norman E Breslow
European journal of epidemiology | VOL. 36
Jie K Hu, et. al.Jie K Hu ... Norman E Breslow
14 Jun 2021
European journal of epidemiology | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving random forest predictions in small datasets from two-phase sampling designs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making