On Fair Performance Comparison between Random Survival Forest and Cox Regression: An Example of Colorectal Cancer Study

Sirin Cetin,Isa Dede,Wentian Li,Ayse Ulgen

doi:10.28991/scimedj-2021-0301-9

Sirin Cetin, Isa Dede + Show 2 more

Open Access

https://doi.org/10.28991/scimedj-2021-0301-9

Copy DOI

Abstract

Random Forest (RF), a mostly model-free and robust machine learning method, has been successfully applied to right-censored survival data, under the name of Random Survival Forest (RSF). However, RF/RSF has its distinct strategies in classification and prediction. First, it is an ensemble classifier and its performance is an average of multiple rounds of data fitting. Second, the training set is a bootstrap (sampling with replacement) generated set with repeated used of roughly 2/3 of all samples and testing set consists of those not used (out of bag samples). Both features are not intrinsic to Cox regression or other single classifiers. Not considering these two features could potentially lead to a partial comparison between the performance of the two methods. By using a colorectal survival dataset, we illustrate the problems of using k-fold cross-validation, using only one resampling without an ensemble average, and using the whole dataset for both fitting and testing, in Cox regression, when comparing with RSF. We provide a more accessible R code for simple calculation of discordance index (D-index) and unweighted integrated Brier score (IBS) for Cox regression, and unweighted IBS for RSF. Doi: 10.28991/SciMedJ-2021-0301-9 Full Text: PDF

Highlights

In cancer epidemiology studies, one of the most commonly used analyses is Cox regression, which regresses the right-censored time-to-death data on risk factors
We are interested in the Random Forest (RF) [2] (or Random Survival Forest (RSF) when it is applied to survival analysis), because RF/RSF is easy to explain, easy to code, and easy to apply to data
We run RSF 100 times on the colorectal dataset with 7 independent variables and calculate integrated Brier score (IBS) for both out of bag (OOB) samples and IB samples, both directly provided by the rfsrc function

Summary

Introduction

One of the most commonly used analyses is Cox regression, which regresses the right-censored time-to-death data on risk factors. An all-embracing name “machine learning" covers a whole spectra of these new techniques [1], with most of them “model free" making less assumption about the data. We are interested in the Random Forest (RF) [2] (or Random Survival Forest (RSF) when it is applied to survival analysis), because RF/RSF is easy to explain, easy to code, and easy to apply to data. There are already many articles published about application of RSF to survival data [3, 4] and its comparison to the standard method in survival analysis, i.e., the Cox regression ( called proportional hazards regression). Even though public software are available making the application of RSF easy, being unfamiliar with the new

Methods

Results

Discussion

Conclusion