Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

Chansik An,Sung Soo Ahn,Yae Won Park,Kyunghwa Han,Seung-Koo Lee,Hwiyoung Kim

doi:10.1371/journal.pone.0256152

Abstract

This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) "Simple" task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) "difficult" task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.

Highlights

Since the advent of precision and personalized medicine, machine learning (ML) has received great interest as a promising tool for identifying the best diagnosis and treatment for an individual patient
Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes
When a sample size is not sufficient for a radiomics ML task, the model’s performance estimated in training and that obtained in testing may vary widely between different training-test set pairs

Summary

Introduction

Since the advent of precision and personalized medicine, machine learning (ML) has received great interest as a promising tool for identifying the best diagnosis and treatment for an individual patient. Many ML studies with medical image data, including radiomics-based ML, are often conducted in a small group of patients especially when rare diseases are involved, but still reporting promising predictive accuracies [4]. For their potential clinical usefulness to be ascertained, models must be rigorously validated in independent external datasets. Most published prediction models have not been validated externally [5], and the field of radiomics ML is no exception where the problem is magnified by the intrinsic difficulty of acquiring large datasets. Recent reports showed that external validation was missing in 81–96% of published radiomics-based studies [4, 6, 7]

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS ONE	Publication Date: Aug 12, 2021
Citations: 39	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results
Hwiyoung Kim ... Seung-Koo Lee
-
Hwiyoung Kim, et. al.Hwiyoung Kim ... Seung-Koo Lee
12 Aug 2021
12 Aug 2021

Differentiation of Clear Cell and Non-clear-cell Renal Cell Carcinoma through CT-based Radiomics Models and Nomogram.
Batuer Tuerdi ... Yeerxiati Abudikeranmu
Current Medical Imaging Formerly Current Medical Imaging Reviews | VOL. 19
Batuer Tuerdi, et. al.Batuer Tuerdi ... Yeerxiati Abudikeranmu
01 Aug 2023
Current Medical Imaging Formerly Current Medical Imaging Reviews | VOL. 19

Development and validation of a combined nomogram for predicting perineural invasion status in rectal cancer via computed tomography-based radiomics.
Jiaxuan Liu ... Xiang Zhao
Journal of Cancer Research and Therapeutics | VOL. 19
Jiaxuan Liu, et. al.Jiaxuan Liu ... Xiang Zhao
01 Dec 2023
Journal of Cancer Research and Therapeutics | VOL. 19

A radiomic-based predictive model of lung adenocarcinoma brain metastases and molecular subtypes.
Xiancheng Wu ... Nourel Tahon
Journal of Clinical Oncology | VOL. 41
Xiancheng Wu, et. al.Xiancheng Wu ... Nourel Tahon
01 Jun 2023
Journal of Clinical Oncology | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE