Validation of classification models in cancer studies using simulated spectral data – A “sandbox” concept

Ekaterina Boichenko,Andrey Panchenko,Margarita Tyndyk,Mikhail Maydin,Stepan Kruglov,Viacheslav Artyushenko,Dmitry Kirsanov

doi:10.1016/j.chemolab.2022.104564

Abstract

Spectroscopy has become a popular method in research devoted to cancer diagnostics, therapy, and surgery – anywhere we need to detect tumor cells surrounded by non-cancerous ones. Usually, chemometrics methods are applied to classify cancerous and non-cancerous sites, so proper validation of classification models is required to ensure the reliability of the obtained results. In this study, we suggest using real data for simulation of spectral sets with varying characteristics (size, distribution of classes) – an analog of “sandbox” used in software development – and to validate the models in different conditions. Near-infrared spectra (939–1796 nm) measured from breast tumors and healthy tissues of laboratory mice (152 spectra) were used for simulation of spectral data sets of different sizes (50, 100, 150 spectra). We proposed a simple simulation method based on a singular value decomposition of the real spectral dataset and rearrangement of the calculated residuals. Several algorithms of training and test set selection have been applied to the simulated data (Kennard-Stone, DUPLEX, random, Monte-Carlo cross-validation), and corresponding Support Vector Machines classification models have been trained, optimized, and validated by using a series of test sets with varying “healthy: tumor” classes distribution (1:1,3:1,1:3) and size (10%, 30%, and 50% of the training data set). Performance of the classification models, expressed in values of accuracy, sensitivity, and selectivity, has been compared, and a validation strategy has been proposed.

Full Text