Analysis of longitudinal data using constrained repeated random sampling-cross validation (CORRS-CV) and partial least squares

Isabel Ten-Doménech,David Pérez-Guaita,Guillermo Quintás,Julia Kuligowski

doi:10.1016/j.chemolab.2023.104776

Abstract

Longitudinal data constitutes a very important source of information to study systems or individuals over time. Analysis of this type of data is often performed using multivariate models to assess the association between high-dimensionality data and an independent variable for e.g., the identification of biomarkers of disease, building accurate classifiers, or the identification of descriptors of a time-dependent process. Cross validation (CV) is frequently used for model selection and development, but current CV strategies such as random k-fold CV or individual-based CV can provide overly optimistic model performance estimates because the independence between the samples included in the train and test subsets is often not ensured. To overcome this potential pitfall, here we show with help of simulated data how the use of constrained repeated random subsampling – cross validation (CORRS-CV) improves the independence between train and test subsets during CV, thus providing an accurate estimation of the model performance and facilitating the identification of informative variables.

Full Text