Abstract

AbstractMany studies make use of multiple types of data that are collected for the same set of samples, resulting in so‐called multiblock data (e.g., multiomics studies). A popular analysis framework is sparse principal component analysis (PCA) of the concatenated data. The sparseness in the component weights of these models is usually induced by penalties. A crucial factor in the use of such penalized methods is a proper tuning of the regularization parameters used to give more or less weight to the penalties. In this paper, we examine several model selection procedures to tune these regularization parameters for sparse PCA. The model selection procedures include cross‐validation, Bayesian information criterion (BIC), index of sparseness, and the convex hull procedure. Furthermore, to account for the multiblock structure, we present a sparse PCA algorithm with a group least absolute shrinkage and selection operator (LASSO) penalty added to it, to either select or cancel out blocks of data in an automated way. Also, the tuning of the group LASSO parameter is studied for the proposed model selection procedures. We conclude that when the component weights are to be interpreted, cross‐validation with the one standard error rule is preferred; alternatively, if the interest lies in obtaining component scores using a very limited set of variables, the convex hull, BIC, and index of sparseness are all suitable.

Highlights

  • Many studies make use of multiple types of data that are collected for the same set of samples, resulting in so-called multiblock data.[1]

  • We conclude that when the component weights are to be interpreted, cross-validation with the one standard error rule is preferred; alternatively, if the interest lies in obtaining component scores using a very limited set of variables, the convex hull, Bayesian information criterion (BIC), and index of sparseness are all suitable

  • In the conditions were the sparsity is 30%, only 10-fold CV and 10-fold CV with the one standard error rule attain Tucker congruence values above 0.85. This means that the BIC, index of sparseness (IS), and the convex hull (CHull) with variance accounted for (VAF) procedures result in models where the estimated component weights are too dissimilar from the true component weights

Read more

Summary

| INTRODUCTION

Many studies make use of multiple types of data that are collected for the same set of samples, resulting in so-called multiblock data.[1]. If the number of blocks and components is large, it is not and can be expected to yield highly variable results (as is the case with the best subset selection method for variable selection) Another option to perform selection at the level of the blocks is to add a group LASSO penalty to the PCA objective; see Jenatton et al.,[26] Deun et al.,[14] and Erichson et al.[27] for similar proposals. In the context of PCA, CV can be applied in several ways; a discussion and comparison with respect to selecting the number of components for the X = TPT model can be found in Bro et al.[15] In that comparison, the best performing method was CV with the eigenvector method; de Schipper and Van Deun[11] discussed the method in the context of sparse SCA to determine the value of the LASSO and ridge tuning parameters. The percentage of correctly identified nonzero weights, calculated as the percentage of nonzero weights in the true matrix that are recovered as a nonzero weight in the estimated matrix

| Results
| CONCLUSION
Jk ðwðjqkÞ
Jk jwðjqkÞ
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call