An empirical comparison of two approaches for CDPCA in high-dimensional data

Adelaide Freitas,Maurizio Vichi,Eloísa Macedo

doi:10.1007/s10260-020-00546-2

Adelaide Freitas, Maurizio Vichi + Show 1 more

Open Access

https://doi.org/10.1007/s10260-020-00546-2

Copy DOI

Abstract

Modified principal component analysis techniques, specially those yielding sparse solutions, are attractive due to its usefulness for interpretation purposes, in particular, in high-dimensional data sets. Clustering and disjoint principal component analysis (CDPCA) is a constrained PCA that promotes sparsity in the loadings matrix. In particular, CDPCA seeks to describe the data in terms of disjoint (and possibly sparse) components and has, simultaneously, the particularity of identifying clusters of objects. Based on simulated and real gene expression data sets where the number of variables is higher than the number of the objects, we empirically compare the performance of two different heuristic iterative procedures, namely ALS and two-step-SDP algorithms proposed in the specialized literature to perform CDPCA. To avoid possible effect of different variance values among the original variables, all the data was standardized. Although both procedures perform well, numerical tests highlight two main features that distinguish their performance, in particular related to the two-step-SDP algorithm: it provides faster results than ALS and, since it employs a clustering procedure (k-means) on the variables, outperforms ALS algorithm in recovering the true variable partitioning unveiled by the generated data sets. Overall, both procedures produce satisfactory results in terms of solution precision, where ALS performs better, and in recovering the true object clusters, in which two-step-SDP outperforms ALS approach for data sets with lower sample size and more structure complexity (i.e., error level in the CDPCA model). The proportion of explained variance by the components estimated by both algorithms is affected by the data structure complexity (higher error level, the lower variance) and presents similar values for the two algorithms, except for data sets with two object clusters where the two-step-SDP approach yields higher variance. Moreover, experimental tests suggest that the two-step-SDP approach, in general, presents more ability to recover the true number of object clusters, while the ALS algorithm is better in terms of quality of object clustering with more homogeneous, compact and well-separated clusters in the reduced space of the CDPCA components.

Highlights

Ever-increasing problem size demands the development of novel techniques to perform statistical analysis
We focus our attention on Clustering and disjoint principal component analysis (CDPCA) applied on high-dimensional data namely, when the number of variables is much greater than the number of objects
We briefly review two recently proposed iterative heuristic procedures based on two steps for performing CDPCA on two-way data (Macedo 2015; Macedo and Freitas 2015), both following the idea of the four-step alternating least-squares (ALS) algorithm proposed in Vichi and Saporta (2009)

Summary

Introduction

Ever-increasing problem size demands the development of novel techniques to perform statistical analysis. Dimensionality reduction based on principal component analysis (PCA) is aimed at representing a high-dimensional data into a lower dimensional space, retaining the maximum variability of the original attributes. This projection to a low-dimensional space is provided by a new set of attributes called principal components (PCs), which are uncorrelated and defined by linear combinations of the original attributes (Jolliffe 2002). In an attempt to construct more interpretable PCs, PCA-based methodologies providing components with zero loadings have been proposed in the literature. Regardless of the sparseness constraints, either ensuring the sparseness in each component loadings or, less restrictive, in the loadings matrix (Adachi and Trendafilov 2016), high computational complexity is present in those methodologies. Modified PCs with sparse loadings have been constructed using, for instance, the LASSO (elastic net) regression method (Zou et al 2006), convex semidefinite programming (SDP) relaxations (d’Aspremont et al 2007), a variable projection solver (Erichson et al 2018), and an iterative thresholding approach (Ma 2013)

Objectives

Methods

Findings

Conclusion