Adaptive estimation in structured factor models with applications to overlapping clustering

Xin Bing,Florentina Bunea,Marten Wegkamp,Yang Ning

doi:10.1214/19-aos1877

Abstract

This work introduces a novel estimation method, called LOVE, of the entries and structure of a loading matrix $A$ in a latent factor model $X=AZ+E$, for an observable random vector $X\in \mathbb{R}^{p}$, with correlated unobservable factors $Z\in \mathbb{R}^{K}$, with $K$ unknown, and uncorrelated noise $E$. Each row of $A$ is scaled, and allowed to be sparse. In order to identify the loading matrix $A$, we require the existence of pure variables, which are components of $X$ that are associated, via $A$, with one and only one latent factor. Despite the fact that the number of factors $K$, the number of the pure variables and their location are all unknown, we only require a mild condition on the covariance matrix of $Z$, and a minimum of only two pure variables per latent factor to show that $A$ is uniquely defined, up to signed permutations. Our proofs for model identifiability are constructive, and lead to our novel estimation method of the number of factors and of the set of pure variables, from a sample of size $n$ of observations on $X$. This is the first step of our LOVE algorithm, which is optimization-free, and has low computational complexity of order $p^{2}$. The second step of LOVE is an easily implementable linear program that estimates $A$. We prove that the resulting estimator is near minimax rate optimal for $A$, with respect to the $\|\ \|_{\infty ,q}$ loss, for $q\geq 1$, up to logarithmic factors in $p$, and that it can be minimax-rate optimal in many cases of interest. The model structure is motivated by the problem of overlapping variable clustering, ubiquitous in data science. We define the population level clusters as groups of those components of $X$ that are associated, via the matrix $A$, with the same unobservable latent factor, and multifactor association is allowed. Clusters are respectively anchored by the pure variables, and form overlapping subgroups of the $p$-dimensional random vector $X$. The Latent model approach to OVErlapping clustering is reflected in the name of our algorithm, LOVE. The third step of LOVE estimates the clusters from the support of the columns of the estimated $A$. We guarantee cluster recovery with zero false positive proportion, and with false negative proportion control. The practical relevance of LOVE is illustrated through the analysis of a RNA-seq data set, devoted to determining the functional annotation of genes with unknown function.

Full Text