Abstract

Distributed Principal Component Analysis (PCA) has been studied to deal with the case when data are stored across multiple machines and communication cost or privacy concerns prohibit the computation of PCA in a central location. However, the sub-Gaussian assumption in the related literature is restrictive in real application where outliers or heavy-tailed data are common in areas such as finance and macroeconomics. In this article, we propose a distributed algorithm for estimating the principal eigenspaces without any moment constraints on the underlying distribution. We study the problem under the elliptical family framework and adopt the sample multivariate Kendall’s tau matrix to extract eigenspace estimators from all submachines, which can be viewed as points in the Grassmann manifold. We then find the “center” of these points as the final distributed estimator of the principal eigenspace. We investigate the bias and variance for the distributed estimator and derive its convergence rate which depends on the effective rank, eigengap of the scatter matrix and the number of submachines. We show that the distributed estimator performs as if we have full access to the whole data. Simulation studies show that the distributed algorithm performs comparably with the existing one for light-tailed data, while showing great advantages for heavy-tailed data. We also extend the distributed algorithm to cases with limited communication constraints and with elliptical factor structure. Thorough simulation studies and a real application to a macroeconomic dataset verify the advantages of the proposed distributed algorithms. Supplementary materials for this article are available online.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call