Efficient and Distributed Generalized Canonical Correlation Analysis for Big Multiview Data

Xiao Fu,Christos Faloutsos,Partha Talukdar,Kejun Huang,Tom Mitchell,Evangelos E Papalexakis,Hyun Ah Song,Nicholas D Sidiropoulos

doi:10.1109/tkde.2018.2875908

Abstract

Generalized canonical correlation analysis (GCCA) integrates information from data samples that are acquired at multiple feature spaces (or ‘views’) to produce low-dimensional representations—which is an extension of classical two-view CCA. Since the 1960s, (G)CCA has attracted much attention in statistics, machine learning, and data mining because of its importance in data analytics. Despite these efforts, the existing GCCA algorithms have serious complexity issues. The memory and computational complexities of the existing algorithms usually grow as a quadratic and cubic function of the problem dimension (the number of samples / features), respectively—e.g., handling views with $\approx \!\!1,000$ ≈ 1 , 000 features using such algorithms already occupies $\approx \!\!10^6$ ≈ 10 6 memory and the per-iteration complexity is $\approx\!\!10^9$ ≈ 10 9 flops—which makes it hard to push these methods much further. To circumvent such difficulties, we first propose a GCCA algorithm whose memory and computational costs scale linearly in the problem dimension and the number of nonzero data elements, respectively. Consequently, the proposed algorithm can easily handle very large sparse views whose sample and feature dimensions both exceed $\approx\!\! 100,000$ ≈ 100 , 000 . Our second contribution lies in proposing two distributed algorithms for GCCA, which compute the canonical components of different views in parallel and thus can further reduce the runtime significantly if multiple computing agents are available. We provide detailed convergence analyses of the proposed algorithms and show that all the large-scale GCCA algorithms converge to a Karush-Kuhn-Tucker (KKT) point at least sublinearly. Judiciously designed synthetic and real-data experiments are employed to showcase the effectiveness of the proposed algorithms.

Full Text