Abstract

One of the important issues associated with real-life high-dimensional data analysis is how to extract significant and relevant features from multiview data. The multiset canonical correlation analysis (MCCA) is a well-known statistical method for multiview data integration. It finds a linear subspace that maximizes the correlations among different views. However, the existing methods to find the multiset canonical variables are computationally very expensive, which restricts the application of the MCCA in real-life big data analysis. The covariance matrix of each high-dimensional view may also suffer from the singularity problem due to the limited number of samples. Moreover, the MCCA-based existing feature extraction algorithms are, in general, unsupervised in nature. In this regard, a new supervised feature extraction algorithm is proposed, which integrates multimodal multidimensional data sets by solving maximal correlation problem of the MCCA. A new block matrix representation is introduced to reduce the computational complexity for computing the canonical variables of the MCCA. The analytical formulation enables efficient computation of the multiset canonical variables under supervised ridge regression optimization technique. It deals with the "curse of dimensionality" problem associated with high-dimensional data and facilitates the sequential generation of relevant features with significantly lower computational cost. The effectiveness of the proposed multiblock data integration algorithm, along with a comparison with other existing methods, is demonstrated on several benchmark and real-life cancer data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call