Kernel canonical correlation analysis for data combination of multiple-source datasets

Masaki Mitsuhiro,Takahiro Hoshino

doi:10.1007/s42081-020-00074-z

Abstract

To investigate the relationship between variables that are not observed simultaneously in the same dataset, “multiple-source datasets” obtained from different individuals or units must be integrated into a “(quasi) single-source dataset”, in which all the relevant variables are observed for the same units. Among various data combination methods, the statistical matching method, frequently used in practical usage in marketing or social sciences, matches units from a certain dataset with similar units from another dataset in terms of the distance of each unit’s values of covariates related to the concerned variables. However, when multiple-source datasets have a large number of covariates, it is difficult to obtain accurate quasi single-source dataset using matching methods, because combination of the covariates’ values becomes complicated and/or it is difficult to deal with the nonlinear relationship between the concerned variables. In this study, we propose a data combination method that combines extension of kernel canonical correlation analysis and statistical matching. This proposed method can estimate canonical variables of a common low-dimensional space that can preserve the relationship between covariates and outcome variables. Using a simulation study and real-world data analysis, we compare our method with existing methods and demonstrate its utility.

Full Text