The potential to study and improve different aspects of our lives is ever growing thanks to the abundance of data available in today’s modern society. Scientists and researchers often need to analyze data from different sources; the observations, which only share a subset of the variables, cannot always be paired to detect common individuals. This is the case, for example, when the information required to study a certain phenomenon is coming from different sample surveys. Statistical matching is a common practice to combine these data sets. In this paper, we investigate and extend to statistical matching two methods based on Kernel Canonical Correlation Analysis (KCCA) and Super-Organizing Map (Super-OM). These methods are designed to deal with various variable types, sample weights and incompatibilities among categorical variables. In the first case, we use KCCA, a non-linear extension of CCA, to create canonical variables that we can compare in the two data sets. In the second case, Super-OM uses organizing maps to create subgroups of individuals who share the same characteristics. We use the 2017 Belgian Statistics on Income and Living Conditions (SILC) and we compare the performance of the proposed statistical matching methods by means of a cross-validation technique, as if the data were available from two separate sources. The results indicate that our proposed methods are superior to existing methods because they preserve the distribution of generated variables while also providing good predictions. Existing methods typically only achieve one or the other. These new techniques open the door to improving statistical matching in other applications such as medicine, economics, …
Read full abstract