Abstract

The potential to study and improve different aspects of our lives is ever growing thanks to the abundance of data available in today’s modern society. Scientists and researchers often need to analyze data from different sources; the observations, which only share a subset of the variables, cannot always be paired to detect common individuals. This is the case, for example, when the information required to study a certain phenomenon is coming from different sample surveys. Statistical matching is a common practice to combine these data sets. In this paper, we investigate and extend to statistical matching two methods based on Kernel Canonical Correlation Analysis (KCCA) and Super-Organizing Map (Super-OM). These methods are designed to deal with various variable types, sample weights and incompatibilities among categorical variables. In the first case, we use KCCA, a non-linear extension of CCA, to create canonical variables that we can compare in the two data sets. In the second case, Super-OM uses organizing maps to create subgroups of individuals who share the same characteristics. We use the 2017 Belgian Statistics on Income and Living Conditions (SILC) and we compare the performance of the proposed statistical matching methods by means of a cross-validation technique, as if the data were available from two separate sources. The results indicate that our proposed methods are superior to existing methods because they preserve the distribution of generated variables while also providing good predictions. Existing methods typically only achieve one or the other. These new techniques open the door to improving statistical matching in other applications such as medicine, economics, …

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.