Abstract

This work provides a procedure with which to construct and visualize profiles, i.e., groups of individuals with similar characteristics, for weighted and mixed data by combining two classical multivariate techniques, multidimensional scaling (MDS) and the k-prototypes clustering algorithm. The well-known drawback of classical MDS in large datasets is circumvented by selecting a small random sample of the dataset, whose individuals are clustered by means of an adapted version of the k-prototypes algorithm and mapped via classical MDS. Gower’s interpolation formula is used to project remaining individuals onto the previous configuration. In all the process, Gower’s distance is used to measure the proximity between individuals. The methodology is illustrated on a real dataset, obtained from the Survey of Health, Ageing and Retirement in Europe (SHARE), which was carried out in 19 countries and represents over 124 million aged individuals in Europe. The performance of the method was evaluated through a simulation study, whose results point out that the new proposal solves the high computational cost of the classical MDS with low error.

Highlights

  • One of the most important goals in visualizing data is to get a sense of how near or far objects are from each other

  • Since classical multidimensional scaling (MDS) suffers from computational problems as sample size increases, we propose instead a "fast" MDS based on the selection of a small random sample, which is clustered by means of an adapted version of the k-prototypes algorithm that can cope with Gower’s metric and weighted data

  • The visualization method that we propose is based on classical MDS applied to a random portion of the dataset plus the projection of the remaining individuals via Gower’s interpolation formula

Read more

Summary

Introduction

One of the most important goals in visualizing data is to get a sense of how near or far objects are from each other Often, this is done with a scatter plot, because the Euclidean distance is the only one that our brain can interpret. Survey data coming from macro-surveys at national and cross-national levels are rather complex datasets of weighted and mixed data They are composed of variables of different natures, such as binary, multi-state categorical and numerical variables; and as result of a multi-stage sampling methodology, they each include a weighting variable, so that each individual represents a group of different size for the target population. Another added complexity may be their large or very large sample size (104 or larger)

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call