Abstract

In the context of data science, data projection and clustering are common procedures. The chosen analysis method is crucial to avoid faulty pattern recognition. It is therefore necessary to know the properties and especially the limitations of projection and clustering algorithms. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). The FCPS contains 10 datasets with the names “Atom”, “Chainlink”, “EngyTime”, “Golfball”, “Hepta”, “Lsun”, “Target”, “Tetra”, “TwoDiamonds”, and “WingNut”. Common clustering methods occasionally identified non-existent clusters or assigned data points to the wrong clusters in the FCPS suite. Likewise, common data projection methods could only partially reproduce the data structure correctly on a two-dimensional plane. In conclusion, the FCPS dataset collection addresses general challenges for clustering and projection algorithms such as lack of linear separability, different or small inner class spacing, classes defined by data density rather than data spacing, no cluster structure at all, outliers, or classes that are in contact. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). It is designed to address specific problems of structure discovery in high-dimensional spaces.

Highlights

  • DataBionics Research Institute, University of Marburg, 35032 Marburg, Germany; Institute of Clinical Pharmacology, Goethe - University, 60590 Frankfurt am Main, Germany

  • The present report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS), which from the outset is focused on specific problems of structure-finding in high-dimensional spaces

  • The EngyTime dataset is a density-defined 2D dataset obtained from a Gaussian mixture model (GMM)

Read more

Summary

Summary

The exploration of high-dimensional data spaces is a challenge. Starting from four dimensions, high dimensions become increasingly incomprehensible and the everyday experience of spatial relationships between data points is replaced by strange phenomena for which the term “curse of dimensionality” was coined [1]. Projections into the R2 provide visualizations that improve the search for groups in the data that have common properties, that is, improve clustering. This is essential for the exploration of complex and large data. If an algorithm is not able to solve such obvious problems, the results for high-dimensional spaces may not be trustworthy. To facilitate this approach, the present report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS), which from the outset is focused on specific problems of structure-finding in high-dimensional spaces. The FCPS has already been successfully for comparisons of the performance of clustering or data projection algorithms, as reported previously [3,4,5]

General Properties of the FCPS Datasets
Chainlink
GolfBall
EngyTime
Target
TwoDiamonds
2.2.10. WingNut
Performance of Different Clusteriung Algorithms
Performance of Different Data Projection Methods
Methods

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.