Abstract

We present a tool for generating multidimensional synthetic datasets for testing, evaluating, and benchmarking unsupervised classification algorithms. Our proposal fills a gap observed in previous approaches with regard to underlying distributions for the creation of multidimensional clusters. As a novelty, normal and non-normal distributions can be combined for either independently defining values feature by feature (i.e., multivariate distributions) or establishing overall intra-cluster distances. Being highly flexible, parameterizable, and randomizable, MDCGen also implements classic pursued features: (a) customization of cluster-separation, (b) overlap control, (c) addition of outliers and noise, (d) definition of correlated variables and rotations, (e) flexibility for allowing or avoiding isolation constraints per dimension, (f) creation of subspace clusters and subspace outliers, (g) importing arbitrary distributions for the value generation, and (h) dataset quality evaluations, among others. As a result, the proposed tool offers an improved range of potential datasets to perform a more comprehensive testing of clustering algorithms.

Highlights

  • Synthetic datasets are necessary since real data does not allow a controlled and flexible testing of data mining algorithms and cannot be used to obtain generalization

  • Synthetic datasets are to algorithms like simulations to control strategies; i.e., they are intended to develop testbeds to undergo exhaustive testing

  • MDCGen is devised for research purposes, to test clustering algorithms and clustering validation techniques

Read more

Summary

Introduction

Synthetic datasets are necessary since real data does not allow a controlled and flexible testing of data mining algorithms and cannot be used to obtain generalization. Journal of Classification (2019) 36:599–618 datasets and scenarios are the ultimate reality check for competitive algorithms, it can be counterproductive to rely on real data during design and development of new algorithms (Farber et al 2010) In this respect, synthetic datasets are to algorithms like simulations to control strategies; i.e., they are intended to develop testbeds to undergo exhaustive testing. Control and allow different cluster properties in the same dataset (e.g., size, number of objects, shape, orientation). Generate subspace clusters if desired, i.e., groups of objects that show a clear clusterstructure in lower dimensional subspaces but become sparse or noisy when additional dimensions are considered.

Related Work
Implementation
Object Distributions
A new set of values D that represent object-to-center distances is created:
Cluster Placement
Overlap Control
Additional Features
Cluster Generation Summary
Parameters and Configuration
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.