Abstract

The nature of data in Astrophysics has changed, as in other scientific fields, in the past decades due to the increase of the measurement capabilities. As a consequence, data are nowadays frequently of high dimensionality and available in mass or stream. Model-based techniques for clustering are popular tools which are renowned for their probabilistic foundations and their flexibility. However, classical model-based techniques show a disappointing behavior in high-dimensional spaces which is mainly due to their dramatical over-parametrization. The recent developments in model-based classification overcome these drawbacks and allow to efficiently classify high-dimensional data, even in the small n / large p situation. This work presents a comprehensive review of these recent approaches, including regularization-based techniques, parsimonious modeling, subspace classification methods and classification methods based on variable selection. The use of these model-based methods is also illustrated on real-world classification problems in Astrophysics using R packages.

Highlights

  • As noticed by Hubble back in 1936 [32]:“The nebulae are so numerous that they cannot all be studied individually

  • Clustering may be a powerful tool for Astrophysicists who have to face to mass of data

  • The reader should keep in mind that principal component analysis (PCA) may be a useful explanatory tool to visualize the data but should not be used as a preprocessing step for clustering or classification

Read more

Summary

Introduction

As noticed by Hubble back in 1936 [32]:. “The nebulae are so numerous that they cannot all be studied individually. Model-based methods usually show a disappointing behavior in high-dimensional spaces. Since the dimension of observed data is usually higher than their intrinsic dimension, it is theoretically possible to reduce the dimension of the original space without loosing any information For this reason, dimension reduction methods are frequently used in practice to reduce the dimension of the data before the clustering step. To avoid the drawbacks of dimension reduction, several approaches have been proposed in the last decade to allow model-based methods to efficiently classify high-dimensional data. Subspace clustering techniques are mostly based on probabilistic versions of the factor analysis model and allow to classify the data in low-dimensional subspaces without reducing the dimension. Before to move forward, let us notice that we present here only the models and, unless a specific note, the inference of those models is done via the EM algorithm (see [Girard and Saracco’s Chapter])

Curse and blessings of the dimensionality
Bellman’s curse of the dimensionality
The curse of dimensionality in model-based clustering
The blessing of dimensionality in clustering
Dimension reduction
Regularization
Parsimonious models
Subspace clustering methods
Mixture of high-dimensional Gaussian mixture models
The discriminative latent mixture models
Variable selection as a model selection problem
Variable selection by penalization of the loadings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.