A Complete Classification and Clustering Model to Account for Continuous and Categorical Data in Presence of Missing Values and Outliers †

Guillaume Revillon,Ali Mohammad-Djafari

doi:10.3390/proceedings2019033023

Abstract

Classification and clustering problems are closely connected with pattern recognition where many general algorithms have been developed and used in various fields. Depending on the complexity of patterns in data, classification and clustering procedures should take into consideration both continuous and categorical data which can be partially missing and erroneous due to mismeasurements and human errors. However, most algorithms cannot handle missing data and imputation methods are required to generate data to use them. Hence, the main objective of this work is to define a classification and clustering framework that handles both outliers and missing values. Here, an approach based on mixture models is preferred since mixture models provide a mathematically based, flexible and meaningful framework for the wide variety of classification and clustering requirements. More precisely, a scale mixture of Normal distributions is updated to handle outliers and missing data issues for any types of data. Then a variational Bayesian inference is used to find approximate posterior distributions of parameters and to provide a lower bound on the model log evidence used as a criterion for selecting the number of clusters. Eventually, experiments are carried out to exhibit the effectiveness of the proposed model through an application in Electronic Warfare.

Highlights

Classification and clustering problems are closely connected with pattern recognition [1] where many general algorithms [2,3,4] have been developed and used in various fields [5,6]
An approach based on mixture models is preferred since mixture models provide a mathematically based, flexible and meaningful framework for the wide variety of classification and clustering requirements [8]
Outliers are only considered for continuous data xq = j=1 since only reliable categorical variables are assumed to be filled in databases and unreliable ones are processed as missing data

Summary

Introduction

Classification and clustering problems are closely connected with pattern recognition [1] where many general algorithms [2,3,4] have been developed and used in various fields [5,6]. Depending on the complexity of patterns in data, classification and clustering procedures should take into consideration both continuous and categorical data which can be partially missing and erroneous due to mismeasurements and human errors. The location mixture model [9] that assumes that continuous variables follow a multivariate. The location mixture model approach is retained since it better models relations between continuous and categorical features when data patterns are mostly designed by first choosing patterns of categorical features to achieve a specific goal and choosing continuous features that meet constraints related to the chosen patterns and the problem environment. The location mixture model naturally responds to that dependence structure by assuming that continuous variables are normally distributed conditionally to categorical variables. Gaussian distributions [11] is updated to handle outliers and missing data issues for any types of data.

Assumptions on Mixed-Type Data

Distribution of Mixed-Type Data

Outlier Handling

Missing Data Handling

Model and Inference

Variational Bayesian Inference

Classification and Clustering

Application

Classification Experiment

Clustering Experiment

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Complete Classification and Clustering Model to Account for Continuous and Categorical Data in Presence of Missing Values and Outliers †

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Dec 9, 2019
Citations: 4	License type: CC BY 4.0

Similar Papers

Missing data in bioarchaeology II: A test of ordinal and continuous data imputation.
Amanda Wissler ... Jane E Buikstra
American journal of biological anthropology | VOL. 179
Amanda Wissler, et. al.Amanda Wissler ... Jane E Buikstra
12 Sep 2022
American journal of biological anthropology | VOL. 179

A novel machine learning-based imputation strategy for missing data in step-stress accelerated degradation test
Yaqiu Li ... Baimao Lei
Heliyon | VOL. 10
Yaqiu Li, et. al.Yaqiu Li ... Baimao Lei
01 Feb 2024
Heliyon | VOL. 10

A new imputation method for small software project data sets
Qinbao Song ... Martin Shepperd
The Journal of Systems & Software | VOL. 80
Qinbao Song, et. al.Qinbao Song ... Martin Shepperd
16 Jun 2006
The Journal of Systems & Software | VOL. 80

How to deal with missing longitudinal data in cost of illness analysis in Alzheimer's disease-suggestions from the GERAS observational study.
Mark Belger ... Richard Dodel
BMC medical research methodology | VOL. 16
Mark Belger, et. al.Mark Belger ... Richard Dodel
18 Jul 2016
BMC medical research methodology | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Complete Classification and Clustering Model to Account for Continuous and Categorical Data in Presence of Missing Values and Outliers †

Abstract

Highlights

Summary

Talk to us

Similar Papers