Toolbox for Distance Estimation and Cluster Validation on Data With Missing Values

Marko Niemela,Sami Ayramo,Tommi Karkkainen

doi:10.1109/access.2021.3136435

Abstract

Missing data are unavoidable in the real-world application of unsupervised machine learning, and their nonoptimal processing may decrease the quality of data-driven models. Imputation is a common remedy for missing values, but directly estimating expected distances have also emerged. Because treatment of missing values is rarely considered in clustering related tasks and distance metrics have a central role both in clustering and cluster validation, we developed a new toolbox that provides a wide range of algorithms for data preprocessing, distance estimation, clustering, and cluster validation in the presence of missing values. All these are core elements in any comprehensive cluster analysis methodology. We describe the methodological background of the implemented algorithms and present multiple illustrations of their use. The experiments include validating distance estimation methods against selected reference methods and demonstrating the performance of internal cluster validation indices. The experimental results demonstrate the general usability of the toolbox for the straightforward realization of alternate data processing pipelines. Source code, data sets, results, and example macros are available on GitHub. <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/markoniem/nanclustering_toolbox</uri>

Highlights

In many machine learning tasks, the volume of data is limited, necessitating that all the available data values be utilized as extensively as possible
The expectation maximization (EM) algorithm for estimating the mean vector μ and the covariance matrix Σ of a data set with missing values under the assumption of the conditional multivariate normal distribution is given in Algorithm 1
The toolbox supports computation strategies based on available data (ADS), partial distance (PDS), and expected distances (ESD, Euclidean distance (EED)) that are used by clustering methods, cluster validation indices, and data preprocessing methods

Summary

INTRODUCTION

In many machine learning tasks, the volume of data is limited, necessitating that all the available data values be utilized as extensively as possible. A well-known distance estimation method is the partial distance strategy (PDS) [3], which is known as a general similarity measure [4] This approach involves similar limitations as the nearest neighbors method so that its accuracy is highly correlated to the number of missing values in data. The elements are related to data selection, data preprocessing, selection of distance measure, choice of clustering criterion, selection missing data strategy, validation of the created algorithms, selection of the number of clusters, and interpretation of results. The second part compares clustering methods and cluster validation indices on two-dimensional (2D) data sets with missing values.

AVAILABLE DATA STRATEGY

EXPECTED SQUARED EUCLIDEAN DISTANCE

EXPECTED EUCLIDEAN DISTANCE

K-NEAREST NEIGHBORS IMPUTATION

TRANSFORMATION INTO SPHERICAL FORM

CLUSTER VALIDATION INDICES

INTERNAL CLUSTER VALIDATION INDICES

EXTERNAL CLUSTER VALIDATION INDICES

OVERVIEW OF THE TOOLBOX

GENERAL USE OF THE TOOLBOX

CLUSTER VALIDATION WITH MULTIDIMENSIONAL DATA

VIII. DISCUSSION

Findings

CONCLUSIONS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2022
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Toolbox for Distance Estimation and Cluster Validation on Data With Missing Values

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Improving Clustering and Cluster Validation with Missing Data Using Distance Estimation Methods
Tommi Kärkkäinen ... Marko Niemelä
-
Tommi Kärkkäinen, et. al.Tommi Kärkkäinen ... Marko Niemelä
20 Aug 2021
20 Aug 2021

Hybrid Cluster Validation Techniques
Satish Gajawada ... Durga Toshniwal
-
Satish Gajawada, et. al.Satish Gajawada ... Durga Toshniwal
01 Jan 2012
01 Jan 2012

Understanding and Enhancement of Internal Clustering Validation Measures
Xuedong Gao ... Sen Wu
IEEE Transactions on Cybernetics | VOL. 43
Xuedong Gao, et. al. Xuedong Gao ... Sen Wu
26 Oct 2012
IEEE Transactions on Cybernetics | VOL. 43

Application of clustering strategy for automatic segmentation of tissue regions in mass spectrometry imaging.
Guang Xu ... Bo Guo
Rapid communications in mass spectrometry : RCM | VOL. 38
Guang Xu, et. al.Guang Xu ... Bo Guo
23 Feb 2024
Rapid communications in mass spectrometry : RCM | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Toolbox for Distance Estimation and Cluster Validation on Data With Missing Values

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access