Abstract

Missing data are unavoidable in the real-world application of unsupervised machine learning, and their nonoptimal processing may decrease the quality of data-driven models. Imputation is a common remedy for missing values, but directly estimating expected distances have also emerged. Because treatment of missing values is rarely considered in clustering related tasks and distance metrics have a central role both in clustering and cluster validation, we developed a new toolbox that provides a wide range of algorithms for data preprocessing, distance estimation, clustering, and cluster validation in the presence of missing values. All these are core elements in any comprehensive cluster analysis methodology. We describe the methodological background of the implemented algorithms and present multiple illustrations of their use. The experiments include validating distance estimation methods against selected reference methods and demonstrating the performance of internal cluster validation indices. The experimental results demonstrate the general usability of the toolbox for the straightforward realization of alternate data processing pipelines. Source code, data sets, results, and example macros are available on GitHub. <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/markoniem/nanclustering_toolbox</uri>

Highlights

  • In many machine learning tasks, the volume of data is limited, necessitating that all the available data values be utilized as extensively as possible

  • The expectation maximization (EM) algorithm for estimating the mean vector μ and the covariance matrix Σ of a data set with missing values under the assumption of the conditional multivariate normal distribution is given in Algorithm 1

  • The toolbox supports computation strategies based on available data (ADS), partial distance (PDS), and expected distances (ESD, Euclidean distance (EED)) that are used by clustering methods, cluster validation indices, and data preprocessing methods

Read more

Summary

INTRODUCTION

In many machine learning tasks, the volume of data is limited, necessitating that all the available data values be utilized as extensively as possible. A well-known distance estimation method is the partial distance strategy (PDS) [3], which is known as a general similarity measure [4] This approach involves similar limitations as the nearest neighbors method so that its accuracy is highly correlated to the number of missing values in data. The elements are related to data selection, data preprocessing, selection of distance measure, choice of clustering criterion, selection missing data strategy, validation of the created algorithms, selection of the number of clusters, and interpretation of results. The second part compares clustering methods and cluster validation indices on two-dimensional (2D) data sets with missing values.

AVAILABLE DATA STRATEGY
EXPECTED SQUARED EUCLIDEAN DISTANCE
EXPECTED EUCLIDEAN DISTANCE
K-NEAREST NEIGHBORS IMPUTATION
TRANSFORMATION INTO SPHERICAL FORM
CLUSTER VALIDATION INDICES
INTERNAL CLUSTER VALIDATION INDICES
EXTERNAL CLUSTER VALIDATION INDICES
OVERVIEW OF THE TOOLBOX
GENERAL USE OF THE TOOLBOX
CLUSTER VALIDATION WITH MULTIDIMENSIONAL DATA
VIII. DISCUSSION
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call