Abstract

Many machine learning procedures, including clustering analysis are often affected by missing values. This work aims to propose and evaluate a Kernel Fuzzy C-means clustering algorithm considering the kernelization of the metric with local adaptive distances (VKFCM-K-LP) under three types of strategies to deal with missing data. The first strategy, called Whole Data Strategy (WDS), performs clustering only on the complete part of the dataset, i.e. it discards all instances with missing data. The second approach uses the Partial Distance Strategy (PDS), in which partial distances are computed among all available resources and then re-scaled by the reciprocal of the proportion of observed values. The third technique, called Optimal Completion Strategy (OCS), computes missing values iteratively as auxiliary variables in the optimization of a suitable objective function. The clustering results were evaluated according to different metrics. The best performance of the clustering algorithm was achieved under the PDS and OCS strategies. Under the OCS approach, new datasets were derive and the missing values were estimated dynamically in the optimization process. The results of clustering under the OCS strategy also presented a superior performance when compared to the resulting clusters obtained by applying the VKFCM-K-LP algorithm on a version where missing values are previously imputed by the mean or the median of the observed values.

Highlights

  • The incessant increase in volume and variety of data requires advances in methodologies in order to understand, process and summarize data automatically

  • Datasets with 5%, 10%, 15% and 20% of missing values were artificially generated using the methodology described in Section 7.1, which means that random variable M was sampled from Bernoulli distributions with parameter θ taken from {0.05, 0.10, 0.15, 0.20}

  • The problem of missing data is commonly discussed in several areas of science, as statistical techniques used for data analysis, such as clustering, were originally proposed for datasets without missing values

Read more

Summary

Introduction

The incessant increase in volume and variety of data requires advances in methodologies in order to understand, process and summarize data automatically. Cluster analysis is one of the main unsupervised techniques that are used to extract knowledge from data, due to its ability to aid in the process of understanding and visualizing data structures [1, 2]. The main goal in clustering is to organize the data (observations, data items, images, pixels etc.) based on similarity (or dissimilarity) criteria such that observations belonging to the same group show high degrees of similarity, while observations in different groups show high degrees of dissimilarity [3, 4].

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.