Abstract

In recent years, uncertain data clustering has become the subject of active research in many fields, for example, pattern recognition, and machine learning. Nowadays, researchers have committed themselves to substitute the traditional distance or similarity measures with new metrics in the existing centralized clustering algorithms in order to tackle uncertainty in data. However, in order to perform uncertain data clustering, representation plays an imperative role. In this paper, a Monte-Carlo integration is adopted and modified to express uncertain data in a probabilistic form. Then three similarity measures are used to determine the closeness between two probability distributions including one novel measure. These similarity measures are derived from the notion of Kullback-Leibler divergence and Jeffreys divergence. Finally, density-based spatial clustering of applications with noise and k-medoids algorithms are modified and implemented on one synthetic database and three real-world uncertain databases. The obtained outcomes confirm that the proposed clustering technique defeats some of the existing algorithms.

Highlights

  • In data mining, data uncertainty entails some deviation of the data from the ground truth due to small perturbations often known as noise or uncertainty

  • Three measures of closeness: KL-divergence, J-divergence, as well as a new devised measure are combined with k-medoids and density-based spatial clustering of applications with noise (DBSCAN) clustering algorithms

  • The b-spline function is one of the components of Monte Carlo integration (MCI) meaning that the performance of MCI depends on the order of the b-spline function

Read more

Summary

Introduction

Data uncertainty entails some deviation of the data from the ground truth due to small perturbations often known as noise or uncertainty. In the era of big data, uncertainty is one of the inherent characteristics of data. Uncertain data is found in abundance today in web applications, IoT sensor networks [1], [2], within enterprises [3], [4]. Data manifest both in structured and unstructured sources due to outdated sensors, inaccurate measurement, or sampling errors. Uncertainty is observed frequently in weather and climate prediction. Small and random perturbations to the atmospheric state variables viz., pressure, temperature, winds, and humidity readings captured by various sensors due to aging of the sensors or atmosphere itself is non-linear which in turn results in forecast divergence from the actual reality

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call