Clustering Big Data Based on Distributed Fuzzy K-Medoids: An Application to Geospatial Informatics

Magda M Madbouly,Mohamed A Osman,Saad M Darwish,Noha A Bagi

doi:10.1109/access.2022.3149548

Magda M Madbouly, Mohamed A Osman + Show 2 more

Open Access

https://doi.org/10.1109/access.2022.3149548

Copy DOI

Abstract

The advent of big data related to spatial position knowledge, called geospatial big data, provides us with opportunities to recognize the urban environment. Existing database processing methods are inadequate to rapidly provide reliable results in a geospatial big data context due to the need for defining approximation “measures” and the increasing execution time for the queries. The clustering method yields the functional effects. How to scale and accelerate clustering algorithms while maintaining high clustering efficiency, on the other hand, remains a significant challenge. The paper’s primary contribution is the introduction of a modified hierarchical distributed k-medoid clustering method that is specific to spatial query analysis for big data. To improve the efficiency of the k-medoid algorithm and obtain more precise clusters, the suggested model utilizes the Fuzzy k-Medoids method to overcome outliers in the spatial data set and to deal with data uncertainty. The method is complex in nature since it is not predicated on the number of right clusters. The proposed model is divided into two phases: the first step creates local clusters based on a portion of the entire dataset; this stage makes extensive use of the parallelism paradigm provided by the Apache Spark framework; and the second phase aggregates the local clusters to produce compact and reliable final clusters. The proposed model greatly reduces the amount of knowledge shared during the aggregation process and automatically produces the appropriate number of clusters based on the dataset characteristics. The results show that the proposed model outperforms the traditional K-medoids in terms of accuracy of obtained centers in big data applications.

Highlights

Over the last few decades, the rapid development of information technology has resulted in an explosion in data from a variety of devices, propelling in the era of big data
The results indicate that the approach accelerates linearly and scales well when the complexity of the local clustering is NP, as its results are unaffected by the type of communication
Inside step 1, the local clustering algorithm makes use of the degree of fuzzification to improve the efficiency of the k-medoid by extending the search for medoids, resulting in the best medoids according the nature of data in terms of uncertainty

Summary

Introduction

Over the last few decades, the rapid development of information technology has resulted in an explosion in data from a variety of devices, propelling in the era of big data. Big data derived from devices such as smartphones and portable Global Positioning System (GPS) devices has permeated our daily lives and shown immense potential in practical applications such as climate science, disaster management, public health, crop protection, smart cities, emergency management, and environmental monitoring [1]. Numerous attempts have been made to use geospatial big data to track patterns in human activity and to conduct urban and environmental research using remote sensing imagery. By examining the temporal characteristics of pick-up and dropoff operations within geographic units, we cannot only identify urban functions and determine job-housing functional dynamics even more accurately than is possible with traditional remote sensing images, allowing us to further explore intra- and inter-city spatial activity [2]

Methods

Results

Conclusion