Large Scale Data Using K-Means

Raheela Zaib,Ourlis Ourabah

doi:10.58496/mjbd/2023/006

Raheela Zaib, Ourlis Ourabah

Open Access

https://doi.org/10.58496/mjbd/2023/006

Copy DOI

Abstract

Because of the exponential growth of high-layered datasets, conventional database querying strategies are inadequate for extracting useful information, and analysts must now devise novel techniques to meet these demands. Such massive articulation data results in a plethora of new computational triggers as a result of both the rise in data protests and the increase of elements/ascribes. Preprocessing the data with a reliable dimensionality reduction method improves the efficacy and precision of mining operations on densely layered data. Therefore, we have compiled the opinions of numerous academics. Cluster analysis is a data analysis tool that has recently acquired prominence in a number of different disciplines. K-means, a common parceling-based clustering algorithm, looks for a fixed number of clusters that can be identified using only their centroids. However, the outcomes depend heavily on the starting points of the clusters' focuses. Again, there is a dramatic rise in the number of distance calculations with increasing data complexity. This is due to the fact that assembling a detailed model typically calls for a substantial and distributed amount of preliminary data. There may be a substantial time commitment involved in preparing a broad collection of ingredients. For huge data sets in particular, there is a cost/benefit analysis to consider when deciding how to create orders: speed vs. accuracy. The k-means method is commonly used to compress and sum vector data, as well as cluster it. For precautious k-means (ASB K-means), we present No Concurrent Specific Clumped K-means, a fast and memory-effective GPU-based method. Our method can be adjusted to use much less GPU RAM than the size of the full dataset, which is a significant improvement over earlier GPU-based k-means methods. Datasets that are too large to fit in RAM may be clustered. The approach uses a clustered architecture and applies the triangle disparity in each k-means focus to remove a data point if its enrollment task or cluster it belongs to remains unaltered, allowing it to efficiently handle big datasets. This reduces the number of data guides that must be transferred between the CPU's Slam and the GPU's global memory.

Full Text