Big data clustering through fusion of FCM, optimized encoder-decoder CNN, and BiLSTM

F Belhabib,K El Moutaouakil,S Rbihou,A Elafaar

doi:10.23939/mmc2024.03.798

Abstract

Clustering Big Data, as a fundamental component in the processing and analysis of massive datasets, holds crucial importance in addressing complex challenges inherent in handling extensive data sets. Falling within the realm of unsupervised learning methods, the primary objective of clustering is to efficiently organize substantial datasets into homogeneous clusters without relying on pre-existing labels. Our innovative approach seeks to optimize this process by synergistically combining three techniques: the fuzzy C-Means (FCM) methodology, the optimized encoder–decoder CNN model, and the bidirectional recurrent neural network (BiLSTM). This synergy represents a strategic convergence between supervised and unsupervised paradigms. The introduction of BiLSTM is of significant importance, leveraging its capability to sequentially process data from both sides using LSTM cells. This bidirectional approach enhances the understanding of data sequences, a crucial feature in the demanding context of Big Data clustering. Simultaneously, FCM benefits from substantial improvement through the introduction of a function that calculates the separation between the cluster center and the instance, thereby reinforcing the precision of clustering. To optimize performance and reduce computation time, our methodology advocates for the use of the Optimized Encoder–Decoder CNN model. This refined architecture promotes more efficient extraction of data features, thereby enhancing the intrinsic quality of clustering. The rigorous evaluation of our approach revolves around specific data sources, namely fashion MNIST. Performance criteria such as accuracy, adjusted rand index (ARI), and normalized mutual information (NMI) convincingly attest to the remarkable capability of our methodology. In comparative analyses, our approach significantly outperforms existing models, demonstrating its effectiveness and relevance in the complex domain of Big Data clustering.

Full Text