A Clustering Algorithm for Multi-Modal Heterogeneous Big Data With Abnormal Data.

An Yan,Wei Wang,Hongwei Geng,Yi Ren

doi:10.3389/fnbot.2021.680613

An Yan, Wei Wang + Show 2 more

Open Access

https://doi.org/10.3389/fnbot.2021.680613

Copy DOI

Abstract

The problems of data abnormalities and missing data are puzzling the traditional multi-modal heterogeneous big data clustering. In order to solve this issue, a multi-view heterogeneous big data clustering algorithm based on improved Kmeans clustering is established in this paper. At first, for the big data which involve heterogeneous data, based on multi view data analyzing, we propose an advanced Kmeans algorithm on the base of multi view heterogeneous system to determine the similarity detection metrics. Then, a BP neural network method is used to predict the missing attribute values, complete the missing data and restore the big data structure in heterogeneous state. Last, we ulteriorly propose a data denoising algorithm to denoise the abnormal data. Based on the above methods, we construct a framework namely BPK-means to resolve the problems of data abnormalities and missing data. Our solution approach is evaluated through rigorous performance evaluation study. Compared with the original algorithm, both theoretical verification and experimental results show that the accuracy of the proposed method is greatly improved.

Highlights

As the carrier of information, data must accurately and reliably reflect the objective things in the real world (Murtagh and Pierre, 2014; Brzezinska and Horyn, 2020)
In view of the above problems, according to the characteristics that BP neural network can well-predict and detect unknown data, this paper proposes a BPK-means algorithm based on BP neural network to improve the Kmeans algorithm
This paper proposes an improved Kmeans algorithm of BP neural network

Summary

INTRODUCTION

As the carrier of information, data must accurately and reliably reflect the objective things in the real world (Murtagh and Pierre, 2014; Brzezinska and Horyn, 2020). In view of the above problems, according to the characteristics that BP neural network can well-predict and detect unknown data, this paper proposes a BPK-means algorithm based on BP neural network to improve the Kmeans algorithm. In BPK-means algorithm, BP neural network is used to complete the missing attributes of data set D in the first step. If N is ≤100,000 records, select 60% of the data set as the training sample set; 3: Three layers BP neural network model is constructed, which are input layer, hidden layer and output layer; 4: The S type transfer function is set f (x). According to all the samples selected in the second step, the network is modeled In this model, the attribute of data set is used as input, and the number of output nodes is set to 1, the l is used in the design of hidden layer. For some scenes with high accuracy requirements, the cost is worth it

Experimental Setup and Experimental Environment

CONCLUSION

Findings

DATA AVAILABILITY STATEMENT