Abstract

AbstractIn this paper, we propose the technique of the optimal method choice of high dimensional data normalizing at the stage of data preprocessing procedure is performed. As well known, the qualitative carried out of the data preprocessing procedure significantly influences the further step of their processing such as classification, clustering, forecasting, etc. Within the framework of our research, we have used both the Shannon entropy and the relative ratio of Shannon entropy as the main criteria to evaluate the data normalizing quality. Before the apply the cluster analysis, we reduce the data dimensionality by using the principal component analysis. The obtained data clustering was performed using a fuzzy C-means clustering algorithm with an evaluation of the data clustering quality when using various methods of data normalizing. The analysis of the simulation results allows us to conclude that for this type of data (gene expression profiles) the decimal scaling method is optimal since the Shannon entropy of the investigated data achieves the minimal value in comparison with the use of other normalizing methods. Moreover, the relative ratio of Shannon entropy does not exceed the permissible norms during the data dimensionality reduction by applying the principal component analysis technique.KeywordsHigh dimensional data clusteringShannon entropyData preprocessingNormalizingOptimizationPrincipal component analysis

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call