Due to the development of information technology, large amounts of data are generated every day in various industries such as engineering, healthcare, finance, anomaly detection, image recognition, and artificial intelligence. This massive data poses the challenge of analyzing accurately and appropriate classifications. The traditional clustering methods require specifying the number of clusters and are mostly based on distance, which cannot effectively consider the correlations between different indicators of high-dimensional and multi-source data. Moreover, the number of clusters cannot automatically adjust when new data is generated. In order to improve the clustering analysis of high-dimensional and multi-source data in a big data environment, this study utilizes non-parametric mixture models based on distribution clustering, which does not require specifying the number of clusters and can auto update with the data. By combining Principal Component Analysis (PCA), t-Distributed Stochastic Neighbour Embedding (t-SNE), and the non-parametric Bayesian method called Dirichlet Process Mixture Model (DPMM), the Bayesian non-parametric PCA model (PCA-DPMM) and Bayesian non-parametric t-SNE model (TSNE-DPMM) are proposed. The Chinese restaurant process of DPMM is used for sampling by introducing a finite normal mixture distribution. The clustering results on the iris dataset are compared and analyzed. The accuracy of DPMM and TSNE-DPMM reaches 0.97, while PCA-DPMM achieves a maximum accuracy of only 0.94. When different numbers of iterations are set, TSNE-DPMM maintains an accuracy ranging from 0.92 to 0.97, DPMM ranges from 0.66 to 0.97, and PCA-DPMM ranges from 0.73 to 0.94. Therefore, the proposed TSNE-DPMM ensures accuracy and exhibits better model stability in clustering results. Future research can explore the improvement of the model by incorporating deep learning algorithms, among others, to further enhance its performance. Additionally, applying the TSNE-DPMM model to data analysis in other fields is also a future research direction. Through these efforts, we can better tackle the challenges of analyzing high-dimensional and multi-source data in a big data environment and extract valuable information from it.
Read full abstract