Application test of PCA based improved K-means clustering algorithm in analyzing NGO assistance needs in less developed countries

Linhui Qin

doi:10.54254/2755-2721/27/20230101

Abstract

In today's society where the amount of data is increasing by Peta Byte (PB) or Exa Byte (EB), it is an era of big data explosion, but there are also some unlabeled data or unstructured data. Compared with complex supervised learning, unmarked unsupervised learning has great potential and value in social development. The clustering algorithm K-means is one of the commonly used algorithms in unsupervised learning. However, after studying the shortcomings of K-means itself, a problem is found that the dimension attribute of the data set must be converted into a numeric type by means of arithmetic average to measure the distance. Different random selection will have a certain degree of influence on the final clustering results, and eventually lead to the decision deviation is too large. Especially for high noise points, multidimensional, nonlinear social big data. In order to solve this problem, the theme of this paper is the application test of PCA based improved K-means clustering algorithm in analyzing NGO assistance needs in less developed countries. First, read and clean up the national data of 167 less developed countries. Secondly, data visualization and data preparation are carried out to re-scale. The principal component analysis algorithm is used to analyze and deal with outliers. Clustering trends are analyzed by combining a k-means model determined by scores obtained from the Hopkins statistical test with a list of countries ultimately in need of assistance. Finally, it can be tested that PCA data cleaning can effectively reduce data noise and improve the clustering effect.

Full Text