Abstract

Big data has become part of the life for many people. The data about people’s life are being continously collected, analysized and applied as our society progresses into the big data era. Behind the scene, the computer server clusters need to process hundres of millions pieces of data every day. It is very important to choose the right big data processing platform and algorithm to deal with different kinds of datasets. Therefore, in order to be fully familiar with the related work of driving big data processing, it is necessary to master the classification algorithm of data. It aims to help us carry out a classification model or operation analysis of classification function by screening and classifying the current data in data mining. In addition, the given data can be mapped to the specified category area, and the development trend of future data can be predicted through classification models. So this kind of algorithm helps to reduce the difficulty of work operation and improve people’s work efficiency. This paper optimizes the classical classification algorithm—KNN, and designs a new normalized algorithm called PEWM_G KNN. From the perspective of distance measurement, we use Pearson correlation coefficient to replace the traditional Euclidean Metric, then we further refine the study for attribute values of datasets and introduce the entropy weight method, combined with Pearson’s measure to optimize the distance calculation equation. After the K value is fixed, we added Gaussian Function to carry out the selection of classification. In this study, we compared the effects of every step, and tested datasets with different data types and sizes, in order to test the performance of the algorithm under different scenarios. The datasets we used include Iris, Breast Cancer, Dry Bean and HTRU2 (All datasets are from The University of California, Irvine). Finally, we further analyze the performance of different system configuration parameters on the prediction rate and time. The experimental results show that PEWM_G KNN algorithm has better optimization effect for datasets with more complex attribute values and more records than the original KNN algorithm. Moreover, the optimization of platform parameters improves prediction rates of algorithms and reduces the time. We tested PEWM_G KNN on the Hadoop platform, confiured with HDFS, Hadoop-YARN, ZooKeeper, Hadoop-HA and MapReduce.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call