Abstract

In order to sovle the problem of dimension disaster when mining the high-dimensional data in the data stream and the problem of poor real-time response and insufficient system throughput of dimensional reduction algorithms, a scheme of implementing PCA algorithm on Storm platform is designed. This scheme programs each branch of PCA algorithm by using Storm’s own components, and each component forms the task entity through data flow communication. The scheme realizes the alogrithm distribution and parallelization by setting the threads number and the process number of task entity. Experimental results of running PCA algorithm on Storm and computer cluster according to the scheme show that the PCA algorithm on Storm platform can meet the requirement of real-time dimensional reduction of data stream. Introduction Data Stream is a continuous, unpredictable, sudden, rapid and time-varying stream [1]. One key issues of data flow management is how to effectively do the dimensional reduction and compression of the data stream with a limited storage resource, according to the characteristics of the data stream and using effective memory scanning method, and express data flow information in a compressed form [2]. As a classical linear dimension reduction algorithm, PCA is simple and has not parameter limits, and it has been widely used in data compression and feature extraction. However, doing dimensional reduction on data stream under the current stand-alone environment by PCA can not meet the real-time dimensional reduction requirement for data stream because the throughput will be small and the complexity will be high. Therefore we solve such problems by means of distributed computing model and the computer cluster. Storm is an open source framework for distributed real-time computing [3], and it can efficiently handle large data streams. In this paper, we design the scheme of implementing PCA algorithm on Storm platform, and configure a high-performance cluster environment to implement PCA algorithm on Storm and computer cluster according to the scheme. The results verify that the scheme is feasible and it has the ability for doing real-time dimensional reduction on data streams. PCA Algorithm The main dimension reduction algorithm is divided into linear dimension reduction algorithm and nonlinear dimensional reduction algorithm, linear dimension reduction algorithm mainly including PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), etc. PCA algorithm has perfect theory, simple concept, convenient calculation, and optimal linear reconstruction error; in addition it has no parameter limits and is widely used in data compression and feature extraction. The principle of PCA is converting the original component related random vector to the new component unrelated random vector by orthogonal transformation [4]. Collecting p-dimensional random vector x=(x1,x2,...,xp) n samples xi=(xi1,xi2,...,xip) (i=1, 2,..., n, n>p) to construct the sample matrix X: 3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) © 2015. The authors Published by Atlantis Press 644 The steps of PCA are as follows: (1) Standardized calculation Standardize sample matrix, firstly obtain its averages and variance, and then obtain a standardized matrix through the averages and standard deviation. , i=1,2, ,n; j=1,2, ,p; (1) We might calculate averages and variance by: , . (2) The correlation matrix calculation To obtain the correlating coefficient matrix of standardized matrix Z The correlation coefficient can be caculated as:

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call