Abstract
As Hadoop has gained popularity in big data era, it is widely used in various fields. The self-design and self-developed large-scale network traffic analysis cluster works well based on Hadoop, with off-line applications running on it to analyze the massive network traffic data. On purpose of scientifically and reasonably evaluating the performance of analysis cluster, we propose a performance evaluation system. Firstly, we set the execution times of three benchmark applications as the benchmark of the performance, and pick 40 metrics of customized statistical resource data. Then we identify the relationship between the resource data and the execution times by a statistic modeling analysis approach, which is composed of principal component analysis and multiple linear regression. After training models by historical data, we can predict the execution times by current resource data. Finally, we evaluate the performance of analysis cluster by the validated predicting of execution times. Experimental results show that the predicted execution times by trained models are within acceptable error range, and the evaluation results of performance are accurate and reliable.
Highlights
With the rapid development of cloud computing, Hadoop[1] as an advanced big data processing tool, has become the first choice for many researchers and companies
Principal Component Analysis (PCA) uses dimensionality reduction technique to make a set of possibly correlated original indicators into relatively fewer comprehensive and linearly uncorrelated indicators by linear combination, and retain most of the information of original target[8]
We propose a score criterion by theses three execution times, we select a typical practical application, whose execution time is a reflection of the practical performance of HBLSNTAC
Summary
With the rapid development of cloud computing, Hadoop[1] as an advanced big data processing tool, has become the first choice for many researchers and companies. In order to analyze the massive traffic data efficiently, we developed a Hadoop-based Large-scale Network Traffic Analysis Cluster (HBLSNTAC). It consists of one master (running NameNode, ResourceManager), one backup master ( work as client access server, running SecondaryNameNode) and nine slaves (running DataNode, NodeManager, ApplicationMaster). Users run various off-line statistical analysis applications using the massive data on the cluster. These applications are aimed at analyzing the basic statistical characteristic of network traffic, such as the time distribution or geographical distribution of network behavior of mobile phone users. We draw the conclusion and give the outlook of future work
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have