Abstract

As Hadoop has gained popularity in big data era, it is widely used in various fields. The self-design and self-developed large-scale network traffic analysis cluster works well based on Hadoop, with off-line applications running on it to analyze the massive network traffic data. On purpose of scientifically and reasonably evaluating the performance of analysis cluster, we propose a performance evaluation system. Firstly, we set the execution times of three benchmark applications as the benchmark of the performance, and pick 40 metrics of customized statistical resource data. Then we identify the relationship between the resource data and the execution times by a statistic modeling analysis approach, which is composed of principal component analysis and multiple linear regression. After training models by historical data, we can predict the execution times by current resource data. Finally, we evaluate the performance of analysis cluster by the validated predicting of execution times. Experimental results show that the predicted execution times by trained models are within acceptable error range, and the evaluation results of performance are accurate and reliable.

Highlights

  • With the rapid development of cloud computing, Hadoop[1] as an advanced big data processing tool, has become the first choice for many researchers and companies

  • Principal Component Analysis (PCA) uses dimensionality reduction technique to make a set of possibly correlated original indicators into relatively fewer comprehensive and linearly uncorrelated indicators by linear combination, and retain most of the information of original target[8]

  • We propose a score criterion by theses three execution times, we select a typical practical application, whose execution time is a reflection of the practical performance of HBLSNTAC

Read more

Summary

Introduction

With the rapid development of cloud computing, Hadoop[1] as an advanced big data processing tool, has become the first choice for many researchers and companies. In order to analyze the massive traffic data efficiently, we developed a Hadoop-based Large-scale Network Traffic Analysis Cluster (HBLSNTAC). It consists of one master (running NameNode, ResourceManager), one backup master ( work as client access server, running SecondaryNameNode) and nine slaves (running DataNode, NodeManager, ApplicationMaster). Users run various off-line statistical analysis applications using the massive data on the cluster. These applications are aimed at analyzing the basic statistical characteristic of network traffic, such as the time distribution or geographical distribution of network behavior of mobile phone users. We draw the conclusion and give the outlook of future work

The Design of Performance Evaluation System
Related Work
Performance Benchmark
Statistical Resource Data
Modeling Analysis
Principal Component Analysis
Multiple Linear Regression
Performance Evaluation
Experiment Setup
Modeling and Prediction
Prediction Verification
Multiple Practical Applications
One Single Practical Application
Summary
Conclusion and Future Work
Findings
Microsoft IT SES Enterprise Data Architect Team
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call