Abstract

SummaryWith the increase of data processing and Hadoop data center construction requirements, the performance of Hadoop data center is limited by inappropriate resources utilization. This paper introduces a new method to predict utilization for large‐scale Hadoop clusters. The new method adopts a two steps model, which includes Hadoop applications' performance simulation and resources utilization prediction. For performance simulation, a new simulator, which integrates baseline test and multilayered network model, is introduced and implemented. A resources utilization predictor is proposed in the second step. By analyzing the pattern of resources utilization, a single task model is proposed. A parallel‐batch‐task‐based (PBT) model, which represents the behavior of real Hadoop applications by integrating the single task model, is introduced. Two test scenarios are configured to verify the performance of our method. For the data center scenario, Terasort, Wordcount, and Hive are selected as benchmarks. In the virtual machines scenario, Terasort is used as benchmark. The experiments show that the error comparing between the simulator results and experimental environment results in most cases is less than 10%. The results confirm that we can locate the resource bottleneck for Hadoop clusters, meanwhile we can agilely configure clusters for applications with massive data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.