GoCJ: Google Cloud Jobs Dataset for Distributed and Cloud Computing Infrastructures

Altaf Hussain,Muhammad Aleem

doi:10.3390/data3040038

Abstract

Developers of resource-allocation and scheduling algorithms share test datasets (i.e., benchmarks) to enable others to compare the performance of newly developed algorithms. However, mostly it is hard to acquire real cloud datasets due to the users’ data confidentiality issues and policies maintained by Cloud Service Providers (CSP). Accessibility of large-scale test datasets, depicting the realistic high-performance computing requirements of cloud users, is very limited. Therefore, the publicly available real cloud dataset will significantly encourage other researchers to compare and benchmark their applications using an open-source benchmark. To meet these objectives, the contemporary state of the art has been scrutinized to explore a real workload behavior in Google cluster traces. Starting from smaller- to moderate-size cloud computing infrastructures, the dataset generation process is demonstrated using the Monte Carlo simulation method to produce a Google Cloud Jobs (GoCJ) dataset based on the analysis of Google cluster traces. With this article, the dataset is made publicly available to enable other researchers in the field to investigate and benchmark their scheduling and resource-allocation schemes for the cloud. The GoCJ dataset is archived and available on the Mendeley Data repository.

Highlights

Developers of resource-allocation and scheduling algorithms share test datasets to enable others to compare the performance of newly developed algorithms
Datasets are becoming increasingly more pertinent when executing the performance assessment of cloud-scheduling, resource-allocation, and load-balancing algorithms used for eagle-eyed examination of efficiency and performance in a real-world cloud
Real cloud workload is hard to acquire for performance analysis and investigation due to the users’ data confidentiality and policies maintained by Cloud Service Providers (CSPs)

Summary

Datasets are becoming increasingly more pertinent when executing the performance assessment of cloud-scheduling, resource-allocation, and load-balancing algorithms used for eagle-eyed examination of efficiency and performance in a real-world cloud. The cloud provides its services in the form of a platform or infrastructure to real-time deployment, execution, or simulation of different computation-greedy applications i.e., big network traffic data visualizations [8], multi-threaded learning control mechanisms for neural networks [9], performance tests on merge sorts, recursive merge sorts for big data processing [10], and parallelization of modified merge sort algorithms [11] etc. The GoCJ dataset provides a reflection of real workload behavior as perceived in Google cluster traces [21,22,23,24,25,26] and MapReduce logs from the M45 supercomputing cluster, so it has more significance and usefulness for researchers working in the scheduling of cluster and cloud-based applications; The GoCJ dataset can serve as an alternative to benchmark workload for scheduling and resource-allocation mechanisms using realistic HPC jobs in cloud computing. The last section concludes the paper and identifies future directions for the GoCJ dataset

Data Description

Data accessibility

Data Acquisition for Original Dataset

Reproduction of GoCJ Realistic Dataset

GoCJ Dataset Generator Tool

Data Distribution and Complexity of GoCJ Generator

Findings

User Notes

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Data	Publication Date: Sep 28, 2018
Citations: 45	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

GoCJ: Google Cloud Jobs Dataset for Distributed and Cloud Computing Infrastructures

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data

Lead the way for us

Similar Papers

Cost-optimized Task Scheduling with Improved Deep Q-Learning in Green Data Centers
Jing Bi ... Zhou Yu
-
Jing Bi, et. al.Jing Bi ... Zhou Yu
09 Oct 2022
09 Oct 2022

An Adjustable Risk Assessment Method for a Cloud System
Chi-An Chih ... Yu-Lun Huang
-
Chi-An Chih, et. al.Chi-An Chih ... Yu-Lun Huang
01 Aug 2015
01 Aug 2015

Negotiation-based resource provisioning and task scheduling algorithm for cloud systems
Ji Li ... Massoud Pedram
-
Ji Li, et. al. Ji Li ... Massoud Pedram
01 Mar 2016
01 Mar 2016

K Stacked Bidirectional LSTM for Resource Usage Prediction in Cloud Data Centers
Yashwant Singh Patel ... Rishabh Jaiswal
-
Yashwant Singh Patel, et. al.Yashwant Singh Patel ... Rishabh Jaiswal
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

GoCJ: Google Cloud Jobs Dataset for Distributed and Cloud Computing Infrastructures

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data