Abstract

As the usability of Cloud-based solutions has increased for various types of users with different needs, from scientists that want to process big data sets collected from sensors or business analysts that want to take decisions based on the huge amount of gathered data to simple users that store or share documents via a Cloud platform, the generated data is increasing more and more. For example, the ATLAS and other detectors at CERN generate petabytes of data and Facebook stores data with a rate of around 600 TB daily. In the current context, efficient scheduling for Big Data applications is a challenge and an appropriate scheduling technique is required for different types of incoming requests. In this paper we propose a scheduling algorithm for different types of computation requests: independent tasks, like bag of tasks (BoT) model or tasks with dependencies modeled as directed acyclic graphs (DAG), and they will be scheduled for execution in a Cloud datacenter. The tasks in the requests are scheduled on the available resources using the suitable scheduling algorithm for each request. We rely on a machine learning toolbox, named as MLBox, to find what algorithm should be used for a certain request. We implemented four heuristics for scheduling BoTs and four heuristics for DAGs scheduling and generated the training data for the machine learning algorithm by running multiple traditional scheduling algorithms and selecting the ‘best’ one for a given request. We evaluate the performance by comparing the scheduling of different tasks requests using some of the traditional algorithms and our machine learning based scheduling algorithm.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call