Abstract
Big data analytics (BDA) applications are a new category of software applications that process large amounts of data using scalable parallel processing infrastructure to obtain hidden value. Hadoop is the most mature open-source big data analytics framework, which implements the MapReduce programming model to process big data with MapReduce jobs. Big data analytics jobs are often continuous and not mutually separated. The existing work mainly focuses on executing jobs in sequence, which are often inefficient and consume high energy. In this paper, we propose a genetic algorithm-based job scheduling model for big data analytics applications to improve the efficiency of big data analytics. To implement the job scheduling model, we leverage an estimation module to predict the performance of clusters when executing analytics jobs. We have evaluated the proposed job scheduling model in terms of feasibility and accuracy.
Highlights
1 Introduction Big data analytics (BDA) applications are a new category of software applications that process large amounts of data using scalable parallel processing infrastructure to obtain hidden value
Hadoop [1] is the most mature open-source big data analytics framework, which implements the MapReduce programming model [2] proposed by Google in 2004 to process big data
The performance of big data analytics application is related to the characteristics of jobs and the configuration of clusters, which have a direct impact on performance of big data analytics applications
Summary
Big data analytics (BDA) applications are a new category of software applications that process large amounts of data using scalable parallel processing infrastructure to obtain hidden value. We propose an estimation module to predict the performance of Hadoop clusters when executing different big data analytics jobs, which can be used by GAs. with the effective information which the estimation module provides, we present a genetic algorithm-based job scheduling model for geo-distributed data. Berlinska and Drozdowski [8] propose a mathematical model of MapReduce and analyze MapReduce distributed computations as a divisible load scheduling problem. They do not consider the system constraints. Han et al [12] proposed a Hadoop performance prediction model It does not consider the data preparation phase of this thesis. The performance prediction model is unintelligible, and the cost is not taken in consideration
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: EURASIP Journal on Wireless Communications and Networking
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.