Abstract
In the realm of big data, where datasets of immense scale pose processing challenges, distributed processing platforms like open-source Apache Spark have emerged to address these issues. Spark’s internal configuration parameters exert varying impacts on execution times based on job characteristics, making manual optimization daunting. The core focus of this study lies in optimizing Spark’s internal configurations, with specific attention directed towards three types of workloads: Iterative-intensive, Memory-intensive, and CPU-intensive. Employing Grid Search, Random Search, and Evolutionary Optimization algorithms yields substantial execution time reductions: 23.24% with Grid Search, 19.71% with Random Search, and 23.06% with Evolutionary Optimization. Notably, Evolutionary Optimization achieves optimal configurations approximately 29% faster than Grid Search. While Random Search and Evolutionary Optimization share similar time requirements, Random Search’s execution time reduction for a given Spark workload is relatively lower. This research sheds light on algorithmic configuration tuning intricacies and its influence on Spark workload execution times, contributing to the exploration of optimizing big data processing platforms.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.