Abstract

The increasingly popular fused batch-streaming big data framework, Apache Flink, has many performance-critical as well as untamed configuration parameters. However, how to tune them for optimal performance has not yet been explored. Machine learning (ML) has been chosen to tune the configurations for other big data frameworks (e.g., Apache Spark), showing significant performance improvements. However, it needs a long time to collect a large amount of training data by nature. In this article, we propose a guided machine learning (GML) approach to tune the configurations of Flink with significantly shorter time for collecting training data compared to traditional ML approaches. GML innovates two techniques. First, it leverages generative adversarial networks (GANs) to generate a part of training data, reducing the time needed for training data collection. Second, GML guides a ML algorithm to select configurations that the corresponding performance is higher than the average performance of random configurations. We evaluate GML on a lab cluster with 4 servers and a real production cluster in an internet company. The results show that GML significantly outperforms the state-of-the-art, DAC (Datasize-Aware-Configuration) (Z. Yu et al. 2018) for tuning the configurations of Spark, with 2.4× of reduced data collection time but with 30 percent reduced 99th percentile latency. When GML is used in the internet company, it reduces the latency by up to 57.8× compared to the configurations made by the company.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.