By ensuring well-developed complex big data platform architectures, data engineers provide data scientists and analysts infrastructure with computational and storage resources to perform their research. Based on such supports, data scientists are provided an opportunity to focus on their domain problems and design the required intelligent modules (i.e., prepare the data; select, train, and tune the machine-learning modules; and validate the results). However, there are still gaps between system engineering and data scientist/engineering teams. Generally, system engineers have limited knowledge on the application domains and the purposes of an analytical program. On the contrary, both data scientists and engineers are usually unfamiliar with the configuration of a computational system, file system, and database. However, the performance of an application can be affected by a system's configuration, and the data scientists and engineers have little information and knowledge about which of the system's properties can affect the application's performance. As a typical example, for Internet-scale applications that have thousands of computing nodes or billions of Internet of Things devices, even a slight improvement may have an enormous influence on energy management and environmental protection issues. To bridge the gap between system engineering and data scientist/engineering teams, we proposed the concept of a configuration layer based on a big data platform, Hadoop. We built a configuration tuner, BigExplorer, to collect and preprocess data. Furthermore, we also created golden configurations for performance improvement. Based on the processed data, we used a semi-automatic feature engineering technique to provide more features for data engineers and developed the performance model using three different machine learning algorithms (i.e., random forest, gradient boosting machine, and support vector machine). Using the commonly used benchmarks of Word-Count, TeraSort, and Pig workloads, our configuration tuner achieved a significant performance improvement of 28%-51% for different workloads than using the rule-of-thumb configuration.
Read full abstract