LPG-model: A novel model for throughput prediction in stream processing, using a light gradient boosting machine, incremental principal component analysis, and deep gated recurrent unit network

Zheng Chu,Jiong Yu,Askar Hamdulla

doi:10.1016/j.ins.2020.05.042

Abstract

In recent years, the volume and velocity of streaming data have been increasing rapidly. Thus, real-time processing scenarios for streaming data have continued to increase. Stream processing tasks face huge challenges in areas such as load optimization, task scheduling, and resource management. Throughput prediction for stream processing tasks is a key technology in these areas. To predict the throughput of stream processing tasks accurately and efficiently, we propose a novel model named the LPG-model. It includes three main components: a light gradient boosting machine (LightGBM), incremental principal component analysis (IPCA), and an evolving deep gated recurrent unit (GRU) network. Unlike existing state-of-the-art models, the LPG-model not only offers a network structure adaptation mechanism (hidden layer adaptation mechanism), but also provides feature processing mechanisms for streaming data. Data preprocessing provides an interpolation method for missing values through an incremental interpolation mechanism and two normalization methods for features through incremental normalization mechanisms. An efficient dimensionality reduction mechanism provided by the LightGBM and IPCA is used to improve the prediction efficiency of the LPG-model. The hidden layer growing mechanism of the evolving deep GRU network is capable of learning new knowledge and maintaining previous knowledge from data streams. Moreover, it also has the ability to capture the temporal aspects of the data streams. The experimental results from four open-source benchmarks illustrate that the LPG-model is more accurate and efficient than state-of-the-art algorithms or networks, under the prequential test-then-train protocol. This proves the effectiveness of the LPG-model in throughput prediction scenarios for stream processing tasks. Furthermore, the numerical results from standard benchmark problems of data streams indicate that the LPG-model has potential to reduce the execution time of high-dimensional data streams with a high classification accuracy.

Full Text