Distributed Large-scale Time-series Data Processing and Analysis System Based on Spark Platform

Bangyan Du

doi:10.1109/bdacs53596.2021.00031

Abstract

With the diversification of applications and the diversity of users, the era of big data is always producing various new types of data, such as graph data, time series data, and spatial data. When predicting the future trend of the system, we always need to learn from the historical working state, and time series data plays a huge value in this aspect. Hardware resources are becoming cheaper, computing power has made major breakthroughs, and multi-machine and multi-core scenarios have become commonplace. At present, the focus of large-scale time series data processing is either focused on data collection and storage, or on single-machine processing, which has been unable to meet the existing needs. In response to this problem, this paper designs and implements a distributed large-scale time series processing and analysis system based on the Spark platform. The system framework is mainly divided into storage layer, operator layer and algorithm layer. At the storage layer, the system completes the organization and indexing of large-scale time series data based on HDFS and Hive. At the operator layer, the system provides users with basic operations commonly used for time series data on the Spark platform, and allows users to directly use these operators to implement custom time series related processing algorithms. At the algorithm level, the system implements some commonly used time series analysis algorithms in the Spark platform, including time series similarity query, clustering and prediction, so that users can directly use these algorithms to analyze time series. Finally, the feasibility and practicability of this system are verified by testing the performance and function of the system.

Full Text