Parallel strategy for multiple scan operations with data replication

Xing Wei,Weining Qian,Huiqi Hu,Aoying Zhou,Huichao Duan

doi:10.1007/s11280-018-0625-7

Abstract

To support the large-scale analytic for Web applications, the backend distributed data management system must provide the service for accessing massive data. Thus, the scan operation becomes a critical step. To improve the performance of scan operation, modern data management systems usually rely on the simple partitioned parallelism. Under the partitioned parallelism, tables are consist of several partitions, and each scan operation can access multiple partitions separately. It is a simple and effective solution for a single scan operation. In this paper, we consider managing multiple scan operations together, where the situation is no longer straightforward. To address the problem, we propose the parallel strategy to schedule batched scan operations together beyond the simple partitioned parallelism. For the sake of performance, first, we utilize replications to increase the parallelism and propose an effective load balancing strategy over replication nodes based on linear programming. Second, we propose an effective chunk-based scheduling algorithm for multi-threading parallelism on each node to guarantee all threads have even workloads under a qualified cost model. Finally, we integrate our parallel scan strategy into an open-sourced distributed data management system. Experimental evaluation shows our parallel scan strategy significantly improves the performance of scan operation.

Full Text