Optimizing data regeneration and storage with data dependency for cloud scientific workflow systems

Meijuan Wang,Lin Zhou,Lei Fan

doi:10.1016/j.eswa.2023.121984

Abstract

The characters of large volume, complex dependency and frequent reuse of the data in data-intensive scientific workflow systems make the data management more and more complex but crucial. Automating the data management is an urgent need for scientific workflow systems and, requires accurate models and efficient methods for data storage and request response. This paper deals with the problems of data regeneration and storage optimization by using data dependency for cloud scientific workflow systems. Through representing the data storage strategies as 0-1 strings, a global discrete model for data storage optimization is constructed, which has a complexity of O(2n) (n is the dataset number). In terms of data regeneration, we develop a deterministic method to determine optimal regeneration strategies for requested datasets with minimal computation under any data storage strategy, such that the systems can satisfy the data requests automatically and evaluate the computation cost accurately. To solve this storage optimization problem, the enumeration method and elitist canonical genetic algorithm are constructed by shortening the representation of storage strategies to avoid redundant computation. At last, experiments are conducted to test the proposed methods. Experimental results and comparisons show the positiveness and effectiveness of the proposed methods for data management in cloud scientific workflow systems.

Full Text