Abstract

The characters of large volume, complex dependency and frequent reuse of the data in data-intensive scientific workflow systems make the data management more and more complex but crucial. Automating the data management is an urgent need for scientific workflow systems and, requires accurate models and efficient methods for data storage and request response. This paper deals with the problems of data regeneration and storage optimization by using data dependency for cloud scientific workflow systems. Through representing the data storage strategies as 0-1 strings, a global discrete model for data storage optimization is constructed, which has a complexity of O(2n) (n is the dataset number). In terms of data regeneration, we develop a deterministic method to determine optimal regeneration strategies for requested datasets with minimal computation under any data storage strategy, such that the systems can satisfy the data requests automatically and evaluate the computation cost accurately. To solve this storage optimization problem, the enumeration method and elitist canonical genetic algorithm are constructed by shortening the representation of storage strategies to avoid redundant computation. At last, experiments are conducted to test the proposed methods. Experimental results and comparisons show the positiveness and effectiveness of the proposed methods for data management in cloud scientific workflow systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call