A data dependency based strategy for intermediate data storage in scientific cloud workflow systems

Yuan, Dong, Yang, Yun, Liu, Xiao, Zhang, Gaofeng and Chen, Jinjun 2012, A data dependency based strategy for intermediate data storage in scientific cloud workflow systems, Concurrency computation : practice and experience, vol. 24, no. 9, Special issue, pp. 956-976, doi: 10.1002/cpe.1636.

Attached Files
Name Description MIMEType Size Downloads

Title A data dependency based strategy for intermediate data storage in scientific cloud workflow systems
Author(s) Yuan, Dong
Yang, Yun
Liu, XiaoORCID iD for Liu, Xiao orcid.org/0000-0001-8400-5754
Zhang, Gaofeng
Chen, Jinjun
Journal name Concurrency computation : practice and experience
Volume number 24
Issue number 9
Season Special issue
Start page 956
End page 976
Total pages 21
Publisher Wiley-Blackwell
Place of publication Chichester, Eng.
Publication date 2012-06
ISSN 1532-0626
1532-0634
Keyword(s) data sets storage
cloud computing
scientific workflow
Science & Technology
Technology
Computer Science, Software Engineering
Computer Science, Theory & Methods
Computer Science
Summary Many scientific workflows are data intensive where large volumes of intermediate data are generated during their execution. Some valuable intermediate data need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science in the cloud has become popular nowadays, more intermediate data can be stored in scientific cloud workflows based on a pay-for-use model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenance in scientific workflows. With the IDG, deleted intermediate data can be regenerated, and as such we develop a novel intermediate data storage strategy that can reduce the cost of scientific cloud workflow systems by automatically storing appropriate intermediate data sets with one cloud service provider. The strategy has significant research merits, i.e. it achieves a cost-effective trade-off of computation cost and storage cost and is not strongly impacted by the forecasting inaccuracy of data sets' usages. Meanwhile, the strategy also takes the users' tolerance of data accessing delay into consideration. We utilize Amazon's cost model and apply the strategy to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that our strategy can reduce the overall cost of scientific cloud workflow execution significantly.
Language eng
DOI 10.1002/cpe.1636
Field of Research 080109 Pattern Recognition and Data Mining
Socio Economic Objective 970108 Expanding Knowledge in the Information and Computing Sciences
HERDC Research category C1.1 Refereed article in a scholarly journal
ERA Research output type C Journal article
Copyright notice ©2010, John Wiley & Sons Ltd
Persistent URL http://hdl.handle.net/10536/DRO/DU:30087716

Connect to link resolver
 
Unless expressly stated otherwise, the copyright for items in DRO is owned by the author, with all rights reserved.

Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 27 times in TR Web of Science
Scopus Citation Count Cited 50 times in Scopus
Google Scholar Search Google Scholar
Access Statistics: 224 Abstract Views, 4 File Downloads  -  Detailed Statistics
Created: Wed, 07 Dec 2016, 13:42:43 EST

Every reasonable effort has been made to ensure that permission has been obtained for items included in DRO. If you believe that your rights have been infringed by this repository, please contact drosupport@deakin.edu.au.