A cost-effective strategy for intermediate data storage in scientific cloud workflow systems
conference contribution
posted on 2010-07-01, 00:00 authored by D Yuan, Y Yang, Xiao LiuXiao Liu, J ChenMany scientific workflows are data intensive where a large volume of intermediate data is generated during their execution. Some valuable intermediate data need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science on cloud has become popular nowadays, more intermediate data can be stored in scientific cloud workflows based on a pay-for- use model. In this paper, we build an Intermediate data Dependency Graph (IDG) from the data provenances in scientific workflows. Based on the IDG, we develop a novel intermediate data storage strategy that can reduce the cost of the scientific cloud workflow system by automatically storing the most appropriate intermediate datasets in the cloud storage. We utilise Amazon's cost model and apply the strategy to an astrophysics pulsar searching scientific workflow for evaluation. The results show that our strategy can reduce the overall cost of scientific cloud workflow execution significantly. © 2010 IEEE.
History
Location
Atlanta, Ga.Publisher DOI
Start date
2010-04-19End date
2010-04-23ISBN-13
9781424464432Language
engPublication classification
E Conference publication, E1.1 Full written paper - refereedCopyright notice
2010, IEEETitle of proceedings
IPDPS 2010 : Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed ProcessingEvent
Parallel & Distributed Processing. International Symposium ( 2010 : Atlanta, Georgia )Publisher
IEEEPlace of publication
Piscataway, N.J.Usage metrics
Categories
No categories selectedKeywords
Licence
Exports
RefWorksRefWorks
BibTeXBibTeX
Ref. managerRef. manager
EndnoteEndnote
DataCiteDataCite
NLMNLM
DCDC