Weitere Kapitel dieses Buchs durch Wischen aufrufen
Scientific applications are usually data intensive [1,~ 2], where the generated datasets are often terabytes or even petabytes in size. As reported by Szalay and Gray in , science is in an exponential world and the amount of scientific data will double every year over the next decade and future. Producing scientific datasets involves large number of computation intensive tasks, e.g., with scientific workflows , hence taking a long time for execution. These generated datasets contain important intermediate or final results of the computation, and need to be stored as valuable resources. This is because: (1) data can be reused – scientists may need to re-analyze the results or apply new analyses on the existing datasets ; (2) data can be shared – for collaboration, the computation results may be shared, hence the datasets are used by scientists from different institutions . Storing valuable generated application datasets can save their regeneration cost when they are reused, not to mention the waiting time caused by regeneration. However, the large size of the scientific datasets is a big challenge for their storage.
Bitte loggen Sie sich ein, um Zugang zu diesem Inhalt zu erhalten
Sie möchten Zugang zu diesem Inhalt erhalten? Dann informieren Sie sich jetzt über unsere Produkte:
Bandwidth is another common kind of resource in the cloud. In [ 1], the authors state that the cost-effective way of doing science in the cloud is to upload all the application data to the cloud storage and run all the applications with the cloud services. So we assume that the scientists upload all the original data to the cloud to conduct their experiments. Because transferring data within one cloud service provider's facilities is usually free, the data transfer cost of managing the application datasets is not counted. In [ 15], the authors discussed the scenario of running scientific applications among different cloud service providers.
The prices may fluctuate from time to time according to market factors.
Amazon cloud service offers different CPU instances with different prices, where using expensive CPU instances with higher performance would reduce computation time. There exists a trade-off of time and cost [ 34], which is different with the trade-off of computation and storage, hence is out of this chapter's scope.
Deelman, E., G. Singh, M. Livny, B. Berriman, and J. Good. The Cost of Doing Science on the Cloud: the Montage Example. in ACM/IEEE Conference on Supercomputing (SC’08). pp. 1–12. 2008. Austin, Texas, USA.
Ludascher, B., I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, and E.A. Lee, Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice and Experience, 2005. 18(10): pp. 1039–1065. CrossRef
Szalay, A.S. and J. Gray, Science in an Exponential World. Nature, 2006. 440: pp. 23–24. CrossRef
Deelman, E., D. Gannon, M. Shields, and I. Taylor, Workflows and e-Science: An Overview of Workflow System Features and Capabilities. Future Generation Computer Systems, 2009. 25(5): pp. 528–540. CrossRef
Bose, R. and J. Frew, Lineage Retrieval for Scientific Data Processing: A Survey. ACM Computing Survey, 2005. 37(1): pp. 1–28. CrossRef
Burton, A. and A. Treloar. Publish My Data: A Composition of Services from ANDS and ARCS. in 5 th IEEE International Conference on e-Science, (e-Science ’09) pp. 164–170. 2009. Oxford, UK.
Foster, I., Z. Yong, I. Raicu, and S. Lu. Cloud Computing and Grid Computing 360-Degree Compared. in Grid Computing Environments Workshop (GCE’08). pp. 1–10. 2008. Austin, Texas, USA.
Buyya, R., C.S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, Cloud Computing and Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility. Future Generation Computer Systems, 2009. 25(6): pp. 599–616. CrossRef
Amazon Cloud Services: http://aws.amazon.com/.
Zaharia, M., A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. in 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’2008). pp. 29–42. 2008. San Diego, CA, USA.
Adams, I., D.D.E. Long, E.L. Miller, S. Pasupathy, and M.W. Storer. Maximizing Efficiency by Trading Storage for Computation. in Workshop on Hot Topics in Cloud Computing (HotCloud’09). pp. 1–5. 2009. San Diego, CA, USA.
Yuan, D., Y. Yang, X. Liu, and J. Chen. A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflows. in 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS’10). pp. 1–12. 2010. Atlanta, Georgia, USA.
Yuan, D., Y. Yang, X. Liu, G. Zhang, and J. Chen, A Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems. Concurrency and Computation: Practice and Experience, 2010. ( http://dx.doi.org/10.1002/cpe.1636)
Yuan, D., Y. Yang, X. Liu, and J. Chen. A Local-Optimisation based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud. in 4th IEEE International Conference on Cloud Computing (Cloud2011). pp. 1–8. 2011. Washington DC, USA.
Yuan, D., Y. Yun, X. Liu, and J. Chen, On-demand Minimum Cost Benchmarking for Intermediate Datasets Storage in Scientific Cloud Workflow Systems. Journal of Parallel and Distributed Computing, 2011. 72(2): pp. 316–332. CrossRef
Chiba, T., T. Kielmann, M.d. Burger, and S. Matsuoka. Dynamic Load-Balanced Multicast for Data-Intensive Applications on Clouds. in IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid2010). pp. 5–14. 2010. Melbourne, Australia.
Juve, G., E. Deelman, K. Vahi, and G. Mehta. Data Sharing Options for Scientific Workflows on Amazon EC2. in ACM/IEEE Conference on Supercomputing (SC’10). pp. 1–9. 2010. New Orleans, Louisiana, USA.
Li, J., M. Humphrey, D. Agarwal, K. Jackson, C.v. Ingen, and Y. Ryu. eScience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Azure Platform. in 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS’10). pp. 1–12. 2010. Atlanta, Georgia, USA.
Yuan, D., Y. Yang, X. Liu, and J. Chen, A Data Placement Strategy in Scientific Cloud Workflows. Future Generation Computer Systems, 2010. 26(8): pp. 1200–1214. CrossRef
Eucalyptus. Available from: http://open.eucalyptus.com/.
Nimbus. Available from: http://www.nimbusproject.org/.
OpenNebula. Available from: http://www.opennebula.org/.
Armbrust, M., A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, A View of Cloud Computing. Commun. ACM, 2010. 53(4): pp. 50–58. CrossRef
Assuncao, M.D.d., A.d. Costanzo, and R. Buyya. Evaluating the Cost-Benefit of Using Cloud Computing to Extend the Capacity of Clusters. in 18th ACM International Symposium on High Performance Distributed Computing (HPDC’09). pp. 1–10. 2009. Garching, Germany.
Kondo, D., B. Javadi, P. Malecot, F. Cappello, and D.P. Anderson. Cost-Benefit Analysis of Cloud Computing versus Desktop Grids. in 23th IEEE International Parallel & Distributed Processing Symposium (IPDPS’09). pp. 1–12. 2009. Rome, Italy.
Cho, B. and I. Gupta. New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks. in IEEE 30th International Conference on Distributed Computing Systems (ICDCS). pp. 305–314. 2010. Genova, Italy.
Gunda, P.K., L. Ravindranath, C.A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic Management of Data and Computation in Datacenters. in 9th Symposium on Operating Systems Design and Implementation (OSDI’2010). pp. 1–14. 2010, Vancouver, Canada.
Bao, Z., S. Cohen-Boulakia, S.B. Davidson, A. Eyal, and S. Khanna. Differencing Provenance in Scientific Workflows. in 25th IEEE International Conference on Data Engineering (ICDE’09). pp. 808–819. 2009. Shanghai, China.
Groth, P. and L. Moreau, Recording Process Documentation for Provenance. IEEE Transactions on Parallel and Distributed Systems, 2009. 20(9): pp. 1246–1259. CrossRef
Muniswamy-Reddy, K.-K., P. Macko, and M. Seltzer. Provenance for the Cloud. in 8th USENIX Conference on File and Storage Technology (FAST’10). pp. 197–210. 2010. San Jose, CA, USA.
Osterweil, L.J., L.A. Clarke, A.M. Ellison, R. Podorozhny, A. Wise, E. Boose, and J. Hadley. Experience in Using A Process Language to Define Scientific Workflow and Generate Dataset Provenance. in 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering. pp. 319–329. 2008. Atlanta, Georgia: ACM.
Foster, I., J. Vockler, M. Wilde, and Z. Yong. Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. in 14th International Conference on Scientific and Statistical Database Management, (SSDBM’02). pp. 37–46. 2002. Edinburgh, Scotland, UK.
Simmhan, Y.L., B. Plale, and D. Gannon, A Survey of Data Provenance in E-Science. SIGMOD Rec., 2005. 34(3): pp. 31–36. CrossRef
Garg, S.K., R. Buyya, and H.J. Siegel, Time and Cost Trade-Off Management for Scheduling Parallel Applications on Utility Grids. Future Generation Computer Systems, 2010. 26(8): pp. 1344–1355. CrossRef
Liu, X., D. Yuan, G. Zhang, J. Chen, and Y. Yang, SwinDeW-C: A Peer-to-Peer Based Cloud Workflow System, in Handbook of Cloud Computing, B. Furht and A. Escalante, Editors. 2010, Springer. pp. 309–332.
Yang, Y., K. Liu, J. Chen, J. Lignier, and H. Jin. Peer-to-Peer Based Grid Workflow Runtime Environment of SwinDeW-G. in IEEE International Conference on e-Science and Grid Computing. pp. 51–58. 2007. Bangalore, India.
- Computation and Storage Trade-Off for Cost-Effectively Storing Scientific Datasets in the Cloud
- Springer New York
- Chapter 5
Neuer Inhalt/© ITandMEDIA