ABSTRACT
A huge amount of data is constantly being produced in the world. Data coming from the IoT, from scientific simulations, or from any other field of the eScience, are accumulated over historical data sets and set up the seed for future Big Data processing, with the final goal to generate added value and discover knowledge. In such computing processes, data are the main resource; however, organizing and managing data during their entire life cycle becomes a complex research topic. As part of this, Data LifeCycle (DLC) models have been proposed to efficiently organize large and complex data sets, from creation to consumption, in any field, and any scale, for an effective data usage and big data exploitation.
Several DLC frameworks can be found in the literature, each one defined for specific environments and scenarios. However, we realized that there is no global and comprehensive DLC model to be easily adapted to different scientific areas. For this reason, in this paper we describe the Comprehensive Scenario Agnostic Data LifeCycle (COSA-DLC) model, a DLC model which: i) is proved to be comprehensive as it addresses the 6Vs challenges (namely Value, Volume, Variety, Velocity, Variability and Veracity; and ii), it can be easily adapted to any particular scenario and, therefore, fit the requirements of a specific scientific field. In this paper we also include two use cases to illustrate the ease of the adaptation in different scenarios. We conclude that the comprehensive scenario agnostic DLC model provides several advantages, such as facilitating global data management, organization and integration, easing the adaptation to any kind of scenario, guaranteeing good data quality levels and, therefore, saving design time and efforts for the scientific and industrial communities.
- J. Wang, Y. Tang, M. Nguyen, and I. Altintas, "A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning," in Proceedings of the 2014 IEEE/ACM International Symposium on Big Data Computing (BDC), 2014, pp. 16--25. Google ScholarDigital Library
- V. Kolias, I. Anagnostopoulos, and E. Kayafas, "A Covering Classification Rule Induction Approach for Big Datasets," in Proceedings of the 2014 IEEE/ACM International Symposium on Big Data Computing, 2014, pp. 45--53. Google ScholarDigital Library
- H. Hu, Y. Wen, T.-S. Chua, and X. Li, "Toward scalable systems for big data analytics: A technology tutorial," Journals & Magazines on IEEE Access, vol. 2, pp. 652--687, 2014.Google ScholarCross Ref
- Y. Demchenko, Z. Zhao, P. Grosso, A. Wibisono, and C. De Laat, "Addressing big data challenges for scientific data infrastructure," in IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom), 2012, pp. 614--617. Google ScholarDigital Library
- R. Grunzke, A. Aguilera, W. E. Nagel, et al., "Managing complexity in distributed Data Life Cycles enhancing scientific discovery," in IEEE 11th International Conference on E-Science (e-Science), 2015, pp. 371--380. Google ScholarDigital Library
- A. Levitin and T. Redman, "A model of the data (life) cycles with application to quality," Journal of Information and Software Technology on Elsevier, vol. 35, pp. 217--223, 1993.Google ScholarCross Ref
- W. K. Michener and M. B. Jones, "Ecoinformatics: supporting ecology as a data-intensive science," Journal of Trends in ecology & evolution, vol. 27, pp. 85--93, 2012.Google ScholarCross Ref
- J. Rüegg, C. Gries, B. Bond-Lamberty, et al., "Completing the Data Life Cycle: using information management in macrosystems ecology research," Journal of Frontiers in Ecology and the Environment, vol. 12, pp. 24--30, 2014.Google ScholarCross Ref
- J. M. Schopf, "Treating data like software: a case for production quality data," in Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, 2012, pp. 153--156. Google ScholarDigital Library
- W. Lenhardt, S. Ahalt, B. Blanton, L. Christopherson, and R. Idaszak, "Data management Lifecycle and Software Lifecycle management in the context of conducting science," Journal of Open Research Software, vol. 2, 2014.Google Scholar
- A. Sinaeepourfard, X. Masip-Bruin, J. Garcia, and E. Marín-Tordera, "A Survey on Data Lifecycle Models: Discussions toward the 6Vs Challenges," Technical Report (UPC-DAC-RR-2015--18), 2015.Google Scholar
- A. Sinaeepourfard, J. Garcia, X. Masip, et al., "A Comprehensive Scenario Agnostic Data LifeCycle model for an efficient data complexity management," in IEEE 12th International Conference on E-Science (e-Science), Baltimore, USA, 2016.Google Scholar
- S. Henry, S. Hoon, M. Hwang, D. Lee, and M. D. DeVore, "Engineering trade study: extract, transform, load tools for data migration," in IEEE Conference on Design Symposium, Systems and Information Engineering, 2005, pp. 1--8.Google Scholar
- S. Kurunji, T. Ge, B. Liu, and C. X. Chen, "Communication cost optimization for cloud Data Warehouse queries," in IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom), 2012, pp. 512--519. Google ScholarDigital Library
- F. L. F. Almeida and C. Calistru, "The main challenges and issues of big data management," International Journal of Research Studies in Computing, vol. 2, 2012.Google Scholar
- M. Rouse. (2010). Data Life Cycle management (DLM) definition. Available: Available on: http://searchstorage.techtarget.com/definition/data-life-cycle-management.Google Scholar
- A. Burton and A. Treloar, "Publish my data: a composition of services from ANDS and ARCS," in IEEE 5th International Conference on E-Science (e-Science), 2009, pp. 164--170. Google ScholarDigital Library
- X. Yu and Q. Wen, "A view about cloud data security from data life cycle," in International Conference on Computational Intelligence and Software Engineering (CiSE), 2010, pp. 1--4.Google Scholar
- J. Starr, P. Willett, L. Federer, C. Horning, and M. L. Bergstrom, "A collaborative framework for data management services: the experience of the University of California," Journal of eScience Librarianship, vol. 1, p. 7, 2012.Google ScholarCross Ref
- L. Hsu, R. L. Martin, B. McElroy, K. Litwin-Miller, and W. Kim, "Data management, sharing, and reuse in experimental geomorphology: Challenges, strategies, and scientific opportunities," Journal of Geomorphology, vol. 244, pp. 180--189, 2015.Google ScholarCross Ref
- M. Emaldi, O. Peña, J. Lázaro, and D. López-de-Ipiña, "Linked Open Data as the fuel for Smarter Cities," in Modeling and Processing for Next-Generation Big-Data Technologies, ed: Springer, 2015, pp. 443--472.Google Scholar
- J. Jin, J. Gubbi, S. Marusic, and M. Palaniswami, "An information framework for creating a Smart City through Internet of Things," Journal of Internet of Things Journal on IEEE, vol. 1, pp. 112--121, 2014.Google ScholarCross Ref
Index Terms
- Towards a comprehensive data lifecycle model for big data environments
Recommendations
A Brief Survey on Big Data in Healthcare
This article presents a brief introduction to big data and big data analytics and also their roles in the healthcare system. A definite range of scientific researches about big data analytics in the healthcare system have been reviewed. The definition ...
Mining Big Data
ICEIS 2015: Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1Nowadays, the daily amount of generated data is measured in exabytes. Such huge data is now referred to as Big Data. Big data mining leads to the discovery of the useful information from huge data repositories. However, this huge amount of data hinders ...
Comments