Skip to main content
Top

2021 | OriginalPaper | Chapter

Metadata Management on Data Processing in Data Lakes

Authors : Imen Megdiche, Franck Ravat, Yan Zhao

Published in: SOFSEM 2021: Theory and Practice of Computer Science

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Data Lake (DL) is known as a Big Data analysis solution. A data lake stores not only data but also the processes that were carried out on these data. It is commonly agreed that data preparation/transformation takes most of the data analyst’s time. To improve the efficiency of data processing in a DL, we propose a framework which includes a metadata model and algebraic transformation operations. The metadata model ensures the findability, accessibility, interoperability and reusability of data processes as well as data lineage of processes. Moreover, each process is described through a set of coarse-grained data transforming operations which can be applied to different types of datasets. We illustrate and validate our proposal with a real medical use case implementation.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Alserafi, A., Abelló, A., Romero, O., Calders, T.: Towards information profiling: data lake content metadata management. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 178–185. IEEE (2016) Alserafi, A., Abelló, A., Romero, O., Calders, T.: Towards information profiling: data lake content metadata management. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 178–185. IEEE (2016)
2.
go back to reference Quix, C., Hai, R., Vatov, I.: Metadata extraction and management in data lakes with gemms. Complex Syst. Inform. Model. Q. (9), 67–83, December 2016 Quix, C., Hai, R., Vatov, I.: Metadata extraction and management in data lakes with gemms. Complex Syst. Inform. Model. Q. (9), 67–83, December 2016
3.
go back to reference Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., Ursino, D.: An approach to extracting thematic views from highly heterogeneous sources of a data lake. In: Atti del Ventiseiesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD 2018) (2018) Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., Ursino, D.: An approach to extracting thematic views from highly heterogeneous sources of a data lake. In: Atti del Ventiseiesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD 2018) (2018)
4.
go back to reference Halevy, A., et al.: Goods: organizing google’s datasets. In: Proceedings of the 2016 International Conference on Management of Data, pp. 795–806. ACM (2016) Halevy, A., et al.: Goods: organizing google’s datasets. In: Proceedings of the 2016 International Conference on Management of Data, pp. 795–806. ACM (2016)
5.
go back to reference Hidalgo, M., Menasalvas, E., Eibe, S.: Definition of a metadata schema for describing data preparation tasks. In: Proceedings of the ECML/PKDD 2009 Workshop on 3rd generation Data Mining (SoKD 2009), pp. 64–75 (2009) Hidalgo, M., Menasalvas, E., Eibe, S.: Definition of a metadata schema for describing data preparation tasks. In: Proceedings of the ECML/PKDD 2009 Workshop on 3rd generation Data Mining (SoKD 2009), pp. 64–75 (2009)
6.
go back to reference Jin, Z., Anderson, M.R., Cafarella, M., Jagadish, H.: Foofah: transforming data by example. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 683–698. ACM (2017) Jin, Z., Anderson, M.R., Cafarella, M., Jagadish, H.: Foofah: transforming data by example. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 683–698. ACM (2017)
7.
go back to reference Poole, J.: The common warehouse metamodel as a foundation for active object models in the data warehouse environment. In: ECOOP 2000 workshop on Metadata and Active Object-Model Pattern Mining-Cannes, France (2000) Poole, J.: The common warehouse metamodel as a foundation for active object models in the data warehouse environment. In: ECOOP 2000 workshop on Metadata and Active Object-Model Pattern Mining-Cannes, France (2000)
11.
go back to reference VanVlymen, J., de Lusignan, S.: A system of metadata to control the process of query, aggregating, cleaning and analysing large datasets of primary care data. J. Innov. Health Inform. 13(4), 281–291 (2005)CrossRef VanVlymen, J., de Lusignan, S.: A system of metadata to control the process of query, aggregating, cleaning and analysing large datasets of primary care data. J. Innov. Health Inform. 13(4), 281–291 (2005)CrossRef
12.
go back to reference Vassiliadis, P., Simitsis, A., Baikousi, E.: A taxonomy of ETL activities. In: Proceedings of the ACM 12th International Workshop on Data Warehousing and OLAP, pp. 25–32 (2009) Vassiliadis, P., Simitsis, A., Baikousi, E.: A taxonomy of ETL activities. In: Proceedings of the ACM 12th International Workshop on Data Warehousing and OLAP, pp. 25–32 (2009)
13.
go back to reference Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL processes. In: Proceedings of the 5th ACM International Workshop on Data Warehousing and OLAP, pp. 14–21. ACM (2002) Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL processes. In: Proceedings of the 5th ACM International Workshop on Data Warehousing and OLAP, pp. 14–21. ACM (2002)
14.
go back to reference Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17(5–6), 375–381 (2003)CrossRef Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17(5–6), 375–381 (2003)CrossRef
Metadata
Title
Metadata Management on Data Processing in Data Lakes
Authors
Imen Megdiche
Franck Ravat
Yan Zhao
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-67731-2_40

Premium Partner