Skip to main content

2021 | OriginalPaper | Buchkapitel

Metadata Management on Data Processing in Data Lakes

verfasst von : Imen Megdiche, Franck Ravat, Yan Zhao

Erschienen in: SOFSEM 2021: Theory and Practice of Computer Science

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data Lake (DL) is known as a Big Data analysis solution. A data lake stores not only data but also the processes that were carried out on these data. It is commonly agreed that data preparation/transformation takes most of the data analyst’s time. To improve the efficiency of data processing in a DL, we propose a framework which includes a metadata model and algebraic transformation operations. The metadata model ensures the findability, accessibility, interoperability and reusability of data processes as well as data lineage of processes. Moreover, each process is described through a set of coarse-grained data transforming operations which can be applied to different types of datasets. We illustrate and validate our proposal with a real medical use case implementation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Alserafi, A., Abelló, A., Romero, O., Calders, T.: Towards information profiling: data lake content metadata management. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 178–185. IEEE (2016) Alserafi, A., Abelló, A., Romero, O., Calders, T.: Towards information profiling: data lake content metadata management. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 178–185. IEEE (2016)
2.
Zurück zum Zitat Quix, C., Hai, R., Vatov, I.: Metadata extraction and management in data lakes with gemms. Complex Syst. Inform. Model. Q. (9), 67–83, December 2016 Quix, C., Hai, R., Vatov, I.: Metadata extraction and management in data lakes with gemms. Complex Syst. Inform. Model. Q. (9), 67–83, December 2016
3.
Zurück zum Zitat Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., Ursino, D.: An approach to extracting thematic views from highly heterogeneous sources of a data lake. In: Atti del Ventiseiesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD 2018) (2018) Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., Ursino, D.: An approach to extracting thematic views from highly heterogeneous sources of a data lake. In: Atti del Ventiseiesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD 2018) (2018)
4.
Zurück zum Zitat Halevy, A., et al.: Goods: organizing google’s datasets. In: Proceedings of the 2016 International Conference on Management of Data, pp. 795–806. ACM (2016) Halevy, A., et al.: Goods: organizing google’s datasets. In: Proceedings of the 2016 International Conference on Management of Data, pp. 795–806. ACM (2016)
5.
Zurück zum Zitat Hidalgo, M., Menasalvas, E., Eibe, S.: Definition of a metadata schema for describing data preparation tasks. In: Proceedings of the ECML/PKDD 2009 Workshop on 3rd generation Data Mining (SoKD 2009), pp. 64–75 (2009) Hidalgo, M., Menasalvas, E., Eibe, S.: Definition of a metadata schema for describing data preparation tasks. In: Proceedings of the ECML/PKDD 2009 Workshop on 3rd generation Data Mining (SoKD 2009), pp. 64–75 (2009)
6.
Zurück zum Zitat Jin, Z., Anderson, M.R., Cafarella, M., Jagadish, H.: Foofah: transforming data by example. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 683–698. ACM (2017) Jin, Z., Anderson, M.R., Cafarella, M., Jagadish, H.: Foofah: transforming data by example. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 683–698. ACM (2017)
7.
Zurück zum Zitat Poole, J.: The common warehouse metamodel as a foundation for active object models in the data warehouse environment. In: ECOOP 2000 workshop on Metadata and Active Object-Model Pattern Mining-Cannes, France (2000) Poole, J.: The common warehouse metamodel as a foundation for active object models in the data warehouse environment. In: ECOOP 2000 workshop on Metadata and Active Object-Model Pattern Mining-Cannes, France (2000)
11.
Zurück zum Zitat VanVlymen, J., de Lusignan, S.: A system of metadata to control the process of query, aggregating, cleaning and analysing large datasets of primary care data. J. Innov. Health Inform. 13(4), 281–291 (2005)CrossRef VanVlymen, J., de Lusignan, S.: A system of metadata to control the process of query, aggregating, cleaning and analysing large datasets of primary care data. J. Innov. Health Inform. 13(4), 281–291 (2005)CrossRef
12.
Zurück zum Zitat Vassiliadis, P., Simitsis, A., Baikousi, E.: A taxonomy of ETL activities. In: Proceedings of the ACM 12th International Workshop on Data Warehousing and OLAP, pp. 25–32 (2009) Vassiliadis, P., Simitsis, A., Baikousi, E.: A taxonomy of ETL activities. In: Proceedings of the ACM 12th International Workshop on Data Warehousing and OLAP, pp. 25–32 (2009)
13.
Zurück zum Zitat Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL processes. In: Proceedings of the 5th ACM International Workshop on Data Warehousing and OLAP, pp. 14–21. ACM (2002) Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL processes. In: Proceedings of the 5th ACM International Workshop on Data Warehousing and OLAP, pp. 14–21. ACM (2002)
14.
Zurück zum Zitat Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17(5–6), 375–381 (2003)CrossRef Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17(5–6), 375–381 (2003)CrossRef
Metadaten
Titel
Metadata Management on Data Processing in Data Lakes
verfasst von
Imen Megdiche
Franck Ravat
Yan Zhao
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-67731-2_40

Premium Partner