Skip to main content
Top
Published in: World Wide Web 6/2023

06-11-2023

A semantic and service-based approach for adaptive mutli-structured data curation in data lakehouses

Authors: Firas Zouari, Chirine Ghedira-Guegan, Khouloud Boukadi, Nadia Kabachi

Published in: World Wide Web | Issue 6/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Recently, we noticed the emergence of several data management architectures to cope with the challenges imposed by big data. Among them, data lakehouses are receiving much interest from industrial and academic fields due to their ability to hold disparate multi-structured batch and streaming data sources in a single data repository. Thus, the heterogeneous and complex aspect of the data requires a dedicated process to improve their quality and retrieve value from them. Therefore, data curation encompasses several tasks that clean and enrich data to ensure it continues to fit the user requirements. Nevertheless, most existing data curation approaches need more dynamics, flexibility, and customization in constituting the data curation pipeline to align with end user requirements that may vary according to her/his decision context. Moreover, they are dedicated to curating only a single type of structure of batch data sources (e.g., semi-structured). Considering the changing requirements of the user and the need to build a customized data curation pipeline according to the users and the data source characteristics, we propose a service-based framework for adaptive data curation in data lakehouses that encompasses five modules: data collection, data quality evaluation, data characterization, curation service composition, and data curation. The proposed framework is built upon new data characterization and evaluation modular ontology and a curation service composition approach that we detail in the following paper. The experimental findings validate the contributions’ performance in terms of effectiveness and execution time.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
2.
go back to reference Lord, P., Macdonald, A., Lyon, L., Giaretta, D.: From data deluge to data curation. In: In Proc 3th UK e-Science All Hands Meeting. pp. 371–375 (2004) Lord, P., Macdonald, A., Lyon, L., Giaretta, D.: From data deluge to data curation. In: In Proc 3th UK e-Science All Hands Meeting. pp. 371–375 (2004)
3.
go back to reference Akoka, J., Comyn-Wattiau, I., Laoufi, N.: Research on Big Data - A systematic mapping study. Computer Standards and Interfaces. 54, 105–115 (2017)CrossRef Akoka, J., Comyn-Wattiau, I., Laoufi, N.: Research on Big Data - A systematic mapping study. Computer Standards and Interfaces. 54, 105–115 (2017)CrossRef
4.
go back to reference Tempini, N.: Data curation-research: Practices of data standardization and exploration in a precision medicine database. New Genet. Soc. 40 (2020) Tempini, N.: Data curation-research: Practices of data standardization and exploration in a precision medicine database. New Genet. Soc. 40 (2020)
5.
go back to reference Beheshti, A., Vaghani, K., Benatallah, B., Tabebordbar, A.: Crowdcorrect: A curation pipeline for social data cleansing and curation. Inf. Syst. Big Data Era, 24–38 (2018) Beheshti, A., Vaghani, K., Benatallah, B., Tabebordbar, A.: Crowdcorrect: A curation pipeline for social data cleansing and curation. Inf. Syst. Big Data Era, 24–38 (2018)
6.
go back to reference Konstantinou, N., Abel, E., Bellomarini, L., Bogatu, A., Civili, C., Irfanie, E., Koehler, M., Mazilu, L., Sallinger, E., Fernandes, A.A.A., Gottlob, G., Keane, J.A., Paton, N.W.: VADA: an architecture for end user informed data preparation. J Big Data. 6(1), 1–32 (2019)CrossRef Konstantinou, N., Abel, E., Bellomarini, L., Bogatu, A., Civili, C., Irfanie, E., Koehler, M., Mazilu, L., Sallinger, E., Fernandes, A.A.A., Gottlob, G., Keane, J.A., Paton, N.W.: VADA: an architecture for end user informed data preparation. J Big Data. 6(1), 1–32 (2019)CrossRef
7.
go back to reference Maccioni, A., Torlone, R.: Kayak: A framework for just-in-time data preparation in a data lake. Adv. Inform. Syst. Eng. 474–489 (2018) Maccioni, A., Torlone, R.: Kayak: A framework for just-in-time data preparation in a data lake. Adv. Inform. Syst. Eng. 474–489 (2018)
8.
go back to reference Bellomarini, L., Fayzrakhmanov, R.R., Gottlob, G., Kravchenko, A., Laurenza, E., Nenov, Y., Reissfelder, S., Sallinger, E., Sherkhonov, E., Vahdati, S., Wu, L.: Data science with vadalog: Knowledge graphs with machine learning and reasoning in practice. Futur. Gener. Comput. Syst. 129, 407–422 (2022)CrossRef Bellomarini, L., Fayzrakhmanov, R.R., Gottlob, G., Kravchenko, A., Laurenza, E., Nenov, Y., Reissfelder, S., Sallinger, E., Sherkhonov, E., Vahdati, S., Wu, L.: Data science with vadalog: Knowledge graphs with machine learning and reasoning in practice. Futur. Gener. Comput. Syst. 129, 407–422 (2022)CrossRef
9.
go back to reference Debattista, J., Lange, C., Auer, S.: daq, an ontology for dataset quality information. CEUR Workshop Proceedings. pp. 1184 (2014) Debattista, J., Lange, C., Auer, S.: daq, an ontology for dataset quality information. CEUR Workshop Proceedings. pp. 1184 (2014)
10.
go back to reference Lebo, T., Sahoo, S., Mcguinness, D., Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao, J.: PROV-O: The PROV Ontology. (2013) Lebo, T., Sahoo, S., Mcguinness, D., Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao, J.: PROV-O: The PROV Ontology. (2013)
13.
go back to reference Haller, A., Janowicz, K., Cox, S., Phuoc, D., Taylor, K., Lefrançois, M.: Semantic Sensor Network Ontology. (2017) Haller, A., Janowicz, K., Cox, S., Phuoc, D., Taylor, K., Lefrançois, M.: Semantic Sensor Network Ontology. (2017)
16.
go back to reference Walker, J., Frank, M., Thompson, N.: User centred methods for measuring the value of open data. (2015) Walker, J., Frank, M., Thompson, N.: User centred methods for measuring the value of open data. (2015)
17.
go back to reference Zouari, F., Ghedira, C., Kabachi, N., Boukadi, K.: Towards an adaptive curation services composition based on machine learning. IEEE International Conference on Web Services (ICWS), 73–78 (2021) Zouari, F., Ghedira, C., Kabachi, N., Boukadi, K.: Towards an adaptive curation services composition based on machine learning. IEEE International Conference on Web Services (ICWS), 73–78 (2021)
18.
go back to reference Zouari, F., Ghedira, C., Kabachi, N., Boukadi, K.: A service-based framework for adaptive data curation in data lakehouses. IEEE International Conference on Web Services (ICWS). (2022) Zouari, F., Ghedira, C., Kabachi, N., Boukadi, K.: A service-based framework for adaptive data curation in data lakehouses. IEEE International Conference on Web Services (ICWS). (2022)
19.
go back to reference Wang, H., Zhou, X., Zhou, X., Liu, W., Li, W., Bouguettaya, A.: Adaptive service composition based on reinforcement learning. Lecture Notes in Computer Science. 6470 LNCS (60673175), 92–107 (2010) Wang, H., Zhou, X., Zhou, X., Liu, W., Li, W., Bouguettaya, A.: Adaptive service composition based on reinforcement learning. Lecture Notes in Computer Science. 6470 LNCS (60673175), 92–107 (2010)
20.
go back to reference Szepesvári, C.: Algorithms for Reinforcement Learning 9, 1–89 (2010) Szepesvári, C.: Algorithms for Reinforcement Learning 9, 1–89 (2010)
21.
go back to reference Lauras, M., Truptil, S., Bénaben, F.: Towards a better management of complex emergencies through crisis management meta-modelling. Disasters 39(4), 687–714 (2015)CrossRef Lauras, M., Truptil, S., Bénaben, F.: Towards a better management of complex emergencies through crisis management meta-modelling. Disasters 39(4), 687–714 (2015)CrossRef
22.
go back to reference Sirin, E., Parsia, B.: Pellet: An owl dl reasoner. Description Logics, 212–213 (2004) Sirin, E., Parsia, B.: Pellet: An owl dl reasoner. Description Logics, 212–213 (2004)
24.
go back to reference Debnath, N.C., Patel, A., Mazumder, D., Manh, P.N., Minh, N.H.: Evaluation of covid-19 ontologies through ontometrics and oops! tools, 351–365 (2022) Debnath, N.C., Patel, A., Mazumder, D., Manh, P.N., Minh, N.H.: Evaluation of covid-19 ontologies through ontometrics and oops! tools, 351–365 (2022)
26.
go back to reference Yahya, M., Zhou, B., Zheng, Z., Zhou, D., Breslin, J.G., Ali, M.I., Kharlamov, E.: Towards generalized welding ontology in line with iso and knowledge graph construction, 83–88 (2022) Yahya, M., Zhou, B., Zheng, Z., Zhou, D., Breslin, J.G., Ali, M.I., Kharlamov, E.: Towards generalized welding ontology in line with iso and knowledge graph construction, 83–88 (2022)
27.
go back to reference Lourdusamy, R., John, A.: A review on metrics for ontology evaluation. 2018 2nd International Conference on Inventive Systems and Control (ICISC), 1415–1421 (2018) Lourdusamy, R., John, A.: A review on metrics for ontology evaluation. 2018 2nd International Conference on Inventive Systems and Control (ICISC), 1415–1421 (2018)
31.
go back to reference Raj, T.F.M., Sivapragasam, P., Balakrishnan, R., Lalithambal, G., Ragasubha, S.: Qos based classification using k-nearest neighbor algorithm for effective web service selection. 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), 1–4 (2015) Raj, T.F.M., Sivapragasam, P., Balakrishnan, R., Lalithambal, G., Ragasubha, S.: Qos based classification using k-nearest neighbor algorithm for effective web service selection. 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), 1–4 (2015)
Metadata
Title
A semantic and service-based approach for adaptive mutli-structured data curation in data lakehouses
Authors
Firas Zouari
Chirine Ghedira-Guegan
Khouloud Boukadi
Nadia Kabachi
Publication date
06-11-2023
Publisher
Springer US
Published in
World Wide Web / Issue 6/2023
Print ISSN: 1386-145X
Electronic ISSN: 1573-1413
DOI
https://doi.org/10.1007/s11280-023-01218-3

Other articles of this Issue 6/2023

World Wide Web 6/2023 Go to the issue

Premium Partner