Skip to main content

2018 | OriginalPaper | Buchkapitel

ETL Processes in the Era of Variety

verfasst von : Nabila Berkani, Ladjel Bellatreche, Laurent Guittet

Erschienen in: Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIX

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Nowadays, we are living in an open and connected world, where small, medium and large companies are looking for integrating data from various data sources to satisfy the requirements of new applications such as delivering real-time alerts and trigger automated actions, complex system failure detection, anomalies detection, etc. The process of getting these data from their sources to its home system in efficient and correct manner is known by data ingestion, usually refer to Extract, Transform, Load (ETL) widely studied in data warehouses. In the context of rapidly technology changing and the explosion of data sources, ETL processes have to consider two main issues: (a) the variety of data sources that spans traditional, XML, semantic, graph databases, etc. and (b) the variety of storage platforms, where the home system may have several stores (known by polystore), where one hosts a particular type of data. These issues directly impact the efficiency and the deployment flexibility of ETL. In this paper, we deal with these issues. Firstly, thanks to Model Driven Engineering, we make generic different types of data sources. This genericity allows overloading the ETL operators for each type of sources. This genericity is illustrated by considering three types of the most popular data sources: relational, semantic and graph databases. Secondly, we show the impact of genericity of operators in the ETL workflow, where a Web-service-driven approach for orchestrating the ETL flows is given. Thirdly, the extracted and merged data obtained by the ETL workflow are deployed according their favorite stores. Finally, our finding is validated through a proof of concept tool using the LUBM semantic database and Yago graph deployed in Oracle RDF Semantic Graph 12c.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017)CrossRef Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017)CrossRef
3.
Zurück zum Zitat Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge (2003)MATH Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge (2003)MATH
5.
Zurück zum Zitat Berkani, N., Bellatreche, L., Khouri, S.: Towards a conceptualization of ETL and physical storage of semantic data warehouses as a service. Cluster Comput. 16(4), 915–931 (2013)CrossRef Berkani, N., Bellatreche, L., Khouri, S.: Towards a conceptualization of ETL and physical storage of semantic data warehouses as a service. Cluster Comput. 16(4), 915–931 (2013)CrossRef
6.
Zurück zum Zitat Calvanese, D., De Giacomo, G., Lenzerini, M., Nardi, D., Rosati, R.: Data integration in data warehousing. Int. J. Coop. Inf. Syst. 10(3), 237–271 (2001)CrossRef Calvanese, D., De Giacomo, G., Lenzerini, M., Nardi, D., Rosati, R.: Data integration in data warehousing. Int. J. Coop. Inf. Syst. 10(3), 237–271 (2001)CrossRef
9.
Zurück zum Zitat DeWitt, D.J., et al.: Split query processing in polybase. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1255–1266. ACM (2013) DeWitt, D.J., et al.: Split query processing in polybase. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1255–1266. ACM (2013)
10.
Zurück zum Zitat Dong, X.L., Srivastava, D.: Big data integration. PVLDB 6(11), 118 (2013) Dong, X.L., Srivastava, D.: Big data integration. PVLDB 6(11), 118 (2013)
11.
Zurück zum Zitat Duggan, J., et al.: The BigDAWG polystore system. ACM SIGMOD Rec. 44(2), 11–16 (2015)CrossRef Duggan, J., et al.: The BigDAWG polystore system. ACM SIGMOD Rec. 44(2), 11–16 (2015)CrossRef
12.
Zurück zum Zitat Inmon, W.H.: Building the Data Warehouse. Wiley, Hoboken (2002) Inmon, W.H.: Building the Data Warehouse. Wiley, Hoboken (2002)
13.
Zurück zum Zitat Mazón, J.-N., Trujillo, J.: An MDA approach for the development of data warehouses. In: JISBD, pp. 208–208 (2009) Mazón, J.-N., Trujillo, J.: An MDA approach for the development of data warehouses. In: JISBD, pp. 208–208 (2009)
16.
Zurück zum Zitat Kolev, B., Valduriez, P., Bondiombouy, C., Jiménez-Peris, R., Pau, R., Pereira, J.: CloudMdsQL: querying heterogeneous cloud data stores with a common language. Distrib. Parallel Databases 34(4), 463–503 (2016)CrossRef Kolev, B., Valduriez, P., Bondiombouy, C., Jiménez-Peris, R., Pau, R., Pereira, J.: CloudMdsQL: querying heterogeneous cloud data stores with a common language. Distrib. Parallel Databases 34(4), 463–503 (2016)CrossRef
17.
Zurück zum Zitat Lenzerini, M.: Data integration: a theoretical perspective. In: ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 233–246 (2002) Lenzerini, M.: Data integration: a theoretical perspective. In: ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 233–246 (2002)
19.
Zurück zum Zitat Nakuçi, E., Theodorou, V., Jovanovic, P., Abelló, A.: Bijoux: data generator for evaluating ETL process quality. In: ACM DOLAP, pp. 23–32 (2014) Nakuçi, E., Theodorou, V., Jovanovic, P., Abelló, A.: Bijoux: data generator for evaluating ETL process quality. In: ACM DOLAP, pp. 23–32 (2014)
20.
Zurück zum Zitat Nebot, V., Berlanga, R.: Building data warehouses with semantic web data. Decis. Support Syst. 52(4), 853–868 (2012)CrossRef Nebot, V., Berlanga, R.: Building data warehouses with semantic web data. Decis. Support Syst. 52(4), 853–868 (2012)CrossRef
21.
Zurück zum Zitat Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ unifying semi-structured query language, and an expressiveness benchmark of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631 (2014) Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ unifying semi-structured query language, and an expressiveness benchmark of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631 (2014)
22.
Zurück zum Zitat Raventós, R., Olivé, A.: An object-oriented operation-based approach to translation between MOF metaschemas. Data Knowl. Eng. 67(3), 444–462 (2008)CrossRef Raventós, R., Olivé, A.: An object-oriented operation-based approach to translation between MOF metaschemas. Data Knowl. Eng. 67(3), 444–462 (2008)CrossRef
23.
Zurück zum Zitat Rodriguez, M.A., Neubauer, P.: Constructions from dots and lines. CoRR, abs/1006.2361 (2010)CrossRef Rodriguez, M.A., Neubauer, P.: Constructions from dots and lines. CoRR, abs/1006.2361 (2010)CrossRef
24.
Zurück zum Zitat Shmueli, O., Tsur, S.: Logical diagnosis of LDL programs. New Gener. Comput. 9(3/4), 277–304 (1991)CrossRef Shmueli, O., Tsur, S.: Logical diagnosis of LDL programs. New Gener. Comput. 9(3/4), 277–304 (1991)CrossRef
25.
Zurück zum Zitat Simitsis, A., Vassiliadis, P., Sellis, T.-K.: Optimizing ETL processes in data warehouses. In: ICDE, pp. 564–575 (2005) Simitsis, A., Vassiliadis, P., Sellis, T.-K.: Optimizing ETL processes in data warehouses. In: ICDE, pp. 564–575 (2005)
26.
Zurück zum Zitat Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 829–840. ACM (2012) Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 829–840. ACM (2012)
27.
Zurück zum Zitat Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL workflows for fault-tolerance. In: ICDE, pp. 385–396 (2010) Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL workflows for fault-tolerance. In: ICDE, pp. 385–396 (2010)
28.
Zurück zum Zitat Skoutas, D., Simitsis, A.: Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Int. J. Semant. Web Inf. Syst. 3(4), 1–24 (2007)CrossRef Skoutas, D., Simitsis, A.: Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Int. J. Semant. Web Inf. Syst. 3(4), 1–24 (2007)CrossRef
29.
Zurück zum Zitat Stonebraker, M.: Technical perspective - one size fits all: an idea whose time has come and gone. Commun. ACM 51(12), 76 (2008)CrossRef Stonebraker, M.: Technical perspective - one size fits all: an idea whose time has come and gone. Commun. ACM 51(12), 76 (2008)CrossRef
30.
Zurück zum Zitat Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW, pp. 697–706 (2007) Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW, pp. 697–706 (2007)
32.
Zurück zum Zitat Tziovara, P., Vassiliadis, P., Simitsis, A.: Deciding the physical implementation of ETL workflows. In: DOLAP, pp. 49–56 (2007) Tziovara, P., Vassiliadis, P., Simitsis, A.: Deciding the physical implementation of ETL workflows. In: DOLAP, pp. 49–56 (2007)
33.
Zurück zum Zitat Vassiliadis, P.: A survey of extract-transform-load technology. IJDWM 5(3), 1–27 (2009) Vassiliadis, P.: A survey of extract-transform-load technology. IJDWM 5(3), 1–27 (2009)
34.
Zurück zum Zitat Vassiliadis, P., Simitsis, A., Baikousi, E.: A taxonomy of ETL activities. In: ACM DOLAP, pp. 25–32 (2009) Vassiliadis, P., Simitsis, A., Baikousi, E.: A taxonomy of ETL activities. In: ACM DOLAP, pp. 25–32 (2009)
35.
Zurück zum Zitat Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., Skiadopoulos, S.: A generic and customizable framework for the design of etl scenarios. Inf. Syst. 30(7), 492–525 (2005)CrossRef Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., Skiadopoulos, S.: A generic and customizable framework for the design of etl scenarios. Inf. Syst. 30(7), 492–525 (2005)CrossRef
36.
Zurück zum Zitat Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL processes. In: DOLAP, pp. 14–21 (2002) Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL processes. In: DOLAP, pp. 14–21 (2002)
37.
Zurück zum Zitat Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Modeling ETL activities as graphs. In: DMDW, pp. 52–61 (2002) Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Modeling ETL activities as graphs. In: DMDW, pp. 52–61 (2002)
39.
Zurück zum Zitat Zhu, M., Risch, T.: Querying combined cloud-based and relational databases. In: 2011 International Conference on Cloud and Service Computing (CSC), pp. 330–335. IEEE (2011) Zhu, M., Risch, T.: Querying combined cloud-based and relational databases. In: 2011 International Conference on Cloud and Service Computing (CSC), pp. 330–335. IEEE (2011)
Metadaten
Titel
ETL Processes in the Era of Variety
verfasst von
Nabila Berkani
Ladjel Bellatreche
Laurent Guittet
Copyright-Jahr
2018
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-662-58415-6_4