Skip to main content

2020 | OriginalPaper | Buchkapitel

Data Engineering for Data Science: Two Sides of the Same Coin

verfasst von : Oscar Romero, Robert Wrembel

Erschienen in: Big Data Analytics and Knowledge Discovery

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A de facto technological standard of data science is based on notebooks (e.g., Jupyter), which provide an integrated environment to execute data workflows in different languages. However, from a data engineering point of view, this approach is typically inefficient and unsafe, as most of the data science languages process data locally, i.e., in workstations with limited memory, and store data in files. Thus, this approach neglects the benefits brought by over 40 years of R&D in the area of data engineering, i.e., advanced database technologies and data management techniques. In this paper, we advocate for a standardized data engineering approach for data science and we present a layered architecture for a data processing pipeline (DPP). This architecture provides a comprehensive conceptual view of DPPs, which next enables the semi-automation of the logical and physical designs of such DPPs.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Data Warehouse Trends Report. Technical report, Panoply (2018) Data Warehouse Trends Report. Technical report, Panoply (2018)
2.
Zurück zum Zitat Data Engineering, Preparation, and Labeling for AI 2019. Technical report, Cognilytica Research (2019) Data Engineering, Preparation, and Labeling for AI 2019. Technical report, Cognilytica Research (2019)
3.
Zurück zum Zitat Abadi, D., Agrawal, R., Ailamaki, A., et al.: The Beckman report on database research. Commun. ACM 59(2), 92–99 (2016)CrossRef Abadi, D., Agrawal, R., Ailamaki, A., et al.: The Beckman report on database research. Commun. ACM 59(2), 92–99 (2016)CrossRef
4.
Zurück zum Zitat Abadi, D., Ailamaki, A., Andersen, D., et al.: The Seattle report on database research. SIGMOD Rec. 48(4), 44–53 (2020)CrossRef Abadi, D., Ailamaki, A., Andersen, D., et al.: The Seattle report on database research. SIGMOD Rec. 48(4), 44–53 (2020)CrossRef
5.
Zurück zum Zitat Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2018) Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2018)
6.
Zurück zum Zitat Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press, Cambridge (2011)CrossRef Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press, Cambridge (2011)CrossRef
7.
Zurück zum Zitat Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: Proceedings of SIGMOD, pp. 1103–1114 (2014) Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: Proceedings of SIGMOD, pp. 1103–1114 (2014)
9.
Zurück zum Zitat Ali, S.M.F., Wrembel, R.: Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27CrossRef Ali, S.M.F., Wrembel, R.: Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://​doi.​org/​10.​1007/​978-3-030-28730-6_​27CrossRef
10.
Zurück zum Zitat Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Intelligent assistance for data pre-processing. Comput. Stand. Interf. 57, 101–109 (2018)CrossRef Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Intelligent assistance for data pre-processing. Comput. Stand. Interf. 57, 101–109 (2018)CrossRef
11.
Zurück zum Zitat Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of PODS, pp. 34–43 (1998) Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of PODS, pp. 34–43 (1998)
12.
Zurück zum Zitat European Commission: Towards a Thriving Data-driven Economy (2018) European Commission: Towards a Thriving Data-driven Economy (2018)
13.
Zurück zum Zitat Ewen, S., Schelter, S., Tzoumas, K., Warneke, D., Markl, V.: Iterative parallel data processing with stratosphere: an inside look. In: Proceedings of SIGMOD, pp. 1053–1056 (2013) Ewen, S., Schelter, S., Tzoumas, K., Warneke, D., Markl, V.: Iterative parallel data processing with stratosphere: an inside look. In: Proceedings of SIGMOD, pp. 1053–1056 (2013)
14.
Zurück zum Zitat Forrester Consulting: Digital Businesses Demand Agile Integration (2019) Forrester Consulting: Digital Businesses Demand Agile Integration (2019)
15.
Zurück zum Zitat Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: Proceedings of IEEE HPEC, pp. 1–6 (2016) Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: Proceedings of IEEE HPEC, pp. 1–6 (2016)
16.
Zurück zum Zitat Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems - The Complete Book. Pearson Education, London (2009) Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems - The Complete Book. Pearson Education, London (2009)
17.
18.
Zurück zum Zitat Nadal, S., et al.: A software reference architecture for semantic-aware big data systems. Inf. Softw. Technol. 90, 75–92 (2017)CrossRef Nadal, S., et al.: A software reference architecture for semantic-aware big data systems. Inf. Softw. Technol. 90, 75–92 (2017)CrossRef
19.
Zurück zum Zitat Nazábal, A., Williams, C.K.I., Colavizza, G., Smith, C.R., Williams, A.: Data engineering for data analytics: a classification of the issues, and case studies. CoRR, abs/2004.12929 (2020) Nazábal, A., Williams, C.K.I., Colavizza, G., Smith, C.R., Williams, A.: Data engineering for data analytics: a classification of the issues, and case studies. CoRR, abs/2004.12929 (2020)
20.
Zurück zum Zitat Piparo, D., Tejedor, E., Mato, P., Mascetti, L., Moscicki, J.T., Lamanna, M.: SWAN: a service for interactive analysis in the cloud. Future Gener. Comput. Syst. 78, 1071–1078 (2018)CrossRef Piparo, D., Tejedor, E., Mato, P., Mascetti, L., Moscicki, J.T., Lamanna, M.: SWAN: a service for interactive analysis in the cloud. Future Gener. Comput. Syst. 78, 1071–1078 (2018)CrossRef
21.
Zurück zum Zitat Quemy, A.: Data pipeline selection and optimization. In: Proceedings of DOLAP (2019) Quemy, A.: Data pipeline selection and optimization. In: Proceedings of DOLAP (2019)
22.
23.
Zurück zum Zitat Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Analytical metadata modeling for next generation BI systems. J. Syst. Softw. 144, 240–254 (2018)CrossRef Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Analytical metadata modeling for next generation BI systems. J. Syst. Softw. 144, 240–254 (2018)CrossRef
Metadaten
Titel
Data Engineering for Data Science: Two Sides of the Same Coin
verfasst von
Oscar Romero
Robert Wrembel
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-59065-9_13

Premium Partner