Skip to main content
Top

2020 | OriginalPaper | Chapter

Data Engineering for Data Science: Two Sides of the Same Coin

Authors : Oscar Romero, Robert Wrembel

Published in: Big Data Analytics and Knowledge Discovery

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

A de facto technological standard of data science is based on notebooks (e.g., Jupyter), which provide an integrated environment to execute data workflows in different languages. However, from a data engineering point of view, this approach is typically inefficient and unsafe, as most of the data science languages process data locally, i.e., in workstations with limited memory, and store data in files. Thus, this approach neglects the benefits brought by over 40 years of R&D in the area of data engineering, i.e., advanced database technologies and data management techniques. In this paper, we advocate for a standardized data engineering approach for data science and we present a layered architecture for a data processing pipeline (DPP). This architecture provides a comprehensive conceptual view of DPPs, which next enables the semi-automation of the logical and physical designs of such DPPs.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Data Warehouse Trends Report. Technical report, Panoply (2018) Data Warehouse Trends Report. Technical report, Panoply (2018)
2.
go back to reference Data Engineering, Preparation, and Labeling for AI 2019. Technical report, Cognilytica Research (2019) Data Engineering, Preparation, and Labeling for AI 2019. Technical report, Cognilytica Research (2019)
3.
go back to reference Abadi, D., Agrawal, R., Ailamaki, A., et al.: The Beckman report on database research. Commun. ACM 59(2), 92–99 (2016)CrossRef Abadi, D., Agrawal, R., Ailamaki, A., et al.: The Beckman report on database research. Commun. ACM 59(2), 92–99 (2016)CrossRef
4.
go back to reference Abadi, D., Ailamaki, A., Andersen, D., et al.: The Seattle report on database research. SIGMOD Rec. 48(4), 44–53 (2020)CrossRef Abadi, D., Ailamaki, A., Andersen, D., et al.: The Seattle report on database research. SIGMOD Rec. 48(4), 44–53 (2020)CrossRef
5.
go back to reference Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2018) Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2018)
6.
go back to reference Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press, Cambridge (2011)CrossRef Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press, Cambridge (2011)CrossRef
7.
go back to reference Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: Proceedings of SIGMOD, pp. 1103–1114 (2014) Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: Proceedings of SIGMOD, pp. 1103–1114 (2014)
9.
10.
go back to reference Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Intelligent assistance for data pre-processing. Comput. Stand. Interf. 57, 101–109 (2018)CrossRef Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Intelligent assistance for data pre-processing. Comput. Stand. Interf. 57, 101–109 (2018)CrossRef
11.
go back to reference Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of PODS, pp. 34–43 (1998) Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of PODS, pp. 34–43 (1998)
12.
go back to reference European Commission: Towards a Thriving Data-driven Economy (2018) European Commission: Towards a Thriving Data-driven Economy (2018)
13.
go back to reference Ewen, S., Schelter, S., Tzoumas, K., Warneke, D., Markl, V.: Iterative parallel data processing with stratosphere: an inside look. In: Proceedings of SIGMOD, pp. 1053–1056 (2013) Ewen, S., Schelter, S., Tzoumas, K., Warneke, D., Markl, V.: Iterative parallel data processing with stratosphere: an inside look. In: Proceedings of SIGMOD, pp. 1053–1056 (2013)
14.
go back to reference Forrester Consulting: Digital Businesses Demand Agile Integration (2019) Forrester Consulting: Digital Businesses Demand Agile Integration (2019)
15.
go back to reference Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: Proceedings of IEEE HPEC, pp. 1–6 (2016) Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: Proceedings of IEEE HPEC, pp. 1–6 (2016)
16.
go back to reference Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems - The Complete Book. Pearson Education, London (2009) Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems - The Complete Book. Pearson Education, London (2009)
17.
18.
go back to reference Nadal, S., et al.: A software reference architecture for semantic-aware big data systems. Inf. Softw. Technol. 90, 75–92 (2017)CrossRef Nadal, S., et al.: A software reference architecture for semantic-aware big data systems. Inf. Softw. Technol. 90, 75–92 (2017)CrossRef
19.
go back to reference Nazábal, A., Williams, C.K.I., Colavizza, G., Smith, C.R., Williams, A.: Data engineering for data analytics: a classification of the issues, and case studies. CoRR, abs/2004.12929 (2020) Nazábal, A., Williams, C.K.I., Colavizza, G., Smith, C.R., Williams, A.: Data engineering for data analytics: a classification of the issues, and case studies. CoRR, abs/2004.12929 (2020)
20.
go back to reference Piparo, D., Tejedor, E., Mato, P., Mascetti, L., Moscicki, J.T., Lamanna, M.: SWAN: a service for interactive analysis in the cloud. Future Gener. Comput. Syst. 78, 1071–1078 (2018)CrossRef Piparo, D., Tejedor, E., Mato, P., Mascetti, L., Moscicki, J.T., Lamanna, M.: SWAN: a service for interactive analysis in the cloud. Future Gener. Comput. Syst. 78, 1071–1078 (2018)CrossRef
21.
go back to reference Quemy, A.: Data pipeline selection and optimization. In: Proceedings of DOLAP (2019) Quemy, A.: Data pipeline selection and optimization. In: Proceedings of DOLAP (2019)
22.
23.
go back to reference Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Analytical metadata modeling for next generation BI systems. J. Syst. Softw. 144, 240–254 (2018)CrossRef Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Analytical metadata modeling for next generation BI systems. J. Syst. Softw. 144, 240–254 (2018)CrossRef
Metadata
Title
Data Engineering for Data Science: Two Sides of the Same Coin
Authors
Oscar Romero
Robert Wrembel
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-59065-9_13

Premium Partner