Skip to main content
Top

2019 | OriginalPaper | Chapter

Ontologies for Data Science: On Its Application to Data Pipelines

Authors : Miguel-Ángel Sicilia, Elena García-Barriocanal, Salvador Sánchez-Alonso, Marçal Mora-Cantallops, Juan-José Cuadrado

Published in: Metadata and Semantic Research

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Ontologies are usually applied to drive intelligent applications and also as a resource for integrating or extracting information, as in the case of Natural Language Processing (NLP) tasks. Further, ontologies as the Gene Ontology (GO) are used as an artifact for very specific research aims. However, the value of ontologies for data analysis tasks may also go beyond these uses and span supporting the reuse and composition of data acquisition, integration and fusion code. This requires that both data and code artifacts support meta-descriptions using shared conceptualizations. In this paper, we discuss the different concerns in semantically describing data pipelines as a key reusable artifact that could be retrieved, compared and reused with a degree of automation if semantically consistent descriptions are provided. Concretely, we propose attaching semantic descriptions for data and analytic transformations to current backend-independent distributed processing frameworks as Apache Beam, as these already abstract out the specificity of supporting execution engines.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
9
At the time of this writing, Bean is mainly used as a data transformation framework not including distributed machine learning algorithms, for example, TensorFlow Extended (TFX) [4] is built on top of Beam but there are not algorithms available as aggregate transformations.
 
10
The transformations are just examples, they are not intended as analytics with real practical value.
 
11
This would require integrating a parser of that language for checking and manipulation, that could include exploiting semantic mappings.
 
Literature
1.
go back to reference Akidau, T., et al.: The dataflow model. Proc. VLDB Endow. 8(12), 1792–1803 (2015)CrossRef Akidau, T., et al.: The dataflow model. Proc. VLDB Endow. 8(12), 1792–1803 (2015)CrossRef
3.
go back to reference Bas̆karada, S., Koronios, A.: Unicorn data scientist: the rarest of breeds. Program 51(1), 65–74 (2017)CrossRef Bas̆karada, S., Koronios, A.: Unicorn data scientist: the rarest of breeds. Program 51(1), 65–74 (2017)CrossRef
4.
go back to reference Baylor, D., Breck, E., Cheng, H.T., et al.: TFX: a tensorflow-based production-scale machine learning platform. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1387–1395. ACM (2017) Baylor, D., Breck, E., Cheng, H.T., et al.: TFX: a tensorflow-based production-scale machine learning platform. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1387–1395. ACM (2017)
6.
go back to reference Buitinck, L., Louppe, G., Blondel, M., et al.: API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238 (2013) Buitinck, L., Louppe, G., Blondel, M., et al.: API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:​1309.​0238 (2013)
7.
go back to reference Capadisli, S., Auer, S., Ngonga Ngomo, A.C.: Linked SDMX data. Semantic Web 6(2), 105–112 (2015)CrossRef Capadisli, S., Auer, S., Ngonga Ngomo, A.C.: Linked SDMX data. Semantic Web 6(2), 105–112 (2015)CrossRef
8.
go back to reference Figuerola, C.G., Groves, T.: Analysing the potential of Wikipedia for science education using automatic organization of knowledge. Program 51(4), 373–386 (2017)CrossRef Figuerola, C.G., Groves, T.: Analysing the potential of Wikipedia for science education using automatic organization of knowledge. Program 51(4), 373–386 (2017)CrossRef
9.
go back to reference Guazzelli, A., Zeller, M., Lin, W.C., Williams, G.: PMML: an open standard for sharing models. R J. 1(1), 60–65 (2009) Guazzelli, A., Zeller, M., Lin, W.C., Williams, G.: PMML: an open standard for sharing models. R J. 1(1), 60–65 (2009)
10.
go back to reference Hajra, A., Tochtermann, K.: Linking science: approaches for linking scientific publications across different LOD repositories. Int. J. Metadata Semant. Ontol. 12(2–3), 124–141 (2017)CrossRef Hajra, A., Tochtermann, K.: Linking science: approaches for linking scientific publications across different LOD repositories. Int. J. Metadata Semant. Ontol. 12(2–3), 124–141 (2017)CrossRef
11.
go back to reference Karimova, Y., Castro, J.A., Silva, J.R.D., Pereira, N., Rodrigues, J., Ribeiro, C.: Description+ annotation: semantic data publication workflow with Dendro and B2NOTE. Int. J. Metadata Semant. Ontol. 12(4), 182–194 (2017)CrossRef Karimova, Y., Castro, J.A., Silva, J.R.D., Pereira, N., Rodrigues, J., Ribeiro, C.: Description+ annotation: semantic data publication workflow with Dendro and B2NOTE. Int. J. Metadata Semant. Ontol. 12(4), 182–194 (2017)CrossRef
12.
go back to reference Lanza, J., et al.: Managing large amounts of data generated by a smart city internet of things deployment. Int. J. Semantic Web Inf. Syst. (IJSWIS) 12(4), 22–42 (2016)CrossRef Lanza, J., et al.: Managing large amounts of data generated by a smart city internet of things deployment. Int. J. Semantic Web Inf. Syst. (IJSWIS) 12(4), 22–42 (2016)CrossRef
13.
go back to reference Lytras, M.D., Raghavan, V., Damiani, E.: Big data and data analytics research: from metaphors to value space for collective wisdom in human decision making and smart machines. Int. J. Semant. Web Inf. Syst. (IJSWIS) 13(1), 1–10 (2017)CrossRef Lytras, M.D., Raghavan, V., Damiani, E.: Big data and data analytics research: from metaphors to value space for collective wisdom in human decision making and smart machines. Int. J. Semant. Web Inf. Syst. (IJSWIS) 13(1), 1–10 (2017)CrossRef
14.
go back to reference McPhillips, T., Bowers, S., Zinn, D., Ludäscher, B.: Scientific workflow design for mere mortals. Future Gener. Comput. Syst. 25(5), 541–551 (2009)CrossRef McPhillips, T., Bowers, S., Zinn, D., Ludäscher, B.: Scientific workflow design for mere mortals. Future Gener. Comput. Syst. 25(5), 541–551 (2009)CrossRef
15.
go back to reference Madin, J., Bowers, S., Schildhauer, M., Krivov, S., Pennington, D., Villa, F.: An ontology for describing and synthesizing ecological observation data. Ecol. Inform. 2(3), 279–296 (2007)CrossRef Madin, J., Bowers, S., Schildhauer, M., Krivov, S., Pennington, D., Villa, F.: An ontology for describing and synthesizing ecological observation data. Ecol. Inform. 2(3), 279–296 (2007)CrossRef
16.
go back to reference Meng, X., Bradley, J., Yavuz, B., et al.: MLlib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)MathSciNetMATH Meng, X., Bradley, J., Yavuz, B., et al.: MLlib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)MathSciNetMATH
17.
go back to reference Patterson, E., Baldini, I., Mojsilovic, A., Varshney, K.R.: Semantic representation of data science programs. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), pp. 5847–5849 (2018) Patterson, E., Baldini, I., Mojsilovic, A., Varshney, K.R.: Semantic representation of data science programs. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), pp. 5847–5849 (2018)
18.
go back to reference Patterson, E., Baldini, I., Mojsilovic, A., Varshney, K.R.: Teaching machines to understand data science code by semantic enrichment of dataflow graphs. arXiv preprint arXiv:1807.05691 (2018) Patterson, E., Baldini, I., Mojsilovic, A., Varshney, K.R.: Teaching machines to understand data science code by semantic enrichment of dataflow graphs. arXiv preprint arXiv:​1807.​05691 (2018)
19.
go back to reference Pease, A., Niles, I., Li, J.: The suggested upper merged ontology: a large ontology for the semantic web and its applications. In: Working Notes of the AAAI-2002 Workshop on Ontologies and the Semantic Web, vol. 28, pp. 7–10 (2002) Pease, A., Niles, I., Li, J.: The suggested upper merged ontology: a large ontology for the semantic web and its applications. In: Working Notes of the AAAI-2002 Workshop on Ontologies and the Semantic Web, vol. 28, pp. 7–10 (2002)
20.
go back to reference Pen̄a, O., Aguilera, U., López-de-Ipin̄a, D.: Exploring LOD through metadata extraction and data-driven visualizations. Program 50(3), 270–287 (2016)CrossRef Pen̄a, O., Aguilera, U., López-de-Ipin̄a, D.: Exploring LOD through metadata extraction and data-driven visualizations. Program 50(3), 270–287 (2016)CrossRef
22.
go back to reference Sicilia, M.A., García-Barriocanal, E., Sánchez-Alonso, S., Rodríguez-García, D.: Ontologies of engineering knowledge: general structure and the case of software engineering. Knowl. Eng. Rev. 24(3), 309–326 (2009)CrossRef Sicilia, M.A., García-Barriocanal, E., Sánchez-Alonso, S., Rodríguez-García, D.: Ontologies of engineering knowledge: general structure and the case of software engineering. Knowl. Eng. Rev. 24(3), 309–326 (2009)CrossRef
23.
go back to reference Wu, D., Zhu, L., Xu, X., Sakr, S., Lu, Q., Sun, D.: A pipeline framework for heterogeneous execution environment of big data processing. IEEE Softw. 33(2), 60–67 (2016)CrossRef Wu, D., Zhu, L., Xu, X., Sakr, S., Lu, Q., Sun, D.: A pipeline framework for heterogeneous execution environment of big data processing. IEEE Softw. 33(2), 60–67 (2016)CrossRef
24.
go back to reference Zhang, X., Li, K., Zhao, C., Pan, D.: A survey on units ontologies: architecture, comparison and reuse. Program 51(2), 193–213 (2017)CrossRef Zhang, X., Li, K., Zhao, C., Pan, D.: A survey on units ontologies: architecture, comparison and reuse. Program 51(2), 193–213 (2017)CrossRef
25.
go back to reference Zheng, J., et al.: The Ontology of Biological and Clinical Statistics (OBCS) for standardized and reproducible statistical analysis. J. Biomed. Semant. 7(1), 53 (2016)CrossRef Zheng, J., et al.: The Ontology of Biological and Clinical Statistics (OBCS) for standardized and reproducible statistical analysis. J. Biomed. Semant. 7(1), 53 (2016)CrossRef
Metadata
Title
Ontologies for Data Science: On Its Application to Data Pipelines
Authors
Miguel-Ángel Sicilia
Elena García-Barriocanal
Salvador Sánchez-Alonso
Marçal Mora-Cantallops
Juan-José Cuadrado
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-14401-2_16