Skip to main content
Erschienen in: The Journal of Supercomputing 8/2021

01.02.2021

Like a rainbow in the dark: metadata annotation for HPC applications in the age of dark data

verfasst von: Björn Schembera

Erschienen in: The Journal of Supercomputing | Ausgabe 8/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The deluge of dark data is about to happen. Lacking data management capabilities, especially in the field of supercomputing, and missing data documentation (i.e., missing metadata annotation) constitute a major source of dark data. The present work contributes to addressing this challenge by presenting ExtractIng, a generic automated metadata extraction toolkit. Existing metadata information of simulation output files scattered through the file system, can be aggregated, parsed and converted to the EngMeta metadata model. Use cases from computational engineering are considered to demonstrate the viability of ExtractIng. The evaluation results show that the metadata extraction is simulation-code independent in the sense that it can handle data outputs from various fields of science, is easy to integrate into simulation workflows and compatible with a multitude of computational environments.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
 
4
https://​spark.​apache.​org/​, last access Feb 14th 2020.
 
8
http://​cfconventions.​org/​, last accessed Feb 26th 2020.
 
10
https://​dataverse.​org/​, last access March 2 2020.
 
11
Interestingly, the authors speak of a “data swamp” in terms of dark data contrasting this with a “data lake” of well-annotated data.
 
Literatur
7.
Zurück zum Zitat Goetz T (2007) Freeing the dark data of failed scientific experiment. Wired Mag 15(10):7 Goetz T (2007) Freeing the dark data of failed scientific experiment. Wired Mag 15(10):7
9.
Zurück zum Zitat Lin D, Wang Q (2017) A game theory based energy efficient clustering routing protocol for WSNs. Wirel Netw 23(4):1101CrossRef Lin D, Wang Q (2017) A game theory based energy efficient clustering routing protocol for WSNs. Wirel Netw 23(4):1101CrossRef
10.
Zurück zum Zitat Lin D, Min W, Xu J (2020) An energy-saving routing integrated economic theory with compressive sensing to extend the lifespan of WSNs. IEEE Internet of Things J Lin D, Min W, Xu J (2020) An energy-saving routing integrated economic theory with compressive sensing to extend the lifespan of WSNs. IEEE Internet of Things J
11.
Zurück zum Zitat Lin D, Wang Q, Min W, Xu J, Zhang Z (2020) A survey on energy-efficient strategies in static wireless sensor networks. ACM Trans Sens Netw (TOSN) 17(1):1 Lin D, Wang Q, Min W, Xu J, Zhang Z (2020) A survey on energy-efficient strategies in static wireless sensor networks. ACM Trans Sens Netw (TOSN) 17(1):1
12.
Zurück zum Zitat Wilkinson MD, Dumontier M, Aalbersberg J, Appleton G, Axton M et al (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3:160018CrossRef Wilkinson MD, Dumontier M, Aalbersberg J, Appleton G, Axton M et al (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3:160018CrossRef
13.
Zurück zum Zitat Schembera B, Bönisch T (2017) Challenges of research data management for high performance computing. In: Kamps J, Tsakonas G, Manolopoulos Y, Iliadis L, Karydis I (eds) Research and advanced technology for digital libraries. Springer International Publishing, Cham, pp 140–151CrossRef Schembera B, Bönisch T (2017) Challenges of research data management for high performance computing. In: Kamps J, Tsakonas G, Manolopoulos Y, Iliadis L, Karydis I (eds) Research and advanced technology for digital libraries. Springer International Publishing, Cham, pp 140–151CrossRef
17.
Zurück zum Zitat Parker-Wood A, Long DDE, Madden BA, Adams IF, McThrow M, Wildani A (2013) Examining Extended and Scientific Metadata for Scalable Index Designs. In: Proceedings of the 6th International Systems and Storage Conference (ACM, New York, NY, USA), SYSTOR ’13, pp 4:1–4:6. https://doi.org/10.1145/2485732.2485754 Parker-Wood A, Long DDE, Madden BA, Adams IF, McThrow M, Wildani A (2013) Examining Extended and Scientific Metadata for Scalable Index Designs. In: Proceedings of the 6th International Systems and Storage Conference (ACM, New York, NY, USA), SYSTOR ’13, pp 4:1–4:6. https://​doi.​org/​10.​1145/​2485732.​2485754
24.
Zurück zum Zitat Schembera B, Iglezakis D (2019) The genesis of engmeta: a metadata model for research data in computational engineering. In: Garoufallou E, Sartori F, Siatri R, Zervas M (eds) Metadata and semantic research. Springer International Publishing, Cham, pp 127–132CrossRef Schembera B, Iglezakis D (2019) The genesis of engmeta: a metadata model for research data in computational engineering. In: Garoufallou E, Sartori F, Siatri R, Zervas M (eds) Metadata and semantic research. Springer International Publishing, Cham, pp 127–132CrossRef
27.
Zurück zum Zitat Riley J (2017) Understanding metadata: What is metadata, and what is it for?: A primer. Tech. rep, NISO Riley J (2017) Understanding metadata: What is metadata, and what is it for?: A primer. Tech. rep, NISO
29.
Zurück zum Zitat Greenberg J (2004) Metadata extraction and harvesting: a comparison of two automatic metadata generation applications. J Internet Catal 6(4):59CrossRef Greenberg J (2004) Metadata extraction and harvesting: a comparison of two automatic metadata generation applications. J Internet Catal 6(4):59CrossRef
30.
Zurück zum Zitat Giuffrida G, Shek EC, Yang J (2000) Knowledge-based metadata extraction from PostScript files. In: Proceedings of the Fifth ACM Conference on Digital Libraries, pp 77–84 Giuffrida G, Shek EC, Yang J (2000) Knowledge-based metadata extraction from PostScript files. In: Proceedings of the Fifth ACM Conference on Digital Libraries, pp 77–84
31.
Zurück zum Zitat Spinosa P, Giardiello G, Cherubini M, Marchi S, Venturi G, Montemagni S (2009) NLP-based metadata extraction for legal text consolidation. In: Proceedings of the 12th International Conference on Artificial Intelligence and Law, pp 40–49 Spinosa P, Giardiello G, Cherubini M, Marchi S, Venturi G, Montemagni S (2009) NLP-based metadata extraction for legal text consolidation. In: Proceedings of the 12th International Conference on Artificial Intelligence and Law, pp 40–49
32.
Zurück zum Zitat Liu R, Gao L, An D, Jiang Z, Tang Z (2017) Automatic document metadata extraction based on deep networks. In: National CCF Conference on Natural Language Processing and Chinese Computing (Springer, 2017), pp 305–317 Liu R, Gao L, An D, Jiang Z, Tang Z (2017) Automatic document metadata extraction based on deep networks. In: National CCF Conference on Natural Language Processing and Chinese Computing (Springer, 2017), pp 305–317
33.
Zurück zum Zitat Paul AK, Wang B, Rutman N, Spitz C, Butt AR (2020) Efficient Metadata Indexing for HPC Storage Systems. In: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) (IEEE, 2020), pp 162–171 Paul AK, Wang B, Rutman N, Spitz C, Butt AR (2020) Efficient Metadata Indexing for HPC Storage Systems. In: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) (IEEE, 2020), pp 162–171
34.
Zurück zum Zitat Paul AK (2020) An application-attuned framework for optimizing hpc storage systems. Ph.D. thesis, Virginia Tech Paul AK (2020) An application-attuned framework for optimizing hpc storage systems. Ph.D. thesis, Virginia Tech
35.
Zurück zum Zitat Khan A, Kim T, Byun H, Kim Y (2019) SciSpace: a scientific collaboration workspace for geo-distributed HPC data centers. Fut Gen Comput Syst 101:398CrossRef Khan A, Kim T, Byun H, Kim Y (2019) SciSpace: a scientific collaboration workspace for geo-distributed HPC data centers. Fut Gen Comput Syst 101:398CrossRef
37.
Zurück zum Zitat Grunzke R, Breuers S, Gesing S, Herres-Pawlis S, Kruse M, Blunk D, de la Garza L, Packschies L, Schäfer P, Schärfe C, Schlemmer T, Steinke T, Schuller B, Müller-Pfefferkorn R, Jäkel R, Nagel WE, Atkinson M, Krüger J (2014) Standards-based metadata management for molecular simulations. Concurr Comput Pract Exp 26(10):1744. https://doi.org/10.1002/cpe.3116CrossRef Grunzke R, Breuers S, Gesing S, Herres-Pawlis S, Kruse M, Blunk D, de la Garza L, Packschies L, Schäfer P, Schärfe C, Schlemmer T, Steinke T, Schuller B, Müller-Pfefferkorn R, Jäkel R, Nagel WE, Atkinson M, Krüger J (2014) Standards-based metadata management for molecular simulations. Concurr Comput Pract Exp 26(10):1744. https://​doi.​org/​10.​1002/​cpe.​3116CrossRef
38.
Zurück zum Zitat Grunzke R (2016) Generic metadata handling in scientific data life cycles. Ph.D. thesis, Technische Universität Dresden Grunzke R (2016) Generic metadata handling in scientific data life cycles. Ph.D. thesis, Technische Universität Dresden
41.
Zurück zum Zitat Skluzacek TJ (2019) Dredging a data lake: decentralized metadata extraction. In: Proceedings of the 20th International Middleware Conference Doctoral Symposium, pp 51–53 Skluzacek TJ (2019) Dredging a data lake: decentralized metadata extraction. In: Proceedings of the 20th International Middleware Conference Doctoral Symposium, pp 51–53
42.
Zurück zum Zitat Skluzacek TJ, Chard R, Wong R, Li Z, Babuji YN, Ward L, Blaiszik B, Chard K, Foster I (2019) Serverless workflows for indexing large scientific data. In: Proceedings of the 5th International Workshop on Serverless Computing, pp 43–48 Skluzacek TJ, Chard R, Wong R, Li Z, Babuji YN, Ward L, Blaiszik B, Chard K, Foster I (2019) Serverless workflows for indexing large scientific data. In: Proceedings of the 5th International Workshop on Serverless Computing, pp 43–48
43.
Zurück zum Zitat Skluzacek TJ, Kumar R, Chard R, Harrison G, Beckman P, Chard K, Foster I (2018) Skluma: an extensible metadata extraction pipeline for disorganized data. In: 2018 IEEE 14th International Conference on e-Science (e-Science) (IEEE, 2018), pp 256–266 Skluzacek TJ, Kumar R, Chard R, Harrison G, Beckman P, Chard K, Foster I (2018) Skluma: an extensible metadata extraction pipeline for disorganized data. In: 2018 IEEE 14th International Conference on e-Science (e-Science) (IEEE, 2018), pp 256–266
44.
Zurück zum Zitat Padhy S, Jansen G, Alameda J, Black E, Diesendruck L, Dietze M, Kumar P, Kooper R, Lee J, Liu R, et al (2015) Brown Dog: leveraging everything towards autocuration. In: 2015 IEEE International Conference on Big Data (Big Data) (IEEE, 2015), pp 493–500 Padhy S, Jansen G, Alameda J, Black E, Diesendruck L, Dietze M, Kumar P, Kooper R, Lee J, Liu R, et al (2015) Brown Dog: leveraging everything towards autocuration. In: 2015 IEEE International Conference on Big Data (Big Data) (IEEE, 2015), pp 493–500
45.
Zurück zum Zitat Satheesan SP, Alameda J, Bradley S, Dietze M, Galewsky B, Jansen G, Kooper R, Kumar P, Lee J, Marciano R et al (2018) Brown dog: making the digital world a better place, a few files at a time. In: Proceedings of the Practice and Experience on Advanced Research Computing, pp 1–8 Satheesan SP, Alameda J, Bradley S, Dietze M, Galewsky B, Jansen G, Kooper R, Kumar P, Lee J, Marciano R et al (2018) Brown dog: making the digital world a better place, a few files at a time. In: Proceedings of the Practice and Experience on Advanced Research Computing, pp 1–8
46.
Zurück zum Zitat Rodrigo GP, Henderson M, Weber GH, Ophus C, Antypas K, Ramakrishnan L (2018) ScienceSearch: enabling search through automatic metadata generation. In: 2018 IEEE 14th International Conference on e-Science (e-Science) (IEEE, 2018), pp 93–104 Rodrigo GP, Henderson M, Weber GH, Ophus C, Antypas K, Ramakrishnan L (2018) ScienceSearch: enabling search through automatic metadata generation. In: 2018 IEEE 14th International Conference on e-Science (e-Science) (IEEE, 2018), pp 93–104
Metadaten
Titel
Like a rainbow in the dark: metadata annotation for HPC applications in the age of dark data
verfasst von
Björn Schembera
Publikationsdatum
01.02.2021
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 8/2021
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-020-03602-6

Weitere Artikel der Ausgabe 8/2021

The Journal of Supercomputing 8/2021 Zur Ausgabe