Skip to main content
Erschienen in:
Buchtitelbild

Open Access 2021 | OriginalPaper | Buchkapitel

2. Research Data Infrastructures and Engineering Metadata

verfasst von : Martin Thomas Horsch, Silvia Chiacchiera, Welchy Leite Cavalcanti, Björn Schembera

Erschienen in: Data Technology in Materials Modelling

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This chapter introduces metadata models as a semantic technology for knowledge representation to describe selected aspects of a research asset. The process of building a hierarchical metadata model is reenacted in this chapter and highlighted by the example of EngMeta. Moreover, an overview on data infrastructures is given, the general architecture and functions are disscussed, and multiple examples of data infrastructures in materials modelling are given.
The two core elements of data technology in every field of science, in general, and in materials modelling, in particular, are metadata or ontologies and data infrastructures. Even though they can work independently, they are strongly connected. Whereas metadata describes the data, the task of the research data infrastructure is to store and to preserve the data and to connect it with its metadata description. So, mere data becomes semantically interoperable and therefore a valuable piece of information respecting the FAIR principles.
The chapter introduces metadata models as a semantic technology for knowledge representation to describe selected aspects of a research asset in Sect. 2.1. The process of building a hierarchical metadata model is re-enacted in this chapter and highlighted by the example of EngMeta [1]. Moreover, this chapter gives an overview on data infrastructures in Sect. 2.2. In this section, the general architecture and functions are discussed and multiple examples of data infrastructures in materials modelling are given.

2.1 Engineering Metadata

This section examines engineering metadata. The term is ambiguous on purpose. First, engineering metadata names metadata which is used for engineering applications, such as materials modelling. Second, engineering metadata conceptualizes the art of designing metadata in a more general way.
This section is organized as follows. First, it is described how an ontology-based metadata model is created in a general way in Sect. 2.1.1. Second, this process is explained along EngMeta, a metadata model for engineering in Sect. 2.1.2.

2.1.1 How to Engineer Metadata

The art of engineering a metadata model includes several consecutive steps which are described in this subsection. It may happen that this process or a single step has to be iterated several times to come to a fine-grained, purposeful description of the research asset. In short, the following steps are necessary to engineer a metadata model. First, a consensus must be reached about what metadata actually serves in the single context. Then, an object model has to be carved out of the research process. Last, the object model has to be transferred to a formal representation and implemented and therefore becomes a metadata model.

2.1.1.1 Definitions of Metadata and Metadata Models

However, in the beginning of designing metadata for a certain purpose, it first has to be discussed how metadata is defined. Usually, metadata is defined as a structured form of knowledge representation, or simply, as many authors put it, “data about data” [2]. Edwards describes this as the holy grail of information science:
Extensive, highly structured metadata often are seen as a holy grail, a magic chalice both necessary and sufficient to render sharing and reusing data seamless, perhaps even automatic. [3, p. 672]
However, metadata is always strongly context dependent. To tackle their context dependence, metadata must serve as a mode of communication:
We propose an alternative view of metadata, focusing on its role in an ephemeral process of scientific communication, rather than as an enduring outcome or product. [3, p. 667]
Following this, metadata takes the role of semantic technology: Its task is to relieve the direct communication and negotiation of data producers and data consumers and should therefore diminish “science friction” [3], which occurs in every process where research data is exchanged. To illustrate science friction, imagine two researchers exchanging a dataset, which is not properly described by metadata. The receiver might suppose the variable \(t_i\) as a data point in a time series. To provide clarification, the receiver would have to contact the sender of the data, and also in this process can be defective. This example shows the importance of metadata as semantic asset, and therefore as a mode of fixed, negotiated communication.
Additionally, as Jane Greenberg puts it, metadata should semantically support the specific workflow [4]. For example, metadata describes a data point with an error bar and defines the form of the error. Thus, metadata would support the interpretation of the data point.
Following the discussion of metadata, a metadata model then can be seen as the middle ground of a non-formal model and a complete formalization of metadata keys, according to [5]. Its task is to describe the research objects or parts of it and its relation to other objects. They are still interpretations; however, they are constructed in a transparent and comprehensible way and derived from a common understanding of the research object, and lead to a fixed negotiation. The approach described in this chapter could also be called an ontology-based metadata, since the metadata model is engineered from an object model. As depicted in Fig. 1.​1, hierarchical models such as EngMeta range below an ontology; however, their task is also to balance the depth of domain knowledge representation and the depth of digitization. The question in what terms a metadata model is different from an ontology has already been discussed in Sect. 1.​2.

2.1.1.2 Object Model

The object model is the starting point for engineering a metadata model and marks the first phase in the creation process [5]. In this phase, an object model, respectively, an ontology description is carved out in a non-formal or natural language (and maybe containing graphical elements) describing and explicating all the relevant objects, terms, relations and rules. Every person potentially involved has to contribute to this process, since the metadata model will act as a semantic convention for a common understanding of the research data described.
The first part of engineering an object model is a clear and fixed understanding of what the object of research is, and what data it is representing. This can only be conducted by the analysis of the research process with all the stakeholders included. In this step, following information must be gathered:
  • Entities All relevant entities (or objects) of the research process must be identified. This includes finding classes of entities, grouping entities or merging them. In materials modelling, one entity which is relevant is, for example, the component which represents a chemical species.
  • Attributes For each entity defined in the previous step, attributes describing the entity must be found. To stick with the example, the component is characterized by attributes like a name, the smiles or IUPAC code and a unit.
  • Relations In this part, the relations between the entities must be cleared, e.g. how they are linked to each other to deliver a holistic description. The arguments must be reasonable, but are strongly specific to the research. For example, one could argue that the component is related to the simulated target system. Usually in metadata modelling, is-part-of relations are sufficient to model the vast majority of cases. However, relations are not limited to these hierarchical types and may give a semantically more advanced description which will eventually lead towards ontologies.
Figure 2.1 shows how a component in materials modelling could be represented by an entity, some attributes and a relation according to the example given above. All the entities can then be categorized according to the proposed classes of Sect. 1.​3. The component entity would be categorized as discipline-specific metadata.
Also in this step, the question arises if the description needs to be data centric or process centric. It strongly depends on the research process how to answer this question. For example, in code development, one needs to continuously follow the changes made to the codes, i.e. the process of programming. Hence, the appropriate description of programming can only be process centric.1 In data science applications, it is strongly dependent on the workflow, if a data-centric or a process-centric description should be chosen. In general, if data is the main outcome, even in a chain of process steps, one might want to choose a data-centric approach. If the processes are central to the research endeavour, and each process has a discrete output, one might chose a process-centric description. Of course, both approaches are not mutually exclusive. A data-centric approach also includes process information and a process-centric approach an elaborated description of the data. It is just a matter of hierarchical structuring and precedence. In Sect. 2.1.2, we will discuss why and how we decide for a data-centric model for computational engineering and realized in EngMeta.

2.1.1.3 The Metadata Model and Its Implementation

When the object model is converted to a formal language, special care has to be taken if parts of the object model already exist in some standard. With respect to the categorization taken in Sect. 1.​3, the probability to find existing, fitting standards for technical or descriptive metadata is high, whereas for process- and domain-specific standards they are not likely to be found. Some of the relevant standards are described in quoted section; however, an excessive amount of standards exist.
Another consideration when implementing the model is choosing the right formal language for representing the metadata model. Most likely, this will be XSD2 or JSON Schema.3 Both offer a strict structural definition of the entities, attributes and relations, and the decision is more or less based on setting of the metadata model: What are the skills available, what are the technical requirements for the implementation? For example, the question, which standard the database or repository supports, where the metadata later will stored, is crucial in deciding for an implementation language.

2.1.1.4 Metadata Processes

A metadata model alone is not sufficient. As Edwards puts it, metadata products such as models have to be accomplished by metadata processes:
Metadata products can be powerful resources, but very often—perhaps even usually—they work only when metadata processes are also available. [3, p. 668]
Otherwise, if processes are not available, something called “metadata friction” would occur and the semantic assets would become worthless. This phenomenon would indicate the additional effort of (manual) metadata annotation and management, which has to be reduced by corresponding processes. This view is backed by the FAIR principles [6] and the additional guidance from an EU report [7]. The FAIR principles state metadata description as the main concept, and the study [7] accomplished this rather technical approach by processes surrounding these principles. In the case of materials modelling and computational engineering, in general, these processes would include, but are not limited to, the following:
  • Automated metadata extraction. One finding of [8] states that manual metadata annotation is a barrier for good research data management especially in the engineering science. Hence, automated metadata extraction is a major supporting process.
  • Data and metadata stewardship. Data and metadata need clear responsibilities and roles that define stewardship. This means that such a role has the responsibility of supporting metadata annotation, building metadata models and checking the data inventory for unindexed data. Such a role is, for example, the Scientific Data Officer [9].
  • Incentives. On main process to support metadata products is incentives to use models and tag the data with metadata. These incentives can either be intrinsic or extrinsic. Intrinsic incentives would include low barriers for metadata annotation. Extrinsic incentives would include making metadata annotation of the published research data mandatory for scientific publication.
  • Culture. Supporting metadata annotation and also cultural processes have to be adapted. Metadata annotation and research data management have to be seen as one essential part of scientific practice. The process of science has to be adapted to 1. publishing the data Open Access and 2. applying FAIR paradigm of data description to it. However, this cultural change may be linked to the above process of incentives. As of now, researchers only get recognition for publishing papers and not the data.

2.1.2 Metadata for Engineering: The EngMeta Metadata Scheme

In this subsection, an example for a metadata model and its design will be given. EngMeta [1, 8, 10] is a semantic metadata standard for computational engineering and was designed following principles of the above subsection. Following Staab et al. [5] EngMeta could be referred to as an ontology-based metadata model. A comparison to VIMMP as a genuine ontology is carried out in Sect. 4.​5. EngMeta was designed as a joint effort of researchers from computational engineering sciences (process engineering and aerodynamics), from the library sciences as well as from the computer sciences. This allowed the design of an integrated metadata model covering all the relevant research aspects in all the four categories as described in Sect. 1.​3.

2.1.2.1 The Object Model of EngMeta

For the design of EngMeta, the object of research had to be identified first. This seems to be an easy task, but the devil is in the detail.
As aerodynamics and molecular dynamics served as use cases, it was clear that computational engineering and its outcome were the common ground, but not more. All the four metadata categories defined in Sect. 1.​3 had to be written out with representations, which could only be accomplished by analysing the research itself for common entities and attributes for process and domain. Both technical and descriptive metadata keys were quite straightforward, since their specificity is low (see Fig. 1.​2). The process metadata and the domain-specific metadata were harder to carve out from both use cases and could only be gathered by a detailed analysis of the research process. The following entities were determined as process metadata for computational engineering:
  • processingStep serves as the highest level of the description for the provenance of the data and describes one processing step in the research process.
  • environment describes the computational environment on which the research was conducted, e.g. the hardware and compiler.
  • software describes the software environment in which the research was conducted, e.g. the code and its version.
The following entities were determined as domain-specific metadata for computational engineering applications and were seen as common ground, stemming from the use case of aerodynamics and thermodynamics but could also be applicable for use cases of materials modelling and beyond:
  • system This key represents the simulated target system (or the observed system) and its characteristics, which are the metadata keys listed below.
  • variable This metadata field represents the used variables and parameters, which can be either controlled or measured variables. This is not bound to a specific field of research but holds more generally for most applications in computational science, as variables and parameters are the basis of every simulation.
  • method This field holds the information on the simulation method, such as “simulation with umbrella sampling”.
  • component This metadata key describes the names and SMILES/IUPAC codes of the molecules and solvents used within the simulation.
  • force field Describes the force field which is used for the simulation.
  • boundaryCondition Describes the properties on the boundaries of two components.
  • spacial resolution This key defines the spacial resolution of a simulation.
  • temporal resolution This key defines the temporal resolution of a simulation, for example, the number of timesteps, the interval and other characteristics.
It also became clear that the model will be data centric, since the research process in computational engineering reaches a steady state when a dataset is produced by a simulation or by post-processing of some data. However, it is crucial to document the processing steps as well for a good provenance description. This leads to a object model where the dataset is on top of the hierarchy and can include several processing steps.
The complete object model of EngMeta, with all entities, their attributes and relations, is depicted in Fig. 2.2. The four metadata categories are coloured differently.

2.1.2.2 The Metadata Model of EngMeta and its Implementation

After setting up the object model, research was conducted if there are metadata standards that serve the purpose of describing research assets in computational engineering as defined by the object model. None was found, however it was identified that different metadata standards cover certain aspects of the EngMeta entities. This coverage is shown in Table 2.1 with respect to the four metadata categories. CodeMeta is a description of software tools and serves for the software part in EngMeta. Data Cite is the standard for descriptive metadata and moreover, enables the data to get a DOI and was therefor integrated into EngMeta. PREMIS is a standard for technical metadata, and ExptML was integrated for experimental device, which can also be modelled by EngMeta. As Prov is a standard for provenance, a crosswalk for this standard was developed in order to achieve semantic interoperability [1]. Moreover, in this table, a comparison to VIMMP, which is discussed in Chap.  regarding existing standards is shown. The model has been implemented as an XML Schema Definition (XSD) and is available for open use and modification.4
Table 2.1
Existing standards that were used in EngMeta and VIMMP with respect to the four categories defined in Sect. 1.​3
 
EngMeta Metadata Model
VIMMP Ontology
Technical
PREMIS
Descriptive
DataCite
MMTO, OTRAS, VICO
Process
CodeMeta, ExptML, UnitsML
VISO
Domain specific
VISO, VOV

2.1.2.3 The Metadata Processes Supporting EngMeta

As discussed in Sect. 2.1.1.4, a metadata model needs to be complemented with metadata processes. Otherwise, it will not be fully effective to make research data FAIR. In the example of EngMeta, the model was complemented by an automated metadata extraction, the establishment of a research data management competence centre and an institutional repository. Details on the repository can be found in the following section on research data infrastructures, especially in Sect. 2.2.3.1. FOKUS was established as the main competence centre for questions and support regarding research data management at the University of Stuttgart. The automated metadata extraction ExtractIng was designed and implemented. It works in a way that all the existing metadata, stemming from log-, job- and various other files in the HPC and simulation environment, are extracted and are converted to the EngMeta metadata model. It can be integrated in the specific research process, and it was shown how an automated approach would look like for simulation sciences. Right after the simulation run, the ExtractIng tool will be triggered, transforming all the scattered metadata in a standardized form according to EngMeta. Then, the metadata can be automatically uploaded to the repository, all together with the data, forming a dataset within the repository including all relevant semantic information for FAIR interoperability.

2.2 Research Data Infrastructures

Research data infrastructures enable the data to become findable and accessible (the FA in FAIR), whereas semantic standards enable the interoperability and reusability (the IR in FAIR). Hence, research data infrastructures are the second crucial pillar for FAIR data technology as both parts are inseparable for semantic interoperability in materials modelling. Research data infrastructures resemble to repositories as they ensure enriching data with metadata, long-term preservation and open-access availability for the scientific community. Moreover, the data infrastructures serve as the link between the data and the community, and therefore play a significant role in science.
This section is organized as follows. First, the requirements and functions for data infrastructures are explained in detail in Sect. 2.2.1. Then, generic architectural key characteristics are discussed in Sect. 2.2.2. Moreover, examples of research data infrastructures relevant for materials modelling are highlighted in Sect. 2.2.3.

2.2.1 Requirements and Functions

Data infrastructures in materials modelling should, besides the typical data management tasks of storing, sharing and enabling FAIR data, support the specific research by integrating open simulation codes, analytics tools and the management of the scientific workflow [11]. This means that a data infrastructure goes beyond mere archival repositories. However, the core of all data infrastructures is an archive with repositoral functions. The OAIS Reference Model (ISO 14721) can give an orientation how such a core may look like [12], and the following functionality was derived from this framework:
  • Data Ingest Functionalities how to ingest data have to be defined and implemented. This includes the design of an appropriate user interface and integration in the workflow.
  • Data Preservation and Archiving Originally split into two functionalities in the OAIS framework, for our purpose of defining functionalities for materials modelling, merging them into one is sufficient. This functionality should ensure permanent storage of the ingested data. Data preservation resembles to bitstream preservation on this layer.
  • Data Management This functionality corresponds to metadata management and linking the data objects according to metadata information.
  • Administration This functionality includes not only administrative tasks, but also policy management and AAI.
  • Data Access This functionality must be designed and implemented by a user interface in order to ensure data access for users. Moreover, this includes capabilities to search and explore the data infrastructure.
As it was mentioned earlier, the above basic functions have to be accompanied by supportive functions for the scientific workflow. These should include the following:
  • Workflow support This means that the above functionalities have to be integrated seamlessly into the scientific workflow in the field.
  • Service tool integration As moving data is expensive, the data infrastructure has to enable data analytics and processing tools close to the data repository. This can also include visualization services.

2.2.2 Architectures

Data infrastructures can be logically divided into three major layers, which are depicted in Fig. 2.3 [13]. The functions defined in the previous Sect. 2.2.1 have to be implemented in the specific or throughout all the three layers. It is subject to the precise implementation of a data infrastructure which function resides in which layer.5
The base layer of a data infrastructure is the storage layer (l1), where the data objects are physically stored and bitstream preservation is guaranteed. Technically, this layer can exist in distributed and/or hierarchical setting and is often a combination from hard disc and tape storage. The intermediate layer is the object layer (l2), whose basic functionality is metadata management. By this layer, data from the storage layer is enriched with metadata and data objects become information objects with a persistent identifier, whose purpose is to make the data citable. The third layer is the service layer (l3) and includes the user interface and marks the visible part of the data infrastructures. Moreover, this layer includes additional services, such as an automated metadata extraction.
Basically, data infrastructures implement all three layers; however, they can operate or work in distributed environments. Usually the base layer (l1) is the hardware part of the data infrastructure, whereas the layers (l2) and (l3) are the software part. The functionalities of the layers (l2) and (l3) are usually covered by repository software. A repository is a store for data that organizes this data in some logical manner and makes the data available for usage to a specified group of persons. It is important to mention that a repository is not a filesystem, which means that its purpose is not to manage the files in directory structures. In contrast, a repository must be imagined as collections of files organized in sets (of some logical manner, for example, as datasets, as linked data, in a loose hierarchical structure,...), which are described by metadata, are search and retrievable, and are provided with a persistent identifier.
Table 2.2
Data repository software
Repository
Origin
Sample installation (Type[,field])
Dataverse
Data management
University of Stuttgart/DaRUS (institutional)
Dspace
Document management
Fraunhofer Gesellschaft/Fordatis (institutional)
Fedora
Document management
Saarland University/CLARIN (domain-specific, linguistics)
Invenio
Data management
Swiss National Computing Centre/Materials Cloud
ARCHIVE (domain specific, materials modelling)
Out-of-the-box generic repository software packages are generally available and serve different purposes. Some of those packages stem from document management, whereas others have their origins in data/file management. Their origin has to be taken into account when evaluating the repository for a specific use case or domain. Table 2.2 gives an overview over typical data repository software. For example, Dataverse originates from the management of datasets, whereas Dspace stems from managing document files.6 However, also Dspace is capable of managing datasets, and the Fraunhofer-Gesellschaft is using it to store research data in its institutional repository Fordatis7 [16].
Research data infrastructures can be classified as institutional and domain-specific infrastructures. Institutional data infrastructures resemble to research data management on an institutional level and are not bound to a specific discipline. An example of this type is DaRUS, which will be discussed in Sect. 2.2.3.1 of this chapter. A domain-specific data infrastructure serves as an approach which is bound to a specific discipline and can span across multiple institutions. An example of a domain-specific data infrastructure for materials modelling is NOMAD, which will be discussed in Sect. 2.2.3.2.

2.2.3 Examples of Research Data Infrastructures in Materials Modelling

2.2.3.1 DaRUS

Even though the Data Repository of the University of Stuttgart (DaRUS)8 is an institutional repository and not limited to materials modelling, it will be discussed here since its development was strongly driven by the EngMeta metadata model. Moreover, it is an example of a loosely coupled data infrastructure. Its overall development was urged by the need of a sustainable repository for the University of Stuttgart and, in particular, the materials modelling community at the university as well as by the precursory design of EngMeta. Within the repository, EngMeta serves as the semantic core and the repository is built around the metadata model, which is also deemed metadata-driven repository development. The requirements, such as handling large datasets, were stemming from aerodynamics and molecular dynamics [17].
DaRUS is based on Dataverse, and the driving factors for choosing this repository software were its design for research data management, its integration with the DOI persistent identifier infrastructure, its adaptability with metadata standards and its monolithic design. In the Dataverse repository software package, all the data is organized in Dataverses (organizational structure), datasets and files [18]. A Dataverse is the highest element in the hierarchical data organization structure in the repository and typically represents an institute or a research project. A dataset in the Dataverse terminology resembles to a directory or a collection of files. As of July 2020, DaRUS holds almost 600 files in 49 datasets, which are organized in 60 Dataverses, mainly from the fields of engineering, computer science and physics.
As DaRUS is an institutional repository, it is only loosely coupled to the research infrastructure since it is generic. This means that the service layer (l3) is basically the generic Dataverse web GUI. Additional services can be integrated by using one of the APIs that Dataverse offers, such as REST or SWORD. For example, an automated toolchain (as an external tool) was implemented using the Dataverse API for the specific use case of thermodynamics: after a simulation run, an automated metadata extraction is triggered. Then, the extracted metadata altogether with the data is automatically ingested into the DaRUS repository [19].

2.2.3.2 NOMAD

In contrast to DaRUS, the Novel Materials Discovery (NOMAD) laboratory9 (or Novel Materials Discovery Center of Excellence (NOMAD CoE)) is a prime example of a domain-specific data infrastructure which is highly integrated [20] in a virtual research environment. The repository part is complemented with the NOMAD Archive, the NOMAD Encylopedia, the NOMAD Visualization Tools and the NOMAD Analytics Toolkit. NOMAD is recommended by Nature10 for depositing supplementary data when submitting a research article on materials modelling.
The NOMAD repository is the central component of the laboratory and holds input and output data from material simulations with a retention period of 10 years for free. The NOMAD archive holds the open-access data from the repository which was converted into a code-independent format. To accomplish this, developing a metadata definition and a metadata component was crucial for this. It serves, just as proposed in Sect. 2.1.1.1, as a common understanding11 and, as the overall outline of this book, for making data semantically interoperable. The metadata definition uses 168 aligned and 2,360 code-specific metadata keys. For example, the different terms for quantities had to be mapped to one aligned term. According to [20], the development of this component of the data infrastructure was a challenge. The NOMAD encylopedia is the part of the NOMAD data infrastructure which provides millions of calculations via a web GUI with a materials-oriented view and therefore serves as knowledge base and a material classification system. The NOMAD visualization tools are a centralized service for data visualization within the data infrastructure allowing users interactive graphical analysis in materials modelling. Additionally, the NOMAD Analytics Toolkit is a big data analytics approach to support data evaluation, for example, scanning for specific thermoelectric materials or finding suitable materials for heterogeneous catalysis.
In the NOMAD laboratory, the archive and the repository components correspond to the storage layer (l1) and the object layer (l2), whereas the encyclopedia, the analytics toolkit and the visualization tools correspond to the service layer (l3), which is strongly coupled to the base layers.
As of February 2020, the NOMAD data infrastructure holds 49TB of raw data in the repository and 19TB of the archive in normalized, annotated form in 758 datasets.12

2.2.3.3 Materials Cloud

The Materials Cloud13 is another domain-specific data infrastructure, which includes all the three aforementioned layers and implement them with specific technology supporting the data life cycle in materials modelling [11]. The Materials Cloud is, just like NOMAD, recommended by Nature for supplementary data for journal submissions in materials modelling. In the Materials Cloud, the ARCHIVE, DISCOVER, EXPLORE, WORK and LEARN components form according to data infrastructure.
The ARCHIVE component represents the open-access research data repository component with long-term storage, metadata protocols (including metadata harvesting for Google Dataset Search and B2FIND) and persistent identifiers (DOIs). The hardware backend of ARCHIVE is hosted at the Swiss National Computing Centre, is free of charge and data records are preserved for 10 years. For the software layer, Invenio will be used. ARCHIVE is moderated, which means all the ingested data is first checked against certain criteria, just as on preprint document servers. The DISCOVER component corresponds to the browsing capabilities for curated datasets of ARCHIVE and offers interactive visualization. The EXPLORE part of the system is the component that tracks and displays provenance information of the datasets to ensure FAIR and reproducible data. All this information is recorded by the AiiDA system, which can be imagined as a git style methodology for data. The information is shown in a provenance graph. The WORK component is the part of the Materials Cloud data infrastructure that allows working with the available data, which can be either stand-alone tools to perform inexpensive calculations or AiiDA lab. AiiDA lab is a tool for defining workflows and orchestrating them from the web interface, since it lets users connect and use remote computational resources or other repositories which include the OPTIMADE standard,14 so, for example, NOMAD. The LEARN part of the system features educational material, such as tutorials or video lectures and a downloadable image of a virtual machine for training purposes in materials modelling. This part is important since it covers metadata processes as displayed in Sect.  2.1.1.4.
Just as NOMAD, the Materials Cloud is highly integrated data infrastructure, where the ARCHIVE component acts as the storage layer (l1) and the object layer (l2). The service layer (l3) is set up by DISCOVER, EXPLORE, WORK and LEARN components.

2.2.3.4 Chemotion, MoMaF and NFDI

The Science Data Center for Molecular Materials Research (MoMaF)15 is one of the four Science Data Center (SDC) projects of the state of Baden-Württemberg in Germany started in late 2019. Its goal is to support the data life cycle and implement the FAIR principles by a domain-specific repository for molecular materials research, digitalization of lab books and metadata standards.
MoMaF relies on preliminary work that was conducted in the Chemotion project,16 whose aim was to build a data infrastructure for synthetic and analytic chemistry [21, 22]. The core of Chemotion is a repository that allows to collect, reuse and publish data. It is complemented with discipline-specific data processing tools and it incorporates DOI generation and supports publishing, such as support for peer-reviewing submissions and comparing submissions with the PubChem database. The repository architecture consists of a private workspace and a publication area. Electronic laboratory notebooks play a crucial role here and can be imported into the private workspace. Research data17 can, after adding metadata and a reviewing process, later be staged from the private workspace to the publication area, where they are provided with a DOI and made Open Data. Also, within this approach, we can see how a repository on the object layer is complemented with additional tools in the service layer, such as data processing tools or electronic laboratory notebooks.
The work and the results from the MoMaF SDC will later be used in the National Research Data Infrastructure (NFDI) for Chemistry [23] as one of the NFDI projects in Germany. Another project within the NFDI, which also will have an impact for materials modelling, is NFDI for Catalysis.18
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Fußnoten
1
This is reflected by tools such as git, which include metadata for every commit to describe the process.
 
5
Mapping these functions to layers is not trivial, an example can be found in [14].
 
6
In the context of this chapter, iRods has to be mentioned. Even though it is not a classical repository software package but offers a unified namespace, its functionalities include repository-style data management on a filesystem level [15].
 
17
Chemotion has two structuring elements, which are samples (batches of a molecules) and reactions leading to the principle that information is kept along with and can be linked to the chemical process.
 
Literatur
1.
Zurück zum Zitat B. Schembera, D. Iglezakis, EngMeta: metadata for computational engineering. IJMSO 14(1), 26–38 (2020)CrossRef B. Schembera, D. Iglezakis, EngMeta: metadata for computational engineering. IJMSO 14(1), 26–38 (2020)CrossRef
2.
Zurück zum Zitat A.J.G. Hey, A.E. Trefethen, The data deluge: an e-science perspective, in Grid Computing: Making the Global Infrastructure a Reality, ed. by F. Berman, G.C. Fox, A.J.G. Hey (Wiley, 2003), pp. 809–824 A.J.G. Hey, A.E. Trefethen, The data deluge: an e-science perspective, in Grid Computing: Making the Global Infrastructure a Reality, ed. by F. Berman, G.C. Fox, A.J.G. Hey (Wiley, 2003), pp. 809–824
4.
Zurück zum Zitat J. Greenberg, Metadata and the world wide web. Encycl. Libr. Inf. Sci. 3, 1876–1888 (2003) J. Greenberg, Metadata and the world wide web. Encycl. Libr. Inf. Sci. 3, 1876–1888 (2003)
5.
Zurück zum Zitat S. Staab, Wissensmanagement mit Ontologien und Metadaten. Inform-Spektr 25(3), 194–209 (2002) S. Staab, Wissensmanagement mit Ontologien und Metadaten. Inform-Spektr 25(3), 194–209 (2002)
6.
Zurück zum Zitat M.D. Wilkinson, M. Dumontier, J. Aalbersberg, G. Appleton, MA, et al., The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016) M.D. Wilkinson, M. Dumontier, J. Aalbersberg, G. Appleton, MA, et al., The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)
9.
Zurück zum Zitat B. Schembera, J.M. Durán, Dark data as the new challenge for big data science and the introduction of the scientific data officer. Philos. Technol. 33, 93–115 (2020)CrossRef B. Schembera, J.M. Durán, Dark data as the new challenge for big data science and the introduction of the scientific data officer. Philos. Technol. 33, 93–115 (2020)CrossRef
10.
Zurück zum Zitat B. Schembera, D. Iglezakis, The genesis of EngMeta: a metadata model for research data in computational engineering, in Proceedings of MTSR 2018, ed. by E. Garoufallou, F. Sartori, R. Siatri, M. Zervas, CCIS, vol. 8486 (Springer, Cham, Switzerland, 2018), pp 127–132 B. Schembera, D. Iglezakis, The genesis of EngMeta: a metadata model for research data in computational engineering, in Proceedings of MTSR 2018, ed. by E. Garoufallou, F. Sartori, R. Siatri, M. Zervas, CCIS, vol. 8486 (Springer, Cham, Switzerland, 2018), pp 127–132
11.
Zurück zum Zitat L. Talirz, S. Kumbhar, E. Passaro, A.V. Yakutovich, V. Granata, F. Gargiulo, M. Borelli, M. Uhrin, S.P. Huber, S. Zoupanos, C.S. Adorf, C.W. Andersen, O. Schütt, C.A. Pignedoli, D. Passerone, J. VandeVondele, T.C. Schulthess, B. Smit, G. Pizzi, N. Marzari, Materials cloud, a platform for open computational science. Sci. Data 7, 299 (2020). arXiv:2003.12510 [cond-mat.mtrl-sci] L. Talirz, S. Kumbhar, E. Passaro, A.V. Yakutovich, V. Granata, F. Gargiulo, M. Borelli, M. Uhrin, S.P. Huber, S. Zoupanos, C.S. Adorf, C.W. Andersen, O. Schütt, C.A. Pignedoli, D. Passerone, J. VandeVondele, T.C. Schulthess, B. Smit, G. Pizzi, N. Marzari, Materials cloud, a platform for open computational science. Sci. Data 7, 299 (2020). arXiv:​2003.​12510 [cond-mat.mtrl-sci]
12.
Zurück zum Zitat OAIS, Reference model for an Open Archival Information System. Technical Report 650.0-M-2 (Magenta Book) Issue 2, CCSDS (2012) OAIS, Reference model for an Open Archival Information System. Technical Report 650.0-M-2 (Magenta Book) Issue 2, CCSDS (2012)
13.
Zurück zum Zitat B. Schembera, T. Bönisch, Challenges of research data management for high performance computing, in Proceedings of TPDL 2017, ed. by J. Kamps, G. Tsakonas, Y. Manolopoulos, L. Iliadis, I. Karydis, LNCS, vol. 10450 (Springer, Heidelberg, Germany, 2017), pp. 140–151 B. Schembera, T. Bönisch, Challenges of research data management for high performance computing, in Proceedings of TPDL 2017, ed. by J. Kamps, G. Tsakonas, Y. Manolopoulos, L. Iliadis, I. Karydis, LNCS, vol. 10450 (Springer, Heidelberg, Germany, 2017), pp. 140–151
14.
Zurück zum Zitat J. Askhoj, M. Nagamori, S. Sugimoto, Archiving as a service: a model for the provision of shared archiving services using cloud computing, in Proceedings of iConference 2011 (ACM, New York, USA, 2011), pp. 151–158 J. Askhoj, M. Nagamori, S. Sugimoto, Archiving as a service: a model for the provision of shared archiving services using cloud computing, in Proceedings of iConference 2011 (ACM, New York, USA, 2011), pp. 151–158
15.
Zurück zum Zitat M. Hedges, A. Hasan, T. Blanke, Management and preservation of research data with iRODS, in Proceedings of ACM 1st Workshop on CyberInfrastructure: Information Management in eScience, ed. by P. Mitra (ACM, New York, USA, 2011), pp. 17–22 M. Hedges, A. Hasan, T. Blanke, Management and preservation of research data with iRODS, in Proceedings of ACM 1st Workshop on CyberInfrastructure: Information Management in eScience, ed. by P. Mitra (ACM, New York, USA, 2011), pp. 17–22
16.
Zurück zum Zitat A. Wuchner, Das Projekt FORDATIS–Aufbau einer Forschungsdateninfrastruktur in der Fraunhofer-Gesellschaft, in Forschungsdaten: sammeln, sichern, strukturieren, Forschungszentrum Jülich, ed. by B. Mittermaier, pp. 57–78 A. Wuchner, Das Projekt FORDATIS–Aufbau einer Forschungsdateninfrastruktur in der Fraunhofer-Gesellschaft, in Forschungsdaten: sammeln, sichern, strukturieren, Forschungszentrum Jülich, ed. by B. Mittermaier, pp. 57–78
18.
Zurück zum Zitat B. Selent, B. Schembera, D. Iglezakis, A. Seeland, Datenmanagement in Infrastrukturen (Abschlussbericht des BMBF-Projektes DIPL-ING. Tech. rep., Universität Stuttgart, Stuttgart, Prozessen und Lebenszyklen für die Ingenieurwissenschaften, 2019). https://doi.org/10.2314/KXP:1693393980 B. Selent, B. Schembera, D. Iglezakis, A. Seeland, Datenmanagement in Infrastrukturen (Abschlussbericht des BMBF-Projektes DIPL-ING. Tech. rep., Universität Stuttgart, Stuttgart, Prozessen und Lebenszyklen für die Ingenieurwissenschaften, 2019). https://​doi.​org/​10.​2314/​KXP:​1693393980
20.
Zurück zum Zitat C. Draxl, M. Scheffler, NOMAD: the FAIR concept for big data-driven materials science. MRS Bull. 43(9), 676–682 (2018) C. Draxl, M. Scheffler, NOMAD: the FAIR concept for big data-driven materials science. MRS Bull. 43(9), 676–682 (2018)
21.
Zurück zum Zitat P. Tremouilhac, A. Nguyen, Y.C. Huang, S. Kotov, D.S. Lütjohann, F. Hübsch, N. Jung, S. Bräse, Chemotion ELN: an open source electronic lab notebook for chemists in academia. J. Chem. 9(1), 1–13 (2017)CrossRef P. Tremouilhac, A. Nguyen, Y.C. Huang, S. Kotov, D.S. Lütjohann, F. Hübsch, N. Jung, S. Bräse, Chemotion ELN: an open source electronic lab notebook for chemists in academia. J. Chem. 9(1), 1–13 (2017)CrossRef
22.
Zurück zum Zitat P. Tremouilhac, C.L. Lin, P.C. Huang, Y.C. Huang, A. Nguyen, N. Jung, F. Bach, R. Ulrich, B. Neumair, A. Streit, S. Bräse, The repository chemotion: infrastructure for sustainable research in chemistry. Ang. Chem. Int. Ed. (2020). https://doi.org/10.1002/anie.202007702 P. Tremouilhac, C.L. Lin, P.C. Huang, Y.C. Huang, A. Nguyen, N. Jung, F. Bach, R. Ulrich, B. Neumair, A. Streit, S. Bräse, The repository chemotion: infrastructure for sustainable research in chemistry. Ang. Chem. Int. Ed. (2020). https://​doi.​org/​10.​1002/​anie.​202007702
23.
Zurück zum Zitat C. Steinbeck, O. Koepler, F. Bach, S. Herres-Pawlis, N. Jung, J.C. Liermann, S. Neumann, M. Razum, C. Baldauf, F. Biedermann, T.W. Bocklitz, F. Boehm, F. Broda, P. Czodrowski, T. Engel, M.G. Hicks, S.M. Kast, C. Kettner, W. Koch, G. Lanza, A. Link, R.A. Mata, W.E. Nagel, A. Porzel, N. Schlörer, T. Schulze, H.G. Weinig, W. Wenzel, L.A. Wessjohann, S. Wulle, NFDI4Chem: towards a national research data infrastructure for chemistry in Germany. Res. Ideas Outcomes 6, e55852 (2020) C. Steinbeck, O. Koepler, F. Bach, S. Herres-Pawlis, N. Jung, J.C. Liermann, S. Neumann, M. Razum, C. Baldauf, F. Biedermann, T.W. Bocklitz, F. Boehm, F. Broda, P. Czodrowski, T. Engel, M.G. Hicks, S.M. Kast, C. Kettner, W. Koch, G. Lanza, A. Link, R.A. Mata, W.E. Nagel, A. Porzel, N. Schlörer, T. Schulze, H.G. Weinig, W. Wenzel, L.A. Wessjohann, S. Wulle, NFDI4Chem: towards a national research data infrastructure for chemistry in Germany. Res. Ideas Outcomes 6, e55852 (2020)
Metadaten
Titel
Research Data Infrastructures and Engineering Metadata
verfasst von
Martin Thomas Horsch
Silvia Chiacchiera
Welchy Leite Cavalcanti
Björn Schembera
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-68597-3_2

    Marktübersichten

    Die im Laufe eines Jahres in der „adhäsion“ veröffentlichten Marktübersichten helfen Anwendern verschiedenster Branchen, sich einen gezielten Überblick über Lieferantenangebote zu verschaffen.