2.1 Engineering Metadata
2.1.1 How to Engineer Metadata
2.1.1.1 Definitions of Metadata and Metadata Models
However, metadata is always strongly context dependent. To tackle their context dependence, metadata must serve as a mode of communication:Extensive, highly structured metadata often are seen as a holy grail, a magic chalice both necessary and sufficient to render sharing and reusing data seamless, perhaps even automatic. [3, p. 672]
Following this, metadata takes the role of semantic technology: Its task is to relieve the direct communication and negotiation of data producers and data consumers and should therefore diminish “science friction” [3], which occurs in every process where research data is exchanged. To illustrate science friction, imagine two researchers exchanging a dataset, which is not properly described by metadata. The receiver might suppose the variable \(t_i\) as a data point in a time series. To provide clarification, the receiver would have to contact the sender of the data, and also in this process can be defective. This example shows the importance of metadata as semantic asset, and therefore as a mode of fixed, negotiated communication.We propose an alternative view of metadata, focusing on its role in an ephemeral process of scientific communication, rather than as an enduring outcome or product. [3, p. 667]
2.1.1.2 Object Model
-
Entities All relevant entities (or objects) of the research process must be identified. This includes finding classes of entities, grouping entities or merging them. In materials modelling, one entity which is relevant is, for example, the component which represents a chemical species.
-
Attributes For each entity defined in the previous step, attributes describing the entity must be found. To stick with the example, the component is characterized by attributes like a name, the smiles or IUPAC code and a unit.
-
Relations In this part, the relations between the entities must be cleared, e.g. how they are linked to each other to deliver a holistic description. The arguments must be reasonable, but are strongly specific to the research. For example, one could argue that the component is related to the simulated target system. Usually in metadata modelling, is-part-of relations are sufficient to model the vast majority of cases. However, relations are not limited to these hierarchical types and may give a semantically more advanced description which will eventually lead towards ontologies.
2.1.1.3 The Metadata Model and Its Implementation
2.1.1.4 Metadata Processes
Otherwise, if processes are not available, something called “metadata friction” would occur and the semantic assets would become worthless. This phenomenon would indicate the additional effort of (manual) metadata annotation and management, which has to be reduced by corresponding processes. This view is backed by the FAIR principles [6] and the additional guidance from an EU report [7]. The FAIR principles state metadata description as the main concept, and the study [7] accomplished this rather technical approach by processes surrounding these principles. In the case of materials modelling and computational engineering, in general, these processes would include, but are not limited to, the following:Metadata products can be powerful resources, but very often—perhaps even usually—they work only when metadata processes are also available. [3, p. 668]
-
Automated metadata extraction. One finding of [8] states that manual metadata annotation is a barrier for good research data management especially in the engineering science. Hence, automated metadata extraction is a major supporting process.
-
Data and metadata stewardship. Data and metadata need clear responsibilities and roles that define stewardship. This means that such a role has the responsibility of supporting metadata annotation, building metadata models and checking the data inventory for unindexed data. Such a role is, for example, the Scientific Data Officer [9].
-
Incentives. On main process to support metadata products is incentives to use models and tag the data with metadata. These incentives can either be intrinsic or extrinsic. Intrinsic incentives would include low barriers for metadata annotation. Extrinsic incentives would include making metadata annotation of the published research data mandatory for scientific publication.
-
Culture. Supporting metadata annotation and also cultural processes have to be adapted. Metadata annotation and research data management have to be seen as one essential part of scientific practice. The process of science has to be adapted to 1. publishing the data Open Access and 2. applying FAIR paradigm of data description to it. However, this cultural change may be linked to the above process of incentives. As of now, researchers only get recognition for publishing papers and not the data.
2.1.2 Metadata for Engineering: The EngMeta Metadata Scheme
2.1.2.1 The Object Model of EngMeta
-
processingStep serves as the highest level of the description for the provenance of the data and describes one processing step in the research process.
-
environment describes the computational environment on which the research was conducted, e.g. the hardware and compiler.
-
software describes the software environment in which the research was conducted, e.g. the code and its version.
-
system This key represents the simulated target system (or the observed system) and its characteristics, which are the metadata keys listed below.
-
variable This metadata field represents the used variables and parameters, which can be either controlled or measured variables. This is not bound to a specific field of research but holds more generally for most applications in computational science, as variables and parameters are the basis of every simulation.
-
method This field holds the information on the simulation method, such as “simulation with umbrella sampling”.
-
component This metadata key describes the names and SMILES/IUPAC codes of the molecules and solvents used within the simulation.
-
force field Describes the force field which is used for the simulation.
-
boundaryCondition Describes the properties on the boundaries of two components.
-
spacial resolution This key defines the spacial resolution of a simulation.
-
temporal resolution This key defines the temporal resolution of a simulation, for example, the number of timesteps, the interval and other characteristics.
2.1.2.2 The Metadata Model of EngMeta and its Implementation
EngMeta Metadata Model | VIMMP Ontology | |
---|---|---|
Technical | PREMIS | – |
Descriptive | DataCite | MMTO, OTRAS, VICO |
Process | CodeMeta, ExptML, UnitsML | VISO |
Domain specific | – | VISO, VOV |
2.1.2.3 The Metadata Processes Supporting EngMeta
2.2 Research Data Infrastructures
2.2.1 Requirements and Functions
-
Data Ingest Functionalities how to ingest data have to be defined and implemented. This includes the design of an appropriate user interface and integration in the workflow.
-
Data Preservation and Archiving Originally split into two functionalities in the OAIS framework, for our purpose of defining functionalities for materials modelling, merging them into one is sufficient. This functionality should ensure permanent storage of the ingested data. Data preservation resembles to bitstream preservation on this layer.
-
Data Management This functionality corresponds to metadata management and linking the data objects according to metadata information.
-
Administration This functionality includes not only administrative tasks, but also policy management and AAI.
-
Data Access This functionality must be designed and implemented by a user interface in order to ensure data access for users. Moreover, this includes capabilities to search and explore the data infrastructure.
-
Workflow support This means that the above functionalities have to be integrated seamlessly into the scientific workflow in the field.
-
Service tool integration As moving data is expensive, the data infrastructure has to enable data analytics and processing tools close to the data repository. This can also include visualization services.
2.2.2 Architectures
Repository | Origin | Sample installation (Type[,field]) |
---|---|---|
Dataverse | Data management | University of Stuttgart/DaRUS (institutional) |
Dspace | Document management | Fraunhofer Gesellschaft/Fordatis (institutional) |
Fedora | Document management | Saarland University/CLARIN (domain-specific, linguistics) |
Invenio | Data management | Swiss National Computing Centre/Materials Cloud ARCHIVE (domain specific, materials modelling) |