Skip to main content
Top

2021 | Book

The Semantic Web – ISWC 2021

20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24–28, 2021, Proceedings

Editors: Prof. Dr. Andreas Hotho, Assoc. Prof. Eva Blomqvist, Stefan Dietze, Achille Fokoue, Ying Ding, Payam Barnaghi, Armin Haller, Dr. Mauro Dragoni, Harith Alani

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes the proceedings of the 20th International Semantic Web Conference, ISWC 2021, which took place in October 2021. Due to COVID-19 pandemic the conference was held virtually.
The papers included in this volume deal with the latest advances in fundamental research, innovative technology, and applications of the Semantic Web, linked data, knowledge graphs, and knowledge processing on the Web. Papers are organized in a research track, resources and in-use track. The research track details theoretical, analytical and empirical aspects of the Semantic Web and its intersection with other disciplines. The resources track promotes the sharing of resources which support, enable or utilize semantic web research, including datasets, ontologies, software, and benchmarks. And finally, the in-use-track is dedicated to novel and significant research contributions addressing theoretical, analytical and empirical aspects of the Semantic Web and its intersection with other disciplines.

Table of Contents

Frontmatter

Research Track

Frontmatter
PCSG: Pattern-Coverage Snippet Generation for RDF Datasets

For reusing an RDF dataset, understanding its content is a prerequisite. To support the comprehension of its large and complex structure, existing methods mainly generate an abridged version of an RDF dataset by extracting representative data patterns as a summary. As a complement, recent attempts extract a representative subset of concrete data as a snippet. We extend this line of research by injecting the strength of summary into snippet. We propose to generate a pattern-coverage snippet that best exemplifies the patterns of entity descriptions and links in an RDF dataset. Our approach incorporates formulations of group Steiner tree and set cover problems to generate compact snippets. This extensible approach is also capable of modeling query relevance to be used with dataset search. Experiments on thousands of real RDF datasets demonstrate the effectiveness and practicability of our approach.

Xiaxia Wang, Gong Cheng, Tengteng Lin, Jing Xu, Jeff Z. Pan, Evgeny Kharlamov, Yuzhong Qu
A Source-to-Target Constraint Rewriting for Direct Mapping

Most of the existing structured digital information today is still stored in relational databases. That’s why it is important for the Semantic Web effort to expose the information in relational databases as RDF, or allow to query it using SPARQL. Direct mapping is a fully automated approach for converting well-structured relational data to RDF that does not require formulating explicit mapping rules [2, 8]. Along with the mapped RDF data, it is desirable to have a description of that data. Previous work [3, 8] has attempted to describe the RDF graph in terms of OWL axioms, which is problematic, partly due to the open world semantics of OWL. We start from the direct mapping suggested by Sequeda et al. [8], which integrates and extends the functionalities of proposal [10] and the W3C recommendation [2], and present a source-to-target semantics preserving rewriting of constraints in an SQL database schema to equivalent SHACL [7] constraints on the RDF graph. We thus provide a SHACL description of the RDF data generated by the direct mapping without the need to perform a costly validation of those constraints on the generated data. Following the approach of [8], we define the rewriting from SQL constraints to SHACL by a set of Datalog rules. We prove that our source-to-target rewriting of constraints is constraint preserving and weakly semantics preserving.

Ratan Bahadur Thapa, Martin Giese
Learning to Predict the Departure Dynamics of Wikidata Editors

Wikidata as one of the largest open collaborative knowledge bases has drawn much attention from researchers and practitioners since its launch in 2012. As it is collaboratively developed and maintained by a community of a great number of volunteer editors, understanding and predicting the departure dynamics of those editors are crucial but have not been studied extensively in previous works. In this paper, we investigate the synergistic effect of two different types of features: statistical and pattern-based ones with DeepFM as our classification model which has not been explored in a similar context and problem for predicting whether a Wikidata editor will stay or leave the platform. Our experimental results show that using the two sets of features with DeepFM provides the best performance regarding AUROC (0.9561) and F1 score (0.8843), and achieves substantial improvement compared to using either of the sets of features and over a wide range of baselines.

Guangyuan Piao, Weipeng Huang
Towards Neural Schema Alignment for OpenStreetMap and Knowledge Graphs

OpenStreetMap (OSM) is one of the richest, openly available sources of volunteered geographic information. Although OSM includes various geographical entities, their descriptions are highly heterogeneous, incomplete, and do not follow any well-defined ontology. Knowledge graphs can potentially provide valuable semantic information to enrich OSM entities. However, interlinking OSM entities with knowledge graphs is inherently difficult due to the large, heterogeneous, ambiguous, and flat OSM schema and the annotation sparsity. This paper tackles the alignment of OSM tags with the corresponding knowledge graph classes holistically by jointly considering the schema and instance layers. We propose a novel neural architecture that capitalizes upon a shared latent space for tag-to-class alignment created using linked entities in OSM and knowledge graphs. Our experiments aligning OSM datasets for several countries with two of the most prominent openly available knowledge graphs, namely, Wikidata and DBpedia, demonstrate that the proposed approach outperforms the state-of-the-art schema alignment baselines by up to 37% points F1-score. The resulting alignment facilitates new semantic annotations for over 10 million OSM entities worldwide, which is over a 400% increase compared to the existing annotations.

Alishiba Dsouza, Nicolas Tempelmeier, Elena Demidova
Improving Inductive Link Prediction Using Hyper-relational Facts

For many years, link prediction on knowledge graphs (KGs) has been a purely transductive task, not allowing for reasoning on unseen entities. Recently, increasing efforts are put into exploring semi- and fully inductive scenarios, enabling inference over unseen and emerging entities. Still, all these approaches only consider triple-based KGs, whereas their richer counterparts, hyper-relational KGs (e.g., Wikidata), have not yet been properly studied. In this work, we classify different inductive settings and study the benefits of employing hyper-relational KGs on a wide range of semi- and fully inductive link prediction tasks powered by recent advancements in graph neural networks. Our experiments on a novel set of benchmarks show that qualifiers over typed edges can lead to performance improvements of 6% of absolute gains (for the Hits@10 metric) compared to triple-only baselines. Our code is available at https://github.com/mali-git/hyper_relational_ilp .

Mehdi Ali, Max Berrendorf, Mikhail Galkin, Veronika Thost, Tengfei Ma, Volker Tresp, Jens Lehmann
Large-Scale Multi-granular Concept Extraction Based on Machine Reading Comprehension

The concepts in knowledge graphs (KGs) enable machines to understand natural language, and thus play an indispensable role in many applications. However, existing KGs have the poor coverage of concepts, especially fine-grained concepts. In order to supply existing KGs with more fine-grained and new concepts, we propose a novel concept extraction framework, namely MRC-CE, to extract large-scale multi-granular concepts from the descriptive texts of entities. Specifically, MRC-CE is built with a machine reading comprehension model based on BERT, which can extract more fine-grained concepts with a pointer network. Furthermore, a random forest and rule-based pruning are also adopted to enhance MRC-CE’s precision and recall simultaneously. Our experiments evaluated upon multilingual KGs, i.e., English Probase and Chinese CN-DBpedia, justify MRC-CE’s superiority over the state-of-the-art extraction models in KG completion. Particularly, after running MRC-CE for each entity in CN-DBpedia, more than 7,053,900 new concepts (instanceOf relations) are supplied into the KG. The code and datasets have been released at https://github.com/fcihraeipnusnacwh/MRC-CE .

Siyu Yuan, Deqing Yang, Jiaqing Liang, Jilun Sun, Jingyue Huang, Kaiyan Cao, Yanghua Xiao, Rui Xie

Open Access

Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering

Visual Question Answering (VQA) is concerned with answering free-form questions about an image. Since it requires a deep semantic and linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires multi-modal reasoning from both computer vision and natural language processing. We propose Graphhopper, a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques. Concretely, our method is based on performing context-driven, sequential reasoning based on the scene entities and their semantic and spatial relationships. As a first step, we derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships. Subsequently, a reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasoning paths, which are the basis for deriving answers. We conduct an experimental study on the challenging dataset GQA, based on both manually curated and automatically generated scene graphs. Our results show that we keep up with human performance on manually curated scene graphs. Moreover, we find that Graphhopper outperforms another state-of-the-art scene graph reasoning model on both manually curated and automatically generated scene graphs by a significant margin.

Rajat Koner, Hang Li, Marcel Hildebrandt, Deepan Das, Volker Tresp, Stephan Günnemann
EDG-Based Question Decomposition for Complex Question Answering over Knowledge Bases

Knowledge base question answering (KBQA) aims at automatically answering factoid questions over knowledge bases (KBs). For complex questions that require multiple KB relations or constraints, KBQA faces many challenges including question understanding, component linking (e.g., entity, relation, and type linking), and query composition. In this paper, we propose a novel graph structure called Entity Description Graph (EDG) to represent the structure of complex questions, which can help alleviate the above issues. By leveraging the EDG structure of given questions, we implement a QA system over DBpedia, called EDGQA. Extensive experiments demonstrate that EDGQA outperforms state-of-the-art results on both LC-QuAD and QALD-9, and that EDG-based decomposition is a feasible way for complex question answering over KBs.

Xixin Hu, Yiheng Shu, Xiang Huang, Yuzhong Qu
Zero-Shot Visual Question Answering Using Knowledge Graph

Incorporating external knowledge to Visual Question Answering (VQA) has become a vital practical need. Existing methods mostly adopt pipeline approaches with different components for knowledge matching and extraction, feature learning, etc. However, such pipeline approaches suffer when some component does not perform well, which leads to error cascading and poor overall performance. Furthermore, the majority of existing approaches ignore the answer bias issue—many answers may have never appeared during training (i.e., unseen answers) in real-word application. To bridge these gaps, in this paper, we propose a Zero-shot VQA algorithm using knowledge graph and a mask-based learning mechanism for better incorporating external knowledge, and present new answer-based Zero-shot VQA splits for the F-VQA dataset. Experiments show that our method can achieve state-of-the-art performance in Zero-shot VQA with unseen answers, meanwhile dramatically augment existing end-to-end models on the normal F-VQA task.

Zhuo Chen, Jiaoyan Chen, Yuxia Geng, Jeff Z. Pan, Zonggang Yuan, Huajun Chen
Learning to Recommend Items to Wikidata Editors

Wikidata is an open knowledge graph built by a global community of volunteers. As it advances in scale, it faces substantial challenges around editor engagement. These challenges are in terms of both attracting new editors to keep up with the sheer amount of work, and retaining existing editors. Experience from other online communities and peer-production systems, including Wikipedia, suggests that personalised recommendations could help, especially newcomers, who are sometimes unsure about how to contribute best to an ongoing effort. For this reason, we propose a recommender system WikidataRec for Wikidata items. The system uses a hybrid of content-based and collaborative filtering techniques to rank items for editors relying on both item features and item-editor previous interaction. A neural network, named neural mixture of representations, is designed to learn fine weights for the combination of item-based representations and optimize them with editor-based representation by item-editor interaction. To facilitate further research in this space, we also create two benchmark datasets, a general-purpose one with 220, 000 editors responsible for 14 million interactions with 4 million items, and a second one focusing on the contributions of more than 8, 000 more active editors. We perform an offline evaluation of the system on both datasets with promising results. Our code and datasets are available at https://github.com/WikidataRec-developer/Wikidata_Recommender .

Kholoud AlGhamdi, Miaojing Shi, Elena Simperl
Graph-Boosted Active Learning for Multi-source Entity Resolution

Supervised entity resolution methods rely on labeled record pairs for learning matching patterns between two or more data sources. Active learning minimizes the labeling effort by selecting informative pairs for labeling. The existing active learning methods for entity resolution all target two-source matching scenarios and ignore signals that only exist in multi-source settings, such as the Web of Data. In this paper, we propose ALMSER, a graph-boosted active learning method for multi-source entity resolution. To the best of our knowledge, ALMSER is the first active learning-based entity resolution method that is especially tailored to the multi-source setting. ALMSER exploits the rich correspondence graph that exists in multi-source settings for selecting informative record pairs. In addition, the correspondence graph is used to derive complementary training data. We evaluate our method using five multi-source matching tasks having different profiling characteristics. The experimental evaluation shows that leveraging graph signals leads to improved results over active learning methods using margin-based and committee-based query strategies in terms of F1 score on all tasks.

Anna Primpeli, Christian Bizer
Computing CQ Lower-Bounds over OWL 2 Through Approximation to RSA

Conjunctive query (CQ) answering over knowledge bases is an important reasoning task. However, with expressive ontology languages such as OWL, query answering is computationally very expensive. The PAGOdA system addresses this issue by using a tractable reasoner to compute lower and upper-bound approximations, falling back to a fully-fledged OWL reasoner only when these bounds don’t coincide. The effectiveness of this approach critically depends on the quality of the approximations, and in this paper we explore a technique for computing closer approximations via RSA, an ontology language that subsumes all the OWL 2 profiles while still maintaining tractability. We present a novel approximation of OWL 2 ontologies into RSA, and an algorithm to compute a closer (than PAGOdA) lower bound approximation using the RSA combined approach. We have implemented these algorithms in a prototypical CQ answering system, and we present a preliminary evaluation of our system that shows significant performance improvements w.r.t. PAGOdA.

Federico Igne, Stefano Germano, Ian Horrocks
Fast ObjectRank for Large Knowledge Databases

ObjectRank is an essential tool to evaluate an importance of nodes for a user-specified query in heterogeneous graphs. However, existing methods are not applicable to massive graphs because they iteratively compute all nodes and edges. This paper proposes SchemaRank, which detects the exact top-k important nodes for a given query within a short running time. SchemaRank dynamically excludes unpromising nodes and edges, ensuring that it detects the same top-k important nodes as ObjectRank. Our extensive evaluations demonstrate that the running time of SchemaRank outperforms existing methods by up to two orders of magnitude.

Hiroaki Shiokawa
Open Domain Question Answering over Knowledge Graphs Using Keyword Search, Answer Type Prediction, SPARQL and Pre-trained Neural Models

Question Answering (QA) in vague or complex open domain information needs is hard to be adequate, satisfying and pleasing for end users. In this paper we investigate an approach where QA complements a general purpose interactive keyword search system over RDF. We describe the role of QA in that context, and we detail and evaluate a pipeline for QA that involves a general purpose entity search service over RDF, answer type prediction, entity enrichment through SPARQL, and pre-trained neural models. The fact that we start from a general purpose keyword search over RDF, makes the proposed pipeline widely applicable and realistic, in the sense that it does not pre-suppose the availability of knowledge graph-specific training dataset. We evaluate various aspects of the pipeline, including the effect of answer type prediction, as well as the performance of QA over existing benchmarks. The results show that, even by using different data sources for training, the proposed pipeline achieves a satisfactory performance. Moreover we show that the ranking of entities for QA can improve the entity ranking.

Christos Nikas, Pavlos Fafalios, Yannis Tzitzikas
Automatically Extracting OWL Versions of FOL Ontologies

While OWL and RDF are by far the most popular logic-based languages for Semantic Web Ontologies, some well-designed ontologies are only available in languages with a much richer expressivity, such as first-order logic (FOL) or the ISO standard Common Logic. This inhibits reuse of these ontologies by the wider Semantic Web Community. While converting OWL ontologies to FOL is straightforward, the reverse problem of finding the closest OWL approximation of an FOL ontology is undecidable. However, for most practical purposes, a “good enough” OWL approximation need not be perfect to enable wider reuse by the Semantic Web Community.This paper outlines such a conversion approach by first normalizing FOL sentences into a function-free prenex conjunctive normal (FF-PCNF) that strips away minor syntactic differences and then applying a pattern-based approach to identify common OWL axioms. It is tested on the over 2,000 FOL ontologies from the Common Logic Ontology Repository.

Torsten Hahmann, Robert W. Powell II
Using Compositional Embeddings for Fact Checking

Unsupervised fact checking approaches for knowledge graphs commonly combine path search and scoring to predict the likelihood of assertions being true. Current approaches search for said metapaths in the discrete search space spanned by the input knowledge graph and make no use of continuous representations of knowledge graphs. We hypothesize that augmenting existing approaches with information from continuous knowledge graph representations has the potential to improve their performance. Our approach Esther searches for metapaths in compositional embedding spaces instead of the graph itself. By being able to explore longer metapaths, it can detect supplementary evidence for assertions being true that can be exploited by existing fact checking approaches. We evaluate Esther by combining it with 10 other approaches in an ensemble learning setting. Our results agree with our hypothesis and suggest that all other approaches can benefit from being combined with Esther by 20.65% AUC-ROC on average. Our code is open-source and can be found at https://github.com/dice-group/esther .

Ana Alexandra Morim da Silva, Michael Röder, Axel-Cyrille Ngonga Ngomo
Background Knowledge in Schema Matching: Strategy vs. Data

The use of external background knowledge can be beneficial for the task of matching schemas or ontologies automatically. In this paper, we exploit six general-purpose knowledge graphs as sources of background knowledge for the matching task. The background sources are evaluated by applying three different exploitation strategies. We find that explicit strategies still outperform latent ones and that the choice of the strategy has a greater impact on the final alignment than the actual background dataset on which the strategy is applied. While we could not identify a universally superior resource, BabelNet achieved consistently good results. Our best matcher configuration with BabelNet performs very competitively when compared to other matching systems even though no dataset-specific optimizations were made.

Jan Portisch, Michael Hladik, Heiko Paulheim
A Graph-Based Approach for Inferring Semantic Descriptions of Wikipedia Tables

There are millions of high-quality tables available in Wikipedia. These tables cover many domains and contain useful information. To make use of these tables for data discovery or data integration, we need precise descriptions of the concepts and relationships in the data, known as semantic descriptions. However, creating semantic descriptions is a complex process requiring considerable manual effort and can be error prone. In this paper, we present a novel probabilistic approach for automatically building semantic descriptions of Wikipedia tables. Our approach leverages hyperlinks in a Wikipedia table and existing knowledge in Wikidata to construct a graph of possible relationships in the table and its context, and then it uses collective inference to distinguish genuine and spurious relationships to form the final semantic description. In contrast to existing methods, our solution can handle tables that require complex semantic descriptions of n-ary relations (e.g., the population of a country in a particular year) or implicit contextual values to describe the data accurately. In our empirical evaluation, our approach outperforms state-of-the-art systems on the SemTab2020 dataset and outperforms those systems by as much as 28% in F1 score on a large set of Wikipedia tables.

Binh Vu, Craig A. Knoblock, Pedro Szekely, Minh Pham, Jay Pujara
Generative Relation Linking for Question Answering over Knowledge Bases

Relation linking is essential to enable question answering over knowledge bases. Although there are various efforts to improve relation linking performance, the current state-of-the-art methods do not achieve optimal results, therefore, negatively impacting the overall end-to-end question answering performance. In this work, we propose a novel approach for relation linking framing it as a generative problem facilitating the use of pre-trained sequence-to-sequence models. We extend such sequence-to-sequence models with the idea of infusing structured data from the target knowledge base, primarily to enable these models to handle the nuances of the knowledge base. Moreover, we train the model with the aim to generate a structured output consisting of a list of argument-relation pairs, enabling a knowledge validation step. We compared our method against the existing relation linking systems on four different datasets derived from DBpedia and Wikidata. Our method reports large improvements over the state-of-the-art while using a much simpler model that can be easily adapted to different knowledge bases.

Gaetano Rossiello, Nandana Mihindukulasooriya, Ibrahim Abdelaziz, Mihaela Bornea, Alfio Gliozzo, Tahira Naseem, Pavan Kapanipathi

Open Access

Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages

Semantic markup, such as Schema.org, allows providers on the Web to describe content using a shared controlled vocabulary. This markup is invaluable in enabling a broad range of applications, from vertical search engines, to rich snippets in search results, to actions on emails, to many others. In this paper, we focus on semantic markup for datasets, specifically in the context of developing a vertical search engine for datasets on the Web, Google’s Dataset Search. Dataset Search relies on Schema.org to identify pages that describe datasets. While Schema.org was the core enabling technology for this vertical search, we also discovered that we need to address the following problem: pages from 61% of internet hosts that provide Schema.org/Dataset markup do not actually describe datasets. We analyze the veracity of dataset markup for Dataset Search’s Web-scale corpus and categorize pages where this markup is not reliable. We then propose a way to drastically increase the quality of the dataset metadata corpus by developing a deep neural-network classifier that identifies whether or not a page with Schema.org/Dataset markup is a dataset page. Our classifier achieves 96.7% recall at the 95% precision point. This level of precision enables Dataset Search to circumvent the noise in semantic markup and to use the metadata to provide high quality results to users.

Tarfah Alrashed, Dimitris Paparas, Omar Benjelloun, Ying Sheng, Natasha Noy
Learning Visual Models Using a Knowledge Graph as a Trainer

Traditional computer vision approaches, based on neural networks (NN), are typically trained on a large amount of image data. By minimizing the cross-entropy loss between a prediction and a given class label, the NN and its visual embedding space are learned to fulfill a given task. However, due to the sole dependence on the image data distribution of the training domain, these models tend to fail when applied to a target domain that differs from their source domain. To learn a more robust NN to domain shifts, we propose the knowledge graph neural network (KG-NN), a neuro-symbolic approach that supervises the training using image-data-invariant auxiliary knowledge. The auxiliary knowledge is first encoded in a knowledge graph with respective concepts and their relationships, which is then transformed into a dense vector representation via an embedding method. Using a contrastive loss function, KG-NN learns to adapt its visual embedding space and thus its weights according to the image-data invariant knowledge graph embedding space. We evaluate KG-NN on visual transfer learning tasks for classification using the mini-ImageNet dataset and its derivatives, as well as road sign recognition datasets from Germany and China. The results show that a visual model trained with a knowledge graph as a trainer outperforms a model trained with cross-entropy in all experiments, in particular when the domain gap increases. Besides better performance and stronger robustness to domain shifts, these KG-NN adapts to multiple datasets and classes without suffering heavily from catastrophic forgetting.

Sebastian Monka, Lavdim Halilaj, Stefan Schmid, Achim Rettinger
Controlled Query Evaluation over Prioritized Ontologies with Expressive Data Protection Policies

We study information disclosure in Description Logic ontologies, in the spirit of Controlled Query Evaluation, where query answering is filtered through optimal censors maximizing answers while hiding data protected by a declarative policy. Previous works have considered limited forms of policy, typically constituted by conjunctive queries (CQs), whose answer must never be inferred by a user. Also, existing implementations adopt approximated notions of censors that might result too restrictive in the practice in terms of the amount of non-protected information returned to the users. In this paper we enrich the framework, by extending CQs in the policy with comparison predicates and introducing preferences between ontology predicates, which can be exploited to decide the portion of a secret that can be disclosed to a user, thus in principle augmenting the throughput of query answers. We show that answering CQs in our framework is first-order rewritable for $$\textit{DL-Lite}_{A} $$ DL - Lite A ontologies and safe policies, and thus in AC $$^0$$ 0 in data complexity. We also present some experiments on a popular benchmark, showing effectiveness and feasibility of our approach in a real-world scenario.

Gianluca Cima, Domenico Lembo, Lorenzo Marconi, Riccardo Rosati, Domenico Fabio Savo
ProGS: Property Graph Shapes Language

Knowledge graphs such as Wikidata are created by a diversity of contributors and a range of sources leaving them prone to two types of errors. The first type of error, falsity of facts, is addressed by property graphs through the representation of provenance and validity, making triples occur as first-order objects in subject position of metadata triples. The second type of error, violation of domain constraints, has not been addressed with regard to property graphs so far. In RDF representations, this error can be addressed by shape languages such as SHACL or ShEx, which allow for checking whether graphs are valid with respect to a set of domain constraints. Borrowing ideas from the syntax and semantics definitions of SHACL, we design a shape language for property graphs, ProGS, which allows for formulating shape constraints on property graphs including their specific constructs, such as edges with identities and key-value annotations to both nodes and edges. We define a formal semantics of ProGS, investigate the resulting complexity of validating property graphs against sets of ProGS shapes, compare with corresponding results for SHACL, and implement a prototypical validator that utilizes answer set programming.

Philipp Seifer, Ralf Lämmel, Steffen Staab
Improving Knowledge Graph Embeddings with Ontological Reasoning

Knowledge graph (KG) embedding models have emerged as powerful means for KG completion. To learn the representation of KGs, entities and relations are projected in a low-dimensional vector space so that not only existing triples in the KG are preserved but also new triples can be predicted. Embedding models might learn a good representation of the input KG, but due to the nature of machine learning approaches, they often lose the semantics of entities and relations, which might lead to nonsensical predictions. To address this issue we propose to improve the accuracy of embeddings using ontological reasoning. More specifically, we present a novel iterative approach ReasonKGE that identifies dynamically via symbolic reasoning inconsistent predictions produced by a given embedding model and feeds them as negative samples for retraining this model. In order to address the scalability problem that arises when integrating ontological reasoning into the training process, we propose an advanced technique to generalize the inconsistent predictions to other semantically similar negative samples during retraining. Experimental results demonstrate the improvements in accuracy of facts produced by our method compared to the state-of-the-art.

Nitisha Jain, Trung-Kien Tran, Mohamed H. Gad-Elrab, Daria Stepanova

Resources Track

Frontmatter
CKGG: A Chinese Knowledge Graph for High-School Geography Education and Beyond

As part of a long-term research effort to provide students with better computer-aided education, we create CKGG, a Chinese knowledge graph for the geography domain at the high school level. Using GeoNames and Wikidata as a basis, we transform and integrate various kinds of geographical data in different formats from diverse sources, including gridded temperature data in NetCDF, precipitation data in HDF5, solar radiation data in AAIGrid, polygon data in GPKG, climate and ocean current data in images, and government data in tables. The current version of CKGG contains 1.5 billion triples and is accessible as Linked Data. We also publish a reified version for provenance tracking. We illustrate the potential application of CKGG with a prototype.

Yulin Shen, Ziheng Chen, Gong Cheng, Yuzhong Qu
A High-Level Ontology Network for ICT Infrastructures

The ICT infrastructures of medium and large organisations that offer ICT services (infrastructure, platforms, software, applications, etc.) are becoming increasingly complex. Nowadays, these environments combine all sorts of hardware (e.g., CPUs, GPUs, storage elements, network equipment) and software (e.g., virtual machines, servers, microservices, services, products, AI models). Tracking, understanding and acting upon all the data produced in the context of such environments is hence challenging. Configuration management databases have been so far widely used to store and provide access to relevant information and views on these components and on their relationships. However, different databases are organised according to different schemas. Despite existing efforts in standardising the main entities relevant for configuration management, there is not yet a core set of ontologies that describes these environments homogeneously, and which can be easily extended when new types of items appear. This paper presents an ontology network created with the purpose of serving as an initial step towards an homogeneous representation of this domain, and which has been already used to produce a knowledge graph for a large ICT company.

Oscar Corcho, David Chaves-Fraga, Jhon Toledo, Julián Arenas-Guerrero, Carlos Badenes-Olmedo, Mingxue Wang, Hu Peng, Nicholas Burrett, José Mora, Puchao Zhang
Chimera: A Bridge Between Big Data Analytics and Semantic Technologies

In the last decades, Knowledge Graph (KG) empowered analytics have been used to extract advanced insights from data. Several companies integrated legacy relational databases with semantic technologies using Ontology-Based Data Access (OBDA). In practice, this approach enables the analysts to write SPARQL queries both over KGs and SQL relational data sources by making transparent most of the implementation details. However, the volume of data is continuously increasing, and a growing number of companies are adopting distributed storage platforms and distributed computing engines. There is a gap between big data and semantic technologies. Ontop, one of the reference OBDA systems, is limited to legacy relational databases, and the compatibility with the big data analytics engine Apache Spark is still missing. This paper introduces Chimera, an open-source software suite that aims at filling such a gap. Chimera enables a new type of round-tripping data science pipelines. Data Scientists can query data stored in a data lake using SPARQL through Ontop and SparkSQL while saving the semantic results of such analysis back in the data lake. This new type of pipelines semantically enriches data from Spark before saving them back.

Matteo Belcao, Emanuele Falzone, Enea Bionda, Emanuele Della Valle
Scalable Transformation of Big Geospatial Data into Linked Data

In the era of big data, a vast amount of geospatial data has become available originating from a large diversity of sources. In most cases, this data does not follow the linked data paradigm and the existing transformation tools have been proved ineffective due to the large volume and velocity of geospatial data. This is because none of the existing tools can utilize effectively the processing power of clusters of computers. We present the system GeoTriples-Spark which is able to massively transform big geospatial data into RDF graphs using Apache Spark. We evaluate GeoTriple-Spark’s performance and scalability in standalone and distributed environments and show that it exhibits superior performance and scalability when compared to all of its competitors.

George Mandilaras, Manolis Koubarakis
AgroLD: A Knowledge Graph for the Plant Sciences

Recent advances in sequencing technologies and high-throughput phenotyping have revolutionized the analysis in the field of the plant sciences. However, there is an urgent need to effectively integrate and assimilate complementary information to understand the biological system in its entirety. We have developed AgroLD, a knowledge graph that exploits Semantic Web technologies to integrate information on plant species and in this way facilitate the formulation and validation of new scientific hypotheses. AgroLD contains around 900M triples created by annotating and integrating more than 100 datasets coming from 15 data sources. Our objective is to offer a domain specific knowledge platform to answer complex biological and plant sciences questions related to the implication of genes in, for instance, plant disease resistance or adaptative responses to climate change. In this paper, we present results of the project, which focused on genomics, proteomics and phenomics. We present the AgroLD pipeline for lifting the data, the open source tools developed for these purposes, as well as the web application allowing to explore the data.

Pierre Larmande, Konstantin Todorov
LiterallyWikidata - A Benchmark for Knowledge Graph Completion Using Literals

In order to transform a Knowledge Graph (KG) into a low dimensional vector space, it is beneficial to preserve as much semantics as possible from the different components of the KG. Hence, some link prediction approaches have been proposed so far which leverage literals in addition to the commonly used links between entities. However, the procedures followed to create the existing datasets do not pay attention to literals. Therefore, this study presents a set of KG completion benchmark datasets extracted from Wikidata and Wikipedia, named LiterallyWikidata. It has been prepared with the main focus on providing benchmark datasets for multimodal KG Embedding (KGE) models, specifically for models using numeric and/or text literals. Hence, the benchmark is novel as compared to the existing datasets in terms of properly handling literals for those multimodal KGE models. LiterallyWikidata contains three datasets which vary both in size and structure. Benchmarking experiments on the task of link prediction have been conducted on LiterallyWikidata with extensively tuned unimodal/multimodal KGE models. The datasets are available at https://doi.org/10.5281/zenodo.4701190 .

Genet Asefa Gesese, Mehwish Alam, Harald Sack
A Framework for Quality Assessment of Semantic Annotations of Tabular Data

Much information is conveyed within tables, which can be semantically annotated by humans or (semi)automatic approaches. Nevertheless, many applications cannot take full advantage of semantic annotations because of the low quality. A few methodologies exist for the quality assessment of semantic annotation of tabular data, but they do not automatically assess the quality as a multidimensional concept through different quality dimensions. The quality dimensions are implemented in STILTool 2, a web application to automate the quality assessment of the annotations. The evaluation is carried out by comparing the quality of semantic annotations with gold standards. The work presented here has been applied to at least three use cases. The results show that our approach can give us hints about the quality issues and how to address them.

Roberto Avogadro, Marco Cremaschi, Ernesto Jiménez-Ruiz, Anisa Rula

Open Access

EduCOR: An Educational and Career-Oriented Recommendation Ontology

With the increased dependence on online learning platforms and educational resource repositories, a unified representation of digital learning resources becomes essential to support a dynamic and multi-source learning experience. We introduce the EduCOR ontology, an educational, career-oriented ontology that provides a foundation for representing online learning resources for personalised learning systems. The ontology is designed to enable learning material repositories to offer learning path recommendations, which correspond to the user’s learning goals and preferences, academic and psychological parameters, and labour-market skills. We present the multiple patterns that compose the EduCOR ontology, highlighting its cross-domain applicability and integrability with other ontologies. A demonstration of the proposed ontology on the real-life learning platform eDoer is discussed as a use case. We evaluate the EduCOR ontology using both gold standard and task-based approaches. The comparison of EduCOR to three gold schemata, and its application in two use-cases, shows its coverage and adaptability to multiple OER repositories, which allows generating user-centric and labour-market oriented recommendations.Resource: https://tibonto.github.io/educor/ .

Eleni Ilkou, Hasan Abu-Rasheed, Mohammadreza Tavakoli, Sherzod Hakimov, Gábor Kismihók, Sören Auer, Wolfgang Nejdl
The Punya Platform: Building Mobile Research Apps with Linked Data and Semantic Features

Modern smartphones offer advanced sensing, connectivity, and processing capabilities for data acquisition, processing, and generation: but it can be difficult and costly to develop mobile research apps that leverage these features. Nevertheless, in life sciences and other scientific domains, there often exists a need to develop advanced mobile apps that go beyond simple questionnaires: ranging from sensor data collection and processing to self-management tools for chronic patients in healthcare. We present Punya, an open source, web-based platform based on MIT App Inventor that simplifies building Linked Data-enabled, advanced mobile apps that exploit smartphone capabilities. We posit that its integration with Linked Data facilitates the development of complex application and business rules, communication with heterogeneous online services, and interaction with the Internet of Things (IoT) data sources using the smartphone hardware. To that end, Punya includes an embedded semantic rule engine, integration with GraphQL and SPARQL to access remote graph data, and support for IoT devices using Bluetooth Low Energy and Linked Data Platform Constrained Application Protocol (LDP-CoAP). Moreover, Punya supports generating Linked Data descriptions of collected data. The platform includes built-in tutorials to quickly build apps using these different technologies. In this paper, we present a short discussion of the Punya platform, its current adoption that includes over 500 active users as well as the larger app-building MIT App Inventor community of which it is a part, and future development directions that would greatly benefit Semantic Web and Linked Data application developers as well as researchers who leverage Linked Open Data resources for their research. Resource: http://punya.mit.edu

Evan W. Patton, William Van Woensel, Oshani Seneviratne, Giuseppe Loseto, Floriano Scioscia, Lalana Kagal
BEEO: Semantic Support forEvent-Based Data Analytics

Recent developments in data analysis and machine learning support novel data-driven operations. Event data provide social and environmental context, thus, such data may become essential for the workflow of data analytic pipelines. In this paper, we introduce our Business Event Exchange Ontology (BEEO), based on Schema.org that enables data integration and analytics for event data. BEEO is available under Apache 2.0 license on GitHub, and is seeing adoption among both its creator companies and other product and service companies. We present and discuss the ontology development drivers and process, its structure, and its usage in different real use cases.Resource Type: OntologyLicense: Apache 2.0DOI: https://doi.org/10.5281/zenodo.4695281 Repository: https://github.com/UNIMIBInside/Business-Event-Exchange-Ontology

Michele Ciavotta, Vincenzo Cutrona, Flavio De Paoli, Matteo Palmonari, Blerina Spahiu
Rail Topology Ontology: A Rail Infrastructure Base Ontology

Engineering projects for railway infrastructure typically involve many subsystems which need consistent views of the planned and built infrastructure and its underlying topology. Consistency is typically ensured by exchanging and verifying data between tools using XML-based data formats and UML-based object-oriented models. A tighter alignment of these data representations via a common topology model could decrease the development effort of railway infrastructure engineering tools. A common semantic model is also a prerequisite for the successful adoption of railway knowledge graphs. Based on the RailTopoModel standard, we developed the Rail Topology Ontology as a model to represent core features of railway infrastructures in a standard-compliant manner. This paper describes the ontology and its development method, and discusses its suitability for integrating data of railway engineering systems and other sources in a knowledge graph.With the Rail Topology Ontology, software engineers and knowledge scientists have a standard-based ontology for representing railway topologies to integrate disconnected data sources. We use the Rail Topology Ontology for our rail knowledge graph and plan to extend it by rail infrastructure ontologies derived from existing data exchange standards, since many such standards use the same base model as the presented ontology, viz., RailTopoModel.

Stefan Bischof, Gottfried Schenner

In-Use Track

Frontmatter
Mapping Manuscript Migrations on the Semantic Web: A Semantic Portal and Linked Open Data Service for Premodern Manuscript Research

This paper presents the Mapping Manuscript Migrations (MMM) system in use for modeling, aggregating, publishing, and studying heterogeneous, distributed premodern manuscript databases on the Semantic Web. A general “Sampo model” is applied to publishing and using linked data in Digital Humanities (DH) research and to creating the MMM system that includes a semantic portal and a Linked Open Data (LOD) service. The idea is to provide the manuscript data publishers with a novel collaborative way to enrich their contents with related data of the other providers and by reasoning. For the end user, the MMM Portal facilitates semantic faceted search and exploration of the data, integrated seamlessly with data analytic tools for solving research problems in manuscript studies. In addition, the SPARQL endpoint of the LOD service can be used with external tools for customized use in DH research and applications. The MMM services are available online, based on metadata of over 220 000 manuscripts from the Schoenberg Database of Manuscripts of the Schoenberg Institute for Manuscript Studies (University of Pennsylvania), the Medieval Manuscripts in Oxford Libraries, and Bibale of Institut de recherche et d’histoire des textes in Paris. Evaluation of the MMM Portal suggests that the system is useful in manuscript studies and outperforms current systems online in searching, exploring, and analyzing data.

Eero Hyvönen, Esko Ikkala, Mikko Koho, Jouni Tuominen, Toby Burrows, Lynn Ransom, Hanno Wijsman
Wikibase as an Infrastructure for Knowledge Graphs: The EU Knowledge Graph

Knowledge graphs are being deployed in many enterprises and institutions. An easy-to-use, well-designed infrastructure for such knowledge graphs is not obvious. After the success of Wikidata, many institutions are looking at the software infrastructure behind it, namely Wikibase.In this paper we introduce Wikibase, describe its different software components and the tools that have emerged around it. In particular, we detail how Wikibase is used as the infrastructure behind the “EU Knowledge Graph”, which is deployed at the European Commission. This graph mainly integrates projects funded by the European Union, and is used to make these projects visible to and easily accessible by citizens with no technical background.Moreover, we explain how this deployment compares to a more classical approach to building RDF knowledge graphs, and point to other projects that are using Wikibase as an underlying infrastructure.

Dennis Diefenbach, Max De Wilde, Samantha Alipio
Leveraging Semantic Technologies for Digital Interoperability in the European Railway Domain

The European Union Agency for Railways is an European authority, tasked with the provision of a legal and technical framework to support harmonized and safe cross-border railway operations throughout the EU. So far, the agency relied on traditional application-centric approaches to support the data exchange among multiple actors interacting within the railway domain. This lead however, to isolated digital environments that consequently added barriers to digital interoperability while increasing the cost of maintenance and innovation. In this work, we show how Semantic Web technologies are leveraged to create a semantic layer for data integration across the base registries maintained by the agency. We validate the usefulness of this approach by supporting route compatibility checks, a highly demanded use case in this domain, which was not available over the agency’s registries before. Our contributions include (i) an official ontology for the railway infrastructure and authorized vehicle types, including 28 reference datasets; (ii) a reusable Knowledge Graph describing the European railway infrastructure; (iii) a cost-efficient system architecture that enables high-flexibility for use case development; and (iv) an open source and RDF native Web application to support route compatibility checks. This work demonstrates how data-centric system design, powered by Semantic Web technologies and Linked Data principles, provides a framework to achieve data interoperability and unlock new and innovative use cases and applications. Based on the results obtained during this work, ERA officially decided to make Semantic Web and Linked Data-based approaches, the default setting for any future development of the data, registers and specifications under the agency’s remit for data exchange mandated by the EU legal framework. The next steps, which are already underway, include further developing and bringing these solutions to a production-ready state.

Julián Andrés Rojas, Marina Aguado, Polymnia Vasilopoulou, Ivo Velitchkov, Dylan Van Assche, Pieter Colpaert, Ruben Verborgh
Use of Semantic Technologies to Inform Progress Toward Zero-Carbon Economy

To investigate the effect of possible changes to decarbonise the economy, a detailed picture of the current production system is needed. Material/energy flow analysis (MEFA) allows for building such a model. There are, however, prohibitive barriers to the integration and use of the diverse datasets necessary for a system-wide yet technically-detailed MEFA study. Herein we describe a methodology exploiting Semantic Web technologies to integrate and reason on top of this diverse production system data. We designed an ontology to model the structure of our data, and developed a declarative logic-based approach to address the many challenges arising from data integration and usage in this context. Further, this system is designed for easy access to the needed data in terms relevant for additional modelling and to be applied by non-experts, allowing for a wide use of our methodology. Our experiments with UK production data confirm the usefulness of this methodology through a case study based on the UK production system.

Stefano Germano, Carla Saunders, Ian Horrocks, Rick Lupton
Towards Semantic Interoperability inHistorical Research: Documenting Research Data and Knowledge withSynthesis

A vast area of research in historical science concerns the documentation and study of artefacts and related evidence. Current practice mostly uses spreadsheets or simple relational databases to organise the information as rows with multiple columns of related attributes. This form offers itself for data analysis and scholarly interpretation, however it also poses problems including i) the difficulty for collaborative but controlled documentation by a large number of users, ii) the lack of representation of the details from which the documented relations are inferred, iii) the difficulty to extend the underlying data structures as well as to combine and integrate data from multiple and diverse information sources, and iv) the limitation to reuse the data beyond the context of a particular research activity. To support historians to cope with these problems, in this paper we describe the Synthesis documentation system and its use by a large number of historians in the context of an ongoing research project in the field of History of Art. The system is Web-based and collaborative, and makes use of existing standards for information documentation and publication (CIDOC-CRM, RDF), focusing on semantic interoperability and the production of data of high value and long-term validity.

Pavlos Fafalios, Konstantina Konsolaki, Lida Charami, Kostas Petrakis, Manos Paterakis, Dimitris Angelakis, Yannis Tzitzikas, Chrysoula Bekiari, Martin Doerr
On Constructing Enterprise Knowledge Graphs Under Quality and Availability Constraints

Knowledge graph technologies have proven their applicability and usefulness to integrate data silos and answer questions spanning over the different sources. However the integration of data can pose some risks and challenges (security, audit needs, quality control, ...). In this paper we abstract from two client use-cases, one in the banking domain and one in the pharmaceutical domain, to highlight those risks/challenges and propose a generic approach to address them. This approach leverages Semantic web technologies and is implemented using Stardog.

Matthew Kujawinski, Christophe Guéret, Chandan Kumar, Brennan Woods, Pavel Klinov, Evren Sirin
Reconciling and Using Historical Person Registers as Linked Open Data in the AcademySampo Portal and Data Service

This paper presents a method for extracting and reassembling a genealogical network automatically from a biographical register of historical people. The method is applied to a dataset of short textual biographies about all 28 000 Finnish and Swedish academic people educated in 1640–1899 in Finland. The aim is to connect and disambiguate the relatives mentioned in the biographies in order to build a continuous, genealogical network, which can be used in Digital Humanities for data and network analysis of historical academic people and their lives. An artificial neural network approach is presented for solving a supervised learning task to disambiguate relatives mentioned in the register descriptions using basic biographical information enhanced with an ontology of vocations and additional occasionally sparse genealogical information. Evaluation results of the record linkage are promising and provide novel insights into the problem of historical people register reconciliation. The outcome of the work has been used in practise as part of the in-use AcademySampo portal and linked open data service, a new member in the Sampo series of cultural heritage applications for Digital Humanities.

Petri Leskinen, Eero Hyvönen
Backmatter
Metadata
Title
The Semantic Web – ISWC 2021
Editors
Prof. Dr. Andreas Hotho
Assoc. Prof. Eva Blomqvist
Stefan Dietze
Achille Fokoue
Ying Ding
Payam Barnaghi
Armin Haller
Dr. Mauro Dragoni
Harith Alani
Copyright Year
2021
Electronic ISBN
978-3-030-88361-4
Print ISBN
978-3-030-88360-7
DOI
https://doi.org/10.1007/978-3-030-88361-4

Premium Partner