Fine-Grained Evaluation of Rule- and Embedding-Based Systems for Knowledge Graph Completion

Over the recent years, embedding methods have attracted increasing focus as a means for knowledge graph completion. Similarly, rule-based systems have been studied for this task in the past. What is missing so far is a common evaluation that includes more than one type of method. We close this gap by comparing representatives of both types of systems in a frequently used evaluation protocol. Leveraging the explanatory qualities of rule-based systems, we present a fine-grained evaluation that gives insight into characteristics of the most popular datasets and points out the different strengths and shortcomings of the examined approaches. Our results show that models such as TransE, RESCAL or HolE have problems in solving certain types of completion tasks that can be solved by a rule-based approach with high precision. At the same time, there are other completion tasks that are difficult for rule-based systems. Motivated by these insights, we combine both families of approaches via ensemble learning. The results support our assumption that the two methods complement each other in a beneficial way.

Christian Meilicke, Manuel Fink, Yanjie Wang, Daniel Ruffinelli, Rainer Gemulla, Heiner Stuckenschmidt

Aligning Knowledge Base and Document Embedding Models Using Regularized Multi-Task Learning

Knowledge Bases (KBs) and textual documents contain rich and complementary information about real-world objects, as well as relations among them. While text documents describe entities in freeform, KBs organizes such information in a structured way. This makes these two information representation forms hard to compare and integrate, limiting the possibility to use them jointly to improve predictive and analytical tasks. In this article, we study this problem, and we propose KADE, a solution based on a regularized multi-task learning of KB and document embeddings. KADE can potentially incorporate any KB and document embedding learning method. Our experiments on multiple datasets and methods show that KADE effectively aligns document and entities embeddings, while maintaining the characteristics of the embedding models.

Matthias Baumgartner, Wen Zhang, Bibek Paudel, Daniele Dell’Aglio, Huajun Chen, Abraham Bernstein

Inducing Implicit Relations from Text Using Distantly Supervised Deep Nets

Knowledge Base Population (KBP) is an important problem in Semantic Web research and a key requirement for successful adoption of semantic technologies in many applications. In this paper we present Socrates, a deep learning based solution for Automated Knowledge Base Population from Text. Socrates does not require manual annotations which would make the solution hard to adapt to a new domain. Instead, it exploits a partially populated knowledge base and a large corpus of text documents to train a set of deep neural network models. As a result of the training process, the system learns how to identify implicit relations between entities across a highly heterogeneous set of documents from various sources, making it suitable for large-scale knowledge extraction from Web documents. Main contributions of this paper include (a) a novel approach based on composite contexts to acquire implicit relations from Title Oriented Documents, and (b) an architecture for unifying relation extraction using binary, unary, and composite contexts. We provide an extensive evaluation of the system across three different benchmarks with different characteristics, showing that our unified framework can consistently outperform state of the art solutions. Remarkably, Socrates ranked first in both the knowledge base population and attribute validation track at the Semantic Web Challenge at ISWC 2017.

Michael Glass, Alfio Gliozzo, Oktie Hassanzadeh, Nandana Mihindukulasooriya, Gaetano Rossiello

Towards Encoding Time in Text-Based Entity Embeddings

Knowledge Graphs (KG) are widely used abstractions to represent entity-centric knowledge. Approaches to embed entities, entity types and relations represented in the graph into vector spaces - often referred to as KG embeddings - have become increasingly popular for their ability to capture the similarity between entities and support other reasoning tasks. However, representation of time has received little attention in these approaches. In this work, we make a first step to encode time into vector-based entity representations using a text-based KG embedding model named Typed Entity Embeddings (TEEs). In TEEs, each entity is represented by a vector that represents the entity and its type, which is learned from entity mentions found in a text corpus. Inspired by evidence from cognitive sciences and application-oriented concerns, we propose an approach to encode representations of years into TEEs by aggregating the representations of the entities that occur in event-based descriptions of the years. These representations are used to define two time-aware similarity measures to control the implicit effect of time on entity similarity. Experimental results show that the linear order of years obtained using our model is highly correlated with natural time flow and the effectiveness of the time-aware similarity measure proposed to flatten the time effect on entity similarity.

Federico Bianchi, Matteo Palmonari, Debora Nozza

Rule Learning from Knowledge Graphs Guided by Embedding Models

Rules over a Knowledge Graph (KG) capture interpretable patterns in data and various methods for rule learning have been proposed. Since KGs are inherently incomplete, rules can be used to deduce missing facts. Statistical measures for learned rules such as confidence reflect rule quality well when the KG is reasonably complete; however, these measures might be misleading otherwise. So it is difficult to learn high-quality rules from the KG alone, and scalability dictates that only a small set of candidate rules could be generated. Therefore, the ranking and pruning of candidate rules are major problems. To address this issue, we propose a rule learning method that utilizes probabilistic representations of missing facts. In particular, we iteratively extend rules induced from a KG by relying on feedback from a precomputed embedding model over the KG and external information sources including text corpora. Experiments on real-world KGs demonstrate the effectiveness of our novel approach both with respect to the quality of the learned rules and fact predictions that they produce.

Vinh Thinh Ho, Daria Stepanova, Mohamed H. Gad-Elrab, Evgeny Kharlamov, Gerhard Weikum

A Novel Ensemble Method for Named Entity Recognition and Disambiguation Based on Neural Network

Named entity recognition (NER) and disambiguation (NED) are subtasks of information extraction that aim to recognize named entities mentioned in text, to assign them pre-defined types, and to link them with their matching entities in a knowledge base. Many approaches, often exposed as web APIs, have been proposed to solve these tasks during the last years. These APIs classify entities using different taxonomies and disambiguate them with different knowledge bases. In this paper, we describe Ensemble Nerd, a framework that collects numerous extractors responses, normalizes them and combines them in order to produce a final entity list according to the pattern (surface form, type, link). The presented approach is based on representing the extractors responses as real-value vectors and on using them as input samples for two Deep Learning networks: ENNTR (Ensemble Neural Network for Type Recognition) and ENND (Ensemble Neural Network for Disambiguation). We train these networks using specific gold standards. We show that the models produced outperform each single extractor responses in terms of micro and macro F1 measures computed by the GERBIL framework.

Lorenzo Canale, Pasquale Lisena, Raphaël Troncy

EARL: Joint Entity and Relation Linking for Question Answering over Knowledge Graphs

Many question answering systems over knowledge graphs rely on entity and relation linking components in order to connect the natural language input to the underlying knowledge graph. Traditionally, entity linking and relation linking have been performed either as dependent sequential tasks or as independent parallel tasks. In this paper, we propose a framework called EARL, which performs entity linking and relation linking as a joint task. EARL implements two different solution strategies for which we provide a comparative analysis in this paper: The first strategy is a formalisation of the joint entity and relation linking tasks as an instance of the Generalised Travelling Salesman Problem (GTSP). In order to be computationally feasible, we employ approximate GTSP solvers. The second strategy uses machine learning in order to exploit the connection density between nodes in the knowledge graph. It relies on three base features and re-ranking steps in order to predict entities and relations. We compare the strategies and evaluate them on a dataset with 5000 questions. Both strategies significantly outperform the current state-of-the-art approaches for entity and relation linking.

Mohnish Dubey, Debayan Banerjee, Debanjan Chaudhuri, Jens Lehmann

TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

Named Entity Recognition and Typing (NER/NET) is a challenging task, especially with long-tail entities such as the ones found in scientific publications. These entities (e.g. “WebKB”,“StatSnowball”) are rare, often relevant only in specific knowledge domains, yet important for retrieval and exploration purposes. State-of-the-art NER approaches employ supervised machine learning models, trained on expensive type-labeled data laboriously produced by human annotators. A common workaround is the generation of labeled training data from knowledge bases; this approach is not suitable for long-tail entity types that are, by definition, scarcely represented in KBs. This paper presents an iterative approach for training NER and NET classifiers in scientific publications that relies on minimal human input, namely a small seed set of instances for the targeted entity type. We introduce different strategies for training data extraction, semantic expansion, and result entity filtering. We evaluate our approach on scientific publications, focusing on the long-tail entities types Datasets, Methods in computer science publications, and Proteins in biomedical publications.

Sepideh Mesbah, Christoph Lofi, Manuel Valle Torre, Alessandro Bozzon, Geert-Jan Houben

An Ontology-Driven Probabilistic Soft Logic Approach to Improve NLP Entity Annotations

Many approaches for Knowledge Extraction and Ontology Population rely on well-known Natural Language Processing (NLP) tasks, such as Named Entity Recognition and Classification (NERC) and Entity Linking (EL), to identify and semantically characterize the entities mentioned in natural language text. Despite being intrinsically related, the analyses performed by these tasks differ, and combining their output may result in NLP annotations that are implausible or even conflicting considering common world knowledge about entities. In this paper we present a Probabilistic Soft Logic (PSL) model that leverages ontological entity classes to relate NLP annotations from different tasks insisting on the same entity mentions. The intuition behind the model is that an annotation likely implies some ontological classes on the entity identified by the mention, and annotations from different tasks on the same mention have to share more or less the same implied entity classes. In a setting with various NLP tools returning multiple, confidence-weighted, candidate annotations on a single mention, the model can be operationally applied to compare the different annotation combinations, and to possibly revise the tools’ best annotation choice. We experimented applying the model with the candidate annotations produced by two state-of-the-art tools for NERC and EL, on three different datasets. The results show that the joint “a posteriori” annotation revision suggested by our PSL model consistently improves the original scores of the two tools.

Marco Rospocher

Ontology Driven Extraction of Research Processes

We address the automatic extraction from publications of two key concepts for representing research processes: the concept of research activity and the sequence relation between successive activities. These representations are driven by the Scholarly Ontology, specifically conceived for documenting research processes. Unlike usual named entity recognition and relation extraction tasks, we are facing textual descriptions of activities of widely variable length, while pairs of successive activities often span multiple sentences. We developed and experimented with several sliding window classifiers using Logistic Regression, SVMs, and Random Forests, as well as a two-stage pipeline classifier. Our classifiers employ task-specific features, as well as word, part-of-speech and dependency embeddings, engineered to exploit distinctive traits of research publications written in English. The extracted activities and sequences are associated with other relevant information from publication metadata and stored as RDF triples in a knowledge base. Evaluation on datasets from three disciplines, Digital Humanities, Bioinformatics, and Medicine, shows very promising performance.

Vayianos Pertsas, Panos Constantopoulos, Ion Androutsopoulos

Enriching Knowledge Bases with Counting Quantifiers

Information extraction traditionally focuses on extracting relations between identifiable entities, such as $$\langle $$ Monterey, locatedIn, California $$\rangle $$ . Yet, texts often also contain Counting information, stating that a subject is in a specific relation with a number of objects, without mentioning the objects themselves, for example, “California is divided into 58 counties”. Such counting quantifiers can help in a variety of tasks such as query answering or knowledge base curation, but are neglected by prior work.This paper develops the first full-fledged system for extracting counting information from text, called CINEX. We employ distant supervision using fact counts from a knowledge base as training seeds, and develop novel techniques for dealing with several challenges: (i) non-maximal training seeds due to the incompleteness of knowledge bases, (ii) sparse and skewed observations in text sources, and (iii) high diversity of linguistic patterns. Experiments with five human-evaluated relations show that CINEX can achieve 60% average precision for extracting counting information. In a large-scale experiment, we demonstrate the potential for knowledge base enrichment by applying CINEX to 2,474 frequent relations in Wikidata. CINEX can assert the existence of 2.5M facts for 110 distinct relations, which is 28% more than the existing Wikidata facts for these relations.

Paramita Mirza, Simon Razniewski, Fariz Darari, Gerhard Weikum

QA4IE: A Question Answering Based Framework for Information Extraction

Information Extraction (IE) refers to automatically extracting structured relation tuples from unstructured texts. Common IE solutions, including Relation Extraction (RE) and open IE systems, can hardly handle cross-sentence tuples, and are severely restricted by limited relation types as well as informal relation specifications (e.g., free-text based relation tuples). In order to overcome these weaknesses, we propose a novel IE framework named QA4IE, which leverages the flexible question answering (QA) approaches to produce high quality relation triples across sentences. Based on the framework, we develop a large IE benchmark with high quality human evaluation. This benchmark contains 293K documents, 2M golden relation triples, and 636 relation types. We compare our system with some IE baselines on our benchmark and the results show that our system achieves great improvements.

Lin Qiu, Hao Zhou, Yanru Qu, Weinan Zhang, Suoheng Li, Shu Rong, Dongyu Ru, Lihua Qian, Kewei Tu, Yong Yu

Constructing a Recipe Web from Historical Newspapers

Historical newspapers provide a lens on customs and habits of the past. For example, recipes published in newspapers highlight what and how we ate and thought about food. The challenge here is that newspaper data is often unstructured and highly varied. Digitised historical newspapers add an additional challenge, namely that of fluctuations in OCR quality. Therefore, it is difficult to locate and extract recipes from them. We present our approach based on distant supervision and automatically extracted lexicons to identify recipes in digitised historical newspapers, to generate recipe tags, and to extract ingredient information. We provide OCR quality indicators and their impact on the extraction process. We enrich the recipes with links to information on the ingredients. Our research shows how natural language processing, machine learning, and semantic web can be combined to construct a rich dataset from heterogeneous newspapers for the historical analysis of food culture.

Marieke van Erp, Melvin Wevers, Hugo Huurdeman

Structured Event Entity Resolution in Humanitarian Domains

In domains such as humanitarian assistance and disaster relief (HADR), events, rather than named entities, are the primary focus of analysts and aid officials. An important problem that must be solved to provide situational awareness to aid providers is automatic clustering of sub-events that refer to the same underlying event. An effective solution to the problem requires judicious use of both domain-specific and semantic information, as well as statistical methods like deep neural embeddings. In this paper, we present an approach, AugSEER (Augmented feature sets for Structured Event Entity Resolution), that combines advances in deep neural embeddings both on text and graph data with minimally supervised inputs from domain experts. AugSEER can operate in both online and batch scenarios. On five real-world HADR datasets, AugSEER is found, on average, to outperform the next best baseline result by almost 15% on the cluster purity metric and by 3% on the F1-Measure metric. In contrast, text-based approaches are found to perform poorly, demonstrating the importance of semantic information in devising a good solution. We also use sub-event clustering visualizations to illustrate the qualitative potential of AugSEER.

Mayank Kejriwal, Jing Peng, Haotian Zhang, Pedro Szekely

That’s Interesting, Tell Me More! Finding Descriptive Support Passages for Knowledge Graph Relationships

We address the problem of finding descriptive explanations of facts stored in a knowledge graph. This is important in high-risk domains such as healthcare, intelligence, etc. where users need additional information for decision making and is especially crucial for applications that rely on automatically constructed knowledge graphs where machine-learned systems extract facts from an input corpus and working of the extractors is opaque to the end-user. We follow an approach inspired from information retrieval and propose a simple, yet effective and efficient solution that takes into account passage level as well as document level properties to produce a ranked list of passages describing a given input relation. We test our approach using Wikidata as the knowledge base and Wikipedia as the source corpus and report results of user studies conducted to study the effectiveness of our proposed model.

Sumit Bhatia, Purusharth Dwivedi, Avneet Kaur

Exploring RDFS KBs Using Summaries

Ontology summarization aspires to produce an abridged version of the original data source highlighting its most important concepts. However, in an ideal scenario, the user should not be limited only to static summaries. Starting from the summary, s/he should be able to further explore the data source requesting more detailed information for a particular part of it. In this paper, we present a new approach enabling the dynamic exploration of summaries through two novel operations zoom and extend. Extend focuses on a specific subgraph of the initial summary, whereas zoom on the whole graph, both providing granular information access to the end-user. We show that calculating these operators is NP-complete and provide approximations for their calculation. Then, we show that using extend, we can answer more queries focusing on specific nodes, whereas using global zoom, we can answer overall more queries. Finally, we show that the algorithms employed can efficiently approximate both operators.

Georgia Troullinou, Haridimos Kondylakis, Kostas Stefanidis, Dimitris Plexousakis

What Is the Cube Root of 27? Question Answering Over CodeOntology

We present an unsupervised approach to process natural language questions that cannot be answered by factual question answering nor advanced data querying, requiring instead ad-hoc code generation and execution. To address this challenging task, our system, , performs language-to-code translation by interpreting the natural language question and generating a SPARQL query that is run against CodeOntology, a large RDF repository containing millions of triples representing Java code constructs. The query retrieves a number of Java source code snippets and methods, ranked by on both syntactic and semantic features, to find the best candidate, that is then executed to get the correct answer. The evaluation of the system is based on a dataset extracted from StackOverflow and experimental results show that our approach is comparable with other state-of-the-art proprietary systems, such as the closed-source WolframAlpha computational knowledge engine.

Mattia Atzeni, Maurizio Atzori

GraFa: Scalable Faceted Browsing for RDF Graphs

Faceted browsing has become a popular paradigm for user interfaces on the Web and has also been investigated in the context of RDF graphs. However, current faceted browsers for RDF graphs encounter performance issues when faced with two challenges: scale, where large datasets generate many results, and heterogeneity, where large numbers of properties and classes generate many facets. To address these challenges, we propose GraFa: a faceted browsing system for heterogeneous large-scale RDF graphs based on a materialisation strategy that performs an offline analysis of the input graph in order to identify a subset of the exponential number of possible facet combinations that are candidates for indexing. In experiments over Wikidata, we demonstrate that materialisation allows for displaying (exact) faceted views over millions of diverse results in under a second while keeping index sizes relatively small. We also present initial usability studies over GraFa.

José Moreno-Vega, Aidan Hogan

Semantics and Validation of Recursive SHACL

With the popularity of RDF as an independent data model came the need for specifying constraints on RDF graphs, and for mechanisms to detect violations of such constraints. One of the most promising schema languages for RDF is SHACL, a recent W3C recommendation. Unfortunately, the specification of SHACL leaves open the problem of validation against recursive constraints. This omission is important because SHACL by design favors constraints that reference other ones, which in practice may easily yield reference cycles.In this paper, we propose a concise formal semantics for the so-called “core constraint components” of SHACL. This semantics handles arbitrary recursion, while being compliant with the current standard. Graph validation is based on the existence of an assignment of SHACL “shapes” to nodes in the graph under validation, stating which shapes are verified or violated, while verifying the targets of the validation process. We show in particular that the design of SHACL forces us to consider cases in which these assignments are partial, or, in other words, where the truth value of a constraint at some nodes of a graph may be left unknown.Dealing with recursion also comes at a price, as validating an RDF graph against SHACL constraints is NP-hard in the size of the graph, and this lower bound still holds for constraints with stratified negation. Therefore we also propose a tractable approximation to the validation problem.

Julien Corman, Juan L. Reutter, Ognjen Savković

Certain Answers for SPARQL with Blank Nodes

Blank nodes in RDF graphs can be used to represent values known to exist but whose identity remains unknown. A prominent example of such usage can be found in the Wikidata dataset where, e.g., the author of Beowulf is given as a blank node. However, while SPARQL considers blank nodes in a query as existentials, it treats blank nodes in RDF data more like constants. Running SPARQL queries over datasets with unknown values may thus lead to counter-intuitive results, which may make the standard SPARQL semantics unsuitable for datasets with existential blank nodes. We thus explore the feasibility of an alternative SPARQL semantics based on certain answers. In order to estimate the performance costs that would be associated with such a change in semantics for current implementations, we adapt and evaluate approximation techniques proposed in a relational database setting for a core fragment of SPARQL. To further understand the impact that such a change in semantics may have on query solutions, we analyse how this new semantics would affect the results of user queries over Wikidata.

Daniel Hernández, Claudio Gutierrez, Aidan Hogan

Efficient Handling of SPARQL OPTIONAL for OBDA

OPTIONAL is a key feature in SPARQL for dealing with missing information. While this operator is used extensively, it is also known for its complexity, which can make efficient evaluation of queries with OPTIONAL challenging. We tackle this problem in the Ontology-Based Data Access (OBDA) setting, where the data is stored in a SQL relational database and exposed as a virtual RDF graph by means of an R2RML mapping. We start with a succinct translation of a SPARQL fragment into SQL. It fully respects bag semantics and three-valued logic and relies on the extensive use of the LEFT JOIN operator and COALESCE function. We then propose optimisation techniques for reducing the size and improving the structure of generated SQL queries. Our optimisations capture interactions between JOIN, LEFT JOIN, COALESCE and integrity constraints such as attribute nullability, uniqueness and foreign key constraints. Finally, we empirically verify effectiveness of our techniques on the BSBM OBDA benchmark.

Guohui Xiao, Roman Kontchakov, Benjamin Cogrel, Diego Calvanese, Elena Botoeva

Representativeness of Knowledge Bases with the Generalized Benford’s Law

Knowledge bases (KBs) such as DBpedia, Wikidata, and YAGO contain a huge number of entities and facts. Several recent works induce rules or calculate statistics on these KBs. Most of these methods are based on the assumption that the data is a representative sample of the studied universe. Unfortunately, KBs are biased because they are built from crowdsourcing and opportunistic agglomeration of available databases. This paper aims at approximating the representativeness of a relation within a knowledge base. For this, we use the generalized Benford’s law, which indicates the distribution expected by the facts of a relation. We then compute the minimum number of facts that have to be added in order to make the KB representative of the real world. Experiments show that our unsupervised method applies to a large number of relations. For numerical relations where ground truths exist, the estimated representativeness proves to be a reliable indicator.

Arnaud Soulet, Arnaud Giacometti, Béatrice Markhoff, Fabian M. Suchanek

Detecting Erroneous Identity Links on the Web Using Network Metrics

In the absence of a central naming authority on the Semantic Web, it is common for different datasets to refer to the same thing by different IRIs. Whenever multiple names are used to denote the same thing, owl:sameAs statements are needed in order to link the data and foster reuse. Studies that date back as far as 2009, have observed that the owl:sameAs property is sometimes used incorrectly. In this paper, we show how network metrics such as the community structure of the owl:sameAs graph can be used in order to detect such possibly erroneous statements. One benefit of the here presented approach is that it can be applied to the network of owl:sameAs links itself, and does not rely on any additional knowledge. In order to illustrate its ability to scale, the approach is evaluated on the largest collection of identity links to date, containing over 558M owl:sameAs links scraped from the LOD Cloud.

Joe Raad, Wouter Beek, Frank van Harmelen, Nathalie Pernelle, Fatiha Saïs

: A Benchmark Generator for Spatial Link Discovery Tools

A number of real and synthetic benchmarks have been proposed for evaluating the performance of link discovery systems. So far, only a limited number of link discovery benchmarks target the problem of linking geo-spatial entities. However, some of the largest knowledge bases of the Linked Open Data Web, such as LinkedGeoData contain vast amounts of spatial information. Several systems that manage spatial data and consider the topology of the spatial resources and the topological relations between them have been developed. In order to assess the ability of these systems to handle the vast amount of spatial data and perform the much needed data integration in the Linked Geo Data Cloud, it is imperative to develop benchmarks for geo-spatial link discovery. In this paper we propose the Spatial Benchmark Generator $$SPgen $$ that can be used to test the performance of link discovery systems which deal with topological relations as proposed in the state of the art DE-9IM (Dimensionally Extended nine-Intersection Model). $$SPgen $$ implements all topological relations of DE-9IM between LineStrings and Polygons in the two-dimensional space. A comparative analysis with benchmarks produced using $$SPgen $$ to assess and identify the capabilities of AML, OntoIdea, RADON and Silk spatial link discovery systems is provided.

Tzanina Saveta, Irini Fundulaki, Giorgos Flouris, Axel-Cyrille Ngonga-Ngomo

Specifying, Monitoring, and Executing Workflows in Linked Data Environments

We present an ontology for representing workflows over components with Read-Write Linked Data interfaces and give an operational semantics to the ontology via a rule language. Workflow languages have been successfully applied for modelling behaviour in enterprise information systems, in which the data is often managed in a relational database. Linked Data interfaces have been widely deployed on the web to support data integration in very diverse domains, increasingly also in scenarios involving the Internet of Things, in which application behaviour is often specified using imperative programming languages. With our work we aim to combine workflow languages, which allow for the high-level specification of application behaviour by non-expert users, with Linked Data, which allows for decentralised data publication and integrated data access. We show that our ontology is expressive enough to cover the basic workflow patterns and demonstrate the applicability of our approach with a prototype system that observes pilots carrying out tasks in a virtual reality aircraft cockpit. On a synthetic benchmark from the building automation domain, the runtime scales linearly with the size of the number of Internet of Things devices.

Tobias Käfer, Andreas Harth

Mapping Diverse Data to RDF in Practice

Converting data from diverse data sources to custom RDF datasets often faces several practical challenges related with the need to restructure and transform the source data. Existing RDF mapping languages assume that the resulting datasets mostly preserve the structure of the original data. In this paper, we present real cases that highlight the limitations of existing languages, and describe D2RML, a transformation-oriented RDF mapping language which addresses such practical needs by incorporating a programming flavor in the mapping process.

Alexandros Chortaras, Giorgos Stamou

A Novel Approach and Practical Algorithms for Ontology Integration

Today a wealth of knowledge and data are distributed using Semantic Web standards. Especially in the (bio)medical domain several sources like SNOMED, NCI, FMA, and more are distributed in the form of OWL ontologies. These can be matched and integrated in order to create one large medical Knowledge Base. However, an important issue is that the structure of these ontologies may be profoundly different hence using the mappings as initially computed can lead to incoherences or changes in their original structure which may affect applications. In this paper we present a framework and novel approach for integrating independently developed ontologies. Starting from an initial seed ontology which may already be in use by an application, new sources are used to iteratively enrich and extend the seed one. To deal with structural incompatibilities we present a novel fine-grained approach which is based on mapping repair and alignment conservativity, formalise it and provide an exact as well as approximate but practical algorithms. Our framework has already been used to integrate a number of medical ontologies and support real-world healthcare services provided by Babylon Health. Finally, we also perform an experimental evaluation and compare with state-of-the-art ontology integration systems that take into account the structure and coherency of the integrated ontologies obtaining encouraging results.

Giorgos Stoilos, David Geleta, Jetendr Shamdasani, Mohammad Khodadadi

Practical Ontology Pattern Instantiation, Discovery, and Maintenance with Reasonable Ontology Templates

Reasonable Ontology Templates ( ) is a language for representing ontology modelling patterns in the form of parameterised ontologies. Ontology templates are simple and powerful abstractions useful for constructing, interacting with, and maintaining ontologies. With ontology templates, modelling patterns can be uniquely identified and encapsulated, broken down into convenient and manageable pieces, instantiated, and used as queries. Formal relations defined over templates support sophisticated maintenance tasks for sets of templates, such as revealing redundancies and suggesting new templates for representing implicit patterns. Ontology templates are designed for practical use; an vocabulary, convenient serialisation formats for the semantic web and for terse specification of template definitions and bulk instances are available, including an open source implementation for using templates. Our approach is successfully tested on a real-world large-scale ontology in the engineering domain.

Martin G. Skjæveland, Daniel P. Lupp, Leif Harald Karlsen, Henrik Forssell

Pragmatic Ontology Evolution: Reconciling User Requirements and Application Performance

Increasingly, organizations are adopting ontologies to describe their large catalogues of items. These ontologies need to evolve regularly in response to changes in the domain and the emergence of new requirements. An important step of this process is the selection of candidate concepts to include in the new version of the ontology. This operation needs to take into account a variety of factors and in particular reconcile user requirements and application performance. Current ontology evolution methods focus either on ranking concepts according to their relevance or on preserving compatibility with existing applications. However, they do not take in consideration the impact of the ontology evolution process on the performance of computational tasks – e.g., in this work we focus on instance tagging, similarity computation, generation of recommendations, and data clustering. In this paper, we propose the Pragmatic Ontology Evolution (POE) framework, a novel approach for selecting from a group of candidates a set of concepts able to produce a new version of a given ontology that (i) is consistent with the a set of user requirements (e.g., max number of concepts in the ontology), (ii) is parametrised with respect to a number of dimensions (e.g., topological considerations), and (iii) effectively supports relevant computational tasks. Our approach also supports users in navigating the space of possible solutions by showing how certain choices, such as limiting the number of concepts or privileging trendy concepts rather than historical ones, would reflect on the application performance. An evaluation of POE on the real-world scenario of the evolving Springer Nature taxonomy for editorial classification yielded excellent results, demonstrating a significant improvement over alternative approaches.

Francesco Osborne, Enrico Motta

Towards Empty Answers in SPARQL: Approximating Querying with RDF Embedding

The LOD cloud offers a plethora of RDF data sources where users discover items of interest by issuing SPARQL queries. A common query problem for users is to face with empty answers: given a SPARQL query that returns nothing, how to refine the query to obtain a non-empty set? In this paper, we propose an RDF graph embedding based framework to solve the SPARQL empty-answer problem in terms of a continuous vector space. We first project the RDF graph into a continuous vector space by an entity context preserving translational embedding model which is specially designed for SPARQL queries. Then, given a SPARQL query that returns an empty set, we partition it into several parts and compute approximate answers by leveraging RDF embeddings and the translation mechanism. We also generate alternative queries for returned answers, which helps users recognize their expectations and refine the original query finally. To validate the effectiveness and efficiency of our framework, we conduct extensive experiments on the real-world RDF dataset. The results show that our framework can significantly improve the quality of approximate answers and speed up the generation of alternative queries.

Meng Wang, Ruijie Wang, Jun Liu, Yihe Chen, Lei Zhang, Guilin Qi

Query-Based Linked Data Anonymization

We introduce and develop a declarative framework for privacy-preserving Linked Data publishing in which privacy and utility policies are specified as SPARQL queries. Our approach is data-independent and leads to inspect only the privacy and utility policies in order to determine the sequence of anonymization operations applicable to any graph instance for satisfying the policies. We prove the soundness of our algorithms and gauge their performance through experiments.

Remy Delanaux, Angela Bonifati, Marie-Christine Rousset, Romuald Thion

Answering Provenance-Aware Queries on RDF Data Cubes Under Memory Budgets

The steadily-growing popularity of semantic data on the Web and the support for aggregation queries in SPARQL 1.1 have propelled the interest in Online Analytical Processing (OLAP) and data cubes in RDF. Query processing in such settings is challenging because SPARQL OLAP queries usually contain many triple patterns with grouping and aggregation. Moreover, one important factor of query answering on Web data is its provenance, i.e., metadata about its origin. Some applications in data analytics and access control require to augment the data with provenance metadata and run queries that impose constraints on this provenance. This task is called provenance-aware query answering. In this paper, we investigate the benefit of caching some parts of an RDF cube augmented with provenance information when answering provenance-aware SPARQL queries. We propose provenance-aware caching (PAC), a caching approach based on a provenance-aware partitioning of RDF graphs, and a benefit model for RDF cubes and SPARQL queries with aggregation. Our results on real and synthetic data show that PAC outperforms significantly the LRU strategy (least recently used) and the Jena TDB native caching in terms of hit-rate and response time.

Luis Galárraga, Kim Ahlstrøm, Katja Hose, Torben Bach Pedersen

Bash Datalog: Answering Datalog Queries with Unix Shell Commands

Dealing with large tabular datasets often requires extensive preprocessing. This preprocessing happens only once, so that loading and indexing the data in a database or triple store may be an overkill. In this paper, we present an approach that allows preprocessing large tabular data in Datalog – without indexing the data. The Datalog query is translated to Unix Bash and can be executed in a shell. Our experiments show that, for the use case of data preprocessing, our approach is competitive with state-of-the-art systems in terms of scalability and speed, while at the same time requiring only a Bash shell on a Unix system.

Thomas Rebele, Thomas Pellissier Tanon, Fabian Suchanek

WORQ: Workload-Driven RDF Query Processing

Cloud-based systems provide a rich platform for managing large-scale RDF data. However, the distributed nature of these systems introduces several performance challenges, e.g., disk I/O and network shuffling overhead, especially for RDF queries that involve multiple join operations. To alleviate these challenges, this paper studies the effect of several optimization techniques that enhance the performance of RDF queries. Based on the query workload, reduced sets of intermediate results (or reductions, for short) that are common for certain join pattern(s) are computed. Furthermore, these reductions are not computed beforehand, but are rather computed only for the frequent join patterns in an online fashion using Bloom filters. Rather than caching the final results of each query, we show that caching the reductions allows reusing intermediate results across multiple queries that share the same join patterns. In addition, we introduce an efficient solution for RDF queries with unbound properties. Based on a realization of the proposed optimizations on top of Spark, extensive experimentation using two synthetic benchmarks and a real dataset demonstrates how these optimizations lead to an order of magnitude enhancement in terms of preprocessing, storage, and query performance compared to the state-of-the-art solutions.

Amgad Madkour, Ahmed M. Aly, Walid G. Aref

Canonicalisation of Monotone SPARQL Queries

Caching in the context of expressive query languages such as SPARQL is complicated by the difficulty of detecting equivalent queries: deciding if two conjunctive queries are equivalent is NP-complete, where adding further query features makes the problem undecidable. Despite this complexity, in this paper we propose an algorithm that performs syntactic canonicalisation of SPARQL queries such that the answers for the canonicalised query will not change versus the original. We can guarantee that the canonicalisation of two queries within a core fragment of SPARQL (monotone queries with select, project, join and union) is equal if and only if the two queries are equivalent; we also support other SPARQL features but with a weaker soundness guarantee: that the (partially) canonicalised query is equivalent to the input query. Despite the fact that canonicalisation must be harder than the equivalence problem, we show the algorithm to be practical for real-world queries taken from SPARQL endpoint logs, and further show that it detects more equivalent queries than when compared with purely syntactic methods. We also present the results of experiments over synthetic queries designed to stress-test the canonicalisation method, highlighting difficult cases.

Jaime Salas, Aidan Hogan

Cross-Lingual Classification of Crisis Data

Many citizens nowadays flock to social media during crises to share or acquire the latest information about the event. Due to the sheer volume of data typically circulated during such events, it is necessary to be able to efficiently filter out irrelevant posts, thus focusing attention on the posts that are truly relevant to the crisis. Current methods for classifying the relevance of posts to a crisis or set of crises typically struggle to deal with posts in different languages, and it is not viable during rapidly evolving crisis situations to train new models for each language. In this paper we test statistical and semantic classification approaches on cross-lingual datasets from 30 crisis events, consisting of posts written mainly in English, Spanish, and Italian. We experiment with scenarios where the model is trained on one language and tested on another, and where the data is translated to a single language. We show that the addition of semantic features extracted from external knowledge bases improve accuracy over a purely statistical model.

Prashant Khare, Grégoire Burel, Diana Maynard, Harith Alani

Measuring Semantic Coherence of a Conversation

Conversational systems have become increasingly popular as a way for humans to interact with computers. To be able to provide intelligent responses, conversational systems must correctly model the structure and semantics of a conversation. We introduce the task of measuring semantic (in)coherence in a conversation with respect to background knowledge, which relies on the identification of semantic relations between concepts introduced during a conversation. We propose and evaluate graph-based and machine learning-based approaches for measuring semantic coherence using knowledge graphs, their vector space embeddings and word embedding models, as sources of background knowledge. We demonstrate how these approaches are able to uncover different coherence patterns in conversations on the Ubuntu Dialogue Corpus.

Svitlana Vakulenko, Maarten de Rijke, Michael Cochez, Vadim Savenkov, Axel Polleres

Combining Truth Discovery and RDF Knowledge Bases to Their Mutual Advantage

This study exploits knowledge expressed in RDF Knowledge Bases (KBs) to enhance Truth Discovery (TD) performances. TD aims to identify facts (true claims) when conflicting claims are made by several sources. Based on the assumption that true claims are provided by reliable sources and reliable sources provide true claims, TD models iteratively compute value confidence and source trustworthiness in order to determine which claims are true. We propose a model that exploits the knowledge extracted from an existing RDF KB in the form of rules. These rules are used to quantify the evidence given by the RDF KB to support a claim. This evidence is then integrated into the computation of the confidence value to improve its estimation. Enhancing TD models efficiently obtains a larger set of reliable facts that vice versa can populate RDF KBs. Empirical experiments on real-world datasets showed the potential of the proposed approach, which led to an improvement of up to 18% compared to the model we modified.

Valentina Beretta, Sébastien Harispe, Sylvie Ranwez, Isabelle Mougenot

Content Based Fake News Detection Using Knowledge Graphs

This paper addresses the problem of fake news detection. There are many works already in this space; however, most of them are for social media and not using news content for the decision making. In this paper, we propose some novel approaches, including the B-TransE model, to detecting fake news based on news content using knowledge graphs. In our solutions, we need to address a few technical challenges. Firstly, computational-oriented fact checking is not comprehensive enough to cover all the relations needed for fake news detection. Secondly, it is challenging to validate the correctness of the extracted triples from news articles. Our approaches are evaluated with the Kaggle’s ‘Getting Real about Fake News’ dataset and some true articles from main stream media. The evaluations show that some of our approaches have over 0.80 F1-scores.

Jeff Z. Pan, Siyana Pavlova, Chenxi Li, Ningxi Li, Yangmei Li, Jinshuo Liu

Springer Professional

About this book

Table of Contents

Frontmatter

Research Track