main-content

## Über dieses Buch

The 47 revised full papers presented together with three invited talks were carefully reviewed and selected from 204 submissions. This program was completed by a demonstration and poster session, in which researchers had the chance to present their latest results and advances in the form of live demos. In addition, the PhD Symposium program included 10 contributions, selected out of 21 submissions.

The core tracks of the research conference were complemented with new tracks focusing on linked data; machine learning; mobile web, sensors and semantic streams; natural language processing and information retrieval; reasoning; semantic data management, big data, and scalability; services, APIs, processes and cloud computing; smart cities, urban and geospatial data; trust and privacy; and vocabularies, schemas, and ontologies.

## Inhaltsverzeichnis

### Detecting Similar Linked Datasets Using Topic Modelling

The Web of data is growing continuously with respect to both the size and number of the datasets published. Porting a dataset to five-star Linked Data however requires the publisher of this dataset to link it with the already available linked datasets. Given the size and growth of the Linked Data Cloud, the current mostly manual approach used for detecting relevant datasets for linking is obsolete. We study the use of topic modelling for dataset search experimentally and present Tapioca, a linked dataset search engine that provides data publishers with similar existing datasets automatically. Our search engine uses a novel approach for determining the topical similarity of datasets. This approach relies on probabilistic topic modelling to determine related datasets by relying solely on the metadata of datasets. We evaluate our approach on a manually created gold standard and with a user study. Our evaluation shows that our algorithm outperforms a set of comparable baseline algorithms including standard search engines significantly by 6 % F1-score. Moreover, we show that it can be used on a large real world dataset with a comparable performance.

Michael Röder, Axel-Cyrille Ngonga Ngomo, Ivan Ermilov, Andreas Both

### Heuristics for Connecting Heterogeneous Knowledge via FrameBase

With recent advances in information extraction techniques, various large-scale knowledge bases covering a broad range of knowledge have become publicly available. As no single knowledge base covers all information, many applications require access to integrated knowledge from multiple knowledge bases. Achieving this, however, is challenging due to differences in knowledge representation. To address this problem, this paper proposes to use linguistic frames as a common representation and maps heterogeneous knowledge bases to the FrameBase schema, which is formed by a large inventory of these frames. We develop several methods to create complex mappings from external knowledge bases to this schema, using text similarity measures, machine learning, and different heuristics. We test them with different widely used large-scale knowledge bases, YAGO2s, Freebase and WikiData. The resulting integrated knowledge can then be queried in a homogeneous way.

Jacobo Rouces, Gerard de Melo, Katja Hose

### Dataset Recommendation for Data Linking: An Intensional Approach

With the growing quantity and diversity of publicly available web datasets, most notably Linked Open Data, recommending datasets, which meet specific criteria, has become an increasingly important, yet challenging problem. This task is of particular interest when addressing issues such as entity retrieval, semantic search and data linking. Here, we focus on that last issue. We introduce a dataset recommendation approach to identify linking candidates based on the presence of schema overlap between datasets. While an understanding of the nature of the content of specific datasets is a crucial prerequisite, we adopt the notion of dataset profiles, where a dataset is characterized through a set of schema concept labels that best describe it and can be potentially enriched by retrieving their textual descriptions. We identify schema overlap by the help of a semantico-frequential concept similarity measure and a ranking criterium based on the tf*idf cosine similarity. The experiments, conducted over all available linked datasets on the Linked Open Data cloud, show that our method achieves an average precision of up to $$53\,\%$$ for a recall of $$100\,\%$$. As an additional contribution, our method returns the mappings between the schema concepts across datasets – a particularly useful input for the data linking step.

Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze, Konstantin Todorov

### From Queriability to Informativity, Assessing “Quality in Use” of DBpedia and YAGO

In recent years, an increasing number of semantic data sources have been published on the web. These sources are further interlinked to form the Linking Open Data (LOD) cloud. To make full use of these data sets, it is necessary to learn their data qualities. Researchers have proposed several metrics and have developed numerous tools to measure the qualities of the data sets in LOD from different dimensions. However, there exist few studies on evaluating data set quality from the users’ usability perspective and usability has great impacts on the spread and reuse of LOD data sets. On the other hand, usability is well studied in the area of software quality. In the newly published standard ISO/IEC 25010, usability is further broadened to include the notion of “quality in use” besides the other two factors, namely, internal and external. In this paper, we first adapt the notions and the methods used in software quality to assess the data set quality. Second, we formally define two quality dimensions, namely, Queriability and Informativity from the perspective of quality in use. The two proposed dimensions correspond to querying and answering, respectively, which are the most frequent usage scenarios for accessing LOD data sets. Then we provide a series of metrics to measure the two dimensions. Last, we apply the metrics to two representative data sets in LOD (i.e., YAGO and DBpedia). In the evaluating process, we select dozens of questions from both QALD and WebQuestions and ask a group of users to construct queries as well as to check the answers with the help of our usability testing tool. The findings during the assessment not only illustrate the capability of our method and metrics but also give new insights on data quality of the two knowledge bases.

Tong Ruan, Yang Li, Haofen Wang, Liang Zhao

### Normalized Semantic Web Distance

In this paper, we investigate the Normalized Semantic Web Distance (NSWD), a semantics-aware distance measure between two concepts in a knowledge graph. Our measure advances the Normalized Web Distance, a recently established distance between two textual terms, to be more semantically aware. In addition to the theoretic fundamentals of the NSWD, we investigate its properties and qualities with respect to computation and implementation. We investigate three variants of the NSWD that make use of all semantic properties of nodes in a knowledge graph. Our performance evaluation based on the Miller-Charles benchmark shows that the NSWD is able to correlate with human similarity assessments on both Freebase and DBpedia knowledge graphs with values up to 0.69. Moreover, we verified the semantic awareness of the NSWD on a set of 20 unambiguous concept-pairs. We conclude that the NSWD is a promising measure with (1) a reusable implementation across knowledge graphs, (2) sufficient correlation with human assessments, and (3) awareness of semantic differences between ambiguous concepts.

Tom De Nies, Christian Beecks, Fréderic Godin, Wesley De Neve, Grzegorz Stepien, Dörthe Arndt, Laurens De Vocht, Ruben Verborgh, Thomas Seidl, Erik Mannens, Rik Van de Walle

### Gleaning Types for Literals in RDF Triples with Application to Entity Summarization

Associating meaning with data in a machine-readable format is at the core of the Semantic Web vision, and typing is one such process. Typing (assigning a class selected from schema) information can be attached to URI resources in RDF/S knowledge graphs and datasets to improve quality, reliability, and analysis. There are two types of properties: object properties, and datatype properties. Type information can be made available for object properties as their object values are URIs. Typed object properties allow richer semantic analysis compared to datatype properties, whose object values are literals. In fact, many datatype properties can be analyzed to suggest types selected from a schema similar to object properties, enabling their wider use in applications. In this paper, we propose an approach to glean types for datatype properties by processing their object values. We show the usefulness of generated types by utilizing them to group facts on the basis of their semantics in computing diversified entity summaries by extending a state-of-the-art summarization algorithm.

Kalpa Gunaratna, Krishnaprasad Thirunarayan, Amit Sheth, Gong Cheng

### TermPicker: Enabling the Reuse of Vocabulary Terms by Exploiting Data from the Linked Open Data Cloud

Deciding which RDF vocabulary terms to use when modeling data as Linked Open Data (LOD) is far from trivial. In this paper, we propose TermPicker as a novel approach enabling vocabulary reuse by recommending vocabulary terms based on various features of a term. These features include the term’s popularity, whether it is from an already used vocabulary, and the so-called schema-level pattern (SLP) feature that exploits which terms other data providers on the LOD cloud use to describe their data. We apply Learning To Rank to establish a ranking model for vocabulary terms based on the utilized features. The results show that using the SLP-feature improves the recommendation quality by 29–36 % considering the Mean Average Precision and the Mean Reciprocal Rank at the first five positions compared to recommendations based on solely the term’s popularity and whether it is from an already used vocabulary.

Johann Schaible, Thomas Gottron, Ansgar Scherp

### Implicit Entity Linking in Tweets

Sujan Perera, Pablo N. Mendes, Adarsh Alex, Amit P. Sheth, Krishnaprasad Thirunarayan

### Fast Approximate A-Box Consistency Checking Using Machine Learning

Ontology reasoning is typically a computationally intensive operation. While soundness and completeness of results is required in some use cases, for many others, a sensible trade-off between computation efforts and correctness of results makes more sense. In this paper, we show that it is possible to approximate a central task in reasoning, i.e., A-box consistency checking, by training a machine learning model which approximates the behavior of that reasoner for a specific ontology. On four different datasets, we show that such learned models constantly achieve an accuracy above 95 % at less than 2 % of the runtime of a reasoner, using a decision tree with no more than 20 inner nodes. For example, this allows for validating 293M Microdata documents against the schema.org ontology in less than 90 min, compared to 18 days required by a state of the art ontology reasoner.

Heiko Paulheim, Heiner Stuckenschmidt

Petar Ristoski, Peter Mika

Liang Zheng, Jiang Xu, Jidong Jiang, Yuzhong Qu, Gong Cheng

### DoSeR - A Knowledge-Base-Agnostic Framework for Entity Disambiguation Using Semantic Embeddings

Entity disambiguation is the task of mapping ambiguous terms in natural-language text to its entities in a knowledge base. It finds its application in the extraction of structured data in RDF (Resource Description Framework) from textual documents, but equally so in facilitating artificial intelligence applications, such as Semantic Search, Reasoning and Question & Answering. In this work, we propose DoSeR (Disambiguation of Semantic Resources), a (named) entity disambiguation framework that is knowledge-base-agnostic in terms of RDF (e.g. DBpedia) and entity-annotated document knowledge bases (e.g. Wikipedia). Initially, our framework automatically generates semantic entity embeddings given one or multiple knowledge bases. In the following, DoSeR accepts documents with a given set of surface forms as input and collectively links them to an entity in a knowledge base with a graph-based approach. We evaluate DoSeR on seven different data sets against publicly available, state-of-the-art (named) entity disambiguation frameworks. Our approach outperforms the state-of-the-art approaches that make use of RDF knowledge bases and/or entity-annotated document knowledge bases by up to 10 % F1 measure.

Stefan Zwicklbauer, Christin Seifert, Michael Granitzer

### Embedding Mapping Approaches for Tensor Factorization and Knowledge Graph Modelling

Latent embedding models are the basis of state-of-the art statistical solutions for modelling Knowledge Graphs and Recommender Systems. However, to be able to perform predictions for new entities and relation types, such models have to be retrained completely to derive the new latent embeddings. This could be a potential limitation when fast predictions for new entities and relation types are required. In this paper we propose approaches that can map new entities and new relation types into the existing latent embedding space without the need for retraining. Our proposed models are based on the observable —even incomplete— features of a new entity, e.g. a subset of observed links to other known entities. We show that these mapping approaches are efficient and are applicable to a wide variety of existing factorization models, including nonlinear models. We report performance results on multiple real-world datasets and evaluate the performances from different aspects.

Yinchong Yang, Cristóbal Esteban, Volker Tresp

### Comparing Vocabulary Term Recommendations Using Association Rules and Learning to Rank: A User Study

When modeling Linked Open Data (LOD), reusing appropriate vocabulary terms to represent the data is difficult, because there are many vocabularies to choose from. Vocabulary term recommendations could alleviate this situation. We present a user study evaluating a vocabulary term recommendation service that is based on how other data providers have used RDF classes and properties in the LOD cloud. Our study compares the machine learning technique Learning to Rank (L2R), the classical data mining approach Association Rule mining (AR), and a baseline that does not provide any recommendations. Results show that utilizing AR, participants needed less time and less effort to model the data, which in the end resulted in models of better quality.

Johann Schaible, Pedro Szekely, Ansgar Scherp

### Full-Text Support for Publish/Subscribe Ontology Systems

In this work, we envision a publish/subscribe ontology system that is able to index large numbers of expressive continuous queries and filter them against RDF data that arrive in a streaming fashion. To this end, we propose a SPARQL extension that supports the creation of full-text continuous queries and propose a family of main-memory query indexing algorithms which perform matching at low complexity and minimal filtering time. We experimentally compare our approach against a state-of-the-art competitor (extended to handle indexing of full-text queries) both on structural and full-text tasks using real-world data. Our approach proves two orders of magnitude faster than the competitor in all types of filtering tasks.

Lefteris Zervakis, Christos Tryfonopoulos, Spiros Skiadopoulos, Manolis Koubarakis

### Heaven: A Framework for Systematic Comparative Research Approach for RSP Engines

Benchmarks like LSBench, SRBench, CSRBench and, more recently, CityBench satisfy the growing need of shared datasets, ontologies and queries to evaluate window-based RDF Stream Processing (RSP) engines. However, no clear winner emerges out of the evaluation. In this paper, we claim that the RSP community needs to adopt a Systematic Comparative Research Approach (SCRA) if it wants to move a step forward. To this end, we propose a framework that enables SCRA for window based RSP engines. The contributions of this paper are: (i) the requirements to satisfy for tools that aim at enabling SCRA; (ii) the architecture of a facility to design and execute experiment guaranteeing repeatability, reproducibility and comparability; (iii) $$\mathcal {H}$$eaven – a proof of concept implementation of such architecture that we released as open source –; (iv) two RSP engine implementations, also open source, that we propose as baselines for the comparative research (i.e., they can serve as terms of comparison in future works). We prove $$\mathcal {H}$$eaven effectiveness using the baselines by: (i) showing that top-down hypothesis verification is not straight forward even in controlled conditions and (ii) providing examples of bottom-up comparative analysis.

Riccardo Tommasini, Emanuele Della Valle, Marco Balduini, Daniele Dell’Aglio

### Bridging the Gap Between Formal Languages and Natural Languages with Zippers

The Semantic Web is founded on a number of Formal Languages (FL) whose benefits are precision, lack of ambiguity, and ability to automate reasoning tasks such as inference or query answering. This however poses the challenge of mediation between machines and users because the latter generally prefer Natural Languages (NL) for accessing and authoring knowledge. In this paper, we introduce the design pattern based on Abstract Syntax Trees (AST), Huet’s zippers and Montague grammars to zip together a natural language and a formal language. Unlike question answering, translation does not go from NL to FL, but as symbol suggests, from ASTs (A) of an intermediate language to both NL () and FL (). ASTs are built interactively and incrementally through a user-machine dialog where the user only sees NL, and the machine only sees FL.

Sébastien Ferré

### Towards Monitoring of Novel Statements in the News

In media monitoring users have a clearly defined information need to find so far unknown statements regarding certain entities or relations mentioned in natural-language text. However, commonly used keyword-based search technologies are focused on finding relevant documents and cannot judge the novelty of statements contained in the text. In this work, we propose a new semantic novelty measure that allows to retrieve statements, which are both novel and relevant, from natural-language sentences in news articles. Relevance is defined by a semantic query of the user, while novelty is ensured by checking whether the extracted statements are related, but non-existing in a knowledge base containing the currently known facts. Our evaluation performed on English news texts and on CrunchBase as the knowledge base demonstrates the effectiveness, unique capabilities and future challenges of this novel approach to novelty.

Michael Färber, Achim Rettinger, Andreas Harth

### AskNow: A Framework for Natural Language Query Formalization in SPARQL

Natural Language Query Formalization involves semantically parsing queries in natural language and translating them into their corresponding formal representations. It is a key component for developing question-answering (QA) systems on RDF data. The chosen formal representation language in this case is often SPARQL. In this paper, we propose a framework, called AskNow, where users can pose queries in English to a target RDF knowledge base (e.g. DBpedia), which are first normalized into an intermediary canonical syntactic form, called Normalized Query Structure (NQS), and then translated into SPARQL queries. NQS facilitates the identification of the desire (or expected output information) and the user-provided input information, and establishing their mutual semantic relationship. At the same time, it is sufficiently adaptive to query paraphrasing. We have empirically evaluated the framework with respect to the syntactic robustness of NQS and semantic accuracy of the SPARQL translator on standard benchmark datasets.

Mohnish Dubey, Sourish Dasgupta, Ankit Sharma, Konrad Höffner, Jens Lehmann

### Knowledge Extraction for Information Retrieval

Document retrieval is the task of returning relevant textual resources for a given user query. In this paper, we investigate whether the semantic analysis of the query and the documents, obtained exploiting state-of-the-art Natural Language Processing techniques (e.g., Entity Linking, Frame Detection) and Semantic Web resources (e.g., YAGO, DBpedia), can improve the performances of the traditional term-based similarity approach. Our experiments, conducted on a recently released document collection, show that Mean Average Precision (MAP) increases of 3.5 % points when combining textual and semantic analysis, thus suggesting that semantic content can effectively improve the performances of Information Retrieval systems.

Francesco Corcoglioniti , Mauro Dragoni, Marco Rospocher, Alessio Palmero Aprosio

### Efficient Graph-Based Document Similarity

Assessing the relatedness of documents is at the core of many applications such as document retrieval and recommendation. Most similarity approaches operate on word-distribution-based document representations - fast to compute, but problematic when documents differ in language, vocabulary or type, and neglecting the rich relational knowledge available in Knowledge Graphs. In contrast, graph-based document models can leverage valuable knowledge about relations between entities - however, due to expensive graph operations, similarity assessments tend to become infeasible in many applications. This paper presents an efficient semantic similarity approach exploiting explicit hierarchical and transversal relations. We show in our experiments that (i) our similarity measure provides a significantly higher correlation with human notions of document similarity than comparable measures, (ii) this also holds for short documents with few annotations, (iii) document similarity can be calculated efficiently compared to other graph-traversal based approaches.

### Semantic Topic Compass – Classification Based on Unsupervised Feature Ambiguity Gradation

Characterising social media topics often requires new features to be continuously taken into account, and thus increasing the need for classifier retraining. One challenging aspect is the emergence of ambiguous features, which can affect classification performance. In this paper we investigate the impact of the use of ambiguous features in a topic classification task, and introduce the Semantic Topic Compass (STC) framework, which characterises ambiguity in a topics feature space. STC makes use of topic priors derived from structured knowledge sources to facilitate the semantic feature grading of a topic. Our findings demonstrate the proposed framework offers competitive boosts in performance across all datasets.

Amparo Elizabeth Cano, Hassan Saif, Harith Alani, Enrico Motta

### Supporting Arbitrary Custom Datatypes in RDF and SPARQL

In the Resource Description Framework, literals are composed of a UNICODE string (the lexical form), a datatype IRI, and optionally, when the datatype IRI is rdf:langString, a language tag. Any IRI can take the place of a datatype IRI, but the specification only defines the precise meaning of a literal when the datatype IRI is among a predefined subset. Custom datatypes have reported use on the Web of Data, and show some advantages in representing some classical structures. Yet, their support by RDF processors is rare and implementation specific. In this paper, we first present the minimal set of functions that should be defined in order to make a custom datatype usable in query answering and reasoning. Based on this, we discuss solutions that would enable: (i) data publishers to publish the definition of arbitrary custom datatypes on the Web, and (ii) generic RDF processor or SPARQL query engine to discover custom datatypes on-the-fly, and to perform operations on them accordingly. Finally, we detail a concrete solution that targets arbitrarily complex custom datatypes, we overview its implementation in Jena and ARQ, and we report the results of an experiment on a real world DBpedia use case.

Maxime Lefrançois, Antoine Zimmermann

### Handling Inconsistencies Due to Class Disjointness in SPARQL Updates

The problem of updating ontologies has received increased attention in recent years. In the approaches proposed so far, either the update language is restricted to sets of ground atoms or, where the full SPARQL update language is allowed, the TBox language is restricted so that no inconsistencies can arise. In this paper we discuss directions to overcome these limitations. Starting from a DL-Lite fragment covering RDFS and concept disjointness axioms, we define three semantics for SPARQL instance-level (ABox) update: under cautious semantics, inconsistencies are resolved by rejecting updates potentially introducing conflicts; under brave semantics, instead, conflicts are overridden in favor of new information where possible; finally, the fainthearted semantics is a compromise between the former two approaches, designed to accommodate as much of the new information as possible, as long as consistency with the prior knowledge is not violated. We show how these semantics can be implemented in SPARQL via rewritings of polynomial size and draw first conclusions from their practical evaluation.

Albin Ahmeti, Diego Calvanese, Axel Polleres, Vadim Savenkov

### A Contextualised Semantics for owl:sameAs

Identity relations are at the foundation of the Semantic Web and the Linked Data Cloud. In many instances the classical interpretation of identity is too strong for practical purposes. This is particularly the case when two entities are considered the same in some but not all contexts. Unfortunately, modeling the specific contexts in which an identity relation holds is cumbersome and, due to arbitrary reuse and the Open World Assumption, it is impossible to anticipate all contexts in which an entity will be used. We propose an alternative semantics for owl:sameAs that partitions the original relation into a hierarchy of subrelations. The subrelation to which an identity statement belongs depends on the dataset in which the statement occurs. Adding future assertions may change the subrelation to which an identity statement belongs, resulting in a context-dependent and non-monotonic semantics. We show that this more fine-grained semantics is better able to characterize the actual use of owl:sameAs as observed in Linked Open Datasets.

Wouter Beek, Stefan Schlobach, Frank van Harmelen

### The Lazy Traveling Salesman – Memory Management for Large-Scale Link Discovery

Links between knowledge bases build the backbone of the Linked Data Web. In previous works, several time-efficient algorithms have been developed for computing links between knowledge bases. Most of these approaches rely on comparing resource properties based on similarity or distance functions as well as combinations thereof. However, these approaches pay little attention to the fact that very large datasets cannot be held in the main memory of most computing devices. In this paper, we present a generic memory management for Link Discovery. We show that the problem at hand is a variation of the traveling salesman problem and is thus NP-complete. We thus provide efficient graph-based algorithms that allow scheduling link discovery tasks efficiently. Our evaluation on real data shows that our approach allows computing links between large amounts of resources efficiently.

Axel-Cyrille Ngonga Ngomo, Mofeed M. Hassan

### RDF Query Relaxation Strategies Based on Failure Causes

Recent advances in Web-information extraction have led to the creation of several large Knowledge Bases (KBs). Querying these KBs often results in empty answers that do not serve the users’ needs. Relaxation of the failing queries is one of the cooperative techniques used to retrieve alternative results. Most of the previous work on RDF query relaxation compute a set of relaxed queries and execute them in a similarity-based ranking order. Thus, these approaches relax an RDF query without knowing its failure causes (FCs). In this paper, we study the idea of identifying these FCs to speed up the query relaxation process. We propose three relaxation strategies based on various information levels about the FCs of the user query and of its relaxed queries as well. A set of experiments conducted on the LUBM benchmark show the impact of our proposal in comparison with a state-of-the-art algorithm.

Géraud Fokou, Stéphane Jean, Allel Hadjali, Mickaël Baron

### CyCLaDEs: A Decentralized Cache for Triple Pattern Fragments

The Linked Data Fragment (LDF) approach promotes a new trade-off between performance and data availability for querying Linked Data. If data providers’ HTTP caches plays a crucial role in LDF performances, LDF clients are also caching data during SPARQL query processing. Unfortunately, as these clients do not collaborate, they cannot take advantage of this large decentralized cache hosted by clients. In this paper, we propose CyCLaDEs an overlay network based on LDF fragments similarity. For each LDF client, CyCLaDEs builds a neighborhood of LDF clients hosting related fragments in their cache. During query processing, neighborhood cache is checked before requesting LDF server. Experimental results show that CyCLaDEs is able to handle a significant amount of LDF query processing and provide a more specialized cache on client-side.

Pauline Folz, Hala Skaf-Molli, Pascal Molli

Finding relevant resources on the Semantic Web today is a dirty job: no centralized query service exists and the support for natural language access is limited. We present LOTUS: Linked Open Text UnleaShed, a text-based entry point to a massive subset of today’s Linked Open Data Cloud. Recognizing the use case dependency of resource retrieval, LOTUS provides an adaptive framework in which a set of matching and ranking algorithms are made available. Researchers and developers are able to tune their own LOTUS index by choosing and combining the matching and ranking algorithms that suit their use case best. In this paper, we explain the LOTUS approach, its implementation and the functionality it provides. We demonstrate the ease with which LOTUS enables text-based resource retrieval at an unprecedented scale in concrete and domain-specific scenarios. Finally, we provide evidence for the scalability of LOTUS with respect to the LOD Laundromat, the largest collection of easily accessible Linked Open Data currently available.

Filip Ilievski, Wouter Beek, Marieke van Erp, Laurens Rietveld, Stefan Schlobach

### Query Rewriting in RDF Stream Processing

Querying and reasoning over RDF streams are two increasingly relevant areas in the broader scope of processing structured data on the Web. While RDF Stream Processing (RSP) has focused so far on extending SPARQL for continuous query and event processing, stream reasoning has concentrated on ontology evolution and incremental materialization. In this paper we propose a different approach for querying RDF streams over ontologies, based on the combination of query rewriting and stream processing. We show that it is possible to rewrite continuous queries over streams of RDF data, while maintaining efficiency for a wide range of scenarios. We provide a detailed description of our approach, as well as an implementation, StreamQR, which is based on the kyrie rewriter, and can be coupled with a native RSP engine, namely CQELS. Finally, we show empirical evidence of the performance of StreamQR in a series of experiments based on the SRBench query set.

Jean-Paul Calbimonte, Jose Mora, Oscar Corcho

### Linking Data, Services and Human Know-How

An increasing number of everyday tasks involve a mixture of human actions and machine computation. This paper presents the first framework that allows non-programmer users to create and execute workflows where each task can be completed by a human or a machine. In this framework, humans and machines interact through a shared knowledge base which is both human and machine understandable. This knowledge base is based on the prohow Linked Data vocabulary that can represent human instructions and link them to machine functionalities. Our hypothesis is that non-programmer users can describe how to achieve certain tasks at a level of abstraction which is both human and machine understandable. This paper presents the prohow vocabulary and describes its usage within the proposed framework. We substantiate our claim with a concrete implementation of our framework and by experimental evidence.

Paolo Pareti, Ewan Klein, Adam Barker

### VOLT: A Provenance-Producing, Transparent SPARQL Proxy for the On-Demand Computation of Linked Data and its Application to Spatiotemporally Dependent Data

Powered by Semantic Web technologies, the Linked Data paradigm aims at weaving a globally interconnected graph of raw data that transforms the ways we publish, retrieve, share, reuse, and integrate data from a variety of distributed and heterogeneous sources. In practice, however, this vision faces substantial challenges with respect to data quality, coverage, and longevity, the amount of background knowledge required to query distant data, the reproducibility of query results and their derived (scientific) findings, and the lack of computational capabilities required for many tasks. One key issue underlying these challenges is the trade-off between storing data and computing them. Intuitively, data that is derived from already stored data, changes frequently in space and time, or is the result of some workflow or procedure, should be computed. However, this functionality is not readily available on the Linked Data cloud with its current technology stack. In this work, we introduce a proxy that can transparently run on top of arbitrary SPARQL endpoints to enable the on-demand computation of Linked Data together with the provenance information required to understand how they were derived. While our work can be generalized to multiple domains, we focus on two geographic use cases to showcase the proxy’s capabilities.

Blake Regalia, Krzysztof Janowicz, Song Gao

### Learning to Classify Spatiotextual Entities in Maps

In this paper, we present an approach for automatically recommending categories for spatiotextual entities, based on already existing annotated entities. Our goal is to facilitate the annotation process in crowdsourcing map initiatives such as OpenStreetMap, so that more accurate annotations are produced for the newly created spatial entities, while at the same time increasing the reuse of already existing tags. We define and construct a set of training features to represent the attributes of the spatiotextual entities and to capture their relation with the categories they are annotated with. These features include spatial, textual and semantic properties of the entities. We evaluate four different approaches, namely SVM, kNN, clustering+SVM and clustering+kNN, on several combinations of the defined training features and we examine which configurations of the algorithms achieve the best results. The presented work is deployed in OSMRec, a plugin for the JOSM tool that is commonly used for editing content in OpenStreetMap.

Giorgos Giannopoulos, Nikos Karagiannakis, Dimitrios Skoutas, Spiros Athanasiou

### Supporting Geo-Ontology Engineering Through Spatial Data Analytics

Geo-ontologies are becoming first-class artifacts in spatial data management because of their ability to represent places and points of interest. Several general-purpose geo-ontologies are available and widely employed to describe spatial entities across the world. The cultural, contextual and geographic differences between locations, however, call for more specialized and spatially-customized geo-ontologies. In order to help ontology engineers in (re)engineering geo-ontologies, spatial data analytics can provide interesting insights on territorial characteristics, thus revealing peculiarities and diversities between places.In this paper we propose a set of spatial analytics methods and tools to evaluate existing instances of a general-purpose geo-ontology within two distinct urban environments, in order to support ontology engineers in two tasks: (1) the identification of possible location-specific ontology restructuring activities, like specializations or extensions, and (2) the specification of new potential concepts to formalize neighborhood semantic models. We apply the proposed approach to datasets related to the cities of Milano and London extracted from LinkedGeoData, we present the experimental results and we discuss their value to assist geo-ontology engineering.

Gloria Re Calegari, Emanuela Carlino, Irene Celino, Diego Peroni

### Provenance Management for Evolving RDF Datasets

Tracking the provenance of information published on the Web is of crucial importance for effectively supporting trustworthiness, accountability and repeatability in the Web of Data. Although extensive work has been done on computing the provenance for SPARQL queries, little research has been conducted for the case of SPARQL updates. This paper proposes a new provenance model that borrows properties from both how and where provenance models, and is suitable for capturing the triple and attribute level provenance of data introduced via SPARQL INSERT updates. To the best of our knowledge, this is the first model that deals with the provenance of SPARQL updates using algebraic expressions, in the spirit of the well-established model of provenance semirings. We present an algorithm that records the provenance of SPARQL update results, and a reconstruction algorithm that uses this provenance to identify a SPARQL update that is compatible to the original one, given only the recorded provenance. Our approach is implemented and evaluated on top of Virtuoso Database Engine.

Argyro Avgoustaki, Giorgos Flouris, Irini Fundulaki, Dimitris Plexousakis

### Private Record Linkage: Comparison of Selected Techniques for Name Matching

The rise of Big Data Analytics has shown the utility of analyzing all aspects of a problem by bringing together disparate data sets. Efficient and accurate private record linkage algorithms are necessary to achieve this. However, records are often linked based on personally identifiable information, and protecting the privacy of individuals is critical. This paper contributes to this field by studying an important component of the private record linkage problem: linking based on names while keeping those names encrypted, both on disk and in memory. We explore the applicability, accuracy and speed of three different primary approaches to this problem (along with several variations) and compare the results to common name-matching metrics on unprotected data. While these approaches are not new, this paper provides a thorough analysis on a range of datasets containing systematically introduced flaws common to name-based data entry, such as typographical errors, optical character recognition errors, and phonetic errors.

Pawel Grzebala, Michelle Cheatham

### An Ontology-Driven Approach for Semantic Annotation of Documents with Specific Concepts

This paper deals with an ontology-driven approach for semantic annotation of documents from a corpus where each document describes an entity of a same domain. The goal is to annotate each document with concepts being too specific to be explicitly mentioned in texts. The only thing we know about the concepts is their labels, i.e., we have no semantic information about these concepts. Moreover, their characteristics in the texts are incomplete. We propose an ontology-based approach, named Saupodoc, aiming to perform this particular annotation process by combining several approaches. Indeed, Saupodoc relies on a domain ontology relative to the field under study, which has a pivotal role, on its population with property assertions coming from documents and external resources, and its enrichment with formal specific concept definitions. Experiments have been carried out in two application domains, showing the benefit of the approach compared to well-known classifiers.

Céline Alec, Chantal Reynaud-Delaître, Brigitte Safar

### Qanary – A Methodology for Vocabulary-Driven Open Question Answering Systems

It is very challenging to access the knowledge expressed within (big) data sets. Question answering (QA) aims at making sense out of data via a simple-to-use interface. However, QA systems are very complex and earlier approaches are mostly singular and monolithic implementations for QA in specific domains. Therefore, it is cumbersome and inefficient to design and implement new or improved approaches, in particular as many components are not reusable.Hence, there is a strong need for enabling best-of-breed QA systems, where the best performing components are combined, aiming at the best quality achievable in the given domain. Taking into account the high variety of functionality that might be of use within a QA system and therefore reused in new QA systems, we provide an approach driven by a core QA vocabulary that is aligned to existing, powerful ontologies provided by domain-specific communities. We achieve this by a methodology for binding existing vocabularies to our core QA vocabulary without re-creating the information provided by external components.We thus provide a practical approach for rapidly establishing new (domain-specific) QA systems, while the core QA vocabulary is re-usable across multiple domains. To the best of our knowledge, this is the first approach to open QA systems that is agnostic to implementation details and that inherently follows the linked data principles.

Andreas Both, Dennis Diefenbach, Kuldeep Singh, Saedeeh Shekarpour, Didier Cherix, Christoph Lange

### Test-Driven Development of Ontologies

Emerging ontology authoring methods to add knowledge to an ontology focus on ameliorating the validation bottleneck. The verification of the newly added axiom is still one of trying and seeing what the reasoner says, because a systematic testbed for ontology authoring is missing. We sought to address this by introducing the approach of test-driven development for ontology authoring. We specify 36 generic tests, as TBox queries and TBox axioms tested through individuals, and structure their inner workings in an ‘open box’-way, which cover the OWL 2 DL language features. This is implemented as a Protégé plugin so that one can perform a TDD test as a black box test. We evaluated the two test approaches on their performance. The TBox queries were faster, and that effect is more pronounced the larger the ontology is.

C. Maria Keet, Agnieszka Ławrynowicz

### Semantically Enhanced Quality Assurance in the JURION Business Use Case

The publishing industry is undergoing major changes. These changes are mainly based on technical developments and related habits of information consumption. Wolters Kluwer already engaged in new solutions to meet these challenges and to improve all processes of generating good quality content in the backend on the one hand and to deliver information and software in the frontend that facilitates the customer’s life on the other hand. JURION is an innovative legal information platform developed by Wolters Kluwer Germany (WKD) that merges and interlinks over one million documents of content and data from diverse sources such as national and European legislation and court judgments, extensive internally authored content and local customer data, as well as social media and web data (e.g. DBpedia). In collecting and managing this data, all stages of the Data Lifecycle are present – extraction, storage, authoring, interlinking, enrichment, quality analysis, repair and publication. Ensuring data quality is a key step in the JURION data lifecycle. In this industry paper we present two use cases for verifying quality: (1) integrating quality tools in the existing software infrastructure and (2) improving the data enrichment step by checking the external sources before importing them in JURION. We open-source part of our extensions and provide a screencast with our prototype in action.

Dimitris Kontokostas, Christian Mader, Christian Dirschl, Katja Eck, Michael Leuthold, Jens Lehmann, Sebastian Hellmann

### Adaptive Linked Data-Driven Web Components: Building Flexible and Reusable Semantic Web Interfaces

Building Flexible and Reusable Semantic Web Interfaces

Due to the increasing amount of Linked Data openly published on the Web, user-facing Linked Data Applications (LDAs) are gaining momentum. One of the major entrance barriers for Web developers to contribute to this wave of LDAs is the required knowledge of Semantic Web (SW) technologies such as the RDF data model and SPARQL query language. This paper presents an adaptive component-based approach together with its open source implementation for creating flexible and reusable SW interfaces driven by Linked Data. Linked Data-driven (LD-R) Web components abstract the complexity of the underlying SW technologies in order to allow reuse of existing Web components in LDAs, enabling Web developers who are not experts in SW to develop interfaces that view, edit and browse Linked Data. In addition to the modularity provided by the LD-R components, the proposed RDF-based configuration method allows application assemblers to reshape their user interface for different use cases, by either reusing existing shared configurations or by creating their proprietary configurations.

Ali Khalili, Antonis Loizou, Frank van Harmelen

### Building the Seshat Ontology for a Global History Databank

This paper describes OWL ontology re-engineering from the wiki-based social science codebook (thesaurus) developed by the Seshat: Global History Databank. The ontology describes human history as a set of over 1500 time series variables and supports variable uncertainty, temporal scoping, annotations and bibliographic references. The ontology was developed to transition from traditional social science data collection and storage techniques to an RDF-based approach. RDF supports automated generation of high usability data entry and validation tools, data quality management, incorporation of facts from the web of data and management of the data curation lifecycle. This ontology re-engineering exercise identified several pitfalls in modelling social science codebooks with semantic web technologies; provided insights into the practical application of OWL to complex, real-world modelling challenges; and has enabled the construction of new, RDF-based tools to support the large-scale Seshat data curation effort. The Seshat ontology is an exemplar of a set of ontology design patterns for modelling uncertainty or temporal bounds in standard RDF. Thus the paper provides guidance for deploying RDF in the social sciences. Within Seshat, OWL-based data quality management will assure the data is suitable for statistical analysis. Publication of Seshat as high-quality, linked open data will enable other researchers to build on it.

Rob Brennan, Kevin Feeney, Gavin Mendel-Gleason, Bojan Bozic, Peter Turchin, Harvey Whitehouse, Pieter Francois, Thomas E. Currie, Stephanie Grohmann

### RMLEditor: A Graph-Based Mapping Editor for Linked Data Mappings

Although several tools have been implemented to generate Linked Data from raw data, users still need to be aware of the underlying technologies and Linked Data principles to use them. Mapping languages enable to detach the mapping definitions from the implementation that executes them. However, no thorough research has been conducted on how to facilitate the editing of mappings. We propose the rmleditor, a visual graph-based user interface, which allows users to easily define the mappings that deliver the rdf representation of the corresponding raw data. Neither knowledge of the underlying mapping language nor the used technologies is required. The rmleditor aims to facilitate the editing of mappings, and thereby lowers the barriers to create Linked Data. The rmleditor is developed for use by data specialists who are partners of (i) a companies-driven pilot and (ii) a community group. The current version of the rmleditor was validated: participants indicate that it is adequate for its purpose and the graph-based approach enables users to conceive the linked nature of the data.

Pieter Heyvaert, Anastasia Dimou, Aron-Levi Herregodts, Ruben Verborgh, Dimitri Schuurman, Erik Mannens, Rik Van de Walle

### Enriching a Small Artwork Collection Through Semantic Linking

Mauro Dragoni, Elena Cabrio, Sara Tonelli, Serena Villata

### Ontology-Based Data Access for Maritime Security

The maritime security domain is challenged by a number of data analysis needs focusing on increasing the maritime situation awareness, i.e., detection and analysis of abnormal vessel behaviors and suspicious vessel movements. The need for efficient processing of dynamic and/or static vessel data that come from different heterogeneous sources is emerged. In this paper we describe how we address the challenge of combining and processing real-time and static data from different sources using ontology-based data access techniques, and we explain how the application of semantic web technologies increases the value of data and improves the processing workflow in the maritime domain.

Stefan Brüggemann, Konstantina Bereta, Guohui Xiao, Manolis Koubarakis

### WarSampo Data Service and Semantic Portal for Publishing Linked Open Data About the Second World War History

This paper presents the WarSampo system for publishing collections of heterogeneous, distributed data about the Second World War on the Semantic Web. WarSampo is based on harmonizing massive datasets using event-based modeling, which makes it possible to enrich datasets semantically with each others’ contents. WarSampo has two components: First, a Linked Open Data (LOD) service WarSampo Data for Digital Humanities (DH) research and for creating applications related to war history. Second, a semantic WarSampo Portal has been created to test and demonstrate the usability of the data service. The WarSampo Portal allows both historians and laymen to study war history and destinies of their family members in the war from different interlinked perspectives. Published in November 2015, the WarSampo Portal had some 20,000 distinct visitors during the first three days, showing that the public has a great interest in this kind of applications.

Eero Hyvönen, Erkki Heino, Petri Leskinen, Esko Ikkala, Mikko Koho, Minna Tamper, Jouni Tuominen, Eetu Mäkelä

### Predicting Drug-Drug Interactions Through Large-Scale Similarity-Based Link Prediction

Drug-Drug Interactions (DDIs) are a major cause of preventable adverse drug reactions (ADRs), causing a significant burden on the patients’ health and the healthcare system. It is widely known that clinical studies cannot sufficiently and accurately identify DDIs for new drugs before they are made available on the market. In addition, existing public and proprietary sources of DDI information are known to be incomplete and/or inaccurate and so not reliable. As a result, there is an emerging body of research on in-silico prediction of drug-drug interactions. We present Tiresias, a framework that takes in various sources of drug-related data and knowledge as inputs, and provides DDI predictions as outputs. The process starts with semantic integration of the input data that results in a knowledge graph describing drug attributes and relationships with various related entities such as enzymes, chemical structures, and pathways. The knowledge graph is then used to compute several similarity measures between all the drugs in a scalable and distributed framework. The resulting similarity metrics are used to build features for a large-scale logistic regression model to predict potential DDIs. We highlight the novelty of our proposed approach and perform thorough evaluation of the quality of the predictions. The results show the effectiveness of Tiresias in both predicting new interactions among existing drugs and among newly developed and existing drugs.

### Semantics Driven Human-Machine Computation Framework for Linked Islamic Knowledge Engineering

Formalized knowledge engineering activities including semantic annotation and linked data management tasks in specialized domains suffer from considerable knowledge acquisition bottleneck - owing to the lack of availability of experts and in-efficacy of computational approaches. Human Computation & Crowdsourcing (HC&C) methods successfully advocate leveraging the human processing power to solve problems that are still difficult to be solved computationally. Contextualized to the domain of Islamic Knowledge, my research investigates the synergistic interplay of these HC&C methods and the semantic web and will seek to devise a semantics driven human-machine computation framework for knowledge engineering in specialized and knowledge intensive domains. The overall objective is to augment the process of automated knowledge extraction and text mining methods using a hybrid approach for combining collective intelligence of the crowds with that of experts to facilitate activities in formalized knowledge engineering - thus overcoming the so-called knowledge acquisition bottleneck.

Amna Basharat

### Towards Scalable Federated Context-Aware Stream Reasoning

With the rising interest in internet connected devices and sensor networks, better known as the Internet of Things, data streams are becoming ubiquitous. Integration and processing of these data streams is challenging. Semantic Web technologies are able to deal with the variety of data but are not able to deal with the velocity of the data. An emerging research domain, called stream reasoning, tries to bridge the gap between traditional stream processing and semantic reasoning. Research in the past years has resulted in several prototyped RDF Stream Processors, each of them with its own features and application domain. They all cover querying over RDF streams but lack support for complex reasoning. This paper presents how adaptive stream processing and context-awareness can be used to enhance semantic reasoning over streaming data. The result is a federated context-aware architecture that allows to leverage reasoning capabilities on data streams produced by distributed sensor devices. The proposed solution is stated by use cases in pervasive health care and smart cities.

Alexander Dejonghe

### Machine-Crowd Annotation Workflow for Event Understanding Across Collections and Domains

People need context to process the massive information online. Context is often expressed by a specific event taking place. The multitude of data streams used to mention events provide an inconceivable amount of information redundancy and perspectives. This poses challenges to both humans, i.e., to reduce the information overload and consume the meaningful information and machines, i.e., to generate a concise overview of the events. For machines to generate such overviews, they need to be taught to understand events. The goal of this research project is to investigate whether combining machines output with crowd perspectives boosts the event understanding of state-of-the-art natural language processing tools and improve their event detection. To answer this question, we propose an end-to-end research methodology for: machine processing, defining experimental data and setup, gathering event semantics and results evaluation. We present preliminary results that indicate crowdsourcing as a reliable approach for (1) linking events and their related entities in cultural heritage collections and (2) identifying salient event features (i.e., relevant mentions and sentiments) for online data. We provide an evaluation plan for the overall research methodology of crowdsourcing event semantics across modalities and domains.

Oana Inel

### Distributed Context-Aware Applications by Means of Web of Things and Semantic Web Technologies

Ambient Assisted Living aims for providing context-aware and adaptive applications to assist elderly and impaired people in their everyday living environment. This requires the recognition of user intentions and activities by means of multi-modal and heterogeneous sensing devices. An unresolved problem is the lack of interoperability and extendibility of the setting. Moreover, to achieve adaptivity, a context-aware environment requires to consider user impairments as well as capabilities and to monitor non-stop user activities. This complicates an on the fly integration of new sensing devices and applications. Furthermore, a flexible and expressive domain model for describing and processing user profiles, intentions and activities, is required. Our approach to overcome these integration and modeling problems, is to use the Web of Things and Semantic Web technologies. Another unresolved problem concerns the security of the collected sensitive data. To avoid the manipulation of applications by an unauthorized access, we introduce ontology based security policies for context-aware applications, considering their managed context data.

Nicole Merkle

### On Learnability of Constraints from RDF Data

RDF is structured, dynamic, and schemaless data, which enables a big deal of flexibility for Linked Data to be available in an open environment such as the Web. However, for RDF data, flexibility turns out to be the source of many data quality and knowledge representation issues. Tasks such as assessing data quality in RDF require a different set of techniques and tools compared to other data models. Furthermore, since the use of existing schema, ontology and constraint languages is not mandatory, there is always room for misunderstanding the structure of the data. Neglecting this problem can represent a threat to the widespread use and adoption of RDF and Linked Data. Users should be able to learn the characteristics of RDF data in order to determine its fitness for a given use case, for example. For that purpose, in this doctoral research, we propose the use of constraints to inform users about characteristics that RDF data naturally exhibits, in cases where ontologies (or any other form of explicitly given constraints or schemata) are not present or not expressive enough. We aim to address the problems of defining and discovering classes of constraints to help users in data analysis and assessment of RDF and Linked Data quality.

Emir Muñoz

### A Knowledge-Based Framework for Events Representation and Reuse from Historical Archives

Thanks to the digitization techniques, historical archives become source of a considerable amount of biographical, factual and geographical data, that need to be structured in order to be usable in higher-level applications. In this paper we present the project of an ontology-based framework aimed at formally representing and extracting historical events from archives; this process should serve to the purpose of (semi)-automatically building narratives that allow users to explore the archives themselves. This proposal refers to a Ph.D. research at early stage and is part of a wider project, Harlock’900, established between the Computer Science department of the University of Turin and the Istituto Gramsci, a cultural foundation promoting research in contemporary history.

Marco Rovera

### Unsupervised Conceptualization and Semantic Text Indexing for Information Extraction

The goal of my thesis is the extension of the Distributional Hypothesis [13] from the word to the concept level. This will be achieved by creating data-driven methods to create and apply conceptualizations, taxonomic semantic models that are grounded in the input corpus. Such conceptualizations can be used to disambiguate all words in the corpus, so that we can extract richer relations and create a dense graph of semantic relations between concepts. These relations will reduce sparsity issues, a common problem for contextualization techniques. By extending our conceptualization with named entities and multi-word entities (MWE), we can create a Linked Open Data knowledge base that is linked to existing knowledge bases like Freebase.

Eugen Ruppert

### Continuously Self-Updating Query Results over Dynamic Heterogeneous Linked Data

Our society is evolving towards massive data consumption from heterogeneous sources, which includes rapidly changing data like public transit delay information. Many applications that depend on dynamic data consumption require highly available server interfaces. Existing interfaces involve substantial costs to publish rapidly changing data with high availability, and are therefore only possible for organisations that can afford such an expensive infrastructure. In my doctoral research, I investigate how to publish and consume real-time and historical Linked Data on a large scale. To reduce server-side costs for making dynamic data publication affordable, I will examine different possibilities to divide query evaluation between servers and clients. This paper discusses the methods I aim to follow together with preliminary results and the steps required to use this solution. An initial prototype achieves significantly lower server processing cost per query, while maintaining reasonable query execution times and client costs. Given these promising results, I feel confident this research direction is a viable solution for offering low-cost dynamic Linked Data interfaces as opposed to the existing high-cost solutions.

Ruben Taelman

### Exploiting Disagreement Through Open-Ended Tasks for Capturing Interpretation Spaces

An important aspect of the semantic web is that systems have an understanding of the content and context of text, images, sounds and videos. Although research in these fields has progressed over the last years, there is still a semantic gap between data available of multimedia and metadata annotated by humans describing the content. This research investigates how the complete interpretation space of humans about the content and context of this data can be captured. The methodology consists of using open-ended crowdsourcing tasks that optimize the capturing of multiple interpretations combined with disagreement based metrics for evaluation of the results. These descriptions can be used meaningfully to improve information retrieval and recommendation of multimedia, to train and evaluate machine learning components and the training and assessment of experts.

Benjamin Timmermans

### A Semantic Approach for Process Annotation and Similarity Analysis

Research in the area of process modeling and analysis has a long-established tradition. There are quite few formalism for capturing processes, which are also accompanied by a number of optimization approaches. We introduce a novel approach, which employs semantics, for process annotation and analysis. In particular, we distinguish between target processes and current processes. Target process models describe how a process should ideally run and define a framework for current processes, which in contrast, capture how processes actually run in real-life use cases. In some cases, current processes do not match the target process models and can even overhaul them. Therefore, one is interested in the similarity between the defined target process model and current processes. The comparisons can consider different characteristics of processes such as service quality measures and dimensions. Current solutions perform process mining methods to discover hidden structures or try to infer knowledge about processes by using specific ontologies. To this end, we propose a novel method to capture and formalize processes, employing semantics and devising strategies and similarity measures that exploit the semantic representation to calculate similarities between target and current processes. As part of the similarity analysis, we consider different service qualities and dimensions in order to determine how they influence the target process models.

Tobias Weller

### Backmatter

Weitere Informationen