Skip to main content
Top

2015 | Book

The Semantic Web. Latest Advances and New Domains

12th European Semantic Web Conference, ESWC 2015, Portoroz, Slovenia, May 31 -- June 4, 2015. Proceedings

Editors: Fabien Gandon, Marta Sabou, Harald Sack, Claudia d’Amato, Philippe Cudré-Mauroux, Antoine Zimmermann

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 12th Extended Semantic Web Conference, ESWC 2014, held in Anissaras, Portoroz, Slovenia, in May/June 2015. The 43 revised full papers presented together with three invited talks were carefully reviewed and selected from 164 submissions. This program was completed by a demonstration and poster session, in which researchers had the chance to present their latest results and advances in the form of live demos. In addition, the PhD Symposium program included 12 contributions, selected out of 16 submissions. The core tracks of the research conference were complemented with new tracks focusing on linking machine and human computation at web scale (cognition and Semantic Web, Human Computation and Crowdsourcing) beside the following subjects Vocabularies, Schemas, Ontologies, Reasoning, Linked Data, Semantic Web and Web Science, Semantic Data Management, Big data, Scalability, Natural Language Processing and Information Retrieval, Machine Learning, Mobile Web, Internet of Things and Semantic Streams, Services, Web APIs and the Web of Things, Cognition and Semantic Web, Human Computation and Crowdsourcing and In-Use Industrial Track as well.

Table of Contents

Frontmatter

Vocabularies, Schemas, Ontologies

Frontmatter
Requirements for and Evaluation of User Support for Large-Scale Ontology Alignment

Currently one of the challenges for the ontology alignment community is the user involvement in the alignment process. At the same time, the focus of the community has shifted towards large-scale matching which introduces an additional dimension to this issue. This paper aims to provide a set of requirements that foster the user involvement for large-scale ontology alignment tasks. Further, we present and discuss the results of a literature study for 7 ontology alignments systems as well as a heuristic evaluation and an observational user study for 3 ontology alignment systems to reveal the coverage of the requirements in the systems and the support for the requirements in the user interfaces.

Valentina Ivanova, Patrick Lambrix, Johan Åberg
RODI: A Benchmark for Automatic Mapping Generation in Relational-to-Ontology Data Integration

A major challenge in information management today is the integration of huge amounts of data distributed across multiple data sources. A suggested approach to this problem is ontology-based data integration where legacy data systems are integrated via a common ontology that represents a unified global view over all data sources. However, data is often not natively born using these ontologies. Instead, much data resides in legacy relational databases. Therefore, mappings that relate the legacy relational data sources to the ontology need to be constructed. Recent techniques and systems that automatically construct such mappings have been developed. The quality metrics of these systems are, however, often only based on self-designed benchmarks. This paper introduces a new publicly available benchmarking suite called

RODI

, which is designed to cover a wide range of mapping challenges in

R

elational-to-

O

ntology

D

ata

I

ntegration scenarios.

RODI

provides a set of different relational data sources and ontologies (representing a wide range of mapping challenges) as well as a scoring function with which the performance of relational-to-ontology mapping construction systems may be evaluated.

Christoph Pinkel, Carsten Binnig, Ernesto Jiménez-Ruiz, Wolfgang May, Dominique Ritze, Martin G. Skjæveland, Alessandro Solimando, Evgeny Kharlamov
VocBench: A Web Application for Collaborative Development of Multilingual Thesauri

We introduce VocBench, an open source web application for editing thesauri complying with the SKOS and SKOS-XL standards. VocBench has a strong focus on collaboration, supported by workflow management for content validation and publication. Dedicated user roles provide a clean separation of competences, addressing different specificities ranging from management aspects to vertical competences on content editing, such as conceptualization versus terminology editing. Extensive support for scheme management allows editors to fully exploit the possibilities of the SKOS model, as well as to fulfill its integrity constraints. We discuss thoroughly the main features of VocBench, detail its architecture, and evaluate it under both a functional and user-appreciation ground, through a comparison with state-of-the-art and user questionnaires analysis, respectively. Finally, we provide insights on future developments.

Armando Stellato, Sachit Rajbhandari, Andrea Turbati, Manuel Fiorelli, Caterina Caracciolo, Tiziano Lorenzetti, Johannes Keizer, Maria Teresa Pazienza
Leveraging and Balancing Heterogeneous Sources of Evidence in Ontology Learning

Ontology learning (OL) aims at the (semi-)automatic acquisition of ontologies from sources of evidence, typically domain text. Recently, there has been a trend towards the application of multiple and heterogeneous evidence sources in OL. Heterogeneous sources provide benefits, such as higher accuracy by exploiting redundancy across evidence sources, and including complementary information. When using evidence sources which are heterogeneous in quality, amount of data provided and type, then a number of questions arise, for example: How many sources are needed to see significant benefits from heterogeneity, what is an appropriate number of evidences per source, is balancing the number of evidences per source important, and to what degree can the integration of multiple sources overcome low quality input of individual sources? This research presents an extensive evaluation based on an existing OL system. It gives answers and insights on the research questions posed for the OL task of concept detection, and provides further hints from experience made. Among other things, our results suggest that a moderate number of evidences per source as well as a moderate number of sources resulting in a few thousand data instances are sufficient to exploit the benefits of heterogeneous evidence integration.

Gerhard Wohlgenannt

Reasoning

Frontmatter
A Context-Based Semantics for SPARQL Property Paths Over the Web

As of today, there exists no standard language for querying Linked Data

on the Web

, where navigation across distributed data sources is a key feature. A natural candidate seems to be SPARQL, which recently has been enhanced with navigational capabilities thanks to the introduction of

property paths

(PPs). However, the semantics of SPARQL restricts the scope of navigation via PPs to

single

RDF graphs. This restriction limits the applicability of PPs on the Web. To fill this gap, in this paper we provide formal foundations for evaluating PPs on the Web, thus contributing to the definition of a query language for Linked Data. In particular, we introduce a query semantics for PPs that couples navigation at the data level with navigation on the Web graph. Given this semantics we find that for some PP-based SPARQL queries a complete evaluation on the Web is not feasible. To enable systems to identify queries that can be evaluated completely, we establish a decidable syntactic property of such queries.

Olaf Hartig, Giuseppe Pirrò
Distributed and Scalable OWL EL Reasoning

OWL 2 EL is one of the tractable profiles of the Web Ontology Language (OWL) which is a W3C-recommended standard. OWL 2 EL provides sufficient expressivity to model large biomedical ontologies as well as streaming data such as traffic, while at the same time allows for efficient reasoning services. Existing reasoners for OWL 2 EL, however, use only a single machine and are thus constrained by memory and computational power. At the same time, the automated generation of ontological information from streaming data and text can lead to very large ontologies which can exceed the capacities of these reasoners. We thus describe a distributed reasoning system that scales well using a cluster of commodity machines. We also apply our system to a use case on city traffic data and show that it can handle volumes which cannot be handled by current single machine reasoners.

Raghava Mutharaju, Pascal Hitzler, Prabhaker Mateti, Freddy Lécué
Large Scale Rule-Based Reasoning Using a Laptop

Although recent developments have shown that it is possible to reason over large RDF datasets with billions of triples in a scalable way, the reasoning process can still be a challenging task with respect to the growing amount of available semantic data. By now, reasoner implementations that are able to process large scale datasets usually use a MapReduce based implementation that runs on a cluster of computing nodes. In this paper we address this circumstance by identifying the resource consuming parts of a reasoner process and providing a solution for a more efficient implementation in terms of memory consumption. As a basis we use a rule-based reasoner concept from our previous work. In detail, we are going to introduce an approach for a memory efficient RETE algorithm implementation. Furthermore, we introduce a compressed triple-index structure that can be used to identify duplicate triples and only needs a few bytes to represent a triple. Based on these concepts we show that it is possible to apply all RDFS rules to more than 1 billion triples on a single laptop reaching a throughput, that is comparable or even higher than state of the art MapReduce based reasoner. Thus, we show that the resources needed for large scale lightweight reasoning can massively be reduced.

Martin Peters, Sabine Sachweh, Albert Zündorf
RDF Digest: Efficient Summarization of RDF/S KBs

The exponential growth of the web and the extended use of semantic web technologies has brought to the fore the need for quick understanding, flexible exploration and selection of complex web documents and schemas. To this direction, ontology summarization aspires to produce an abridged version of the original ontology that highlights its most representative concepts. In this paper, we present

RDF Digest

, a novel platform that automatically produces summaries of RDF/S Knowledge Bases (KBs). A summary is a valid RDFS document/graph that includes the most representative concepts of the schema adapted to the corresponding instances. To construct this graph, our algorithm exploits the semantics and the structure of the schema and the distribution of the corresponding data/instances. The performed preliminary evaluation demonstrates the benefits of our approach and the considerable advantages gained.

Georgia Troullinou, Haridimos Kondylakis, Evangelia Daskalaki, Dimitris Plexousakis

Linked Data

Frontmatter
A Comparison of Data Structures to Manage URIs on the Web of Data

Uniform Resource Identifiers (URIs) are one of the corner stones of the Web; They are also exceedingly important on the Web of data, since RDF graphs and Linked Data both heavily rely on URIs to uniquely identify and connect entities. Due to their hierarchical structure and their string serialization, sets of related URIs typically contain a high degree of redundant information and are systematically dictionary-compressed or encoded at the back-end (e.g., in the triple store). The paper represents, to the best of our knowledge, the first systematic comparison of the most common data structures used to encode URI data. We evaluate a series of data structures in term of their read/write performance and memory consumption.

Ruslan Mavlyutov, Marcin Wylot, Philippe Cudre-Mauroux
Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Being promoted by major search engines such as Google, Yahoo!, Bing, and Yandex, Microdata embedded in web pages, especially using

schema.org

, has become one of the most important markup languages for the Web. However, deployed Microdata is most often not free from errors, which limits its practical use. In this paper, we use the WebDataCommons corpus of Microdata extracted from more than

$$250$$

million web pages for a quantitative analysis of common mistakes in Microdata provision. Since it is unrealistic that

data providers

will provide clean and correct data, we discuss a set of heuristics that can be applied on the

data consumer

side to fix many of those mistakes in a post-processing step. We apply those heuristics to provide an improved knowledge base constructed from the raw Microdata extraction.

Robert Meusel, Heiko Paulheim

Semantic Web and Web Science

Frontmatter
Using @Twitter Conventions to Improve #LOD-Based Named Entity Disambiguation

State-of-the-art named entity disambiguation approaches tend to perform poorly on social media content, and microblogs in particular. Tweets are processed individually and the richer, microblog-specific context is largely ignored. This paper focuses specifically on quantifying the impact on entity disambiguation performance when readily available contextual information is included from URL content, hash tag definitions, and Twitter user profiles. In particular, including URL content significantly improves performance. Similarly, user profile information for @mentions improves recall by over 10 % with no adverse impact on precision. We also share a new corpus of tweets, which have been hand-annotated with DBpedia URIs, with high inter-annotator agreement.

Genevieve Gorrell, Johann Petrak, Kalina Bontcheva
Knowledge Enabled Approach to Predict the Location of Twitter Users

Knowledge bases have been used to improve performance in applications ranging from web search and event detection to entity recognition and disambiguation. More recently, knowledge bases have been used to analyze social data. A key challenge in social data analysis has been the identification of the geographic location of online users in a social network such as Twitter. Existing approaches to predict the location of users, based on their tweets, rely solely on social media features or probabilistic language models. These approaches are supervised and require large training dataset of geo-tagged tweets to build their models. As most Twitter users are reluctant to publish their location, the collection of geo-tagged tweets is a time intensive process. To address this issue, we present an alternative, knowledge-based approach to predict a Twitter user’s location at the city level. Our approach utilizes Wikipedia as a source of knowledge base by exploiting its hyperlink structure. Our experiments, on a publicly available dataset demonstrate comparable performance to the state of the art techniques.

Revathy Krishnamurthy, Pavan Kapanipathi, Amit P. Sheth, Krishnaprasad Thirunarayan

Semantic Data Management, Big Data, Scalability

Frontmatter
A Compact In-Memory Dictionary for RDF Data

While almost all dictionary compression techniques focus on static RDF data, we present a compact in-memory RDF dictionary for dynamic and streaming data. To do so, we analysed the structure of terms in real-world datasets and observed a high degree of common prefixes. We studied the applicability of Trie data structures on RDF data to reduce the memory occupied by common prefixes and discovered that all existing Trie implementations lead to either poor performance, or an excessive memory wastage.

In our approach, we address the existing limitations of Tries for RDF data, and propose a new variant of Trie which contains some optimizations explicitly designed to improve the performance on RDF data. Furthermore, we show how we use this Trie as an in-memory dictionary by using as numerical ID a memory address instead of an integer counter. This design removes the need for an additional decoding data structure, and further reduces the occupied memory. An empirical analysis on real-world datasets shows that with a reasonable overhead our technique uses 50–59% less memory than a conventional uncompressed dictionary.

Hamid R. Bazoobandi, Steven de Rooij, Jacopo Urbani, Annette ten Teije, Frank van Harmelen, Henri Bal
Quality Assessment of Linked Datasets Using Probabilistic Approximation

With the increasing application of Linked Open Data, assessing the quality of datasets by computing quality metrics becomes an issue of crucial importance. For large and evolving datasets, an exact, deterministic computation of the quality metrics is too time consuming or expensive. We employ probabilistic techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient estimation for implementing a broad set of data quality metrics in an approximate but sufficiently accurate way. Our implementation is integrated in the comprehensive data quality assessment framework Luzzu. We evaluated its performance and accuracy on Linked Open Datasets of broad relevance.

Jeremy Debattista, Santiago Londoño, Christoph Lange, Sören Auer
Cooperative Techniques for SPARQL Query Relaxation in RDF Databases

This paper addresses the problem of failing

$$\mathtt {RDF}$$

queries. Query relaxation is one of the cooperative techniques that allows providing users with alternative answers instead of an empty result. While previous works on query relaxation over

$$\mathtt {RDF}$$

data have focused on defining new relaxation operators, we investigate in this paper techniques to find the parts of an

$$\mathtt {RDF}$$

query that are responsible of its failure. Finding such subqueries, named

Minimal Failing Subqueries

(

$$\mathtt {MFSs}$$

), is of great interest to efficiently perform the relaxation process. We propose two algorithmic approaches for computing

$$\mathtt {MFSs}$$

. The first approach (

$$\mathtt {LBA}$$

) intelligently leverages the subquery lattice of the initial

$$\mathtt {RDF}$$

query while the second approach (

$$\mathtt {MBA}$$

) is based on a particular matrix that improves the performance of

$$\mathtt {LBA}$$

. Our approaches also compute a particular kind of relaxed

RDF

queries, called

Ma

x

imal Succeeding Subqueries

(

$$\mathtt {XSSs}$$

).

$$\mathtt {XSSs}$$

are subqueries with a maximal number of triple patterns of the initial query. To validate our approaches, a set of thorough experiments is conducted on the

$$\mathtt {LUBM}$$

benchmark and a comparative study with other approaches is done.

Géraud Fokou, Stéphane Jean, Allel Hadjali, Mickael Baron
HDT-MR: A Scalable Solution for RDF Compression with HDT and MapReduce

HDT a is binary RDF serialization aiming at minimizing the space overheads of traditional RDF formats, while providing retrieval features in compressed space. Several HDT-based applications, such as the recent

Linked Data Fragments

proposal, leverage these features for diverse publication, interchange and consumption purposes. However, scalability issues emerge in HDT construction because the whole RDF dataset must be processed in a memory-consuming task. This is hindering the evolution of novel applications and techniques at Web scale. This paper introduces HDT-MR, a MapReduce-based technique to process huge RDF and build the HDT serialization. HDT-MR performs in linear time with the dataset size and has proven able to serialize datasets up to several billion triples, preserving HDT compression and retrieval features.

José M. Giménez-García, Javier D. Fernández, Miguel A. Martínez-Prieto
Processing Aggregate Queries in a Federation of SPARQL Endpoints

More and more RDF data is exposed on the Web via SPARQL endpoints. With the recent SPARQL 1.1 standard, these datasets can be queried in novel and more powerful ways, e.g., complex analysis tasks involving grouping and aggregation, and even data from multiple SPARQL endpoints, can now be formulated in a single query. This enables Business Intelligence applications that access data from federated web sources and can combine it with local data. However, as both aggregate and federated queries have become available only recently, state-of-the-art systems lack sophisticated optimization techniques that facilitate efficient execution of such queries over large datasets. To overcome these shortcomings, we propose a set of query processing strategies and the associated Cost-based Optimizer for Distributed Aggregate queries (CoDA) for executing aggregate SPARQL queries over federations of SPARQL endpoints. Our comprehensive experiments show that CoDA significantly improves performance over current state-of-the-art systems.

Dilshod Ibragimov, Katja Hose, Torben Bach Pedersen, Esteban Zimányi
A Survey of HTTP Caching Implementations on the Open Semantic Web

Scalability of the data access architecture in the Semantic Web is dependent on the establishment of caching mechanisms to take the load off of servers. Unfortunately, there is a chicken and egg problem here: Research, implementation, and evaluation of caching infrastructure is uninteresting as long as data providers do not publish relevant metadata. And publishing metadata is useless as long as there is no infrastructure that uses it.

We show by means of a survey of live RDF data sources that caching metadata is prevalent enough already to be used in some cases. On the other hand, they are not commonly used even on relatively static data, and when they are given, they are very conservatively set. We point out future directions and give recommendations for the enhanced use of caching in the Semantic Web.

Kjetil Kjernsmo
Query Execution Optimization for Clients of Triple Pattern Fragments

In order to reduce the server-side cost of publishing queryable Linked Data, Triple Pattern Fragments (

tpf

) were introduced as a simple interface to

rdf

triples. They allow for

sparql

query execution at low server cost, by partially shifting the load from servers to clients. The previously proposed query execution algorithm uses more

http

requests than necessary, and only makes partial use of the available metadata. In this paper, we propose a new query execution algorithm for a client communicating with a

tpf

server. In contrast to a greedy solution, we maintain an overview of the entire query to find the optimal steps for solving a given query. We show multiple cases in which our algorithm reaches solutions with far fewer

http

requests, without significantly increasing the cost in other cases. This improves the efficiency of common

sparql

queries against

tpf

interfaces, augmenting their viability compared to the more powerful, but more costly,

sparql

interface.

Joachim Van Herwegen, Ruben Verborgh, Erik Mannens, Rik Van de Walle

Natural Language Processing and Information Retrieval

Frontmatter
LIME: The Metadata Module for OntoLex

The OntoLex W3C Community Group has been working for more than three years on a shared lexicon model for ontologies, called

lemon

. The

lemon

model consists of a core model that is complemented by a number of modules accounting for specific aspects in the modeling of lexical information within ontologies. In many usage scenarios, the discovery and exploitation of linguistically grounded ontologies may benefit from summarizing information about their linguistic expressivity and lexical coverage by means of metadata. That situation is compounded by the fact that

lemon

allows the independent publication of ontologies, lexica and lexicalizations linking them. While the VoID vocabulary already addresses the need for general metadata about interlinked datasets, it is unable by itself to represent the more specific metadata relevant to

lemon

. To solve this problem, we developed a module of

lemon

, named LIME (Linguistic Metadata), which extends VoID with a vocabulary of metadata about the ontology-lexicon interface.

Manuel Fiorelli, Armando Stellato, John P. McCrae, Philipp Cimiano, Maria Teresa Pazienza
Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text

Learning cross-lingual semantic representations of relations from textual data is useful for tasks like cross-lingual information retrieval and question answering. So far, research has been mainly focused on cross-lingual entity linking, which is confined to linking between phrases in a text document and their corresponding entities in a knowledge base but cannot link to relations. In this paper, we present an approach for inducing clusters of semantically related relations expressed in text, where relation clusters (i) can be extracted from text of different languages, (ii) are embedded in a semantic representation of the context, and (iii) can be linked across languages to properties in a knowledge base. This is achieved by combining multi-lingual semantic role labeling (SRL) with cross-lingual entity linking followed by spectral clustering of the annotated SRL graphs. With our initial implementation we learned a cross-lingual lexicon of relation expressions from English and Spanish Wikipedia articles. To demonstrate its usefulness we apply it to cross-lingual question answering over linked data.

Achim Rettinger, Artem Schumilin, Steffen Thoma, Basil Ell
HAWK – Hybrid Question Answering Using Linked Data

The decentral architecture behind the Web has led to pieces of information being distributed across data sources with varying structure. Hence, answering complex questions often requires combining information from structured and unstructured data sources. We present HAWK, a novel entity search approach for Hybrid Question Answering based on combining Linked Data and textual data. The approach uses predicate-argument representations of questions to derive equivalent combinations of SPARQL query fragments and text queries. These are executed so as to integrate the results of the text queries into SPARQL and thus generate a formal interpretation of the query. We present a thorough evaluation of the framework, including an analysis of the influence of entity annotation tools on the generation process of the hybrid queries and a study of the overall accuracy of the system. Our results show that HAWK achieves 0.68 respectively 0.61 F-measure within the training respectively test phases on the Question Answering over Linked Data (QALD-4) hybrid query benchmark.

Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Lorenz Bühmann, Christina Unger

Machine Learning

Frontmatter
Automating RDF Dataset Transformation and Enrichment

With the adoption of RDF across several domains, come growing requirements pertaining to the completeness and quality of RDF datasets. Currently, this problem is most commonly addressed by manually devising means of enriching an input dataset. The few tools that aim at supporting this endeavour usually focus on supporting the manual definition of enrichment pipelines. In this paper, we present a supervised learning approach based on a refinement operator for enriching RDF datasets. We show how we can use exemplary descriptions of enriched resources to generate accurate enrichment pipelines. We evaluate our approach against eight manually defined enrichment pipelines and show that our approach can learn accurate pipelines even when provided with a small number of training examples.

Mohamed Ahmed Sherif, Axel-Cyrille Ngonga Ngomo, Jens Lehmann
Semi-supervised Instance Matching Using Boosted Classifiers

Instance matching concerns identifying pairs of instances that refer to the same underlying entity. Current state-of-the-art instance matchers use machine learning methods. Supervised learning systems achieve good performance by training on significant amounts of manually labeled samples. To alleviate the labeling effort, this paper presents a

minimally supervised

instance matching approach that is able to deliver competitive performance using only 2 % training data and little parameter tuning. As a first step, the classifier is trained in an ensemble setting using

boosting

. Iterative

semi-supervised learning

is used to improve the performance of the boosted classifier even further, by

re-training

it on the most confident samples labeled in the current iteration. Empirical evaluations on a suite of six publicly available benchmarks show that the proposed system outcompetes optimization-based minimally supervised approaches in 1–7 iterations. The system’s average F-Measure is shown to be within 2.5 % of that of recent supervised systems that require more training samples for effective performance.

Mayank Kejriwal, Daniel P. Miranker
Assigning Semantic Labels to Data Sources

There is a huge demand to be able to find and integrate heterogeneous data sources, which requires mapping the attributes of a source to the concepts and relationships defined in a domain ontology. In this paper, we present a new approach to find these mappings, which we call semantic labeling. Previous approaches map each data value individually, typically by learning a model based on features extracted from the data using supervised machine-learning techniques. Our approach differs from existing approaches in that we take a holistic view of the data values corresponding to a semantic label and use techniques that treat this data collectively, which makes it possible to capture characteristic properties of the values associated with a semantic label as a whole. Our approach supports both textual and numeric data and proposes the top

$$k$$

semantic labels along with their associated confidence scores. Our experiments show that the approach has higher label prediction accuracy, has lower time complexity, and is more scalable than existing systems.

S.K. Ramnandan, Amol Mittal, Craig A. Knoblock, Pedro Szekely
Inductive Classification Through Evidence-Based Models and Their Ensembles

In the context of Semantic Web, one of the most important issues related to the class-membership prediction task (through inductive models) on ontological knowledge bases concerns the imbalance of the training examples distribution, mostly due to the heterogeneous nature and the incompleteness of the knowledge bases. An ensemble learning approach has been proposed to cope with this problem. However, the majority voting procedure, exploited for deciding the membership, does not consider explicitly the uncertainty and the conflict among the classifiers of an ensemble model. Moving from this observation, we propose to integrate the Dempster-Shafer (DS) theory with ensemble learning. Specifically, we propose an algorithm for learning

Evidential Terminological Random Forest

models, an extension of Terminological Random Forests along with the DS theory. An empirical evaluation showed that: (i) the resulting models performs better for datasets with a lot of positive and negative examples and have a less conservative behavior than the voting-based forests; (ii) the new extension decreases the variance of the results.

Giuseppe Rizzo, Claudia d’Amato, Nicola Fanizzi, Floriana Esposito

Mobile Web, Internet of Things and Semantic Streams

Frontmatter
Standardized and Efficient RDF Encoding for Constrained Embedded Networks

In the context of Web of Things (WoT), embedded networks have to face the challenge of getting ever more complex. The complexity arises as the number of interchanging heterogeneous devices and different hardware resource classes always increase. When it comes to the development and the use of embedded networks in the WoT domain, Semantic Web technologies are seen as one way to tackle this complexity. For example, properties and capabilities of embedded devices may be semantically described in order to enable an effective search over different classes of devices, semantic data integration may be deployed to integrate data produced by these devices, or embedded devices may be empowered to reason about semantic data in the context of WoT applications. Despite these possibilities, a wide adoption of Semantic Web or Linked Data technologies in the domain of embedded networks has not been established yet. One reason for this is an inefficient representation of semantic data. Serialisation formats of RDF data, such as for instance a plain-text XML, are not suitable for embedded devices. In this paper, we present an approach that enables constrained devices, such as microcontrollers with very limited hardware resources, to store and process semantic data. Our approach is based on the W3C Efficient XML Interchange (EXI) format. To show the applicability of the approach, we provide an EXI-based

$$\mu $$

RDF Store and show associated evaluation results.

Sebastian Käbisch, Daniel Peintner, Darko Anicic

Services, Web APIs, and the Web of Things

Frontmatter
SPSC: Efficient Composition of Semantic Services in Unstructured P2P Networks

The problem of automated semantic peer-to-peer (P2P) service composition has been addressed in cross-disciplinary research of semantic web and P2P computing. Solutions for semantic web service composition in structured P2P networks benefit from the underlying distributed global index but at the cost of network traffic overhead for its maintenance. Current solutions to service composition in unstructured P2P networks with selective flooding can be more robust against changes but suffer from redundant messaging, lack of efficient semantics-empowered search heuristics and proven soundness. In this paper, we present a novel approach, called SPSC, for efficient semantic service composition planning in unstructured P2P networks. SPSC peers conduct a guarded heuristics-based composition to jointly plan complex workflows of semantic services in OWL-S. The semantic service query branching method based on local observations by peers about the semantic overlay alleviates the problem of reaching dead-ends in the not fully observable and heuristically pruned search space. We theoretically prove that the SPSC approach is sound and provide a lower bound of its completeness. Finally, our experimental evaluation shows that SPSC achieves high cumulative recall with relatively low traffic overhead.

Xiaoqi Cao, Patrick Kapahnke, Matthias Klusch
Linked Data-as-a-Service: The Semantic Web Redeployed

Ad-hoc querying is crucial to access information from Linked Data, yet publishing queryable RDF datasets on the Web is not a trivial exercise. The most compelling argument to support this claim is that the Web contains hundreds of thousands of data documents, while only 260 queryable SPARQL endpoints are provided. Even worse, the SPARQL endpoints we

do

have are often unstable, may not comply with the standards, and may differ in supported features. In other words, hosting data online is easy, but publishing Linked Data via a queryable API such as SPARQL appears to be too difficult. As a consequence, in practice, there is no single uniform way to query the LOD Cloud today. In this paper, we therefore combine a large-scale Linked Data publication project (LOD Laundromat) with a low-cost server-side interface (Triple Pattern Fragments), in order to bridge the gap between the Web of downloadable data documents and the Web of live queryable data. The result is a repeatable, low-cost, open-source data publication process. To demonstrate its applicability, we made over 650,000 data documents available as data APIs, consisting of 30 billion triples.

Laurens Rietveld, Ruben Verborgh, Wouter Beek, Miel Vander Sande, Stefan Schlobach

Cognition and Semantic Web

Frontmatter
Gagg: A Graph Aggregation Operator

Graph aggregation is an important operation when studying graphs and has been applied in many fields. The heterogeneity, fine-granularity and semantic richness of RDF graphs introduce unique requirements when aggregating the data. In this work, we propose

Gagg

, an RDF graph aggregation operator that is both expressive and flexible. We provide a formal definition of Gagg on top of SPARQL Algebra, define its operational semantics and describe an algorithm to answer graph aggregation queries. Our evaluation results show significant improvements in performance compared to plain-SPARQL graph aggregation.

Fadi Maali, Stéphane Campinas, Stefan Decker
FrameBase: Representing N-Ary Relations Using Semantic Frames

Large-scale knowledge graphs such as those in the Linked Data cloud are typically represented as subject-predicate-object triples. However, many facts about the world involve more than two entities. While n-ary relations can be converted to triples in a number of ways, unfortunately, the structurally different choices made in different knowledge sources significantly impede our ability to connect them. They also make it impossible to query the data concisely and without prior knowledge of each individual source. We present FrameBase, a wide-coverage knowledge-base schema that uses linguistic frames to seamlessly represent and query n-ary relations from other knowledge bases, at different levels of granularity connected by logical entailment. It also opens possibilities to draw on natural language processing techniques for querying and data mining.

Jacobo Rouces, Gerard de Melo, Katja Hose

Human Computation and Crowdsourcing

Frontmatter
Towards Hybrid NER: A Study of Content and Crowdsourcing-Related Performance Factors

This paper explores the factors that influence the human component in hybrid approaches to named entity recognition (NER) in microblogs, which combine state-of-the-art automatic techniques with human and crowd computing. We identify a set of content and crowdsourcing-related features (number of entities in a post, types of entities, skipped true-positive posts, average time spent to complete the tasks, and interaction with the user interface) and analyse their impact on the accuracy of the results and the timeliness of their delivery. Using CrowdFlower and a simple, custom built gamified NER tool we run experiments on three datasets from related literature and a fourth newly annotated corpus. Our findings show that crowd workers are adept at recognizing people, locations, and implicitly identified entities within shorter microposts. We expect them to lead to the design of more advanced NER pipelines, informing the way in which tweets are chosen to be outsourced or processed by automatic tools. Experimental results are published as JSON-LD for further use by the research community.

Oluwaseyi Feyisetan, Markus Luczak-Roesch, Elena Simperl, Ramine Tinati, Nigel Shadbolt
Ranking Entities in the Age of Two Webs, an Application to Semantic Snippets

The advances of the Linked Open Data (LOD) initiative are giving rise to a more structured Web of data. Indeed, a few datasets act as hubs (e.g., DBpedia) connecting many other datasets. They also made possible new Web services for entity detection inside plain text (e.g., DBpedia Spotlight), thus allowing for new applications that can benefit from a combination of the Web of documents and the Web of data. To ease the emergence of these new applications, we propose a query-biased algorithm (LDRANK) for the ranking of web of data resources with associated textual data. Our algorithm combines link analysis with dimensionality reduction. We use crowdsourcing for building a publicly available and reusable dataset for the evaluation of query-biased ranking of Web of data resources detected in Web pages. We show that, on this dataset, LDRANK outperforms the state of the art. Finally, we use this algorithm for the construction of semantic snippets of which we evaluate the usefulness with a crowdsourcing-based approach.

Mazen Alsarem, Pierre-Edouard Portier, Sylvie Calabretto, Harald Kosch

In-Use and Industrial Track

Frontmatter
Troubleshooting and Optimizing Named Entity Resolution Systems in the Industry

Named Entity Resolution (NER) is an information extraction task that involves detecting mentions of named entities within texts and mapping them to their corresponding entities in a given knowledge resource. Systems and frameworks for performing NER have been developed both by the academia and the industry with different features and capabilities. Nevertheless, what all approaches have in common is that their satisfactory performance in a given scenario does not constitute a trustworthy predictor of their performance in a different one, the reason being the scenario’s different characteristics (target entities, input texts, domain knowledge etc.). With that in mind, in this paper we describe a metric-based Diagnostic Framework that can be used to identify the causes behind the low performance of NER systems in industrial settings and take appropriate actions to increase it.

Panos Alexopoulos, Ronald Denaux, Jose Manuel Gomez-Perez
Using Ontologies for Modeling Virtual Reality Scenarios

Serious games with 3D interfaces are Virtual Reality (VR) systems that are becoming common for the training of military and emergency teams. A platform for the development of serious games should allow the addition of semantics to the virtual environment and the modularization of the artificial intelligence controlling the behaviors of non-playing characters in order to support a productive end-user development environment. In this paper, we report the ontology design activity performed in the context of the

PRESTO

project aiming to realize a conceptual model able to abstract the developers from the graphical and geometrical properties of the entities in the virtual reality, as well as the behavioral models associated to the non-playing characters. The feasibility of the proposed solution has been validated through real-world examples and discussed with the actors using the modeled ontologies in every day practical activities.

Mauro Dragoni, Chiara Ghidini, Paolo Busetta, Mauro Fruet, Matteo Pedrotti
Supporting Open Collaboration in Science Through Explicit and Linked Semantic Description of Processes

The Web was originally developed to support collaboration in science. Although scientists benefit from many forms of collaboration on the Web (e.g., blogs, wikis, forums, code sharing, etc.), most collaborative projects are coordinated over email, phone calls, and in-person meetings. Our goal is to develop a collaborative infrastructure for scientists to work on complex science questions that require multi-disciplinary contributions to gather and analyze data, that cannot occur without significant coordination to synthesize findings, and that grow organically to accommodate new contributors as needed as the work evolves over time. Our approach is to develop an organic data science framework based on a task-centered organization of the collaboration, includes principles from social sciences for successful on-line communities, and exposes an open science process. Our approach is implemented as an extension of a semantic wiki platform, and captures formal representations of task decomposition structures, relations between tasks and users, and other properties of tasks, data, and other relevant science objects. All these entities are captured through the semantic wiki user interface, represented as semantic web objects, and exported as linked data.

Yolanda Gil, Felix Michel, Varun Ratnakar, Jordan Read, Matheus Hauder, Christopher Duffy, Paul Hanson, Hilary Dugan
Crowdmapping Digital Social Innovation with Linked Data

The European Commission recently became interested in mapping digital social innovation in Europe. In order to understand this rapidly developing if little known area, a visual and interactive survey was made in order to crowd-source a map of digital social innovation, available at

http://digitalsocial.eu

. Over 900 organizations participated, and Linked Data was used as the backend with a number of valuable advantages. The data was processed using SPARQL and network analysis, and a number of concrete policy recommendations resulted from the analysis.

Harry Halpin, Francesca Bria
Desperately Searching for Travel Offers? Formulate Better Queries with Some Help from Linked Data

Various studies have reported on inefficiencies of existing travel search engines, and user frustration generated through hours of searching and browsing, often with no satisfactory results. Not only do the users fail to find the right offer in the myriad of websites, but they end up browsing through many offers that do not correspond to their criteria. The Semantic Web framework is a reasonable candidate to improve this. In this paper, we present a semantic travel offer search system named “RE-ONE (Relevance Engine-One)”. We especially highlight its ability to help users formulate better search queries. An example of a permitted query is

in Croatia at the seaside where there is Vegetarian Restaurant.

We conducted two experiments to evaluate the Query Auto-completion mechanism. The results showed that our system outperforms the Google Custom Search baseline. Queries freely conducted in RE-ONE are shown to be 63.4 % longer in terms of number of words and 27 % richer in terms of number of search criteria. RE-ONE supports better users’ query formulation process by giving suggestions in greater accordance with users’ idea flow.

Chun Lu, Milan Stankovic, Philippe Laublet
Towards the Russian Linked Culture Cloud: Data Enrichment and Publishing

In this paper we present an architecture and approach to publishing open linked data in the cultural heritage domain. We demonstrate our approach for building a system both for data publishing and consumption and show how user benefits can be achieved with semantic technologies. For domain knowledge representation the CIDOC-CRM ontology is used. As a main source of trusted data, we use the data of the web portal of the Russian Museum. For data enrichment we selected DBpedia and the published Linked Data of the British Museum. The evaluation shows the potential of semantic applications for data publishing in contextual environment, semantic search, visualization and automated enrichment according to needs and expectations of art experts and regular museum visitors.

Dmitry Mouromtsev, Peter Haase, Eugene Cherny, Dmitry Pavlov, Alexey Andreev, Anna Spiridonova
From Symptoms to Diseases – Creating the Missing Link

A wealth of biomedical datasets is meanwhile published as Linked Open Data. Each of these datasets has a particular focus, such as providing information on diseases or symptoms of a certain kind. Hence, a comprehensive view can only be provided by integrating information from various datasets. Although, links between diseases and symptoms can be found, these links are far too sparse to enable practical applications such as a disease-centric access to clinical reports that are annotated with symptom information. For this purpose, we build a model of disease-symptom relations. Utilizing existing ontology mappings, we propagate semantic type information for

disease

and

symptom

across ontologies. Then entities of the same semantic type from different ontologies are clustered and object properties between entities are mapped to cluster-level relations. The effectiveness of our approach is demonstrated by integrating all available disease-symptom relations from different biomedical ontologies resulting in a significantly increased linkage between datasets.

Heiner Oberkampf, Turan Gojayev, Sonja Zillner, Dietlind Zühlke, Sören Auer, Matthias Hammon
Using Semantic Web Technologies for Enterprise Architecture Analysis

Enterprise Architecture (EA) models are established means for decision makers in organizations. They describe the business processes, the application landscape and IT infrastructure as well as the relationships between those layers. Current research focuses merely on frameworks, modeling and documentation approaches for EA. But once these models are established, methods for their analysis are rare. In this paper we propose the use of semantic web technologies in order to represent the EA and perform analyses. We present an approach how to transform an existing EA model into an ontology. Using this knowledge base, simple questions can be answered with the query language SPARQL. The major benefits of semantic web technologies can be found, when defining and applying more complex analyses. Change impact analysis is important to estimate the effects and costs of a change to an EA model element. To show the benefits of semantic web technologies for EA, we implemented an approach to change impact analysis and executed it within a case study.

Maximilian Osenberg, Melanie Langermeier, Bernhard Bauer
PADTUN - Using Semantic Technologies in Tunnel Diagnosis and Maintenance Domain

A Decision Support System (DSS) in tunnelling domain deals with identifying pathologies based on disorders present in various tunnel portions and contextual factors affecting a tunnel. Another key area in diagnosing pathologies is to identify regions of interest (ROI). In practice, tunnel experts intuitively abstract regions of interest by selecting tunnel portions that are susceptible to the same types of pathologies with some distance approximation. This complex diagnosis process is often subjective and poorly scales across cases and transport structures. In this paper, we introduce PADTUN system, a working prototype of a DSS in tunnelling domain using semantic technologies. Ontologies are developed and used to capture tacit knowledge from tunnel experts. Tunnel inspection data are annotated with ontologies to take advantage of inferring capabilities offered by semantic technologies. In addition, an intelligent mechanism is developed to exploit abstraction and inference capabilities to identify ROI. PADTUN is developed in real-world settings offered by the NeTTUN EU Project and is applied in a tunnel diagnosis use case with Société Nationale des Chemins de Fer Français (SNCF), France. We show how the use of semantic technologies allows addressing the complex issues of pathology and ROI inferencing and matching experts’ expectations of decision support.

Dhavalkumar Thakker, Vania Dimitrova, Anthony G. Cohn, Joaquin Valdes

PhD Symposium

Frontmatter
Crowdsourcing Disagreement for Collecting Semantic Annotation

This paper proposes an approach to gathering semantic annotation, which rejects the notion that human interpretation can have a single ground truth, and is instead based on the observation that disagreement between annotators can signal ambiguity in the input text, as well as how the annotation task has been designed. The purpose of this research is to investigate whether disagreement-aware crowdsourcing is a scalable approach to gather semantic annotation across various tasks and domains. We propose a methodology for answering this question that involves, for each task and domain: defining the crowdsourcing setup, experimental data collection, and evaluating both the setup and the results. We present initial results for the task of medical relation extraction, and propose an evaluation plan for crowdsourcing semantic annotation for several tasks and domains.

Anca Dumitrache
Ontology Change in Ontology-Based Information Integration Systems

Ontology change is an important part of the Semantic Web field that helps researchers and practitioners to deal with changes performed in ontologies. Ontology change is especially important in Ontology-Based Information Integration (OBII) systems, where several ontologies are interrelated and therefore, changes raise various complexities and implications, such as modifications of ontology mappings and change propagation. Current approaches to ontology change mainly focus on a single ontology and therefore do not properly address the constraints specific to OBII systems. To address the challenge of ontology change in OBII contexts, we plan to adapt successful techniques proposed both by Semantic Web and Model-Driven Engineering communities. We discuss the research goals, methods, and evaluation options to address this challenge. Real-world case studies are used for the development and evaluation of the proposed methods.

Fajar Juang Ekaputra
Creating Learning Material from Web Resources

We observed that learners use general Web resources as learning material. In order to overcome problems such as distraction and abandonment of a given learning task, we want to integrate these Web resources into Web-based learning systems and make them available as learning material within the learning context. We present an approach to generating learning material from Web resources that extracts a semantic fingerprint for these resources, obtains educational objectives, and publishes the learning material as Linked Data.

Katrin Krieger
The Design and Implementation of Semantic Web-Based Architecture for Augmented Reality Browser

Due to the proliferation of smartphones, Augmented Reality applications have become more widespread nowadays. Augmented Reality browsers have especially enjoyed wide popularity within these applications. The physical environment could be extended by location-aware additional information using these browsers. At present, typically a specific data source is used by the current Augmented Reality browsers, even if there is an enormous amount of available data sources. The Semantic Web could help to bridge this problem. The goal of this work is to combine Augmented Reality and Semantic Web technologies in order to enhance the existing mobile Augmented Reality browsers using Semantic Web technologies. For this purpose, we utilize the advantages of the Semantic Web technologies such as data integration, unified data model as well as publicly available semantic data sources, among other things.

Tamás Matuszka
Information Extraction for Learning Expressive Ontologies

Ontologies are used to represent knowledge in a formal and unambiguous way, facilitating its reuse and sharing among people and computer systems. A large amount of knowledge is traditionally available in unstructured text sources and manually encoding their content into a formal representation is costly and time-consuming. Several methods have been proposed to support ontology engineers in the ontology building process, but they mostly turned out to be inadequate for building rich and expressive ontologies. We propose some concrete research directions for designing an effective methodology for semi-supervised ontology learning. This methodology will integrate a new axiom extraction technique which exploits several features of the text corpus.

Giulio Petrucci
A Scalable Adaptive Method for Complex Reasoning Over Semantic Data Streams

Data streams are the infinite sequences of data elements that are being generated by companies, social network, mobile phones, smart homes, public transport vehicles and other modern infrastructures. Current stream processing solutions can handle streams of data to timely produce new results but they lack the complex reasoning capacities that are required to go from data to actionable knowledge. Conversely, engines that can perform such complex reasoning tasks, are mostly designed to work on static data. The main aim of my research proposal is to provide a solution to perform complex reasoning on dynamic semantic information in a scalable way. At its core, this requires a solution which combines advantages of both stream processing and reasoning research areas, and has flexible heuristics for adaptation of the stream reasoning processes in order to enhance scalability.

Thu-Le Pham
Sequential Decision Making with Medical Interpretation Algorithms in the Semantic Web

Supporting physicians in their daily work with state-of-the art technology is an important ongoing undertaking. If a radiologist wants to see the tumour region of a headscan of a new patient, a system needs to build a workflow of several interpretation algorithms all processing the image in one or the other way. If a lot of such interpretation algorithms are available, the system needs to select viable candidates, choose the optimal interpretation algorithms for the current patient and finally execute them correctly on the right data. We work towards developing such a system by using RDF and OWL to annotate interpretation algorithms and data, executing interpretation algorithms on a data-driven and declarative basis and integrating so-called meta components. These let us flexibly decide which interpretation algorithms to execute in order to optimally solve the current task.

Patrick Philipp
Towards Linked Open Data Enabled Data Mining
Strategies for Feature Generation, Propositionalization, Selection, and Consolidation

Background knowledge from Linked Open Data sources can be used to improve the results of a data mining problem at hand: predictive models can become more accurate, and descriptive models can reveal more interesting findings. However, collecting and integrating background knowledge is a tedious manual work. In this paper we propose a set of desiderata, and identify the challenges for developing a framework for unsupervised generation of data mining features from Linked Data.

Petar Ristoski
Semantic Support for Recording Laboratory Experimental Metadata: A Study in Food Chemistry

A fundamental principle of scientific enquiry is to create proper documentation of data and methods during experimental research [

1

,

5

].

Dena Tahvildari
Exploiting Semantics from Ontologies to Enhance Accuracy of Similarity Measures

Precisely determining semantic similarity between entities becomes a building block for data mining tasks, and existing approaches tackle this problem by mainly considering ontology-based annotations to decide relatedness. Nevertheless, because semantic similarity measures usually rely on the ontology class hierarchy and blindly treat ontology facts, they may erroneously assign high values of similarity to dissimilar entities. We propose ColorSim, a similarity measure that considers semantics of OWL2 annotations, e.g., relationship types, and implicit facts and their inferring processes, to accurately compute the relatedness of two ontology annotated entities. We compare ColorSim with state-of-the-art approaches and report on preliminary experimental results that suggest the benefits of exploiting knowledge encoded in the ontologies to measure similarity.

Ignacio Traverso-Ribón
e-Document Standards as Background Knowledge in Context-Based Ontology Matching

Ontology matching is the process of finding correspondence between heterogeneous ontologies and consequently support semantic interoperability between different information systems. Using contextual information relative to the ontologies being matched is referred to as context-based ontology matching and is considered one promising direction of improving the matching performance. This PhD investigates how such contextual information, often residing in disparate sources and represented by different formats, can be optimally represented to ontology matching systems and how these systems best can employ this context to produce accurate and correct correspondences. Currently we are investigating how the international e-Document standard Universal Business Language from the transport logistics domain can provide useful context when matching domain ontologies for this particular domain. Early evaluation tests and analysis of the results suggest that the current version of the Universal Business Language ontology does not impact on the matching results and that further reconfiguration and enhancements are needed.

Audun Vennesland
Semantics-Enabled User Interest Mining

Microblogging services such as Twitter allow users to express their feelings and views in real-time through microposts. This provides a wealth of information both collectively and individually that can be effectively mined so as to facilitate personalization, recommendation and customized search. A fundamental task with this respect would be to extract users’ interests. This has been mainly done using probabilistic models that rely on measures such as frequency of co-occurrence of important phrases, which forgoes the underlying semantics of the phrases in favor of highlighting the role of syntactical repetition of content. Some recent works have considered the role of semantics by using knowledge bases such as DBPedia and Freebase. However, they limit the topics of interest to be a set of individual concepts extracted from the microposts in isolation, i.e. without considering the relationships of the microposts to each other or to other users. This proposal seeks to further build on these works by introducing a definition of topical interest, which enables the identification of more specific and semantically complex topics involving multiple interrelated concepts. Based on this definition, methods will be introduced for the detection of both explicitly observed and implicitly implied user interests, in addition to the identification of user interest shifts based on the temporal clues.

Fattane Zarrinkalam
Backmatter
Metadata
Title
The Semantic Web. Latest Advances and New Domains
Editors
Fabien Gandon
Marta Sabou
Harald Sack
Claudia d’Amato
Philippe Cudré-Mauroux
Antoine Zimmermann
Copyright Year
2015
Electronic ISBN
978-3-319-18818-8
Print ISBN
978-3-319-18817-1
DOI
https://doi.org/10.1007/978-3-319-18818-8