Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud

Vocabularies are used for modeling data in Knowledge Graphs (KGs) like the Linked Open Data Cloud and Wikidata. During their lifetime, vocabularies are subject to changes. New terms are coined, while existing terms are modified or deprecated. We first quantify the amount and frequency of changes in vocabularies. Subsequently, we investigate to which extend and when the changes are adopted in the evolution of KGs. We conduct our experiments on three large-scale KGs: the Billion Triples Challenge datasets, the Dynamic Linked Data Observatory dataset, and Wikidata. Our results show that the change frequency of terms is rather low, but can have high impact due to the large amount of distributed graph data on the web. Furthermore, not all coined terms are used and most of the deprecated terms are still used by data publishers. The adoption time of terms coming from different vocabularies ranges from very fast (few days) to very slow (few years). Surprisingly, we could observe some adoptions before the vocabulary changes were published. Understanding the evolution of vocabulary terms is important to avoid wrong assumptions about the modeling status of data published on the web, which may result in difficulties when querying the data from distributed sources.

Mohammad Abdel-Qader, Ansgar Scherp, Iacopo Vagliano

GSP (Geo-Semantic-Parsing): Geoparsing and Geotagging with Machine Learning on Top of Linked Data

Recently, user-generated content in social media opened up new alluring possibilities for understanding the geospatial aspects of many real-world phenomena. Yet, the vast majority of such content lacks explicit, structured geographic information. Here, we describe the design and implementation of a novel approach for associating geographic information to text documents. GSP exploits powerful machine learning algorithms on top of the rich, interconnected Linked Data in order to overcome limitations of previous state-of-the-art approaches. In detail, our technique performs semantic annotation to identify relevant tokens in the input document, traverses a sub-graph of Linked Data for extracting possible geographic information related to the identified tokens and optimizes its results by means of a Support Vector Machine classifier. We compare our results with those of 4 state-of-the-art techniques and baselines on ground-truth data from 2 evaluation datasets. Our GSP technique achieves excellent performances, with the best $$F1 = 0.91$$, sensibly outperforming benchmark techniques that achieve $$F1 \le 0.78$$.

Marco Avvenuti, Stefano Cresci, Leonardo Nizzoli, Maurizio Tesconi

LDP-DL: A Language to Define the Design of Linked Data Platforms

Linked Data Platform 1.0 (LDP) is the W3C Recommendation for exposing linked data in a RESTful manner. While several implementations of the LDP standard exist, deploying an LDP from existing data sources still involves much manual development. This is because there is currently no support for automatizing generation of LDP on these implementations. To this end, we propose an approach whose core is a language for specifying how existing data sources should be used to generate LDPs in a way that is independent of and compatible with any LDP implementation and deployable on any of them. We formally describe the syntax and semantics of the language and its implementation. We show that our approach (1) allows the reuse of the same design for multiple deployments, or (2) the same data with different designs, (3) is open to heterogeneous data sources, (4) can cope with hosting constraints and (5) significantly automatizes deployment of LDPs.

Noorani Bakerally, Antoine Zimmermann, Olivier Boissier

Empirical Analysis of Ranking Models for an Adaptable Dataset Search

Currently available datasets still have a large unexplored potential for interlinking. Ranking techniques contribute to this task by scoring datasets according to the likelihood of finding entities related to those of a target dataset. Ranked datasets can be either manually selected for standalone linking discovery tasks or automatically inspected by programs that would go through the ranking looking for entity links. This work presents empirical comparisons between different ranking models and argues that different algorithms could be used depending on whether the ranking is manually or automatically handled and, also, depending on the available metadata of the datasets. Experiments indicate that ranking algorithms that performed best with nDCG do not always have the best Recall at Position k, for high recall levels. The best ranking model for the manual use case (with respect to nDCG) may need 13% more datasets for 90% of recall, i.e., instead of just a slice of 34% of the datasets at the top of the ranking, reached by the best model for the automatic use case (with respect to recall@k), it would need almost 47% of the ranking.

Angelo B. Neves, Rodrigo G. G. de Oliveira, Luiz André P. Paes Leme, Giseli Rabello Lopes, Bernardo P. Nunes, Marco A. Casanova

sameAs.cc: The Closure of 500M owl:sameAs Statements

The owl:sameAs predicate is an essential ingredient of the Semantic Web architecture. It allows parties to independently mint names, while at the same time ensuring that these parties are able to understand each other’s data. An online resource that collects all owl:sameAs statements on the Linked Open Data Cloud has therefore both practical impact (it helps data users and providers to find different names for the same entity) as well as analytical value (it reveals important aspects of the connectivity of the LOD Cloud).This paper presents sameAs.cc: the largest dataset of identity statements that has been gathered from the LOD Cloud to date. We describe an efficient approach for calculating and storing the full equivalence closure over this dataset. The dataset is published online, as well as a web service from which the data and its equivalence closure can be queried.

Wouter Beek, Joe Raad, Jan Wielemaker, Frank van Harmelen

Modeling and Preserving Greek Government Decisions Using Semantic Web Technologies and Permissionless Blockchains

We present a re-engineering of Diavgeia, the Greek government portal for open and transparent public administration. We study how decisions of Greek government institutions can be modeled using ontologies expressed in OWL and queried using SPARQL. We also discuss how to use the bitcoin blockchain, to enable government decisions to remain immutable. We provide an open source implementation, called DiavgeiaRedefined, that generates and visualizes the decisions inside a web browser, offers a SPARQL endpoint for retrieving and querying these decisions and provides citizens an automated tool for verifying correctness and detecting possible foul play by an adversary. We conclude with experimental results illustrating that our scheme is efficient and feasible.

Themis Beris, Manolis Koubarakis

Towards a Binary Object Notation for RDF

The recent JSON-LD standard, that specifies an object notation for RDF, has been adopted by a number of data providers on the Web. In this paper, we present a novel usage of JSON-LD, as a compact format to exchange and query RDF data in constrained environments, in the context of the Web of Things.A typical exchange between Web of Things agents involves small pieces of semantically described data (RDF data sets of less than hundred triples). In this context, we show how JSON-LD, serialized in binary JSON formats like EXI4JSON and CBOR, outperforms the state-of-the-art. Our experiments were performed on data sets provided by the literature, as well as a production data set exported from Siemens Desigo CC.We also provide a formalism for JSON-LD and show how it offers a lightweight alternative to SPARQL via JSON-LD framing.

Victor Charpenay, Sebastian Käbisch, Harald Kosch

User-Centric Ontology Population

Ontologies are a basic tool to formalize and share knowledge. However, very often the conceptualization of a specific domain depends on the particular user’s needs. We propose a methodology to perform user-centric ontology population that efficiently includes human-in-the-loop at each step. Given the existence of suitable target ontologies, our methodology supports the alignment of concepts in the user’s conceptualization with concepts of the target ontologies, using a novel hierarchical classification approach. Our methodology also helps the user to build, alter and grow their initial conceptualization, exploiting both the target ontologies and new facts extracted from unstructured data. We evaluate our approach on a real-world example in the healthcare domain, in which adverse phrases for drug reactions, as extracted from user blogs, are aligned with MedDRA concepts. The evaluation shows that our approach has high efficacy in assisting the user to both build the initial ontology ($${{\mathrm{{\textit{HITS}\,@10}}}}$$ up to 99.5%) and to maintain it ($${{\mathrm{{\textit{HITS}\,@10}}}}$$ up to 99.1%).

Kenneth Clarkson, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski, Joseph Terdiman, Steve Welch

Using Ontology-Based Data Summarization to Develop Semantics-Aware Recommender Systems

In the current information-centric era, recommender systems are gaining momentum as tools able to assist users in daily decision-making tasks. They may exploit users’ past behavior combined with side/contextual information to suggest them new items or pieces of knowledge they might be interested in. Within the recommendation process, Linked Data have been already proposed as a valuable source of information to enhance the predictive power of recommender systems not only in terms of accuracy but also of diversity and novelty of results. In this direction, one of the main open issues in using Linked Data to feed a recommendation engine is related to feature selection: how to select only the most relevant subset of the original Linked Data thus avoiding both useless processing of data and the so called “curse of dimensionality” problem. In this paper, we show how ontology-based (linked) data summarization can drive the selection of properties/features useful to a recommender system. In particular, we compare a fully automated feature selection method based on ontology-based data summaries with more classical ones, and we evaluate the performance of these methods in terms of accuracy and aggregate diversity of a recommender system exploiting the top-k selected features. We set up an experimental testbed relying on datasets related to different knowledge domains. Results show the feasibility of a feature selection process driven by ontology-based data summaries for Linked Data-enabled recommender systems.

Tommaso Di Noia, Corrado Magarelli, Andrea Maurino, Matteo Palmonari, Anisa Rula

PageRank and Generic Entity Summarization for RDF Knowledge Bases

Ranking and entity summarization are operations that are tightly connected and recurrent in many different domains. Possible application fields include information retrieval, question answering, named entity disambiguation, co-reference resolution, and natural language generation. Still, the use of these techniques is limited because there are few accessible resources. PageRank computations are resource-intensive and entity summarization is a complex research field in itself.We present two generic and highly re-usable resources for RDF knowledge bases: a component for PageRank-based ranking and a component for entity summarization. The two components, namely PageRankRDF and summaServer, are provided in form of open source code along with example datasets and deployments. In addition, this work outlines the application of the components for PageRank-based RDF ranking and entity summarization in the question answering project WDAqua.

Dennis Diefenbach, Andreas Thalhammer

Answering Multiple-Choice Questions in Geographical Gaokao with a Concept Graph

Answering questions in Gaokao (the national college entrance examination in China) brings a great challenge for recent AI systems, where the difficulty of questions and the lack of formal knowledge are two main obstacles, among others. In this paper, we focus on answering multiple-choice questions in geographical Gaokao. Specifically, a concept graph is automatically constructed from textbook tables and Chinese wiki encyclopedia, to capture core concepts and relations in geography. Based on this concept graph, a graph search based question answering approach is designed to find explainable inference paths between questions and options. We developed an online system called CGQA and conducted experiments on two real datasets created from the last ten year geographical Gaokao. Our experimental results demonstrated that CGQA can generate accurate judgments and provide explainable solving procedures. Additionally, CGQA showed promising improvement by combining with existing approaches.

Jiwei Ding, Yuan Wang, Wei Hu, Linfeng Shi, Yuzhong Qu

TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets

Publicly available social media archives facilitate research in a variety of fields, such as data science, sociology or the digital humanities, where Twitter has emerged as one of the most prominent sources. However, obtaining, archiving and annotating large amounts of tweets is costly. In this paper, we describe TweetsKB, a publicly available corpus of currently more than 1.5 billion tweets, spanning almost 5 years (Jan’13–Nov’17). Metadata information about the tweets as well as extracted entities, hashtags, user mentions and sentiment information are exposed using established RDF/S vocabularies. Next to a description of the extraction and annotation process, we present use cases to illustrate scenarios for entity-centric information exploration, data integration and knowledge discovery facilitated by TweetsKB.

Pavlos Fafalios, Vasileios Iosifidis, Eirini Ntoutsi, Stefan Dietze

HDTQ: Managing RDF Datasets in Compressed Space

HDT (Header-Dictionary-Triples) is a compressed representation of RDF data that supports retrieval features without prior decompression. Yet, RDF datasets often contain additional graph information, such as the origin, version or validity time of a triple. Traditional HDT is not capable of handling this additional parameter(s). This work introduces HDTQ (HDT Quads), an extension of HDT that is able to represent quadruples (or quads) while still being highly compact and queryable. Two HDTQ-based approaches are introduced: Annotated Triples and Annotated Graphs, and their performance is compared to the leading open-source RDF stores on the market. Results show that HDTQ achieves the best compression rates and is a competitive alternative to well-established systems.

Javier D. Fernández, Miguel A. Martínez-Prieto, Axel Polleres, Julian Reindorf

Answers Partitioning and Lazy Joins for Efficient Query Relaxation and Application to Similarity Search

Query relaxation has been studied as a way to find approximate answers when user queries are too specific or do not align well with the data schema. We are here interested in the application of query relaxation to similarity search of RDF nodes based on their description. However, this is challenging because existing approaches have a complexity that grows in a combinatorial way with the size of the query and the number of relaxation steps. We introduce two algorithms, answers partitioning and lazy join, that together significantly improve the efficiency of query relaxation. Our experiments show that our approach scales much better with the size of queries and the number of relaxation steps, to the point where it becomes possible to relax large node descriptions in order to find similar nodes. Moreover, the relaxed descriptions provide explanations for their semantic similarity.

Sébastien Ferré

Evaluation of Schema.org for Aggregation of Cultural Heritage Metadata

In the World Wide Web, a very large number of resources is made available through digital libraries. The existence of many individual digital libraries, maintained by different organizations, brings challenges to the discoverability, sharing and reuse of the resources. A widely-used approach is metadata aggregation, where centralized efforts like Europeana facilitate the discoverability and use of the resources by collecting their associated metadata. The cultural heritage domain embraced the aggregation approach while, at the same time, the technological landscape kept evolving. Nowadays, cultural heritage institutions are increasingly applying technologies designed for the wider interoperability on the Web. In this context, we have identified the Schema.org vocabulary as a potential technology for innovating metadata aggregation. We conducted two case studies that analysed Schema.org metadata from collections from cultural heritage institutions. We used the requirements of the Europeana Network as evaluation criteria. These include the recommendations of the Europeana Data Model, which is a collaborative effort from all the domains represented in Europeana: libraries, museums, archives, and galleries. We concluded that Schema.org poses no obstacle that cannot be overcome to allow data providers to deliver metadata in full compliance with Europeana requirements and with the desired semantic quality. However, Schema.org’s cross-domain applicability raises the need for accompanying its adoption by recommendations and/or specifications regarding how data providers should create their Schema.org metadata, so that they can meet the specific requirements of Europeana or other cultural aggregation networks.

Nuno Freire, Valentine Charles, Antoine Isaac

Dynamic Planning for Link Discovery

With the growth of the number and the size of RDF datasets comes an increasing need for scalable solutions to support the linking of resources. Most Link Discovery frameworks rely on complex link specifications for this purpose. We address the scalability of the execution of link specifications by presenting the first dynamic planning approach for Link Discovery dubbed Condor. In contrast to the state of the art, Condor can re-evaluate and reshape execution plans for link specifications during their execution. Thus, it achieves significantly better runtimes than existing planning solutions while retaining an F-measure of 100%. We quantify our improvement by evaluating our approach on 7 datasets and 700 link specifications. Our results suggest that Condor is up to 2 orders of magnitude faster than the state of the art and requires less than 0.1% of the total runtime of a given specification to generate the corresponding plan.

Kleanthi Georgala, Daniel Obraczka, Axel-Cyrille Ngonga Ngomo

A Dataset for Web-Scale Knowledge Base Population

For many domains, structured knowledge is in short supply, while unstructured text is plentiful. Knowledge Base Population (KBP) is the task of building or extending a knowledge base from text, and systems for KBP have grown in capability and scope. However, existing datasets for KBP are all limited by multiple issues: small in size, not open or accessible, only capable of benchmarking a fraction of the KBP process, or only suitable for extracting knowledge from title-oriented documents (documents that describe a particular entity, such as Wikipedia pages). We introduce and release CC-DBP, a web-scale dataset for training and benchmarking KBP systems. The dataset is based on Common Crawl as the corpus and DBpedia as the target knowledge base. Critically, by releasing the tools to build the dataset, we enable the dataset to remain current as new crawls and DBpedia dumps are released. Also, the modularity of the released tool set resolves a crucial tension between the ease that a dataset can be used for a particular subtask in KBP and the number of different subtasks it can be used to train or benchmark.

Michael Glass, Alfio Gliozzo

EventKG: A Multilingual Event-Centric Temporal Knowledge Graph

One of the key requirements to facilitate semantic analytics of information regarding contemporary and historical events on the Web, in the news and in social media is the availability of reference knowledge repositories containing comprehensive representations of events and temporal relations. Existing knowledge graphs, with popular examples including DBpedia, YAGO and Wikidata, focus mostly on entity-centric information and are insufficient in terms of their coverage and completeness with respect to events and temporal relations. EventKG presented in this paper is a multilingual event-centric temporal knowledge graph that addresses this gap. EventKG incorporates over 690 thousand contemporary and historical events and over 2.3 million temporal relations extracted from several large-scale knowledge graphs and semi-structured sources and makes them available through a canonical representation.

Simon Gottschalk, Elena Demidova

Semantic Concept Discovery over Event Databases

In this paper, we study the problem of identifying certain types of concept (e.g., persons, organizations, topics) for a given analysis question with the goal of assisting a human analyst in writing a deep analysis report. We consider a case where we have a large event database describing events and their associated news articles along with meta-data describing various event attributes such as people and organizations involved and the topic of the event. We describe the use of semantic technologies in question understanding and deep analysis of the event database, and show a detailed evaluation of our proposed concept discovery techniques using reports from Human Rights Watch organization and other sources. Our study finds that combining our neural network based semantic term embeddings over structured data with an index-based method can significantly outperform either method alone.

Oktie Hassanzadeh, Shari Trewin, Alfio Gliozzo

Smart Papers: Dynamic Publications on the Blockchain

Distributed Ledgers (DLs), also known as blockchains, provide decentralised, tamper-free registries of transactions among partners that distrust each other. For the scientific community, DLs have been proposed to decentralise and make more transparent each step of the scientific workflow. For the particular case of dissemination and peer-reviewing, DLs can provide the cornerstone to realise open decentralised publishing systems where social interactions between peers are tamper-free, enabling trustworthy computation of bibliometrics. In this paper, we propose the use of DL-backed smart contracts to track a subset of social interactions for scholarly publications in a decentralised and reliable way, yielding Smart Papers. We show how our Smart Papers approach complements current models for decentralised publishing, and analyse cost implications.

Michał R. Hoffman, Luis-Daniel Ibáñez, Huw Fryer, Elena Simperl

Mind the (Language) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders

While Wikipedia exists in 287 languages, its content is unevenly distributed among them. It is therefore of utmost social and cultural importance to focus efforts on languages whose speakers only have access to limited Wikipedia content. We investigate supporting communities by generating summaries for Wikipedia articles in underserved languages, given structured data as an input.We focus on an important support for such summaries: ArticlePlaceholders, a dynamically generated content pages in underserved Wikipedias. They enable native speakers to access existing information in Wikidata. To extend those ArticlePlaceholders, we provide a system, which processes the triples of the KB as they are provided by the ArticlePlaceholder, and generate a comprehensible textual summary. This data-driven approach is employed with the goal of understanding how well it matches the communities’ needs on two underserved languages on the Web: Arabic, a language with a big community with disproportionate access to knowledge online, and Esperanto, an easily-acquainted, artificial language whose Wikipedia content is maintained by a small but devoted community. With the help of the Arabic and Esperanto Wikipedians, we conduct a study which evaluates not only the quality of the generated text, but also the usefulness of our end-system to any underserved Wikipedia version.

Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl

What Does an Ontology Engineering Community Look Like? A Systematic Analysis of the schema.org Community

We present a systematic analysis of participation and interactions within the community behind schema.org, one of the largest and most relevant ontology engineering projects in recent times. Previous work conducted in this space has focused on ontology collaboration tools, and the roles that different contributors play within these projects. This paper takes a broader view and looks at the entire life cycle of the collaborative process to gain insights into how new functionality is proposed and accepted, and how contributors engage with one another based on real-world data. The analysis resulted in several findings. First, the collaborative ontology engineering roles identified in previous studies with a much stronger link to ontology editors apply to community interaction contexts as well. In the same time, the participation inequality is less pronounced than the 90-9-1 rule for Internet communities. In addition, schema.org seems to facilitate a form of collaboration that is friendly towards newcomers, whose concerns receive as much attention from the community as those of their longer-serving peers.

Samantha Kanza, Alex Stolz, Martin Hepp, Elena Simperl

FERASAT: A Serendipity-Fostering Faceted Browser for Linked Data

Accidental knowledge discoveries occur most frequently during capricious and unplanned search and browsing of data. This type of undirected, random, and exploratory search and browsing of data results in Serendipity – the art of unsought finding. In our previous work we extracted a set of serendipity-fostering design features for developing intelligent user interfaces on Semantic Web and Linked Data browsing environments. The features facilitate the discovery of interesting and valuable facts in (linked) data which were not initially sought for. In this work, we present an implementation of those features called FERASAT. FERASAT provides an adaptive multigraph-based faceted browsing interface to catalyze serendipity while browsing Linked Data. FERASAT is already in use within the domain of science, technology & innovation (STI) studies to allow researchers who are not familiar with Linked Data technologies to explore heterogeneous interlinked datasets in order to observe and interpret surprising facts from the data relevant to policy and innovation studies. In addition to an analysis of the related work, we describe two STI use cases in the paper and demonstrate how different serendipity design features are addressed in those use cases.

Ali Khalili, Peter van den Besselaar, Klaas Andries de Graaf

Classifying Crises-Information Relevancy with Semantics

Social media platforms have become key portals for sharing and consuming information during crisis situations. However, humanitarian organisations and affected communities often struggle to sieve through the large volumes of data that are typically shared on such platforms during crises to determine which posts are truly relevant to the crisis, and which are not. Previous work on automatically classifying crisis information was mostly focused on using statistical features. However, such approaches tend to be inappropriate when processing data on a type of crisis that the model was not trained on, such as processing information about a train crash, whereas the classifier was trained on floods, earthquakes, and typhoons. In such cases, the model will need to be retrained, which is costly and time-consuming. In this paper, we explore the impact of semantics in classifying Twitter posts across same, and different, types of crises. We experiment with 26 crisis events, using a hybrid system that combines statistical features with various semantic features extracted from external knowledge bases. We show that adding semantic features has no noticeable benefit over statistical features when classifying same-type crises, whereas it enhances the classifier performance by up to 7.2% when classifying information about a new type of crisis.

Prashant Khare, Grégoire Burel, Harith Alani

Efficient Temporal Reasoning on Streams of Events with DOTR

Many ICT applications need to make sense of large volumes of streaming data to detect situations of interest and enable timely reactions. Stream Reasoning (SR) aims to combine the performance of stream/event processing and the reasoning expressiveness of knowledge representation systems by adopting Semantic Web standards to encode streaming elements. We argue that the mainstream SR model is not flexible enough to properly express the temporal relations common in many applications. We show that the model can miss relevant information and lead to inconsistent derivations. Moving from these premises, we introduce a novel SR model that provides expressive ontological and temporal reasoning by neatly decoupling their scope to avoid losses and inconsistencies. We implement the model in the DOTR system that defines ontological reasoning using Datalog rules and temporal reasoning using a Complex Event Processing language that builds on metric temporal logic. We demonstrate the expressiveness of our model through examples and benchmarks, and we show that DOTR outperforms state-of-the-art SR tools, processing data with millisecond latency.

Alessandro Margara, Gianpaolo Cugola, Dario Collavini, Daniele Dell’Aglio

Intelligent Clients for Replicated Triple Pattern Fragments

Following the Triple Pattern Fragments (TPF) approach, intelligent clients are able to improve the availability of the Linked Data. However, data availability is still limited by the availability of TPF servers. Although some existing TPF servers belonging to different organizations already replicate the same datasets, existing intelligent clients are not able to take advantage of replicated data to provide fault tolerance and load-balancing. In this paper, we propose Ulysses, an intelligent TPF client that takes advantage of replicated datasets to provide fault tolerance and load-balancing. By reducing the load on a server, Ulysses improves the overall Linked Data availability and reduces data hosting cost for organizations. Ulysses relies on an adaptive client-side load-balancer and a cost-model to distribute the load among heterogeneous replicated TPF servers. Experimentations demonstrate that Ulysses reduces the load of TPF servers, tolerates failures and improves queries execution time in case of heavy loads on servers.

Thomas Minier, Hala Skaf-Molli, Pascal Molli, Maria-Esther Vidal

Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects

Images on the Web encapsulate diverse knowledge about varied abstract concepts. They cannot be sufficiently described with models learned from image-caption pairs that mention only a small number of visual object categories. In contrast, large-scale knowledge graphs contain many more concepts that can be detected by image recognition models. Hence, to assist description generation for those images which contain visual objects unseen in image-caption pairs, we propose a two-step process by leveraging large-scale knowledge graphs. In the first step, a multi-entity recognition model is built to annotate images with concepts not mentioned in any caption. In the second step, those annotations are leveraged as external semantic attention and constrained inference in the image description generation model. Evaluations show that our models outperform most of the prior work on out-of-domain MSCOCO image description generation and also scales better to broad domains with more unseen objects.

Aditya Mogadala, Umanga Bista, Lexing Xie, Achim Rettinger

Benchmarking of a Novel POS Tagging Based Semantic Similarity Approach for Job Description Similarity Computation

Most solutions providing hiring analytics involve mapping provided job descriptions to a standard job framework, thereby requiring computation of a document similarity score between two job descriptions. Finding semantic similarity between a pair of documents is a problem that is yet to be solved satisfactorily over all possible domains/contexts. Most document similarity calculation exercises require a large corpus of data for training the underlying models. In this paper we compare three methods of document similarity for job descriptions - topic modeling (LDA), doc2vec, and a novel part-of-speech tagging based document similarity (POSDC) calculation method. LDA and doc2vec require a large corpus of data to train, while POSDC exploits a domain specific property of descriptive documents (such as job descriptions) that enables us to compare two documents in isolation. POSDC method is based on an action-object-attribute representation of documents, that allows meaningful comparisons. We use stanford Core NLP and NLTK Wordnet to do a multilevel semantic match between the actions and corresponding objects. We use sklearn for topic modeling and gensim for doc2vec. We compare the results from these three methods based on IBM Kenexa Talent frameworks job taxonomy.

Joydeep Mondal, Sarthak Ahuja, Kushal Mukherjee, Sudhanshu Shekhar Singh, Gyana Parija

A Tri-Partite Neural Document Language Model for Semantic Information Retrieval

Previous work in information retrieval have shown that using evidence, such as concepts and relations, from external knowledge sources could enhance the retrieval performance. Recently, deep neural approaches have emerged as state-of-the art models for capturing word semantics. This paper presents a new tri-partite neural document language framework that leverages explicit knowledge to jointly constrain word, concept, and document learning representations to tackle a number of issues including polysemy and granularity mismatch. We show the effectiveness of the framework in various IR tasks.

Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, Nathalie Souf

Multiple Models for Recommending Temporal Aspects of Entities

Entity aspect recommendation is an emerging task in semantic search that helps users discover serendipitous and prominent information with respect to an entity, of which salience (e.g., popularity) is the most important factor in previous work. However, entity aspects are temporally dynamic and often driven by events happening over time. For such cases, aspect suggestion based solely on salience features can give unsatisfactory results, for two reasons. First, salience is often accumulated over a long time period and does not account for recency. Second, many aspects related to an event entity are strongly time-dependent. In this paper, we study the task of temporal aspect recommendation for a given entity, which aims at recommending the most relevant aspects and takes into account time in order to improve search experience. We propose a novel event-centric ensemble ranking method that learns from multiple time and type-dependent models and dynamically trades off salience and recency characteristics. Through extensive experiments on real-world query logs, we demonstrate that our method is robust and achieves better effectiveness than competitive baselines.

Tu Ngoc Nguyen, Nattiya Kanhabua, Wolfgang Nejdl

GDPRtEXT - GDPR as a Linked Data Resource

The General Data Protection Regulation (GDPR) is the new European data protection law whose compliance affects organisations in several aspects related to the use of consent and personal data. With emerging research and innovation in data management solutions claiming assistance with various provisions of the GDPR, the task of comparing the degree and scope of such solutions is a challenge without a way to consolidate them. With GDPR as a linked data resource, it is possible to link together information and approaches addressing specific articles and thereby compare them. Organisations can take advantage of this by linking queries and results directly to the relevant text, thereby making it possible to record and measure their solutions for compliance towards specific obligations. GDPR text extensions (GDPRtEXT) uses the European Legislation Identifier (ELI) ontology published by the European Publications Office for exposing the GDPR as linked data. The dataset is published using DCAT and includes an online webpage with HTML id attributes for each article and its subpoints. A SKOS vocabulary is provided that links concepts with the relevant text in GDPR. To demonstrate how related legislations can be linked to highlight changes between them for reusing existing approaches, we provide a mapping from Data Protection Directive (DPD), which was the previous data protection law, to GDPR showing the nature of changes between the two legislations. We also discuss in brief the existing corpora of research that can benefit from the adoption of this resource.

Harshvardhan J. Pandit, Kaniz Fatema, Declan O’Sullivan, Dave Lewis

Transfer Learning for Item Recommendations and Knowledge Graph Completion in Item Related Domains via a Co-Factorization Model

With the popularity of Knowledge Graphs (KGs) in recent years, there have been many studies that leverage the abundant background knowledge available in KGs for the task of item recommendations. However, little attention has been paid to the incompleteness of KGs when leveraging knowledge from them. In addition, previous studies have mainly focused on exploiting knowledge from a KG for item recommendations, and it is unclear whether we can exploit the knowledge in the other way, i.e, whether user-item interaction histories can be used for improving the performance of completing the KG with regard to the domain of items. In this paper, we investigate the effect of knowledge transfer between two tasks: (1) item recommendations, and (2) KG completion, via a co-factorization model (CoFM) which can be seen as a transfer learning model. We evaluate CoFM by comparing it to three competitive baseline methods for each task. Results indicate that considering the incompleteness of a KG outperforms a state-of-the-art factorization method leveraging existing knowledge from the KG, and performs better than other baselines. In addition, the results show that exploiting user-item interaction histories also improves the performance of completing the KG with regard to the domain of items, which has not been investigated before.

Guangyuan Piao, John G. Breslin

Modeling and Summarizing News Events Using Semantic Triples

Summarizing news articles is becoming crucial for allowing quick and concise access to information about daily events. This task can be challenging when the same event is reported with various levels of detail or is subject to diverse view points. A well established technique in the area of news summarization consists in modeling events as a set of semantic triples. These triples are weighted, mainly based on their frequencies, and then fused to build summaries. Typically, triples are extracted from main clauses, which might lead to information loss. Moreover, some crucial facets of news, such as reasons or consequences, are mostly reported in subordinate clauses and thus they are not properly handled. In this paper, we focus on an existing work that uses a graph structure to model sentences allowing the access to any triple independently from the clause it belongs to. Summary sentences are then generated by taking the top ranked paths that contain many triples and show grammatical correctness. We further provide several improvements to that approach. First, we leverage node degrees for finding the most important triples and facets shared among sentences. Second, we enhance the process of triple fusion by providing more effective similarity measures that exploit entity linking and predicate similarity. We performed extensive experiments using the DUC’04 and DUC’07 datasets showing that our approach outperforms baseline approaches by a large margin in terms of ROUGE and PYRAMID scores.

Radityo Eko Prasojo, Mouna Kacimi, Werner Nutt

GNIS-LD: Serving and Visualizing the Geographic Names Information System Gazetteer as Linked Data

In this dataset description paper we introduce the GNIS-LD, an authoritative and public domain Linked Dataset derived from the Geographic Names Information System (GNIS) which was developed by the U.S. Geological Survey (USGS) and the U.S. Board on Geographic Names. GNIS provides data about current, as well as historical, physical, and cultural geographic features in the United States. We describe the dataset, introduce an ontology for geographic feature types, and demonstrate the utility of recent linked geographic data contributions made in conjunction with the development of this resource. Co-reference resolution links to GeoNames.org and DBpedia are provided in the form of owl:sameAs relations. Finally, we point out how the adapted workflow is foundational for complex Digital Line Graph (DLG) data from the USGS National Map and how the GNIS-LD data can be integrated with DLG and other data sources such as sensor observations.

Blake Regalia, Krzysztof Janowicz, Gengchen Mai, Dalia Varanka, E. Lynn Usery

Event-Enhanced Learning for KG Completion

Statistical learning of relations between entities is a popular approach to address the problem of missing data in Knowledge Graphs. In this work we study how relational learning can be enhanced with background of a special kind: event logs, that are sequences of entities that may occur in the graph. Events naturally appear in many important applications as background. We propose various embedding models that combine entities of a Knowledge Graph and event logs. Our evaluation shows that our approach outperforms state-of-the-art baselines on real-world manufacturing and road traffic Knowledge Graphs, as well as in a controlled scenario that mimics manufacturing processes.

Martin Ringsquandl, Evgeny Kharlamov, Daria Stepanova, Marcel Hildebrandt, Steffen Lamparter, Raffaello Lepratti, Ian Horrocks, Peer Kröger

Exploring Enterprise Knowledge Graphs: A Use Case in Software Engineering

When reusing software architectural knowledge, such as design patterns or design decisions, software architects need support for exploring architectural knowledge collections, e.g., for finding related items. While semantic-based architectural knowledge management tools are limited to supporting lookup-based tasks through faceted search and fall short of enabling exploration, semantic-based exploratory search systems primarily focus on web-scale knowledge graphs without having been adapted to enterprise-scale knowledge graphs (EKG). We investigate how and to what extent exploratory search can be supported on EKGs of architectural knowledge. We propose an approach for building exploratory search systems on EKGs and demonstrate its use within Siemens, which resulted in the STAR system used in practice by 200–300 software architects. We found that the EKG’s ontology allows making previously implicit organisational knowledge explicit and this knowledge informs the design of suitable relatedness metrics to support exploration. Yet, the performance of these metrics heavily depends on the characteristics of the EKG’s data. Therefore both statistical and user-based evaluations can be used to select the right metric before system implementation.

Marta Sabou, Fajar J. Ekaputra, Tudor Ionescu, Juergen Musil, Daniel Schall, Kevin Haller, Armin Friedl, Stefan Biffl

Using Link Features for Entity Clustering in Knowledge Graphs

Knowledge graphs holistically integrate information about entities from multiple sources. A key step in the construction and maintenance of knowledge graphs is the clustering of equivalent entities from different sources. Previous approaches for such an entity clustering suffer from several problems, e.g., the creation of overlapping clusters or the inclusion of several entities from the same source within clusters. We therefore propose a new entity clustering algorithm CLIP that can be applied both to create entity clusters and to repair entity clusters determined with another clustering scheme. In contrast to previous approaches, CLIP not only uses the similarity between entities for clustering but also further features of entity links such as the so-called link strength. To achieve a good scalability we provide a parallel implementation of CLIP based on Apache Flink. Our evaluation for different datasets shows that the new approach can achieve substantially higher cluster quality than previous approaches.

Alieh Saeedi, Eric Peukert, Erhard Rahm

Modeling Relational Data with Graph Convolutional Networks

Knowledge graphs enable a wide variety of applications, including question answering and information retrieval. Despite the great effort invested in their creation and maintenance, even the largest (e.g., Yago, DBPedia or Wikidata) remain incomplete. We introduce Relational Graph Convolutional Networks (R-GCNs) and apply them to two standard knowledge base completion tasks: Link prediction (recovery of missing facts, i.e. subject-predicate-object triples) and entity classification (recovery of missing entity attributes). R-GCNs are related to a recent class of neural networks operating on graphs, and are developed specifically to handle the highly multi-relational data characteristic of realistic knowledge bases. We demonstrate the effectiveness of R-GCNs as a stand-alone model for entity classification. We further show that factorization models for link prediction such as DistMult can be significantly improved through the use of an R-GCN encoder model to accumulate evidence over multiple inference steps in the graph, demonstrating a large improvement of 29.8% on FB15k-237 over a decoder-only baseline.

Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, Max Welling

Ontology-Driven Sentiment Analysis of Product and Service Aspects

With so much opinionated, but unstructured, data available on the Web, sentiment analysis has become popular with both companies and researchers. Aspect-based sentiment analysis goes one step further by relating the expressed sentiment in a text to the topic, or aspect, the sentiment is expressed on. This enables a detailed analysis of the sentiment expressed in, for example, reviews of products or services. In this paper we propose a knowledge-driven approach to aspect sentiment analysis that complements traditional machine learning methods. By utilizing common domain knowledge, as encoded in an ontology, we improve the sentiment analysis of a given aspect. The domain knowledge is used to determine which words are expressing sentiment on the given aspect as well as to disambiguate sentiment carrying words or phrases. The proposed method has a highly competitive performance of over 80% accuracy on both SemEval-2015 and SemEval-2016 data, significantly outperforming the considered baselines.

Kim Schouten, Flavius Frasincar

Frankenstein: A Platform Enabling Reuse of Question Answering Components

Recently remarkable trials of the question answering (QA) community yielded in developing core components accomplishing QA tasks. However, implementing a QA system still was costly. While aiming at providing an efficient way for the collaborative development of QA systems, the Frankenstein framework was developed that allows dynamic composition of question answering pipelines based on the input question. In this paper, we are providing a full range of reusable components as independent modules of Frankenstein populating the ecosystem leading to the option of creating many different components and QA systems. Just by using the components described here, 380 different QA systems can be created offering the QA community many new insights. Additionally, we are providing resources which support the performance analyses of QA tasks, QA components, and complete QA systems. Hence, Frankenstein is dedicated to improving the efficiency of the research process w.r.t. QA.

Kuldeep Singh, Andreas Both, Arun Sethupat, Saeedeh Shekarpour

Querying APIs with SPARQL: Language and Worst-Case Optimal Algorithms

Although the amount of RDF data has been steadily increasing over the years, the majority of information on the Web is still residing in other formats, and is often not accessible to Semantic Web services. A lot of this data is available through APIs serving JSON documents. In this work we propose a way of extending SPARQL with the option to consume JSON APIs and integrate the obtained information into SPARQL query answers, thus obtaining a query language allowing to bring data from the “traditional” Web to the Semantic Web. Looking to evaluate these queries as efficiently as possible, we show that the main bottleneck is the amount of API requests, and present an algorithm that produces “worst-case optimal” query plans that reduce the number of requests as much as possible. We also do a set of experiments that empirically confirm the optimality of our approach.

Matthieu Mosser, Fernando Pieressa, Juan Reutter, Adrián Soto, Domagoj Vrgoč

Task-Oriented Complex Ontology Alignment: Two Alignment Evaluation Sets

Simple ontology alignments, largely studied, link one entity of a source ontology to one entity of a target ontology. One of the limitations of these alignments is, however, their lack of expressiveness which can be overcome by complex alignments. Although different complex matching approaches have emerged in the literature, there is a lack of complex reference alignments on which these approaches can be systematically evaluated. This paper proposes two sets of complex alignments between 10 pairs of ontologies from the well-known OAEI conference simple alignment dataset.The methodology for creating the alignment sets is described and takes into account the use of the alignments for two tasks: ontology merging and query rewriting. The ontology merging alignment set contains 313 correspondences and the query rewriting one 431. We report an evaluation of state-of-the art complex matchers on the proposed alignment sets.

Élodie Thiéblin, Ollivier Haemmerlé, Nathalie Hernandez, Cassia Trojahn

Where is My URI?

One of the Semantic Web foundations is the possibility to dereference URIs to let applications negotiate their semantic content. However, this exploitation is often infeasible as the availability of such information depends on the reliability of networks, services, and human factors. Moreover, it has been shown that around 90% of the information published as Linked Open Data is available as data dumps and more than 60% of endpoints are offline. To this end, we propose a Web service called Where is my URI?. Our service aims at indexing URIs and their use in order to let Linked Data consumers find the respective RDF data source, in case such information cannot be retrieved from the URI alone. We rank the corresponding datasets by following the rationale upon which a dataset contributes to the definition of a URI proportionally to the number of literals. We finally describe potential use-cases of applications that can immediately benefit from our simple yet useful service.

Andre Valdestilhas, Tommaso Soru, Markus Nentwig, Edgard Marx, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo

Optimizing Semantic Reasoning on Memory-Constrained Platforms Using the RETE Algorithm

Mobile hardware improvements have opened the door for deploying rule systems on ubiquitous, mobile platforms. By executing rule-based tasks locally, less remote (cloud) resources are needed, bandwidth usage is reduced, and local, time-sensitive tasks are no longer influenced by network conditions. Further, with data being increasingly published in semantic format, an opportunity arises for rule systems to leverage the embedded semantics of semantic, ontology-based data. To support this kind of ontology-based reasoning in rule systems, rule-based axiomatizations of ontology semantics can be utilized (e.g., OWL 2 RL). Nonetheless, recent benchmarks have found that any kind of semantic reasoning on mobile platforms still lacks scalability, at least when directly re-using existing (PC- or server-based) technologies. To create a tailored solution for resource-constrained platforms, we propose changes to RETE, the mainstay algorithm for production rule systems. In particular, we present an adapted algorithm that, by selectively pooling RETE memories, aims to better balance memory usage with performance. We show that this algorithm is well-suited towards many typical Semantic Web scenarios. Using our custom algorithm, we perform an extensive evaluation of semantic, ontology-based reasoning, using our custom RETE algorithm and an OWL2 RL ruleset, both on the PC and mobile platform.

William Van Woensel, Syed Sibte Raza Abidi

Efficient Ontology-Based Data Integration with Canonical IRIs

In this paper, we study how to efficiently integrate multiple relational databases using an ontology-based approach. In ontology-based data integration (OBDI) an ontology provides a coherent view of multiple databases, and SPARQL queries over the ontology are rewritten into (federated) SQL queries over the underlying databases. Specifically, we address the scenario where records with different identifiers in different databases can represent the same entity. The standard approach in this case is to use sameAs to model the equivalence between entities. However, the standard semantics of sameAs may cause an exponential blow up of query results, since all possible combinations of equivalent identifiers have to be included in the answers. The large number of answers is not only detrimental to the performance of query evaluation, but also makes the answers difficult to understand due to the redundancy they introduce. This motivates us to propose an alternative approach, which is based on assigning canonical IRIs to entities in order to avoid redundancy. Formally, we present our approach as a new SPARQL entailment regime and compare it with the sameAs approach. We provide a prototype implementation and evaluate it in two experiments: in a real-world data integration scenario in Statoil and in an experiment extending the Wisconsin benchmark. The experimental results show that the canonical IRI approach is significantly more scalable.

Guohui Xiao, Dag Hovland, Dimitris Bilidas, Martin Rezk, Martin Giese, Diego Calvanese

Formal Query Generation for Question Answering over Knowledge Bases

Question answering (QA) systems often consist of several components such as Named Entity Disambiguation (NED), Relation Extraction (RE), and Query Generation (QG). In this paper, we focus on the QG process of a QA pipeline on a large-scale Knowledge Base (KB), with noisy annotations and complex sentence structures. We therefore propose SQG, a SPARQL Query Generator with modular architecture, enabling easy integration with other components for the construction of a fully functional QA pipeline. SQG can be used on large open-domain KBs and handle noisy inputs by discovering a minimal subgraph based on uncertain inputs, that it receives from the NED and RE components. This ability allows SQG to consider a set of candidate entities/relations, as opposed to the most probable ones, which leads to a significant boost in the performance of the QG component. The captured subgraph covers multiple candidate walks, which correspond to SPARQL queries. To enhance the accuracy, we present a ranking model based on Tree-LSTM that takes into account the syntactical structure of the question and the tree representation of the candidate queries to find the one representing the correct intention behind the question. SQG outperforms the baseline systems and achieves a macro F1-measure of 75% on the LC-QuAD dataset.

Hamid Zafar, Giulio Napolitano, Jens Lehmann

A LOD Backend Infrastructure for Scientific Search Portals

In recent years, Linked Data became a key technology for organizations in order to publish their data collections on the web and to connect it with other data sources on the web. With the ongoing change in the research infrastructure landscape where an integrated search for comprehensive research information gains importance, organizations are challenged to connect their historically unconnected databases with each other. In this article, we present a Linked Open Data based backend infrastructure for a scientific search portal which is set as an additional layer between unconnected non-RDF data collections and makes the links between datasets visible and usable for retrieval. In addition, Linked Data technologies are used in order to organize different versions and aggregations of datasets. We evaluate the in-use application of our approach in a scientific search portal for the social sciences by investigating the benefit of links between different data sources in a user study.

Benjamin Zapilko, Katarina Boland, Dagmar Kern

Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network

In recent years, the increasing propagation of hate speech on social media and the urgent need for effective counter-measures have drawn significant investment from governments, companies, and empirical research. Despite a large number of emerging scientific studies to address the problem, a major limitation of existing work is the lack of comparative evaluations, which makes it difficult to assess the contribution of individual works. This paper introduces a new method based on a deep neural network combining convolutional and gated recurrent networks. We conduct an extensive evaluation of the method against several baselines and state of the art on the largest collection of publicly available Twitter datasets to date, and show that compared to previously reported results on these datasets, our proposed method is able to capture both word sequence and order information in short texts, and it sets new benchmark by outperforming on 6 out of 7 datasets by between 1 and 13% in F1. We also extend the existing dataset collection on this task by creating a new dataset covering different topics.

Ziqi Zhang, David Robinson, Jonathan Tepper

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter