Skip to main content

2017 | Book | 1. edition

Semantic Keyword-Based Search on Structured Data Sources

COST Action IC1302 Second International KEYSTONE Conference, IKC 2016, Cluj-Napoca, Romania, September 8–9, 2016, Revised Selected Papers

Editors: Andrea Calì, Dorian Gorgan, Martín Ugarte

Publisher: Springer International Publishing

Book Series: Lecture Notes in Computer Science


About this book

This book constitutes the thoroughly refereed post-conference proceedings of the Second COST Action IC1302 International KEYSTONE Conference on Semantic Keyword-Based Search on Structured Data Sources, IKC 2016, held in Cluj-Napoca, Romania, in September 2016. The 15 revised full papers and 2 invited papers are reviewed and selected from 18 initial submissions and cover the areas of keyword extraction, natural language searches, graph databases, information retrieval techniques for keyword search and document retrieval.

Table of Contents


Invited Papers

Retrieval, Crawling and Fusion of Entity-centric Data on the Web
While the Web of (entity-centric) data has seen tremendous growth over the past years, take-up and re-use is still limited. Data vary heavily with respect to their scale, quality, coverage or dynamics, what poses challenges for tasks such as entity retrieval or search. This chapter provides an overview of approaches to deal with the increasing heterogeneity of Web data. On the one hand, recommendation, linking, profiling and retrieval can provide efficient means to enable discovery and search of entity-centric data, specifically when dealing with traditional knowledge graphs and linked data. On the other hand, embedded markup such as Microdata and RDFa has emerged a novel, Web-scale source of entity-centric knowledge. While markup has seen increasing adoption over the last few years, driven by initiatives such as schema.​org, it constitutes an increasingly important source of entity-centric data on the Web, being in the same order of magnitude as the Web itself with regards to dynamics and scale. To this end, markup data lends itself as a data source for aiding tasks such as knowledge base augmentation, where data fusion techniques are required to address the inherent characteristics of markup data, such as its redundancy, heterogeneity and lack of links. Future directions are concerned with the exploitation of the complementary nature of markup data and traditional knowledge graphs.
Stefan Dietze
Data Multiverse: The Uncertainty Challenge of Future Big Data Analytics
With the explosion of data sizes, extracting valuable insight out of big data becomes increasingly difficult. New challenges begin to emerge that complement traditional, long-standing challenges related to building scalable infrastructure and runtime systems that can deliver the desired level of performance and resource efficiency. This vision paper focuses on one such challenge, which we refer to as the analytics uncertainty: with so much data available from so many sources, it is difficult to anticipate what the data can be useful for, if at all. As a consequence, it is difficult to anticipate what data processing algorithms and methods are the most appropriate to extract value and insight. In this context, we contribute with a study on current big data analytics state-of-art, the use cases where the analytics uncertainty is emerging as a problem and future research directions to address them.
Radu Tudoran, Bogdan Nicolae, Götz Brasche

Information Extraction and Retrieval

Experiments with Document Retrieval from Small Text Collections Using Latent Semantic Analysis or Term Similarity with Query Coordination and Automatic Relevance Feedback
Users face the Vocabulary Gap problem when attempting to retrieve relevant textual documents from small databases, especially when there are only a small number of relevant documents, as it is likely that different terms are used in queries and relevant documents to describe the same concept. To enable comparison of results of different approaches to semantic search in small textual databases, the PIKES team constructed an annotated test collection and Gold Standard comprising 35 search queries and 331 articles. We present two different possible solutions. In one, we index an unannotated version of the PIKES collection using Latent Semantic Analysis (LSA) retrieving relevant documents using a combination of query coordination and automatic relevance feedback. Although we outperform prior work, this approach is dependent on the underlying collection, and is not necessarily scalable. In the second approach, we use an LSA Model generated by SEMILAR from a Wikipedia dump to generate a Term Similarity Matrix (TSM). Queries are automatically expanded with related terms from the TSM and are submitted to a term-by-document matrix Vector Space Model of the PIKES collection. Coupled with a combination of query coordination and automatic relevance feedback we also outperform prior work with this approach. The advantage of the second approach is that it is independent of the underlying document collection.
Colin Layfield, Joel Azzopardi, Chris Staff
Unsupervised Extraction of Conceptual Keyphrases from Abstracts
The extraction of meaningful keyphrases is important for a variety of applications, such as recommender systems, solutions for browsing of literature, or automatic categorization of documents. Since this task is not trivial, a great amount of different approaches have been introduced in the past, either focusing on single aspects of the process or utilizing the characteristics of a certain type of document. Especially when it comes to supporting the user in grasping the topics of a document (i.e. in the display of search results), precise keyphrases can be very helpful. However, in such situations usually only the abstract or a short excerpt is available, which most approaches do not acknowledge. Methods based on the frequency of words are not appropriate in this case, since the short texts do not contain sufficient word statistics for a frequency analysis. Secondly, many existing methods are supervised and therefore depend on domain knowledge or manually annotated data, which is in many scenarios not available. Therefore we present an unsupervised graph-based approach for extracting meaningful keyphrases from abstracts of scientific articles. We show that even though our method is not based on manually annotated data or corpora, it works surprisingly well.
Philipp Ludwig, Marcus Thiel, Andreas Nürnberger
Back to the Sketch-Board: Integrating Keyword Search, Semantics, and Information Retrieval
We reproduce recent research results combining semantic and information retrieval methods. Additionally, we expand the existing state of the art by combining the semantic representations with IR methods from the probabilistic relevance framework. We demonstrate a significant increase in performance, as measured by standard evaluation metrics.
Joel Azzopardi, Fabio Benedetti, Francesco Guerra, Mihai Lupu
Topic Detection in Multichannel Italian Newspapers
Nowadays, any person, company or public institution uses and exploits different channels to share private or public information with other people (friends, customers, relatives, etc.) or institutions. This context has changed the journalism, thus, the major newspapers report news not just on its own web site, but also on several social media such as Twitter or YouTube. The use of multiple communication media stimulates the need for integration and analysis of the content published globally and not just at the level of a single medium. An analysis to achieve a comprehensive overview of the information that reaches the end users and how they consume the information is needed. This analysis should identify the main topics in the news flow and reveal the mechanisms of publication of news on different media (e.g. news timeline). Currently, most of the work on this area is still focused on a single medium. So, an analysis across different media (channels) should improve the result of topic detection. This paper shows the application of a graph analytical approach, called Keygraph, to a set of very heterogeneous documents such as the news published on various media. A preliminary evaluation on the news published in a 5 days period was able to identify the main topics within the publications of a single newspaper, and also within the publications of 20 newspapers on several on-line channels.
Laura Po, Federica Rollo, Raquel Trillo Lado
Random Walks Analysis on Graph Modelled Multimodal Collections
Nowadays, there is a proliferation of information objects from different modalities—Text, Image, Audio, Video. Different types of relations between information objects (e.g. similarity or semantic) has motivated graph-based search in multimodal Information Retrieval. In this paper, we formulate a Random Walks problem along our model for multimodal IR, that is robust over different distributions of modalities. We investigate query-dependent and query-independent Random Walks on our model. The results show that the query-dependent Random Walks provides higher precision value than query-independent Random Walks. We additionally investigate the contribution of the graph structure (quantified by the number and weights of incoming and outgoing links) to the final ranking in both types of Random Walks. We observed that query-dependent Random Walks is less dependent on the graph structure. The experiments are applied on a multimodal collection with about 400,000 documents and images.
Serwah Sabetghadam, Mihai Lupu, Andreas Rauber

Text and Digital Libraries

A Software Processing Chain for Evaluating Thesaurus Quality
Thesauri are knowledge models commonly used for information classification and retrieval whose structure is defined by standards that describe the main features the concepts and relations must have. However, following these standards requires a deep knowledge of the field the thesaurus is going to cover and experience in their creation. To help in this task, this paper describes a software processing chain that provides different validation components that evaluates the quality of the main thesaurus features.
Javier Lacasta, Gilles Falquet, Javier Nogueras-Iso, Javier Zarazaga-Soria
Comparison of Collaborative and Content-Based Automatic Recommendation Approaches in a Digital Library of Serbian PhD Dissertations
Digital libraries have become an excellent information resource for researchers. However, users of digital libraries would be served better by having the relevant items ‘pushed’ to them. In this research, we present various automatic recommendation systems to be used in a digital library of Serbian PhD Dissertations. We experiment with the use of Latent Semantic Analysis (LSA) in both content and collaborative recommendation approaches, and evaluate the use of different similarity functions. We find that the best results are obtained when using a collaborative approach that utilises LSA and Pearson similarity.
Joel Azzopardi, Dragan Ivanovic, Georgia Kapitsaki
Keyword-Based Search on Bilingual Digital Libraries
This paper outlines the main features of Bibliša, a tool that offers various possibilities of enhancing queries submitted to large collections of aligned parallel text residing in bilingual digital library. Bibliša supports keyword queries as an intuitive way of specifying information needs. The keyword queries initiated, in Serbian or English, can be expanded, both semantically, morphologically and in other language, using different supporting monolingual and bilingual resources. Terminological and lexical resources are of various types, such as wordnets, electronic dictionaries, SQL and NoSQL databases, which are distributed in different servers accessed in various ways. The web application has been tested on a collection of texts from 3 journals and 2 projects, comprising 299 documents generated from TMX, stored in a NoSQL database. The tool allows the full-text and metadata search, with extraction of concordance sentence pairs for translation and terminology work support.
Ranka Stanković, Cvetana Krstev, Duško Vitas, Nikola Vulović, Olivera Kitanović
Network-Enabled Keyword Extraction for Under-Resourced Languages
In this paper we discuss advantages of network-enabled keyword extraction from texts in under-resourced languages. Network-enabled methods are shortly introduced, while focus of the paper is placed on discussion of difficulties that methods must overcome when dealing with content in under-resourced languages (mainly exhibit as a lack of natural language processing resources: corpora and tools). Additionally, the paper discusses how to circumvent the lack of NLP tools with network-enabled method such is SBKE method.
Slobodan Beliga, Sanda Martinčić-Ipšić

Documents and Information Retrieval

Making Sense of Citations
To this day the analysis of citations has been aimed mainly to the exploration of different ways to count them, such as the total count, the h-index or the s-index, in order to quantify a researcher’s overall contribution and impact. In this work we show how the consideration of the structured metadata that accompany citations, such as the publication outlet in which they have appeared, can lead to a considerably more insightful understanding of the ways in which a researcher has impacted the work of others.
Xenia Koulouri, Claudia Ifrim, Manolis Wallace, Florin Pop
An Ontology-Based Approach to Information Retrieval
We define a general framework for ontology-based information retrieval (IR). In our approach, document and query expansion rely on a base taxonomy that is extracted from a lexical database or a Linked Data set (e.g. WordNet, Wiktionary etc.). Each term from a document or query is modelled as a vector of base concepts from the base taxonomy. We define a set of mapping functions which map multiple ontological layers (dimensions) onto the base taxonomy. This way, each concept from the included ontologies can also be represented as a vector of base concepts from the base taxonomy. We propose a general weighting schema which is used for the vector space model. Our framework can therefore take into account various lexical and semantic relations between terms and concepts (e.g. synonymy, hierarchy, meronymy, antonymy, geo-proximity, etc.). This allows us to avoid certain vocabulary problems (e.g. synonymy, polysemy) as well as to reduce the vector size in the IR tasks.
Ana Meštrović, Andrea Calì

Collaboration and Semantics

Game with a Purpose for Verification of Mappings Between Wikipedia and WordNet
The paper presents a Game with a Purpose for verification of automatically generated mappings focusing on mappings between WordNet synsets and Wikipedia articles. General description of idea standing behind the games with the purpose is given. Description of TGame system, a 2D platform mobile game with verification process included in the game-play, is provided. Additional mechanisms for anti-cheating, increasing player’s motivation and gathering feedback are also presented. The evaluation of proposed solution and future work is also described.
Tomasz Boiński
TB-Structure: Collective Intelligence for Exploratory Keyword Search
In this paper we address an exploratory search challenge by presenting a new (structure-driven) collaborative filtering technique. The aim is to increase search effectiveness by predicting implicit seeker’s intents at an early stage of the search process. This is achieved by uncovering behavioral patterns within large datasets of preserved collective search experience. We apply a specific tree-based data structure called a TB (There-and-Back) structure for compact storage of search history in the form of merged query trails – sequences of queries approaching iteratively a seeker’s goal. The organization of TB-structures allows inferring new implicit trails for the prediction of a seeker’s intents. We used experiments to demonstrate both: the storage compactness and inference potential of the proposed structure.
Vagan Terziyan, Mariia Golovianko, Michael Cochez
Using Natural Language to Search Linked Data
There are many endeavors aiming to offer users more effective ways of getting relevant information from web. One of them is represented by a concept of Linked Data, which provides interconnected data sources. But querying these types of data is difficult not only for the conventional web users but also for experts in this field. Therefore, a more comfortable way of user query would be of great value. One direction could be to allow the user to use a natural language. To make this task easier we have proposed a method for translating natural language query to SPARQL query. It is based on a sentence structure - utilizing dependencies between the words in user queries. Dependencies are used to map the query to the semantic web structure, which is in the next step translated to SPARQL query. According to our first experiments we are able to answer a significant group of user queries.
Viera Rozinajová, Peter Macko
The Use of Semantics in the CrossCult H2020 Project
CrossCult is a newly started project that aims to make reflective history a reality in the European cultural context. In this paper we examine how the project aims to take advantage of advances in semantic technologies in order to achieve its goals. Specifically, we see what the quest for reflection is and, through practical examples from two of the project’s flagship pilots, explain how semantics can assist in this direction.
Stavroula Bampatzia, Omar Gustavo Bravo-Quezada, Angeliki Antoniou, Martín López Nores, Manolis Wallace, George Lepouras, Costas Vassilakis
Semantic Keyword-Based Search on Structured Data Sources
Andrea Calì
Dorian Gorgan
Martín Ugarte
Copyright Year
Springer International Publishing
Electronic ISBN
Print ISBN