Skip to main content
main-content

About this book

This book constitutes the thoroughly refereed post-conference proceedings of the Third COST Action IC1302 International KEYSTONE Conference on Semantic Keyword-Based Search on Structured Data Sources, IKC 2017, held in Gdańsk, Poland, in September 2017.

The 13 revised full papers and 5 short papers included in the first part of the book were carefully reviewed and selected from numerous submissions. The second part contains reports that summarize the major activities and achievements that have taken place in the context of the action: the short term scientific missions, the outcome of the summer schools, and the results achieved within the following four work packages: representation of structured data sources; keyword search; user interaction and keyword query interpretation; and research integration, showcases, benchmarks and evaluations. Also included is a short report generated by the chairs of the action. The papers cover a broad range of topics in the area of keyword search combining expertise from many different related fields such as information retrieval, natural language processing, ontology management, indexing, semantic web and linked data.

Table of Contents

Frontmatter

Proceedings of the KEYSTONE Conference 2017

Frontmatter

Formalization and Visualization of the Narrative for Museum Guides

There is a wide range of meta-data standards for the documentation of museum related information, such as CIDOC-CRM; these standards focus on the description of distinct exhibits. In contrast, there is a lack of standards for the digitization and documentation of the routes followed and information provided by museum guides. In this work we propose the notion of the narrative, which can be used to model a guided museum visit. We provide a formalization for the narrative so that it can be digitally encoded, and thus preserved, shared, re-used, further developed and exploited, and also propose an intuitive visualization approach.

Ioannis Bourlakos, Manolis Wallace, Angeliki Antoniou, Costas Vassilakis, George Lepouras, Anna Vassiliki Karapanagiotou

Data Reduction Techniques Applied on Automatic Identification System Data

In recent years, the constant increase of waterway traffic generates a high volume of Automatic Identification System data that require a big effort to be processed and analyzed in near real-time. In this paper, we analyze an Automatic Identification System data set and we propose a data reduction technique that can be applied on Automatic Identification System data without losing any important information in order to reduce it to a manageable size data set that can be further used for analysis or can be easily used for Automatic Identification System data visualization applications.

Claudia Ifrim, Iulian Iuga, Florin Pop, Manolis Wallace, Vassilis Poulopoulos

FIRE: Finding Important News REports

Every day, an immeasurable number of news items are published. Social media greatly contributes to the dissemination of information, making it difficult to stay on top of what is happening. Twitter stands out among popular social networks due to its large user base and the immediateness with which news is spread.In this paper, we present a solution named Finding Important News REports (FIRE) that exploits the information available on Twitter to identify and track breaking news, and the defining articles that discuss them. The methods used in FIRE present context-specific problems when dealing with the micro-messages of Twitter, and thus they are the subject of research.FIRE demonstrates how Twitter’s conversation habits do nothing to shackle the detection of important news. To the contrary, the developed system is able to extract newsworthy stories that are important to the general population, and do so before Twitter itself. Moreover, the results emphasize the need for reliable and efficient spam and noise filtering tools.

Nicholas Mamo, Joel Azzopardi

Analysing and Visualising Parliamentary Questions: A Linked Data Approach

In many national parliaments, Members can exercise a basic Parliamentary function of holding the Executive to account by submitting Questions to Government Ministers. In certain parliaments, Members also have the faculty of either requesting a written answer or an oral one. Parliamentary Questions (PQs) often generate significant media attention and public interest, and are considered to be a very useful tool for parliamentarians to scrutinise the Government’s operative and financial administration. Interesting insights about individual Members of Parliament (MPs) as well as about the Parliament as a collective institution can be gleaned by analysing PQs. In this paper we present a linked data approach to PQs that is complemented with visualisations intended to increase the accessibility, by citizens, to this rich repository of parliamentary data. We use PQ data from the Maltese Parliament ranging over the last four legislatures and present an application called PQViz that exploits graph analytics to expose interesting insights from this data.

Charlie Abela, Joel Azzopardi

Keyword Extraction from Parallel Abstracts of Scientific Publications

In this paper, we study the keyword extraction from parallel abstracts of scientific publication in the Serbian and English languages. The keywords are extracted by a selectivity-based keyword extraction method. The method is based on the structural and statistical properties of text represented as a complex network. The constructed parallel corpus of scientific abstracts with annotated keywords allows a better comparison of the performance of the method across languages since we have the controlled experimental environment and data. The achieved keyword extraction results measured with an F1 score are 49.57% for English and 46.73% for the Serbian language, if we disregard keywords that are not present in the abstracts. In case that we evaluate against the whole keyword set, the F1 scores are 40.08% and 45.71% respectively. This work shows that SBKE can be easily ported to new a language, domain and type of text in the sense of its structure. Still, there are drawbacks – the method can extract only the words that appear in the text.

Slobodan Beliga, Olivera Kitanović, Ranka Stanković, Sanda Martinčić-Ipšić

Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System

The plagiarism detection problem involves finding patterns in unstructured text documents. Similarity of documents in this approach means that the documents contain some identical phrases with defined minimal length. The typical methods used to find similar documents in digital libraries are not suitable for this task (plagiarism detection) because found documents may contain similar content and we have not any warranty that they contain any of identical phrases. The article describes an example method of searching for similar documents contains identical phrases in big documents repositories, and presents a problem of selecting storage and computing platform suitable for presented method using in plagiarism detection systems. In the article we present comparison of the mentioned above method implementations using two computing platforms: KASKADA and Hadoop with different configurations in order to test and compare their performance and scalability. The method using the default tools available on the Hadoop platform i.e. HDFS and Apache Spark offers worse performance than the method implemented on the KASKADA platform using the NFS (Network File System) and the processing model Master/Slave. The advantage of the Hadoop platform increases with the use of additional data structures (hash-map) and tools offered on this platform, i.e. HBase (NoSQL). The tools integrated with the Hadoop platform provide a possibility of creating efficient and a scalable method for finding similar documents in big repositories. The KASKADA platform offers efficient tools for analysing data in real-time processes i.e. when there is no need to compare the input data to a large collection of information (patterns) and to use the advanced data structures. The Contribution of this article is the comparison of the two computing and storage platforms in order to achieve better performance of the method used in the plagiarism detection system to find similar documents containing identical phrases.

Andrzej Sobecki, Marcin Kepa

Assessing Word Difficulty for Quiz-Like Game

Mappings verification is a laborious task. Our research aims at providing a framework for manual verification of mappings using crowdsourcing approach. For this purpose we plan on implementing a quiz like game. For this purpose the mappings have to be evaluated in terms of difficulty to better present texts in respect of game levels. In this paper we present an algorithm for assessing word difficulty. Three approaches are presented and experimental results are shown. Plans for future works are also provided.

Jakub Jagoda, Tomasz Boiński

From Deep Learning to Deep University: Cognitive Development of Intelligent Systems

Search is not only an instrument to find intended information. Ability to search is a basic cognitive skill helping people to explore the world. It is largely based on personal intuition and creativity. However, due to the emerged big data challenge, people require new forms of training to develop or improve this ability. Current developments within Cognitive Computing and Deep Learning enable artificial systems to learn and gain human-like cognitive abilities. This means that the skill how to search efficiently and creatively within huge data spaces becomes one of the most important ones for the cognitive systems aiming at autonomy. This skill cannot be pre-programmed, it requires learning. We offer to use the collective search expertise to train creative association-driven navigation across heterogeneous information spaces. We argue that artificial cognitive systems, as well as humans, need special environments, like universities, to train skills of autonomy and creativity.

Mariia Golovianko, Svitlana Gryshko, Vagan Terziyan

From a Web Services Catalog to a Linked Ecosystem of Services

In this paper, we present a Linked ecosystem of Web services where both Web services, mashups and users are represented as a multigraph structure. For illustration and experimental purposes, a graph has been constructed, in gathering web services metadata from ProgrammableWeb. The graph is stored in a Neo4j graph database and serves as a repository for a realistic collection of web services for achieving services/mashups discovery and recommendation.

Fatma Slaimi, Sana Sellami, Omar Boucelma

Towards Keyword-Based Search over Environmental Data Sources

This paper describes the problem of keyword-based search over environmental data sources. Based on a number of assumptions that simplify this general problem, a prototype of a search engine for environmental data was designed, implemented and evaluated. This first solution serves as a proof of concept that illustrates its applicability in different domains, for both expert and non-expert users. The requirements analysis undertaken and the subsequent design and implementation helped in the identification of a number of new research challenges.

David Álvarez-Castro, José R. R. Viqueira, Alberto Bugarín

Collaboration Networks Analysis: Combining Structural and Keyword-Based Approaches

This paper proposes a method for the analysis of the characteristics of collaboration networks. The method uses social network analysis metrics which are especially applicable to directed and weighted collaboration networks. By using the proposed method it is possible to investigate the global structure of the collaboration networks, such as density, centralisation, assortativity and the dynamics of network growth. Furthermore, the method proposes appropriate network centrality measures (degree and its variations for directed and weighted networks) for ranking the nodes. In addition the proposed method combines a keyword-based approach and Louvain algorithm for the community detection task. Next, the paper describes a case study in which the proposed method is applied to the collaboration networks emerged from STSMs on the KEYSTONE COST Action.

Ana Meštrović

Exploration of Web Search Results Based on the Formal Concept Analysis

In this paper, we present an approach to support exploratory search by structuring search results based on concept lattices, which are created on the fly using advanced methods from the area of Formal Concept Analysis (FCA). The main aim of the approach is to organize query based search engine results (e.g. web documents) as a hierarchy of clusters that are composed of documents with similar attributes. The concept lattice provides a structured view on the query-related domains and hence can improve the understanding of document properties and shared features. Additionally, we applied a fuzzy extension of FCA in order to support the usage of different types of attributes within the analyzed query results set. The approach has been integrated into an interactive web search interface. It provides a smooth integration of keyword-based web search and interactive visualization of concept lattice and its concepts in order to support complex search tasks.

Peter Butka, Thomas Low, Michael Kotzyba, Stefan Haun, Andreas Nürnberger

Challenges in Applying Machine Learning Methods: Studying Political Interactions on Social Networks

This document discusses the potential role of Machine Learning (ML) methods in social science research, in general, and specifically in studies of political behavior of users in social networks (SN). This paper explores challenges which occurred in a set of studies which we conducted regarding classification of comments to posts of politicians and suggests ways of addressing these challenges. These challenges apply to a larger set of online political behavior studies.

Chaya Liebeskind, Karine Nahon

Wikidata and DBpedia: A Comparative Study

DBpedia and Wikidata are two online projects focused on offering structured data from Wikipedia in order to ease its exploitation on the Linked Data Web. In this paper, a comparison of these two widely-used structured data sources is presented. This comparison considers the most relevant data quality dimensions in the state of the art of the scientific research. As fundamental differences between both projects, we can highlight that Wikidata has an open centralised nature, whereas DBpedia is more popular in the Semantic Web and the Linked Open Data communities and depends on the different linguistic editions of Wikipedia.

D. Abián, F. Guerra, J. Martínez-Romanos, Raquel Trillo-Lado

Multi-Lingual LSA with Serbian and Croatian: An Investigative Case Study

One of the challenges in information retrieval is attempting to search a corpus of documents that may contain multiple languages. This exploratory study expands upon earlier research employing Latent Semantic Analysis (so called Multi-Lingual Latent Semantic Indexing, or ML-LSI/LSA). We experiment using this approach, and a new one, in a multi-lingual context utilising two similar languages, namely Serbian and Croatian. Traditionally, with an LSA approach, a parallel corpus would be needed in order to train the system by combining identical documents in two languages into one document. We repeat that approach and also experiment with creating a semantic space using the parallel corpus on its own without merging the documents together to test the hypothesis that, with very similar languages, the merging of documents may not be required for good results.

Colin Layfield, Dragan Ivanović, Joel Azzopardi

Item-Based Vs User-Based Collaborative Recommendation Predictions

The use of personalised recommendation systems to push interesting items to users has become a necessity in the digital world that contains overwhelming amounts of information. One of the most effective ways to achieve this is by considering the opinions of other similar users – i.e. through collaborative techniques. In this paper, we compare the performance of item-based and user-based recommendation algorithms as well as propose an ensemble that combines both systems. We investigate the effect of applying LSA, as well as varying the neighbourhood size on the different algorithms. Finally, we experiment with the inclusion of content-type information in our recommender systems. We find that the most effective system is the ensemble system that uses LSA.

Joel Azzopardi

An Integrated Smart City Platform

Smart Cities aim to create a higher quality of life for their citizens, improve business services and promote tourism experience. Fostering smart city innovation at local and regional level requires a set of mature technologies to discover, integrate and harmonize multiple data sources and the exposure of effective applications for end-users (citizens, administrators, tourists ...). In this context, Semantic Web technologies and Linked Open Data principles provide a means for sharing knowledge about cities as physical, economical, social, and technical systems, enabling the development of smart city services. Despite the tremendous effort these communities have done so far, there exists a lack of comprehensive and effective platforms that handle the entire process of identification, ingestion, consumption and publication of data for Smart Cities.In this paper, a complete open-source platform to boost the integration, semantic enrichment, publication and exploitation of public data to foster smart cities in local and national administrations is proposed. Starting from mature software solutions, we propose a platform to facilitate the harmonization of datasets (open and private, static and dynamic on real time) of the same domain generated by different authorities. The platform provides a unified dataset oriented to smart cities that can be exploited to offer services to the citizens in a uniform way, to easily release open data, and to monitor services status of the city in real time by means of a suite of web applications.

Paolo Nesi, Laura Po, José R. R. Viqueira, Raquel Trillo-Lado

Accessing the Deep Web with Keywords: A Foundational Approach

The Deep Web is constituted by data that are generated dynamically as the result of interactions with Web pages. The problem of accessing Deep Web data presents many challenges: it has been shown that answering even simple queries on such data requires the execution of recursive query plans. There is a gap between the theoretical understanding of this problem and the practical approaches to it. The main reason behind this is that the problem is to be studied by considering the database as part of the input, but queries can be processed by accessing data according to limitations, expressed as so-called access patterns. In this paper we embark on the task of closing the above gap by giving a precise definition that reflects the practical nature of accessing Deep Web data sources. In particular, we define the problem of querying Deep Web sources with keywords. We describe two scenarios: in the first, called unrestricted, there query answering algorithm has full access to the data; in the second, called restricted, the algorithm can access the data only according to the access patterns. We formalise the associated decision problem associated to that of query answering in the Deep Web, explaining its relevance in both the aforementioned scenarios. We then present some complexity results.

Andrea Calì, Martín Ugarte

The KEYSTONE COST Action

Frontmatter

The KEYSTONE IC1302 COST Action

As more and more data becomes available on the Web, as its complexity increases and as the Web’s user base shifts towards a more general non-technical population, keyword searching is becoming a valuable alternative to traditional SQL queries, mainly due to its simplicity and the lower effort/expertise it requires. Existing approaches suffer from a number of limitations when applied to multi-source scenarios requiring some form of query planning, without direct access to database instances, and with frequent updates precluding any effective implementation of data indexes. Typical scenarios include Deep Web databases, virtual data integration systems and data on the Web. Therefore, building effective keyword searching techniques can have an extensive impact since it allows non-professional users to access large amounts of information stored in structured repositories through simple keyword-based query interfaces. This revolutionises the paradigm of searching for data since users are offered access to structured data in a similar manner to the one they already use for documents. To build a successful, unified and effective solution, the action “semantic KEYword-based Search on sTructured data sOurcEs” (KEYSTONE) promoted synergies across several disciplines, such as semantic data management, the Semantic Web, information retrieval, artificial intelligence, machine learning, user interaction, interface design, and natural language processing. This paper describes the main achievements of this COST Action.

Francesco Guerra, Yannis Velegrakis, Jorge Cardoso, John G. Breslin

KEYSTONE WG1: Activities and Results Overview on Representation of Structured Data Sources

The main goal of research in the Keystone Action COST IC1302 is to manage big amounts of heterogeneous data, particularly structured data, in order to provide users (people or software agents) with the data they require in an effective way with the minimum cost. The processes of managing and organizing data to provide users with them in an efficient way also generate new data that can be recollected and exploited to improve the processes; i.e., data about the processes involved can be used as feedback to improve them.Keystone is organized in 4 working groups: Representation of Structure Data Sources (WG1), Keyword-based Search (WG2), User Interaction and Keyword Query Interpretation (WG3), and Research Integration, Showcases, Benchmarks and Evaluations (WG4). This chapter is focused on the research related to WG1 focusing on profiling, assessment, representation and discovery of structured datasets. The results of WG1 influence WG2 and WG3, whereas WG4 focuses on the integration of the results of all working groups and how to exploit them.

Raquel Trillo-Lado, Stefan Dietze

KEYSTONE WG2: Activities and Results Overview on Keyword Search

In this chapter we summarize activities and results achieved by the Keyword Search Working Group (WG2) of the KEYSTONE Cost Action IC1302. We present the goals of the WG2, its main activities in course of the action and provide a summary of the selected publications related to the WG2 goals and co-authored by WG2 members. We conclude with a summary of open research directions in the area of keyword search for structured data.

Julian Szymański, Elena Demidova

KEYSTONE WG3: Activities and Results Overview on User Interaction

User Interaction WG investigates issues related to the semantic disambiguation of the queries based on the context and on the keyword annotations with respect to some reference ontologies, the development of languages for keyword searching and the use of users’ feedbacks for improving results.

Omar Boucelma

KEYSTONE Activities and Results Overview on Training Schools

This chapter reports on the results and provides a brief overview of the topics addressed by the 25 lectures and 8 industrial talks given in the three Training Schools organized in the scope of the KEYSTONE (Semantic KEYword-based Search on sTructured data sOurcEs) COST action IC1302.

Charlie Abela, Antonio Fariña, Mihai Lupu, Raquel Trillo-Lado, José R. R. Viqueira

KEYSTONE Activities and Results Overview on Enabling Mobility & Fostering Collaborations Through STSM

This article gives an overview of the research mobility, connectivity and collaboration activities facilitated by the Keystone Cost Action IC1302 in the context of its Short-Term Scientific Missions (STSMs). It provides data and figures regarding all funded STSMs over the term of the action in terms of the funding mechanism used, distribution among members, and participating personnel and institutions, along of with summaries of associated research projects.

Abdulhussain E. Mahdi

Backmatter

Additional information