Skip to main content

Über dieses Buch

This, the 39th issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, contains extended and revised versions of seven papers selected from the 37 contributions presented at the 28th International Conference on Database and Expert Systems Applications, DEXA 2017, held in Lyon, France, in August 2017. Topics covered include knowledge bases, clustering algorithms, parallel frequent itemset mining, model-driven engineering, virtual machines, recommendation systems, and federated SPARQL query processing.



Querying Interlinked Data by Bridging RDF Molecule Templates

Linked Data initiatives have encouraged the publication of a large number of RDF datasets created by different data providers independently. These datasets can be accessed using different Web interfaces, e.g., SPARQL endpoint; however, federated query engines are still required in order to provide an integrated view of these datasets. Given the large number of Web accessible RDF datasets, SPARQL federated query engines implement query processing techniques to effectively select the relevant datasets that provide the data required to answer a query. Existing federated query engines usually utilize coarse-grained description methods where datasets are characterized based on their vocabularies or schema, and details about data in the dataset are ignored, e.g., classes, properties, or relations. This lack of source description may lead to the erroneous selection of data sources for a query, and unnecessary retrieval of data and source communication, affecting thus the performance of query processing over the federation. We address the problem of federated SPARQL query processing and devise MULDER, a query engine for federations of RDF data sources. MULDER describes data sources in terms of an abstract description of entities belonging to the same RDF class, dubbed as an RDF molecule template, and utilizes them for source selection, and query decomposition and optimization. We empirically study the performance and continuous efficiency of MULDER on existing benchmarks, and compare with respect to existing federated SPARQL query engines. The experimental results suggest that RDF molecule templates empower MULDER, and allow for selection of RDF data sources that not only reduce execution time, but also increase answer completeness and continuous efficiency of MULDER.
Kemele M. Endris, Mikhail Galkin, Ioanna Lytra, Mohamed Nadjib Mami, Maria-Esther Vidal, Sören Auer

A Package-to-Group Recommendation Framework

Recommender systems are important information filtering techniques that retrieve interesting and personalized items for users based on their profiles and past activities. The goal of most recommender systems is to identify a ranked list of items that are likely to be of interest to users. However, there are several applications such as trip planning, where the items to be selected are not intended for single users but for a group of users, and where the group members are interested in package recommendations as collections of items. Recent research on recommender systems has generalized recommendations to suggest packages of items to single users (Package recommendations), and single items to groups of users (Group recommendations). However, the package-to-group recommendation task has not gained much attention. In this paper, we focus on the task of recommending packages of items to groups of users. This is a task with several real life scenarios, such as recommending a set of Points of Interest packages to tourist groups. We formally define the problem of top-k package-to-group recommendations and propose two models for estimating the preference of a group for a package, incorporating features such as package constraint, user impact and package viability. We design ranking algorithms for finding the top-k package-to-group recommendations and we compare our proposed models with baseline approaches stemming from related works. The experimental evaluation of our proposals, using the Yelp dataset demonstrates that our models find packages of high quality considering important features of package-to-group recommendations.
Idir Benouaret, Dominique Lenne

Statistical Relation Cardinality Bounds in Knowledge Bases

There is an increasing number of Semantic Web knowledge bases (KBs) available on the Web, created in academia and industry alike. In this paper, we address the problem of lack of structure in these KBs due to their schema-free nature required for open environments such as the Web. Relation cardinality is an important structural aspect of data that has not received enough attention in the context of KBs. We propose a definition for relation cardinality bounds that can be used to unveil the structure that KBs data naturally exhibit. Information about relation cardinalities such as a person can have two parents and zero or more children, or a book should have one author at least, or a country should have more than two cities can be useful for data users and knowledge engineers when writing queries and reusing or engineering KB systems. Such cardinalities can be declared using OWL and RDF constraint languages as constraints on the usage of properties in the domain of knowledge; however, their declaration is optional and consistency with the instance data is not ensured. We first address the problem of mining relation cardinality bounds by proposing an algorithm that normalises and filters the data to ensure the accuracy and robustness of the mined cardinality bounds. Then we show how these bounds can be used to assess two relevant data quality dimensions: consistency and completeness. Finally, we report that relation cardinality bounds can also be used to expose structural characteristics of a KB by mapping the bounds into a constraint language to declare the actual shape of data.
Emir Muñoz, Matthias Nickles

ETL Processes in the Era of Variety

Nowadays, we are living in an open and connected world, where small, medium and large companies are looking for integrating data from various data sources to satisfy the requirements of new applications such as delivering real-time alerts and trigger automated actions, complex system failure detection, anomalies detection, etc. The process of getting these data from their sources to its home system in efficient and correct manner is known by data ingestion, usually refer to Extract, Transform, Load (ETL) widely studied in data warehouses. In the context of rapidly technology changing and the explosion of data sources, ETL processes have to consider two main issues: (a) the variety of data sources that spans traditional, XML, semantic, graph databases, etc. and (b) the variety of storage platforms, where the home system may have several stores (known by polystore), where one hosts a particular type of data. These issues directly impact the efficiency and the deployment flexibility of ETL. In this paper, we deal with these issues. Firstly, thanks to Model Driven Engineering, we make generic different types of data sources. This genericity allows overloading the ETL operators for each type of sources. This genericity is illustrated by considering three types of the most popular data sources: relational, semantic and graph databases. Secondly, we show the impact of genericity of operators in the ETL workflow, where a Web-service-driven approach for orchestrating the ETL flows is given. Thirdly, the extracted and merged data obtained by the ETL workflow are deployed according their favorite stores. Finally, our finding is validated through a proof of concept tool using the LUBM semantic database and Yago graph deployed in Oracle RDF Semantic Graph 12c.
Nabila Berkani, Ladjel Bellatreche, Laurent Guittet

eVM: An Event Virtual Machine Framework

Information and communication technology (ICT) is impacting our daily lives more than ever before. Many existing applications guide users in their daily activities (e.g., navigation through traffic, health monitoring, managing home comfort, socializing with others). Although these applications are different in terms of purpose and application domain, they all detect events and propose actions and decision making aid to users. However, there is no usage of a common backbone for event detection that can be instantiated, re-used, and reconfigured in different use cases. In this paper, we propose eVM, a generic event Virtual Machine able to detect events in different contexts while allowing domain experts to model and define the targeted events prior to detection. eVM simultaneously considers the various features of the defined events (e.g., temporal, geographical), and uses the latter to detect different feature-centric events (e.g., time-centric, location-centric). eVM is based on different components (an event query language, a query compiler, an event detection core, etc.), but mainly the event detection modules are detailed here. We show that eVM is re-usable in different contexts and that the performance of our prototype is quasi-linear in most cases. Our experimental results showed that the detection accuracy is improved when, besides spatio-temporal information, other features are considered.
Elio Mansour, Richard Chbeir, Philippe Arnould

Interactive Exploration of Subspace Clusters on Multicore Processors

The PreDeCon clustering algorithm finds arbitrarily shaped clusters in high-dimensional feature spaces, which remains an active research topic with many potential applications. However, it suffers from poor runtime performance, as well as a lack of user interaction. Our new method AnyPDC introduces a novel approach to cope with these problems by casting PreDeCon into an anytime algorithm. In this anytime scheme, it quickly produces an approximate result and iteratively refines it toward the result of PreDeCon at the end. AnyPDC not only significantly speeds up PreDeCon clustering but also allows users to interact with the algorithm during its execution. Moreover, by maintaining an underlying cluster structure consisting of so-called primitive clusters and by block processing of neighborhood queries, AnyPDC can be efficiently executed in parallel on shared memory architectures such as multi-core processors. Experiments on large real world datasets show that AnyPDC achieves high quality approximate results early on, leading to orders of magnitude speedup compared to PreDeCon. Moreover, while anytime techniques are usually slower than batch ones, the algorithmic solution in AnyPDC is actually faster than PreDeCon even if run to the end. AnyPDC also scales well with the number of threads on multi-cores CPUs.
The Hai Pham, Jesper Kristensen, Son T. Mai, Ira Assent, Jon Jacobsen, Bay Vo, Anh Le

MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets

Mining frequent itemsets in large datasets has received much attention in recent years relying on MapReduce programming model. For instance, many efficient Frequent Itemset Mining (a.k.a. FIM) algorithms have been parallelized to MapReduce principle such as Parallel Apriori, Parallel FP-Growth and Dist-Eclat. However, most approaches focus on job partitioning and/or load balancing without considering the extensibility depending on required memory assumptions. Thus, a challenge in designing parallel FIM algorithms consists therefore in finding ways to guarantee that data structures used during the mining process always fit in the local memory of processing nodes during all computation steps. In this paper, we propose MapFIM+, a two-phase approach to frequent itemset mining in very large datasets benefiting both from a MapReduce-based distributed Apriori method and local in-memory FIM methods. In our approach, MapReduce is first used to generate frequent itemsets until getting local memory-fitted prefix-projected databases, and an optimized local in-memory mining process is then launched to generate all remaining frequent itemsets from each prefix-projected database on individual processing nodes. Indeed, MapFIM+ improves our previous algorithm MapFIM by using an exact evaluation of prefix-projected database sizes during the MapReduce phase. This improvement makes MapFIM+ more efficient, especially for databases leading to huge candidate sets, by significantly reducing communication and disk I/O costs. Performance evaluation shows that MapFIM+ is more efficient and more extensible than existing MapReduce based frequent itemset mining approaches.
Khanh-Chuong Duong, Mostafa Bamha, Arnaud Giacometti, Dominique Li, Arnaud Soulet, Christel Vrain


Weitere Informationen

Premium Partner