Skip to main content

Über dieses Buch

The LNCS journal Transactions on Large-Scale Data- and Knowledge-Centered Systems focuses on data management, knowledge discovery, and knowledge processing, which are core and hot topics in computer science. Since the 1990s, the Internet has become the main driving force behind application development in all domains. An increase in the demand for resource sharing (e.g., computing resources, services, metadata, data sources) across different sites connected through networks has led to an evolution of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability.

This, the 44th issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, contains six fully revised and extended papers selected from the 35th conference on Data Management – Principles, Technologies and Applications, BDA 2019. The topics covered include big data, graph data streams, workflow execution in the cloud, privacy in crowdsourcing, secure distributed computing, machine learning, and data mining for recommendation systems.



Scalable Saturation of Streaming RDF Triples

In the Big Data era, RDF data are produced in high volumes. While there exist proposals for reasoning over large RDF graphs using big data platforms, there is a dearth of solutions that do so in environments where RDF data are dynamic, and where new instance and schema triples can arrive at any time. In this work, we present the first solution for reasoning over large streams of RDF data using big data platforms. In doing so, we focus on the saturation operation, which seeks to infer implicit RDF triples given RDF Schema or OWL constraints. Indeed, unlike existing solutions which saturate RDF data in bulk, our solution carefully identifies the fragment of the existing (and already saturated) RDF dataset that needs to be considered given the fresh RDF statements delivered by the stream. Thereby, it performs the saturation in an incremental manner. Experimental analysis shows that our solution outperforms existing bulk-based saturation solutions.
Mohammad Amin Farvardin, Dario Colazzo, Khalid Belhajjame, Carlo Sartiani

Efficient Execution of Scientific Workflows in the Cloud Through Adaptive Caching

Many scientific experiments are now carried on using scientific workflows, which are becoming more and more data-intensive and complex. We consider the efficient execution of such workflows in the cloud. Since it is common for workflow users to reuse other workflows or data generated by other workflows, a promising approach for efficient workflow execution is to cache intermediate data and exploit it to avoid task re-execution. In this paper, we propose an adaptive caching solution for data-intensive workflows in the cloud. Our solution is based on a new scientific workflow management architecture that automatically manages the storage and reuse of intermediate data and adapts to the variations in task execution times and output data size. We evaluated our solution by implementing it in the OpenAlea system and performing extensive experiments on real data with a data-intensive application in plant phenotyping. The results show that adaptive caching can yield major performance gains, e.g., up to a factor of 3.5 with 6 workflow re-executions.
Gaëtan Heidsieck, Daniel de Oliveira, Esther Pacitti, Christophe Pradal, François Tardieu, Patrick Valduriez

From Task Tuning to Task Assignment in Privacy-Preserving Crowdsourcing Platforms

Specialized worker profiles of crowdsourcing platforms may contain a large amount of identifying and possibly sensitive personal information (e.g., personal preferences, skills, available slots, available devices) raising strong privacy concerns. This led to the design of privacy-preserving crowdsourcing platforms, that aim at enabling efficient crowdsourcing processes while providing strong privacy guarantees even when the platform is not fully trusted. In this paper, we propose two contributions. First, we propose the PKD algorithm with the goal of supporting a large variety of aggregate usages of worker profiles within a privacy-preserving crowdsourcing platform. The PKD algorithm combines together homomorphic encryption and differential privacy for computing (perturbed) partitions of the multi-dimensional space of skills of the actual population of workers and a (perturbed) COUNT of workers per partition. Second, we propose to benefit from recent progresses in Private Information Retrieval techniques in order to design a solution to task assignment that is both private and affordable. We perform an in-depth study of the problem of using PIR techniques for proposing tasks to workers, show that it is NP-Hard, and come up with the PKD PIR Packing heuristic that groups tasks together according to the partitioning output by the PKD algorithm. In a nutshell, we design the PKD algorithm and the PKD PIR Packing heuristic, we prove formally their security against honest-but-curious workers and/or platform, we analyze their complexities, and we demonstrate their quality and affordability in real-life scenarios through an extensive experimental evaluation performed over both synthetic and realistic datasets.
Joris Duguépéroux, Tristan Allard

Secure Distributed Queries over Large Sets of Personal Home Boxes

Smart disclosure initiatives and new regulations such as GDPR allow individuals to get the control back on their data by gathering their entire digital life in a Personal Data Management Systems (PDMS). Multiple PDMS architectures exist and differ on their ability to preserve data privacy and to perform collective computations crossing data of multiple individuals (e.g., epidemiological or social studies) but none of them satisfy both objectives. The emergence of Trusted Execution Environments (TEE) changes the game. We propose a solution called Trusted PDMS, combining the TEE and PDMS properties to manage the data of each individual, and a complete framework to execute collective computation on top of them, with strong privacy and fault tolerance guarantees. We demonstrate the practicality of the solution through a real case-study being conducted over 10.000 patients in the healthcare field.
Riad Ladjel, Nicolas Anciaux, Philippe Pucheral, Guillaume Scerri

Evaluating Classification Feasibility Using Functional Dependencies

With the vast amount of available tools and libraries for data science, it has never been easier to make use of classification algorithms: a few lines of code are enough to apply dozens of algorithms on any dataset. It is therefore “super easy” for data scientists to produce machine learning (ML) models in a very limited time. On the counterpart, domain experts may have the impression that such ML models are just a black box, almost magic, that would work on any dataset without really understanding why. For this reason, related to interpretability of machine learning, there is an urgent need to reconcile domain experts with ML models and to identify at the right level of abstraction, techniques to get them implied in the ML model construction.
In this paper, we address this notion of trusting ML models by using data dependencies. We argue that functional dependencies characterize the existence of a function that a classification algorithm seeks to define. From this simple yet crucial remark, we have made several contributions. First, we show how functional dependencies can give a tight upper bound for the classification’s accuracy, leading to impressive experimental results on UCI datasets with state-of-the art ML methods. Second, we point out how to generate very difficult synthetic datasets for classification, showing evidence about the fact that for some datasets, it does not make any sense to use ML methods. Third, we propose a practical and scalable solution to assess the existence of a function before applying ML techniques, allowing to take into account real life data and to keep domain experts in the loop.
Marie Le Guilly, Jean-Marc Petit, Vasile-Marian Scuturici

Enabling Decision Support Through Ranking and Summarization of Association Rules for TOTAL Customers

Our focus in this experimental analysis paper is to investigate existing measures that are available to rank association rules and understand how they can be augmented further to enable real-world decision support as well as providing customers with personalized recommendations. For example, by analyzing receipts of TOTAL customers, one can find that, customers who buy windshield wash, also buy engine oil and energy drinks or middle-aged customers from the South of France subscribe to a car wash program. Such actionable insights can immediately guide business decision making, e.g., for product promotion, product recommendation or targeted advertising. We present an analysis of 30 million unique sales receipts, spanning 35 million records, by almost 1 million customers, generated at 3,463 gas stations, over three years. Our finding is that the 35 commonly used measures to rank association rules, such as Confidence and Piatetsky-Shapiro, can be summarized into 5 synthesized clusters based on similarity in their rankings. We then use one representative measure in each cluster to run a user study with a data scientist and a product manager at TOTAL. Our analysis draws actionable insights to enable decision support for TOTAL decision makers: rules that favor Confidence are best to determine which products to recommend and rules that favor Recall are well-suited to find customer segments to target. Finally, we present how association rules using the representative measures can be used to provide customers with personalized product recommendations.
Idir Benouaret, Sihem Amer-Yahia, Senjuti Basu Roy, Christiane Kamdem-Kengne, Jalil Chagraoui


Weitere Informationen

Premium Partner