main-content

## Über dieses Buch

The LNCS journal Transactions on Large-Scale Data- and Knowledge-Centered Systems focuses on data management, knowledge discovery, and knowledge processing, which are core and hot topics in computer science. Since the 1990s, the Internet has become the main driving force behind application development in all domains. An increase in the demand for resource sharing (e.g., computing resources, services, metadata, data sources) across different sites connected through networks has led to an evolution of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability.

This, the 46th issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, contains six fully revised selected regular papers. Topics covered include an elastic framework for genomic data management, medical data cloud federations, temporal pattern mining, scalable schema discovery, load shedding, and selectivity estimation using linked Bayesian networks.

## Inhaltsverzeichnis

### Extracting Insights: A Data Centre Architecture Approach in Million Genome Era

Abstract
Advances in high throughput sequencing technologies have resulted in a drastic reduction in genome sequencing price and led to an exponential growth in the generation of genomic sequencing data. The genomics data is often stored on shared repositories and is both heterogeneous and unstructured in nature. It is both technically and culturally residing in big data domain due to the challenges of volume, velocity and variety.
Appropriate data storage and management, processing and analytic models are required to meet the growing challenges of genomic and clinical data. Existing research on the storage, management and analyses of genomic and clinical data do not provide a comprehensive solution, either providing Hadoop based solution lacking a robust computing solution for data mining and knowledge discovery, or a distributed in memory solution that are effective in reducing runtime but lack robustness on data store, resource management, reservation, and scheduling.
In this paper, we present a scalable and elastic framework for genomic data storage, management, and processing that addresses the weaknesses of existing approaches. Fundamental to our framework is a distributed resource management system with a plug and play NoSQL component and an in-memory, distributed computing framework with machine learning and visualisation plugin tools. We evaluated Avro, CSV, HBase, ORC, Parquet datastores and benchmark their performance. A case study of machine learning based genotype clustering is presented to demonstrate and evaluate the effectiveness of the presented framework. The results show an overall performance improvement of the genomics data analysis pipeline by 49% from existing approaches. Finally, we make recommendations on the state of the art technology and tools for effective architecture approaches for the management and knowledge discovery from large datasets.
Tariq Abdullah, Ahmed Ahmet

### Dynamic Estimation and Grid Partitioning Approach for Multi-objective Optimization Problems in Medical Cloud Federations

Abstract
Data sharing is important in the medical domain. Sharing data allows large-scale analysis with many data sources to provide more accurate results. Cloud federations can leverage sharing medical data stored in different cloud platforms, such as Amazon, Microsoft, etc. The pay-as-you-go model in cloud federations raises important issues of Multi-Objective Optimization Problems (MOOP) related to users’ preferences, such as response time, money, etc. However, optimizing a query in a cloud federation is complex with increasing the variety, especially due to a wide range of communications and pricing models. The variety of virtual machines configuration also leverages the high complexity in generating the space of candidate solutions. Indeed, in such a context, it is difficult to provide accurate estimations and optimal solutions to make relevant decisions. The first challenge is how to estimate accurate parameter values for MOOPs in a cloud federation consisting of different sites. To address the accurate estimation of parameter values problem, we present the Dynamic Regression Algorithm (DREAM). DREAM focuses on reducing the size of historical data while maintaining the estimation accuracy. The second challenge is how to find an approximate optimal solution in MOOPs using an efficient Multi-Objective Optimization algorithm. To address the problem of finding an approximate optimal solution, we present Non-dominated Sorting Genetic Algorithms based on Grid partitioning (NSGA-G) for MOOPs. The proposed algorithm is integrated into the Intelligent Resource Scheduler, a solution for heterogeneous databases, to solve MOOP in cloud federations. We validate our algorithms with experiments on a decision support benchmark.
Trung-Dung Le, Verena Kantere, Laurent d’Orazio

### Temporal Pattern Mining for E-commerce Dataset

Abstract
Over the last few years, several data mining algorithms have been developed to understand customers’ behaviors in e-commerce platforms. They aim to extract knowledge and predict future actions on the website. In this paper we present three algorithms: SEPM−, SEPM+ and SEPM++ (Sequential Event Pattern Mining), for mining sequential frequent patterns. Our goal is to mine clickstream data to extract and analyze useful sequential patterns of clicks. For this purpose, we augment the vertical representation of patterns with additional information about the items’ duration. Then based on this representation, we propose the necessary algorithms to mine sequential frequent patterns with the average duration of each of their items. Also, the direction of durations’ variation in the sequence is taken into account by the algorithms. This duration is used as a proxy of the interest of the user in the content of the page. Finally, we categorize the resulting patterns and we prove that they are more discriminating than the standard ones. Our approach is tested on real data, and patterns found are analyzed to extract users’ discriminatory behaviors. The experimental results on both real and synthetic datasets indicate that our algorithms are efficient and scalable.
Mohamad Kanaan, Remy Cazabet, Hamamache Kheddouci

### Scalable Schema Discovery for RDF Data

Abstract
The semantic web provides access to an increasing number of linked datasets expressed in RDF. One feature of these datasets is that they are not constrained by a schema. Such schema could be very useful as it helps users understand the structure of the entities and can ease the exploitation of the dataset. Several works have proposed clustering-based schema discovery approaches which provide good quality schema, but their ability to process very large RDF datasets is still a challenge. In this work, we address the problem of automatic schema discovery, focusing on scalability issues. We introduce an approach, relying on a scalable density-based clustering algorithm, which provides the classes composing the schema of a large dataset. We propose a novel distribution method which splits the initial dataset into subsets, and we provide a scalable design of our algorithm to process these subsets efficiently in parallel. We present a thorough experimental evaluation showing the effectiveness of our proposal.
Redouane Bouhamoum, Zoubida Kedad, Stéphane Lopes

### Load-Aware Shedding in Stream Processing Systems

Abstract
Distributed stream processing systems are today gaining momentum as a tool to perform analytics on continuous data streams. Load shedding is a technique used to handle unpredictable spikes in the input load whenever available computing resources are not adequately provisioned. In this paper, we propose Load-Aware Shedding (LAS), a novel load shedding solution that, unlike previous works, does not rely neither on a pre-defined cost model nor on any assumption on the tuple execution duration. Leveraging sketches, LAS efficiently estimates the execution duration of each tuple with small error bounds and uses this knowledge to proactively shed input streams at any operator to limiting queuing latencies while dropping as few tuples as possible. We provide a theoretical analysis proving that LAS is an $$({\varepsilon }, \delta )$$-approximation of the optimal online load shedder. Furthermore, through an extensive practical evaluation based on simulations and a prototype, we evaluate its impact on stream processing applications.
Nicoló Rivetti, Yann Busnel, Leonardo Querzoni

### Selectivity Estimation with Attribute Value Dependencies Using Linked Bayesian Networks

Abstract
Relational query optimisers rely on cost models to choose between different query execution plans. Selectivity estimates are known to be a crucial input to the cost model. In practice, standard selectivity estimation procedures are prone to large errors. This is mostly because they rely on the so-called attribute value independence and join uniformity assumptions. Therefore, multidimensional methods have been proposed to capture dependencies between two or more attributes both within and across relations. However, these methods require a large computational cost which makes them unusable in practice. We propose a method based on Bayesian networks that is able to capture cross-relation attribute value dependencies with little overhead. Our proposal is based on the assumption that dependencies between attributes are preserved when joins are involved. Furthermore, we introduce a parameter for trading between estimation accuracy and computational cost. We validate our work by comparing it with other relevant methods on a large workload derived from the JOB and TPC-DS benchmarks. Our results show that our method is an order of magnitude more efficient than existing methods, whilst maintaining a high level of accuracy.
Max Halford, Philippe Saint-Pierre, Franck Morvan

### Backmatter

Weitere Informationen