main-content

## Über dieses Buch

This, the 24th issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, contains extended and revised versions of seven papers presented at the 25th International Conference on Database and Expert Systems Applications, DEXA 2014, held in Munich, Germany, in September 2014. Following the conference, and two further rounds of reviewing and selection, six extended papers and one invited keynote paper were chosen for inclusion in this special issue. Topics covered include systems modeling, similarity search, bioinformatics, data pricing, k-nearest neighbor querying, database replication, and data anonymization.

## Inhaltsverzeichnis

### Reflective Constraint Writing

A Symbolic Viewpoint of Modeling Languages
Abstract
In this article we show how to extend object constraint languages by reflection. We choose OCL (Object Constraint Language) and extend it by operators for reification and reflection. We show how to give precise semantics to the extended language OCL$$_{R}$$ by elaborating the necessary type derivation rules and value specifications. A driving force for the introduction of reflection capabilities into a constraint language is the investigation of semantics and pragmatics of modeling constructs. We exploit the resulting reflective constraint language in modeling domains including sets of sets of domain objects. We give precise semantics to UML power types. We carve out the notion of sustainable constraint writing which is about making models robust against unwanted updates. Reflective constraints are an enabler for sustainable constraint writing. We discuss the potential of sustainable constraint writing for emerging tools and technologies. For this purpose, we need to introduce a symbolic viewpoint of information system modeling.
Dirk Draheim

### PPP-Codes for Large-Scale Similarity Searching

Abstract
Many current applications need to organize data with respect to mutual similarity between data objects. A typical general strategy to retrieve objects similar to a given sample is to access and then refine a candidate set of objects. We propose an indexing and search technique that can significantly reduce the candidate set size by combination of several space partitionings. Specifically, we propose a mapping of objects from a generic metric space onto main memory codes using several pivot spaces; our search algorithm first ranks objects within each pivot space and then aggregates these rankings producing a candidate set reduced by two orders of magnitude while keeping the same answer quality. Our approach is designed to well exploit contemporary HW: (1) larger main memories allow us to use rich and fast index, (2) multi-core CPUs well suit our parallel search algorithm, and (3) SSD disks without mechanical seeks enable efficient selective retrieval of candidate objects. The gain of the significant candidate set reduction is paid by the overhead of the candidate ranking algorithm and thus our approach is more advantageous for datasets with expensive candidate set refinement, i.e. large data objects or expensive similarity function. On real-life datasets, the search time speedup achieved by our approach is by factor of two to five.
David Novak, Pavel Zezula

### Solving Data Mismatches in Bioinformatics Workflows by Generating Data Converters

Abstract
Heterogeneity of data and data formats in bioinformatics entail mismatches between inputs and outputs of different services, making it difficult to compose them into workflows. To reduce those mismatches, bioinformatics platforms propose ad’hoc converters, called shims. When shims are written by hand, they are time-consuming to develop, and cannot anticipate all needs. When shims are automatically generated, they miss transformations, for example data composition from multiple parts, or parallel conversion of list elements.
This article proposes to systematically detect convertibility from output types to input types. Convertibility detection relies on a rule system based on abstract types, close to XML Schema. Types allow to abstract data while precisely accounting for their composite structure. Detection is accompanied by an automatic generation of converters between input and output XML data. We show the applicability of our approach by abstracting concrete bioinformatics types (e.g., complex biosequences) for a number of bioinformatics services (e.g., blast). We illustrate how our automatically generated converters help to resolve data mismatches when composing workflows. We conducted an experiment on bioinformatics services and datatypes, using an implementation of our approach, as well as a survey with domain experts. The detected convertibilities and produced converters were validated as relevant from a biological point of view. Furthermore the automatically produced graph of potentially compatible services exhibited a connectivity higher than with the ad’hoc approaches. Indeed, the experts discovered unknown possible connexions.
Mouhamadou Ba, Sébastien Ferré, Mireille Ducassé

### A Framework for Sampling-Based XML Data Pricing

Abstract
While price and data quality should define the major trade-off for consumers in data markets, prices are usually prescribed by vendors and data quality is not negotiable. In this paper we study a model where data quality can be traded for a discount. We focus on the case of XML documents and consider completeness as the quality dimension.
In our setting, the data provider offers an XML document, and sets both the price of the document and a weight to each node of the document, depending on its potential worth. The data consumer proposes a price. If the proposed price is lower than that of the entire document, then the data consumer receives a sample, i.e., a random rooted subtree of the document whose selection depends on the discounted price and the weight of nodes. By requesting several samples, the data consumer can iteratively explore the data in the document.
We present a pseudo-polynomial time algorithm to select a rooted subtree with prescribed weight uniformly at random, but show that this problem is unfortunately intractable. Yet, we are able to identify several practical cases where our algorithm runs in polynomial time. The first case is uniform random sampling of a rooted subtree with prescribed size rather than weights; the second case restricts to binary weights.
As a more challenging scenario for the sampling problem, we also study the uniform sampling of a rooted subtree of prescribed weight and prescribed height. We adapt our pseudo-polynomial time algorithm to this setting and identify tractable cases.
Ruiming Tang, Antoine Amarilli, Pierre Senellart, Stéphane Bressan

### kdANN+: A Rapid AkNN Classifier for Big Data

Abstract
A k-nearest neighbor (kNN) query determines the k nearest points, using distance metrics, from a given location. An all k-nearest neighbor (AkNN) query constitutes a variation of a kNN query and retrieves the k nearest points for each point inside a database. Their main usage resonates in spatial databases and they consist the backbone of many location-based applications and not only. In this work, we propose a novel method for classifying multidimensional data using an AkNN algorithm in the MapReduce framework. Our approach exploits space decomposition techniques for processing the classification procedure in a parallel and distributed manner. To our knowledge, we are the first to study the kNN classification of multidimensional objects under this perspective. Through an extensive experimental evaluation we prove that our solution is efficient, robust and scalable in processing the given queries.
Nikolaos Nodarakis, Evaggelia Pitoura, Spyros Sioutas, Athanasios Tsakalidis, Dimitrios Tsoumakos, Giannis Tzimas

### Optimizing Inter-data-center Large-Scale Database Parallel Replication with Workload-Driven Partitioning

Abstract
Zhen Gao, Hong Min, Xiao Li, Jie Huang, Yi Jin, An Lei, Serge Bourbonnais, Miao Zheng, Gene Fuh

### Anonymization of Data Sets with NULL Values

Abstract
Releasing, publishing or transferring microdata is restricted by the necessity to protect the privacy of data owners. k-anonymity is one of the most widespread concepts for anonymizing microdata but it does not explicitly cover NULL values which are nevertheless frequently found in microdata. We study the problem of NULL values (missing values, non-applicable attributes, etc.) for anonymization in detail, present a set of new definitions for k-anonymity explicitly considering NULL values and analyze which definition protects from which attacks. We show that an adequate treatment of missing values in microdata can be easily achieved by an extension of generalization algorithms. In particular, we show how the proposed treatment of NULL values was incorporated in the anonymization tool ANON, which implements generalization and tuple suppression with an application specific definition of information loss. With a series of experiments we show that NULL aware generalization algorithms have less information loss than standard algorithms.
Margareta Ciglic, Johann Eder, Christian Koncilia

### Backmatter

Weitere Informationen