Skip to main content

Über dieses Buch

This, the 38th issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, contains extended and revised versions of six papers selected from the 68 contributions presented at the 27th International Conference on Database and Expert Systems Applications, DEXA 2016, held in Porto, Portugal, in September 2016. Topics covered include query personalization in databases, data anonymization, similarity search, computational methods for entity resolution, array-based computations in big data analysis, and pattern mining.



Bound-and-Filter Framework for Aggregate Reverse Rank Queries

Finding top-rank products based on a given user’s preference is a user-view rank model that helps users to find their desired products. Recently, another query processing problem named reverse rank query has attracted significant research interest. The reverse rank query is a manufacturer-view model and can find users based on a given product. It can help to target potential users or find the placement for a specific product in marketing analysis.
Unfortunately, previous reverse rank queries only consider one product, and they cannot identify the users for product bundling, which is known as a common sales strategy. To address the limitation, we propose a new query named aggregate reverse rank query to find matching users for a set of products. Three different aggregate rank functions (SUM, MIN, MAX) are proposed to evaluate a given product bundling in a variety of ways and target different users. To resolve these queries more efficiently, we propose a novel and sophisticated bound-and-filter framework. In the bound phase, two points are found to bound the query set for excluding candidates outside the bounds. In the filter phase, two tree-based methods are implemented with the bounds; they are the tree pruning method (TPM) and the double-tree method (DTM). The theoretical analysis and experimental results demonstrate the efficacy of the proposed methods.
Yuyang Dong, Hanxiong Chen, Kazutaka Furuse, Hiroyuki Kitagawa

Syntactic Anonymisation of Shared Datasets in Resource Constrained Environments

Resource constrained environments (RCEs) describe remote or rural developing world regions where missing specialised expertise, and computational processing power hinders data analytics operations. Outsourcing to third-party data analytics service providers offers a cost-effective management solution. However, a necessary pre-processing step is to anonymise the data before it is shared, to protect against privacy violations. Syntactic anonymisation algorithms (k-anonymisation, l-diversity, and t-closeness) are an attractive solution for RCEs because the generated data is not use case specific. These algorithms have however been shown to be NP-Hard, and as such need to be re-factored to run efficiently with limited processing power. In previous work [23], we presented a method of extending the standard k-anonymization and l-diversity algorithms, to satisfy both data utility and privacy. We used a multi-objective optimization scheme to minimise information loss and maximize privacy. Our results showed that the extended l-diverse algorithm incurs higher information losses than the extended k-anonymity algorithm, but offers better privacy in terms of protection against inferential disclosure. The additional information loss (7%) was negligible, and did not negatively affect data utility. As a further step, in this paper, we extend this result with a modified t-closeness algorithm based on the notion of clustering. The aim of this is to provide a performance-efficient algorithm that maintains the low information loss levels of our extended k-anonymisation and l-diversity algorithms, but also provides protection against skewness and similarity attacks.
Anne V. D. M. Kayem, C. T. Vester, Christoph Meinel

Towards Faster Similarity Search by Dynamic Reordering of Streamed Queries

Current era of digital data explosion calls for employment of content-based similarity search techniques, since traditional searchable metadata like annotations are not always available. In our work, we focus on a scenario where the similarity search is used in the context of stream processing, which is one of the suitable approaches to deal with huge amounts of data. Our goal is to maximize the throughput of processed queries while a slight delay is acceptable. We propose a technique that dynamically reorders the queries coming from the stream in order to use our caching mechanism in huge data spaces more effectively. We were able to achieve significantly higher throughput compared to the baseline when no reordering and no caching were used. Moreover, our proposal does not incur any additional precision loss of the similarity search, as opposed to some other caching techniques. In addition to the throughput maximization, we also study the potential of trading off the throughput for low delays (waiting times). The proposed technique allows to be parameterized by the amount of the throughput that can be sacrificed.
Filip Nalepa, Michal Batko, Pavel Zezula

SjClust: A Framework for Incorporating Clustering into Set Similarity Join Algorithms

A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. Similarity join is largely used in order to detect pairs of similar records in combination with a subsequent clustering algorithm for grouping together records referring to the same entity. Unfortunately, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance, and final results are produced at the end of the whole process only. Inspired by this critical evidence, in this article we propose and experimentally evaluate SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a variety of cluster representations that are smoothly merged during the set similarity task carried out by the join algorithm. An optimization task is further applied on top of such framework. Experimental results derived from an extensive experimental campaign show that we outperform previous approaches by an order of magnitude in most settings.
Leonardo Andrade Ribeiro, Alfredo Cuzzocrea, Karen Aline Alves Bezerra, Ben Hur Bahia do Nascimento

A Query Processing Framework for Large-Scale Scientific Data Analysis

Current scientific applications must analyze enormous amounts of array data using complex mathematical data processing methods. This paper describes a distributed query processing framework for large-scale scientific data analysis that captures array-based computations using SQL-like queries and optimizes and evaluates these computations using state-of-the-art parallel processing algorithms. Instead of providing a library of concrete distributed algorithms that implement certain matrix operations efficiently, we generalize these algorithms by making them parametric in such a way that the same efficient implementations that apply to the concrete algorithms can also apply to their generic counterparts. By specifying matrix operations as generic algebraic operators, we are able to perform inter-operator optimizations, such as fusing matrix transpose with matrix multiplication, resulting to new instantiations of the generic algebraic operators, without having to introduce new efficient algorithms on the fly. We report on a prototype implementation of our framework on three Big Data platforms: Hadoop Map-Reduce, Apache Spark, and Apache Flink, using Apache MRQL, which is a query processing and optimization system for large-scale, distributed data analysis. Finally, we evaluate the effectiveness of our framework through experiments on three queries: a matrix multiplication query, a simple query that combines matrix multiplication with matrix transpose, and a complex iterative query for matrix factorization.
Leonidas Fegaras

Discovering Periodic-Correlated Patterns in Temporal Databases

The support and periodicity are two important dimensions to determine the interestingness of a pattern in a dataset. Periodic-frequent patterns are an important class of regularities that exist in a dataset with respect to these two dimensions. Most previous models on periodic-frequent pattern mining have focused on finding all patterns in a transactional database that satisfy the user-specified minimum support (minSup) and maximum periodicity (maxPer) constraints. These models suffer from the following two obstacles: (i) Current periodic-frequent pattern models cannot handle datasets in which multiple transactions can share a common time stamp and/or transactions occur at irregular time intervals (ii) The usage of single minSup and maxPer for finding the patterns leads to the rare item problem. This paper tries to address these two obstacles by proposing a novel model to discover periodic-correlated patterns in a temporal database. Considering the input data as a temporal database addresses the first obstacle, while finding periodic-correlated patterns address the second obstacle. The proposed model employs all-confidence measure to prune the uninteresting patterns in support dimension. A new measure, called periodic-all-confidence, is being proposed to filter out uninteresting patterns in periodicity dimension. A pattern-growth algorithm has also been discussed to find periodic-correlated patterns. Experimental results show that the proposed model is efficient.
J. N. Venkatesh, R. Uday Kiran, P. Krishna Reddy, Masaru Kitsuregawa


Weitere Informationen

Premium Partner