Skip to main content

2017 | Buch

Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXII

Special Issue on Big Data Analytics and Knowledge Discovery

herausgegeben von: Abdelkader Hameurlain, Josef Küng, Prof. Dr. Roland Wagner, Sanjay Madria, Prof. Takahiro Hara

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The LNCS journal Transactions on Large-Scale Data- and Knowledge-Centered Systems focuses on data management, knowledge discovery, and knowledge processing, which are core and hot topics in computer science. Since the 1990s, the Internet has become the main driving force behind application development in all domains. An increase in the demand for resource sharing across different sites connected through networks has led to an evolution of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability. Current decentralized systems still focus on data and knowledge as their main resource. Feasibility of these systems relies basically on P2P (peer-to-peer) techniques and the support of agent systems with scaling and decentralized control. Synergy between grids, P2P systems, and agent technologies is the key to data- and knowledge-centered systems in large-scale environments.

This volume, the 32nd issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, focuses on Big Data Analytics and Knowledge Discovery, and contains extended and revised versions of five papers selected from the 17th International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2015, held in Valencia, Spain, during September 1-4, 2015. The five papers focus on the exact detection of information leakage, the binary shapelet transform for multiclass time series classification, a discrimination-aware association rule classifier for decision support (DAAR), new word detection and tagging on Chinese Twitter, and on-demand snapshot maintenance in data warehouses using incremental ETL pipelines, respectively.

discovery,="" contains="" extended="" revised="" versions="" five="" papers="" selected="" from="" 17th="" international="" conference="" discovery="" (dawak="" 2015),="" held="" in="" valencia,="" spain,="" during="" september="" 1-4,="" 2015.="" focus="" exact="" detection="" information="" leakage,="" binary="" shapelet="" transform="" for="" multiclass="" time="" series="" classification,="" a="" discrimination-aware="" association="" rule="" classifier="" decision="" support="" (daar),="" new="" word="" tagging="" chinese="" twitter,="" on-demand="" snapshot="" maintenance="" warehouses="" using="" incremental="" etl="" pipelines,="" respectively.

Inhaltsverzeichnis

Frontmatter
Exact Detection of Information Leakage: Decidability and Complexity
Abstract
Elaborate security policies often require organizations to restrict user data access in a fine-grained manner, instead of traditional table- or column-level access control. Not surprisingly, managing fine-grained access control in software is rather challenging. In particular, if access is not configured carefully, information leakage may happen: Users may infer sensitive information through the data explicitly accessible to them.
In this paper we formalize this information-leakage problem, by modeling sensitive information as answers to “secret queries,” and by modeling access-control rules as views. We focus on the scenario where sensitive information can be deterministically derived by adversaries. We review a natural data-exchange based inference model for detecting information leakage, and show its capabilities and limitation. We then introduce and formally study a new inference model, view-verified data exchange, that overcomes the limitation for the query language under consideration. Our formal study provides correctness and complexity results for the proposed inference model in the context of queries belonging to a frequent realistic query type and common types of integrity constraints on the data.
Rada Chirkova, Ting Yu
Binary Shapelet Transform for Multiclass Time Series Classification
Abstract
Shapelets have recently been proposed as a new primitive for time series classification. Shapelets are subseries of series that best split the data into its classes. In the original research, shapelets were found recursively within a decision tree through enumeration of the search space. Subsequent research indicated that using shapelets as the basis for transforming datasets leads to more accurate classifiers. Both these approaches evaluate how well a shapelet splits all the classes. However, often a shapelet is most useful in distinguishing between members of the class of the series it was drawn from against all others. To assess this conjecture, we evaluate a one vs all encoding scheme. This technique simplifies the quality assessment calculations, speeds up the execution through facilitating more frequent early abandon and increases accuracy for multi-class problems. We also propose an alternative shapelet evaluation scheme which we demonstrate significantly speeds up the full search.
Aaron Bostrom, Anthony Bagnall
DAAR: A Discrimination-Aware Association Rule Classifier for Decision Support
Abstract
Undesirable correlations between sensitive attributes (such as race, gender or personal status) and the class label (such as recruitment decision and approval of credit card), may lead to biased decision in data analytics. In this paper, we investigate how to build discrimination-aware models even when the available training set is intrinsically discriminating based on the sensitive attributes. We propose a new classification method called Discrimination-Aware Association Rule classifier (DAAR), which integrates a new discrimination-aware measure and an association rule mining algorithm. We evaluate the performance of DAAR on three real datasets from different domains and compare DAAR with two non-discrimination-aware classifiers (a standard association rule classification algorithm and the state-of-the-art association rule algorithm SPARCCC), and also with a recently proposed discrimination-aware decision tree method. Our comprehensive evaluation is based on three measures: predictive accuracy, discrimination score and inclusion score. The results show that DAAR is able to effectively filter out the discriminatory rules and decrease the discrimination severity on all datasets with insignificant impact on the predictive accuracy. We also find that DAAR generates a small set of rules that are easy to understand and applied by users, to help them make discrimination-free decisions.
Ling Luo, Wei Liu, Irena Koprinska, Fang Chen
New Word Detection and Tagging on Chinese Twitter Stream
Abstract
Twitter becomes one of the critical channels for disseminating up-to-date information. The volume of tweets can be huge. It is desirable to have an automatic system to analyze tweets. The obstacle is that Twitter users usually invent new words using non-standard rules that appear in a burst within a short period of time. Existing new word detection methods are not able to identify them effectively. Even if the new words can be identified, it is difficult to understand their meanings. In this paper, we focus on Chinese Twitter. There are no natural word delimiters in a sentence, which makes the problem more difficult. To solve the problem, we first introduce a method of detecting new words in Chinese twitter using a statistical approach without relying on training data for which the availability is limited. Then, we derive two tagging algorithms based on two aspects, namely word distance and word vector angle, to tag these new words using known words, which would provide a basis for subsequent automatic interpretation. We show the effectiveness of our algorithms using real data in twitter and although we focus on Chinese, the approach could be applied to other Kanji based languages.
Yuzhi Liang, Pengcheng Yin, S. M. Yiu
On-Demand Snapshot Maintenance in Data Warehouses Using Incremental ETL Pipeline
Abstract
Multi-version concurrency control method has nowadays been widely used in data warehouses to provide OLAP queries and ETL maintenance flows with concurrent access. A snapshot is taken on existing warehouse tables to answer a certain query independently of concurrent updates. In this work, we extend the snapshot in the data warehouse with the deltas which reside at the source side of ETL flows. Before answering a query which accesses the warehouse tables, relevant tables are first refreshed with the exact source deltas which are captured until this query arrives and haven’t been synchronized with the tables yet (called on-demand maintenance). Snapshot maintenance is done by an incremental recomputation pipeline which is flushed by a set of consecutive, non-overlapping delta batches in delta streams which are split according to a sequence of incoming queries. A workload scheduler is thereby used to achieve a serializable schedule of concurrent maintenance jobs and OLAP queries. Performance has been examined by using read-/update-heavy workloads.
Weiping Qu, Stefan Dessloch
Backmatter
Metadaten
Titel
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXII
herausgegeben von
Abdelkader Hameurlain
Josef Küng
Prof. Dr. Roland Wagner
Sanjay Madria
Prof. Takahiro Hara
Copyright-Jahr
2017
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-662-55608-5
Print ISBN
978-3-662-55607-8
DOI
https://doi.org/10.1007/978-3-662-55608-5