Skip to main content
main-content

Über dieses Buch

This book constitutes the refereed proceedings of the 17th International Conference on Data Warehousing and Knowledge Discovery, DaWaK 2015, held in Valencia, Spain, September 2015.

The 31 revised full papers presented were carefully reviewed and selected from 90 submissions. The papers are organized in topical sections similarity measure and clustering; data mining; social computing; heterogeneos networks and data; data warehouses; stream processing; applications of big data analysis; and big data.

Inhaltsverzeichnis

Frontmatter

Similarity Measure and Clustering

Frontmatter

Determining Query Readiness for Structured Data

The outcomes and quality of organizational decisions depend on the characteristics of the data available for making the decisions and on the value of the data in the decision-making process. Toward enabling management of these aspects of data in analytics, we introduce and investigate

Data Readiness Level (DRL),

a quantitative measure of the value of a piece of data at a given point in a processing flow. Our DRL proposal is a multidimensional measure that takes into account the relevance, completeness, and utility of data with respect to a given analysis task. This study provides a formalization of DRL in a structured-data scenario, and illustrates how knowledge of rules and facts, both within and outside the given data, can be used to identify those transformations of the data that improve its DRL.

Farid Alborzi, Rada Chirkova, Jon Doyle, Yahya Fathi

Efficient Cluster Detection by Ordered Neighborhoods

Detecting cluster structures seems to be a simple task, i.e. separating similar from dissimilar objects. However, given today’s complex data, (dis-)similarity measures and traditional clustering algorithms are not reliable in separating clusters from each other. For example, when too many dimensions are considered simultaneously, objects become unique and (dis-)similarity does not provide meaningful information to detect clusters anymore. While the (dis-)similarity measures might be meaningful for individual dimensions, algorithms fail to combine this information for cluster detection. In particular, it is the severe issue of a combinatorial search space that results in inefficient algorithms.

In this paper we propose a cluster detection method based on the

ordered neighborhoods

. By considering such ordered neighborhoods in each dimension individually, we derive properties that allow us to detect clustered objects in dimensions in linear time. Our algorithm exploits the ordered neighborhoods in order to find both the similar objects and the dimensions in which these objects show high similarity. Evaluation results show that our method is scalable with both database size and dimensionality and enhances cluster detection w.r.t. state-of-the-art clustering techniques.

Emin Aksehirli, Bart Goethals, Emmanuel Müller

Unsupervised Semantic and Syntactic Based Classification of Scientific Citations

In the recent years, the number of scientific publications has increased substantially. A way to measure the impact of a publication is to count the number of citations to the paper. Thus, citations are being used as a proxy for a researcher’s contribution and influence in a field. Citation classification can provide context to the citations. To perform citation classification, supervised techniques are normally used. To the best of our knowledge there are no research that performs this task in a unsupervised manner. In this paper we present two techniques to cluster citations automatically without human intervention. This paper presents two novel techniques to cluster citations according to their contents (semantic) and the citation sentence styles (syntactic). The techniques are validated using external test sets from existing supervised citation classification studies.

Mohammad Abdullatif, Yun Sing Koh, Gillian Dobbie

Data Mining

Frontmatter

HI-Tree: Mining High Influence Patterns Using External and Internal Utility Values

We propose an efficient algorithm, called HI-Tree, for mining high influence patterns for an incremental dataset. In traditional pattern mining, one would find the complete set of patterns and then apply a post-pruning step to it. The size of the complete mining results is typically prohibitively large, despite the fact that only a small percentage of high utility patterns are interesting. Thus it is inefficient to wait for the mining algorithm to complete and then apply feature selection to post-process the large number of resulting patterns. Instead of generating the complete set of frequent patterns we are able to directly mine patterns with high utility values in an incremental manner. In this paper we propose a novel utility measure called an influence factor using the concepts of external utility and internal utility of an item. The influence factor for an item takes into consideration its connectivity with its neighborhood as well as its importance within a transaction. The measure is especially useful in problem domains utilizing network or interaction characteristics amongst items such as in a social network or web click-stream data. We compared our technique against state of the art incremental mining techniques and show that our technique has better rule generation and runtime performance.

Yun Sing Koh, Russel Pears

Balancing Tree Size and Accuracy in Fast Mining of Uncertain Frequent Patterns

To mine frequent patterns from uncertain data, many existing algorithms (e.g., UF-growth) directly calculate the expected support of a pattern. Consequently, they require a significant amount of storage space to capture all existential probability values among the items in the data. To reduce the amount of required storage space, some existing algorithms (e.g., PUF-growth) combine nodes with the same item by storing an upper bound on expected support. Consequently, they lead to many false positives in the intermediate mining step. There is trade-off between storage space and accuracy. In this paper, we introduce a new algorithm called MUF-growth for achieving a tighter upper bound on expected support than PUF-growth while balancing the storage space requirement. We evaluate the trade-off between storing more information to further tighten the bound and its effect on the performance of the algorithm. Our experimental results reveal a diminishing return on performance as the bound is increasingly tightened, allowing us to make a recommendation on the most effective use of extra storage towards increasing the efficiency of the algorithm.

Carson Kai-Sang Leung, Richard Kyle MacKinnon

Secure Outsourced Frequent Pattern Mining by Fully Homomorphic Encryption

With the advent of the big data era, outsourcing data storage together with data mining tasks to cloud service providers is becoming a trend, which however incurs security and privacy issues. To address the issues, this paper proposes two protocols for mining frequent patterns securely on the cloud by employing fully homomorphic encryption. One protocol requires little communication between the client and the cloud service provider, the other incurs less computation cost. Moreover, a new privacy notion, namely

$$\alpha $$

α

-pattern uncertainty, is proposed to reinforce the second protocol. Our scenario has two advantages: one is stronger privacy protection, and the other is that the outsourced data can be used in different mining tasks. Experimental evaluation demonstrates that the proposed protocols provide a feasible solution to the issues.

Junqiang Liu, Jiuyong Li, Shijian Xu, Benjamin C.M. Fung

Supervised Evaluation of Top-k Itemset Mining Algorithms

A major mining task for binary matrixes is the extraction of approximate top-

k

patterns that are able to concisely describe the input data. The top-

k

pattern discovery problem is commonly stated as an optimization one, where the goal is to minimize a given cost function, e.g., the accuracy of the data description.

In this work, we review several greedy state-of-the-art algorithms, namely

Asso

,

Hyper+

, and

PaNDa

$$^{+}$$

+

, and propose a methodology to compare the patterns extracted. In evaluating the set of mined patterns, we aim at overcoming the usual assessment methodology, which only measures the given cost function to minimize. Thus, we evaluate how good are the models/patterns extracted in unveiling supervised knowledge on the data. To this end, we test algorithms and diverse cost functions on several datasets from the UCI repository. As contribution, we show that

PaNDa

$$^{+}$$

+

performs best in the majority of the cases, since the classifiers built over the mined patterns used as dataset features are the most accurate.

Claudio Lucchese, Salvatore Orlando, Raffaele Perego

Finding Banded Patterns in Data: The Banded Pattern Mining Algorithm

The concept of Banded pattern mining is concerned with the identification of “bandings” within zero-one data. A zero-one data set is said to be fully banded if all the “ones” can be arranged along the leading diagonal. The discovery of a banded pattern is of interest in its own right, at least in a data analysis context, because it tells us something about the data. Banding has also been shown to enhances the efficiency of matrix manipulation algorithms. In this paper the exact

N

dimensional Banded Pattern Mining (BPM) algorithm is presented together with a full evaluation of its operation. To illustrate the utility of the banded pattern concept a case study using the Great Britain (GB) Cattle movement database is also presented.

Fatimah B. Abdullahi, Frans Coenen, Russell Martin

Discrimination-Aware Association Rule Mining for Unbiased Data Analytics

A discriminatory dataset refers to a dataset with undesirable correlation between sensitive attributes and the class label, which often leads to biased decision making in data analytics processes. This paper investigates how to build discrimination-aware models even when the available training set is intrinsically discriminating based on some sensitive attributes, such as race, gender or personal status. We propose a new classification method called Discrimination-Aware Association Rule classifier (DAAR), which integrates a new discrimination-aware measure and an association rule mining algorithm. We evaluate the performance of DAAR on three real datasets from different domains and compare it with two non-discrimination-aware classifiers (a standard association rule classification algorithm and the state-of-the-art association rule algorithm SPARCCC), and also with a recently proposed discrimination-aware decision tree method. The results show that DAAR is able to effectively filter out the discriminatory rules and decrease the discrimination on all datasets with insignificant impact on the predictive accuracy.

Ling Luo, Wei Liu, Irena Koprinska, Fang Chen

Social Computing

Frontmatter

Big Data Analytics of Social Networks for the Discovery of “Following” Patterns

In the current era of big data, high volumes of valuable data can be easily collected and generated. Social networks are examples of generating sources of these big data. Users (or social entities) in these social networks are often linked by some interdependency such as friendship or “following” relationships. As these big social networks keep growing, there are situations in which individual users or businesses want to find those frequently followed groups of social entities so that they can follow the same groups. In this paper, we present a big data analytics solution that uses the MapReduce model to mine social networks for discovering groups of frequently followed social entities. Evaluation results show the efficiency and practicality of our big data analytics solution in discovering “following” patterns from social networks.

Carson Kai-Sang Leung, Fan Jiang

Sentiment Extraction from Tweets: Multilingual Challenges

Every day users of social networks and microblogging services share their point of view about products, companies, movies and their emotions on a variety of topics. As social networks and microblogging services become more popular, the need to mine and analyze their content grows. We study the task of sentiment analysis in the well-known social network Twitter (

https://twitter.com/

). We present a case study on tweets written in Greek and propose an effective method that categorizes Greek tweets as positive, negative and neutral according to their sentiment. We validate our method’s effectiveness on both Greek and English to check its robustness on multilingual challenges, and present the first multilingual comparative study with three pre-existing state of the art techniques for Twitter sentiment extraction on English tweets. Last but not least, we examine the importance of different preprocessing techniques in different languages. Our technique outperforms two out of the three methods we compared against and is on a par to the best of those methods, but it needs significantly less time for prediction and training.

Nantia Makrynioti, Vasilis Vassalos

TiDE: Template-Independent Discourse Data Extraction

The problem of Discourse Data Extraction focuses on identifying comments and reviews from social networking websites. Existing approaches for Discourse Data extraction are either template-dependent or they are limited to comment-posting-structure discovery. We are not aware of any technique that extracts the detailed comment information like comment text, commenter and discussion structure from the comment page. In this paper, we present a template-independent two step approach, namely TiDE, which extracts the discourse data such as comments, reviews, posts and structural relationship among them. In the first step, we parse the input comment page to prepare a Document Object Model tree and then find the location of discourse data in the tree using the concept of Path-Strings. The outputs of the first step are Comment Blocks and these Comment Blocks are leveraged in second step to extract the comments, commenter and discussion structure. Experimental studies on 19 well known Discourse websites having different templates show that our Comment Block discovery is more adaptable than the existing posting-structure discovery technique. We are able to extract 97 % of comment-text and 79 % commenter information which is significant compared to state of the art techniques. We also show the usefulness of TiDE by building a news comment crawler.

Jayendra Barua, Dhaval Patel, Vikram Goyal

Heterogeneous Networks and Data

Frontmatter

A New Relevance Measure for Heterogeneous Networks

Measuring relatedness between objects (nodes) in a heterogeneous network is a challenging and an interesting problem. Many people transform a heterogeneous network into a homogeneous network before applying a similarity measure. However, such transformation results in information loss as path semantics are lost. In this paper, we study the problem of measuring relatedness between objects in a heterogeneous network using only link information and propose a meta-path based novel measure for relevance measurement in a general heterogeneous network with a specified network schema. The proposed measure is semi-metric and incorporates the path semantics by following the specified meta-path. For relevance measurement, using the specified meta-path, the given heterogeneous network is converted into a bipartite network consisting only of source and target type objects between which relatedness is to be measured. In order to validate the effectiveness of the proposed measure, we compared its performance with existing relevance measures which are semi-metric and applicable to heterogeneous networks. To show the viability and the effectiveness of the proposed measure, experiments were performed on real world bibliographic dataset DBLP. Experimental results show that the proposed measure effectively measures the relatedness between objects in a heterogeneous network and it outperforms earlier measures in clustering and query task.

Mukul Gupta, Pradeep Kumar, Bharat Bhasker

UFOMQ: An Algorithm for Querying for Similar Individuals in Heterogeneous Ontologies

The chief challenge in identifying similar individuals across multiple ontologies is the high computational cost of evaluating similarity between every pair of entities. We present an approach to querying for similar individuals across multiple ontologies that makes use of the correspondences discovered during ontology alignment in order to reduce this cost. The query algorithm is designed using the framework of fuzzy logic and extends fuzzy ontology alignment. The algorithm identifies entities that are related to the given entity directly from a single alignment link or by following multiple alignment links. We evaluate the approach using both publicly available ontologies and from an enterprise-scale dataset. Experiments show that it is possible to trade-off a small decrease in precision of the query results with a large savings in execution time.

Yinuo Zhang, Anand Panangadan, Viktor K. Prasanna

Semantics-Based Multidimensional Query Over Sparse Data Marts

Measurement of Performances Indicators (PIs) in highly distributed environments, especially in networked organisations, is particularly critical because of heterogeneity issues and sparsity of data. In this paper we present a semantics-based approach for dynamic calculation of PIs in the context of sparse distributed data marts. In particular, we propose to enrich the multidimensional model with the formal description of the structure of an indicator given in terms of its algebraic formula and aggregation function. Upon such a model, a set of reasoning-based functionalities are capable to mathematically manipulate formulas for dynamic aggregation of data and computation of indicators on-the-fly, by means of recursive application of rewriting rules based on logic programming.

Claudia Diamantini, Domenico Potena, Emanuele Storti

Data Warehouses

Frontmatter

Automatically Tailoring Semantics-Enabled Dimensions for Movement Data Warehouses

This paper proposes an automatic approach to build tailored dimensions for movement data warehouses based on views of existing hierarchies of objects (and their respective classes) used to semantically annotate movement segments. It selects the objects (classes) that annotate at least a given number of segments of a movement dataset to delineate hierarchy views for deriving tailored analysis dimensions for that movement dataset. Dimensions produced in this way can be quite smaller than the hierarchies from which they are extracted, leading to efficiency gains, among other potential benefits. Results of experiments with tweets semantically enriched with points of interest taken from linked open data collections show the viability of the proposed approach.

Juarez A. P. Sacenti, Fabio Salvini, Renato Fileto, Alessandra Raffaetà, Alessandro Roncato

Real-Time Snapshot Maintenance with Incremental ETL Pipelines in Data Warehouses

Multi-version concurrency control method has nowadays been widely used in data warehouses to provide OLAP queries and ETL maintenance flows with concurrent access. A snapshot is taken on existing warehouse tables to answer a certain query independently of concurrent updates. In this work, we extend this snapshot with the deltas which reside at the source side of ETL flows. Before answering a query, relevant tables are first refreshed with the exact source deltas which are captured at the time this query arrives (so-called query-driven policy). Snapshot maintenance is done by an incremental recomputation pipeline which is flushed by a set of consecutive deltas belonging to a sequence of incoming queries. A workload scheduler is thereby used to achieve a serializable schedule of concurrent maintenance tasks and OLAP queries. Performance has been examined by using read-/update-heavy workloads.

Weiping Qu, Vinanthi Basavaraj, Sahana Shankar, Stefan Dessloch

Eco-Processing of OLAP Complex Queries

With the Era of

Big Data

and the spectacular development of High-Performance Computing, organizations and countries spend considerable efforts and money to control/reduce the energy consumption. In data-centric applications, DBMS are

one of the major energy consumers

when executing complex queries. As a consequence, integrating the energy aspects in the advanced database design becomes an economic necessity. To predict this energy, the development of mathematical cost models is one of the avenues worth exploring. In this paper, we propose a cost model for estimating the energy required to execute a workload. This estimation is obtained by the means of statistical regression techniques that consider three types of parameters related to the query execution strategies, the used deployment platform and the characteristics of the data warehouses. To evaluate the quality of our cost model, we conduct two types of experiments: one using our mathematical cost model and another using a real DBMS with dataset of TPC-H and TPC-DS benchmarks. The obtained results show the quality of our cost model.

Amine Roukh, Ladjel Bellatreche

Materializing Baseline Views for Deviation Detection Exploratory OLAP

Alert-raising and deviation detection in OLAP and explora-tory search concerns calling the user’s attention to variations and non-uniform data distributions, or directing the user to the most interesting exploration of the data. In this paper, we are interested in the ability of a data warehouse to monitor continuously new data, and to update accordingly a particular type of materialized views recording statistics, called baselines. It should be possible to detect deviations at various levels of aggregation, and baselines should be fully integrated into the database. We propose Multi-level Baseline Materialized Views (BMV), including the mechanisms to build, refresh and detect deviations. We also propose an incremental approach and formula for refreshing baselines efficiently. An experimental setup proves the concept and shows its efficiency.

Pedro Furtado, Sergi Nadal, Veronika Peralta, Mahfoud Djedaini, Nicolas Labroche, Patrick Marcel

Stream Processing

Frontmatter

Binary Shapelet Transform for Multiclass Time Series Classification

Shapelets have recently been proposed as a new primitive for time series classification. Shapelets are subseries of series that best split the data into its classes. In the original research, shapelets were found recursively within a decision tree through enumeration of the search space. Subsequent research indicated that using shapelets as the basis for transforming datasets leads to more accurate classifiers.

Both these approaches evaluate how well a shapelet splits all the classes. However, often a shapelet is most useful in distinguishing between members of the class of the series it was drawn from against all others. To assess this conjecture, we evaluate a one vs all encoding scheme. This technique simplifies the quality assessment calculations, speeds up the execution through facilitating more frequent early abandon and increases accuracy for multi-class problems. We also propose an alternative shapelet evaluation scheme which we demonstrate significantly speeds up the full search.

Aaron Bostrom, Anthony Bagnall

StreamXM: An Adaptive Partitional Clustering Solution for Evolving Data Streams

A challenge imposed by continuously arriving data streams is to analyze them and to modify the models that explain them as new data arrives. We propose StreamXM, a stream clustering technique that does not require an arbitrary selection of number of clusters, repeated and expensive heuristics or in-depth prior knowledge of the data to create an informed clustering that relates to the data. It allows a clustering that can adapt its number of classes to those present in the underlying distribution. In this paper, we propose two different variants of StreamXM and compare them against a current, state-of-the-art technique, StreamKM. We evaluate our proposed techniques using both synthetic and real world datasets. From our results, we show StreamXM and StreamKM run in similar time and with similar accuracy when running with similar numbers of clusters. We show our algorithms can provide superior stream clustering if true clusters are not known or if emerging or disappearing concepts will exist within the data stream.

Robert Anderson, Yun Sing Koh

Data Stream Mining with Limited Validation Opportunity: Towards Instrument Failure Prediction

A data stream mining mechanism for predicting instrument failure, founded on the concept of time series analysis, is presented. The objective is to build a model that can predict instrument failure so that some mitigation can be invoked so as to prevent the failure. The proposed mechanism therefore features the interesting characteristic that there is only a limited opportunity to validate the model. The mechanism is fully described and evaluated using single and multiple attribute scenarios.

Katie Atkinson, Frans Coenen, Phil Goddard, Terry Payne, Luke Riley

Distributed Classification of Data Streams: An Adaptive Technique

Mining data streams is a critical task of actual Big Data applications. Usually, data stream mining algorithms work on resource-constrained environments, which call for novel requirements like availability of resources and adaptivity. Following this main trend, in this paper we propose a distributed data stream classification technique that has been tested on a real sensor network platform, namely, Sun SPOT. The proposed technique shows several points of research innovation, with are also confirmed by its effectiveness and efficiency assessed in our experimental campaign.

Alfredo Cuzzocrea, Mohamed Medhat Gaber, Ary Mazharuddin Shiddiqi

New Word Detection and Tagging on Chinese Twitter Stream

Twitter becomes one of the critical channels for disseminating up-to-date information. The volume of tweets can be huge. It is desirable to have an automatic system to analyze tweets. The obstacle is that Twitter users usually invent new words using non-standard rules that appear in a burst within a short period of time. Existing new word detection methods are not able to identify them effectively. Even if the new words can be identified, it is difficult to understand their meanings. In this paper, we focus on Chinese Twitter. There are no natural word delimiters in a sentence, which makes the problem more difficult. To solve the problem, we derive an unsupervised new word detection framework without relying on training data. Then, we introduce automatic tagging to new word annotation which tag the new words using known words according to our proposed tagging algorithm.

Yuzhi Liang, Pengcheng Yin, S. M. Yiu

Applications of Big Data Analysis

Frontmatter

Text Categorization for Deriving the Application Quality in Enterprises Using Ticketing Systems

Today’s enterprise services and business applications are often centralized in a small number of data centers. Employees located at branches and side offices access the computing infrastructure via the internet using thin client architectures. The task to provide a good application quality to the employers using a multitude of different applications and access networks has thus become complex. Enterprises have to be able to identify resource bottlenecks and applications with a poor performance quickly to take appropriate countermeasures and enable a good application quality for their employees. Ticketing systems within an enterprise use large databases for collecting complaints and problems of the users over a long period of time and thus are an interesting starting point to identify performance problems. However, manual categorization of tickets comes with a high workload.

In this paper, we analyze in a case study the applicability of supervised learning algorithms for the automatic identification of relevant tickets, i.e., tickets indicating problematic applications. In that regard, we evaluate different classification algorithms using 12,000 manually annotated tickets accumulated in July 2013 at the ticketing system of a nation-wide operating enterprise. In addition to traditional machine learning metrics, we also analyze the performance of the different classifiers on business-relevant metrics.

Thomas Zinner, Florian Lemmerich, Susanna Schwarzmann, Matthias Hirth, Peter Karg, Andreas Hotho

MultiSpot: Spotting Sentiments with Semantic Aware Multilevel Cascaded Analysis

Given a textual resource (e.g. post, review, comment), how can we spot the expressed sentiment? What will be the core information to be used for accurately capturing sentiment given a number of textual resources? Here, we introduce an approach for extracting and aggregating information from different text-levels, namely words and sentences, in an effort to improve the capturing of documents’ sentiments in relation to the state of the art approaches. Our main contributions are: (a) the proposal of two semantic aware approaches for enhancing the

cascaded phase

of a sentiment analysis process; and (b)

MultiSpot

, a multilevel sentiment analysis approach which combines word and sentence level features. We present experiments on two real-world datasets containing movie reviews.

Despoina Chatzakou, Nikolaos Passalis, Athena Vakali

Online Urban Mobility Detection Based on Velocity Features

The study of the mobility models that arise from the city dynamics has become instrumental to provide new urban services. In this context, many proposals applied an off-line learning on historical data. However, at the dawn of the Big Data era, there is an increasing need for systems and architectures able to process data in a timely manner. The present work introduces a novel approach for online mobility model detection along with a new concept for trajectory abstraction based on velocity features. Finally, the proposal is evaluated with a real-world dataset.

Fernando Terroso-Saenz, Mercedes Valdes-Vela, Antonio F. Skarmeta-Gomez

Big Data

Frontmatter

Partition and Conquer: Map/Reduce Way of Substructure Discovery

Transactional data mining (decision trees, association rules etc.) has been used to discover non trivial patterns in unstructured data. For applications that have an inherent structure (such as social networks, phone networks etc.) graph mining is useful as mapping such data into an unstructured representation will lead to loss of relationships. Graph mining finds use in a plethora of applications: analysis of fraud detection in transaction networks, finding friendships and other characteristics are to name a few. Finding interesting and frequent substructures is central to graph mining in all of these applications. Until now, graph mining has been addressed using main memory, disk-based as well as database-oriented approaches to deal with progressively larger sizes of applications.

This paper presents two algorithms using the Map/Reduce paradigm for mining interesting and repetitive patterns from a partitioned input graph. A general form of graphs, including directed edges and cycles are handled by our approach. Our primary goal is to address scalability, solve difficult and computationally expensive problems like duplicate elimination, canonical labeling and isomorphism detection in the Map/Reduce framework, without loss of information. Our analysis and experiments show that graphs with hundreds of millions of edges can be handled with acceptable speedup by the algorithm and the approach presented in this paper.

Soumyava Das, Sharma Chakravarthy

Implementation of Multidimensional Databases with Document-Oriented NoSQL

NoSQL (Not Only SQL) systems are becoming popular due to known advantages such as horizontal scalability and elasticity. In this paper, we study the implementation of data warehouses with document-oriented NoSQL systems. We propose mapping rules that transform the multidimensional data model to logical document-oriented models. We consider three different logical translations and we use them to instantiate multidimensional data warehouses. We focus on data loading, model-to-model conversion and cuboid computation.

M. Chevalier, M. El Malki, A. Kopliku, O. Teste, R. Tournier

A Graph-Based Concept Discovery Method for n-Ary Relations

Concept discovery is a multi-relational data mining task for inducing definitions of a specific relation in terms of other relations in the data set. Such learning tasks usually have to deal with large search spaces and hence have efficiency and scalability issues. In this paper, we present a hybrid approach that combines association rule mining methods and graph-based approaches to cope with these issues. The proposed method inputs the data in relational format, converts it into a graph representation, and traverses the graph to find the concept descriptors. Graph traversal and pruning are guided based on association rule mining techniques. The proposed method distinguishes from the state-of-the art methods as it can work on n-ary relations, it uses path finding queries to extract concepts and can handle numeric values. Experimental results show that the method is superior to the state-of-the art methods in terms of accuracy and the coverage of the induced concept descriptors and the running time.

Nazmiye Ceren Abay, Alev Mutlu, Pinar Karagoz

Exact Detection of Information Leakage in Database Access Control

Elaborate security policies often require organizations to restrict user data access in a fine-grained manner, instead of traditional table- or column-level access control. Not surprisingly, managing fine-grained access control in software is rather challenging. In particular, if access is not configured carefully, information leakage may happen: Users may infer sensitive information through the data explicitly accessible to them in centralized systems or in the cloud.

In this paper we formalize this

information-leakage problem,

by modeling sensitive information as answers to “secret queries,” and by modeling access-control rules as views. We focus on the scenario where sensitive information can be deterministically derived by adversaries. We review a natural data-exchange based inference model for detecting information leakage, and show its capabilities and limitation. We then introduce and formally study a new inference model, view-verified data exchange, that overcomes the limitation for the query language under consideration.

Farid Alborzi, Rada Chirkova, Ting Yu

Backmatter

Weitere Informationen

Premium Partner

Neuer Inhalt

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.
Jetzt gratis downloaden!

Bildnachweise