ETL Processing

ETLDiff: A Semi-automatic Framework for Regression Test of ETL Software

Modern software development methods such as Extreme Programming (XP) favor the use of frequently repeated tests, so-called regression tests, to catch new errors when software is updated or tuned, by checking that the software still produces the right results for a reference input. Regression testing is also very valuable for Extract–Transform–Load (ETL) software, as ETL software tends to be very complex and error-prone. However, regression testing of ETL software is currently cumbersome and requires large manual efforts. In this paper, we describe a novel, easy–to–use, and efficient semi–automatic test framework for regression test of ETL software. By automatically analyzing the schema, the tool detects how tables are related, and uses this knowledge, along with optional user specifications, to determine exactly what data warehouse (DW) data should be identical across test ETL runs, leaving out change-prone values such as surrogate keys. The framework also provides tools for quickly detecting and displaying differences between the current ETL results and the reference results. In summary, manual work for test setup is reduced to a minimum, while still ensuring an efficient testing procedure.

Christian Thomsen, Torben Bach Pedersen

Applying Transformations to Model Driven Data Warehouses

In the past few years, several conceptual approaches have been proposed for the specification of the main multidimensional (MD) properties of the data warehouse (DW) repository. However, these approaches often fail in providing mechanisms to univocally and automatically derive a logical representation of the MD conceptual model. To overcome this limitation, we present an approach to align the MD modeling of the DW repository with the Model Driven Architecture (MDA) by formally defining a set of Query/View/Transformation (QVT) transformation rules which allow us to obtain a logical representation of the MD conceptual model in an automatic way. Finally, we show how to implement our approach in an MDA-compliant commercial tool.

Jose-Norberto Mazón, Jesús Pardillo, Juan Trujillo

Bulk Loading a Linear Hash File

We study the problem of bulk loading a linear hash file; the problem is that a

good

hash function is able to distribute records into random locations in the file; however, performing a random disk access for each record can be costly and this cost increases with the size of the file. We propose a bulk loading algorithm that can avoid random disk accesses by reducing multiple accesses to the same location into a single access and reordering the accesses such that the pages are accessed sequentially. Our analysis shows that our algorithm is near-optimal with a cost roughly equal to the cost of sorting the dataset, thus the algorithm can scale up to very large datasets. Our experiments show that our method can improve upon the Berkeley DB load utility, in terms of running time, by two orders of magnitude and the improvements scale up well with the size of the dataset.

Davood Rafiei, Cheng Hu

Materialized View

Dynamic View Selection for OLAP

Due to the increasing size of data warehouses it is often infeasible to materialize all possible aggregate views for Online Analytical Processing. View selection, the task of selecting a subset of views to materialize based on knowledge of the incoming queries and updates, is an important and challenging problem. In this paper we explore

Dynamic View Selection

in which the distribution of queries changes over time, and a subset of a materialized view set is updated to better serve the incoming queries.

Michael Lawrence, Andrew Rau-Chaplin

Preview: Optimizing View Materialization Cost in Spatial Data Warehouses

One of the major challenges facing a data warehouse is to improve the query response time while keeping the maintenance cost to a minimum. Recent solutions to tackle this problem suggest to selectively materialize certain views and compute the remaining views on-the-fly, so that the cost is optimized. Unfortunately, in case of a spatial data warehouse, both the view materialization cost and the on-the-fly computation cost are often extremely high. This is due to the fact that spatial data are larger in size and spatial operations are more complex and expensive than the traditional relational operations. In this paper, we propose a new notion, called preview, for which both the materialization and on-the-fly costs are significantly smaller than those of the traditional views. Essentially, to achieve these cost savings, a preview pre-processes the non-spatial part of the query, and maintains pointers to the spatial data. In addition, it exploits the hierarchical relationships among the different views by maintaining a universal composite lattice, and mapping each view onto it. We optimally decompose a spatial query into three components, the preview part, the materialized view part and the on-the-fly computation part, so that the total cost is minimized. We demonstrate the cost savings with realistic query scenarios.

Songmei Yu, Vijayalakshmi Atluri, Nabil Adam

Preprocessing for Fast Refreshing Materialized Views in DB2

Materialized views (MVs) are used in databases and data warehouses to greatly improve query performance. In this context, a great challenge is to exploit commonalities among the views and to employ multi-query optimization techniques in order to derive an efficient global evaluation plan for refreshing the MVs concurrently.

IBMDB

2

$^{\rm {\textregistered}}Universal Database^{\texttrademark}$

(DB2 UDB) provides two query matching techniques, query stacking and query sharing, to exploit commonalities among the MVs, and to construct an efficient global evaluation plan. When the number of MVs is large, memory and time restrictions prevent us from using both query matching techniques in constructing efficient global plans. We suggest an approach that applies the query stacking and query sharing techniques in different steps. The query stacking technique is applied first, and the outcome is exploited to define groups of MVs. The number of MVs in each group is restricted. This allows the query sharing technique to be applied only within groups in a second step. Finally, the query stacking technique is used again to determine an efficient global evaluation plan. An experimental evaluation shows that the execution time of the plan generated by our approach is very close to that of the plan generated using both query matching techniques without restriction. This result is valid no matter how big the database is.

Wugang Xu, Calisto Zuzarte, Dimitri Theodoratos, Wenbin Ma

Multidimensional Design

A Multiversion-Based Multidimensional Model

This paper addresses the problem of how to specify changes in multidimensional databases. These changes may be motivated by evolutions of user requirements as well as changes of operational sources. The multiversion-based multidimensional model we provide supports both data and structure changes. The approach consists in storing star versions according to relevant structure changes whereas data changes are recorded through dimension instances and fact instances in a star version. The model is able to integrate mapping functions to populate multiversion-based multidimensional databases.

Franck Ravat, Olivier Teste, Gilles Zurfluh

Towards Multidimensional Requirement Design

Data warehouses (DW) main objective is to facilitating decision- making. Thus their development has to take into account DW project actor requirements. While much recent research has been focused on the design of multidimensional conceptual models, only few works were done to develop models and tools to analyze them. Despite specificities (OLAP, historicization, ...) of these requirements, most of the previous works used an E/R or UML schemas which do not allow designers to represent these specific principles. A main property of DW is that they are derived from existing data sources. Therefore, Extraction-Transformation-Loading (ETL) processes from these sources to the DW are very useful to define a reliable DW.

In this paper, we fill this gap by showing how to systematically derive a conceptual schema of actors requirements using a rule-based mechanism. We provide a conceptual model which tries to be close to user vision of data by extending an object-oriented multidimensional model. Hence, in the early step of DW engineering, designers can formalize actors requirements about information and ETL processes at the same time to make easier understandability, confrontation and validation by all DW actors.

Estella Annoni, Franck Ravat, Olivier Teste, Gilles Zurfluh

Multidimensional Design by Examples

In this paper we present a method to validate user multidimensional requirements expressed in terms of SQL queries. Furthermore, our approach automatically generates and proposes the set of multidimensional schemas satisfying the user requirements, from the organizational operational schemas. If no multidimensional schema is generated for a query, we can state that requirement is not multidimensional.

Oscar Romero, Alberto Abelló

OLAP and Multidimensional Model

Extending Visual OLAP for Handling Irregular Dimensional Hierarchies

Comprehensive data analysis has become indispensable in a variety of environments. Standard OLAP (On-Line Analytical Processing) systems, designed for satisfying the reporting needs of the business, tend to perform poorly or even fail when applied in non-business domains such as medicine, science, or government. The underlying multidimensional data model is restricted to aggregating only over summarizable data, i.e. where each dimensional hierarchy is a balanced tree. This limitation, obviously too rigid for a number of applications, has to be overcome in order to provide adequate OLAP support for novel domains.

We present a framework for querying complex multidimensional data, with the major effort at the conceptual level as to transform irregular hierarchies to make them navigable in a uniform manner. We provide a classification of various behaviors in dimensional hierarchies, followed by our two-phase modeling method that proceeds by eliminating irregularities in the data with subsequent transformation of a complex hierarchical schema into a set of well-behaved sub-dimensions.

Mapping of the data to a visual OLAP browser relies solely on meta-data which captures the properties of facts and dimensions as well as the relationships across dimensional levels. Visual navigation is schema-based, i.e., users interact with dimensional levels and the data instances are displayed on-demand. The power of our approach is exemplified using a real-world study from the domain of academic administration.

Svetlana Mansmann, Marc H. Scholl

A Hierarchy-Driven Compression Technique for Advanced OLAP Visualization of Multidimensional Data Cubes

In this paper, we investigate the problem of visualizing multidimensional data cubes, and propose a novel technique for supporting advanced OLAP visualization of such data structures. Founding on very efficient data compression solutions for two-dimensional data domains, the proposed technique relies on the amenity of generating “semantics-aware” compressed representation of two-dimensional OLAP views extracted from multidimensional data cubes via the so-called OLAP dimension flattening process. A wide set of experimental results conducted on several kind of synthetic two-dimensional OLAP views clearly confirm the effectiveness and the efficiency of our technique, also in comparison with state-of-the-art proposals.

Alfredo Cuzzocrea, Domenico Saccà, Paolo Serafino

Analysing Multi-dimensional Data Across Autonomous Data Warehouses

Business cooperations frequently require to analyse data across enterprises, where there is no central authority to combine and manage cross-enterprise data. Thus, rather than integrating independent data warehouses into a Distributed Data Warehouse (DDWH) for cross-enterprise analyses, this paper introduces a multi data warehouse OLAP language for integrating, combining, and analysing data from several, independent data warehouses (DWHs). The approach may be best compared to multi-database query languages for database integration. The key difference to these prior works is that they do not consider the multi-dimensional organisation of data warehouses.

The major problems addressed and solutions provided are: (1) a classification of DWH schema and instance heterogeneities at the fact and dimension level, (2) a methodology to combine independent data cubes taking into account the special characteristics of conceptual DWH schemata, i.e., OLAP dimension hierarchies and facts, and (3) a novel query language for bridging these heterogeneities in cross-DWH OLAP queries.

Stefan Berger, Michael Schrefl

What Time Is It in the Data Warehouse?

Though in most data warehousing applications no relevance is given to the time when events are recorded, some domains call for a different behavior. In particular, whenever late registrations of events take place, and particularly when the events registered are subject to further updates, the traditional design solutions fail in preserving accountability and query consistency. In this paper we discuss the alternative design solutions that can be adopted, in presence of late registrations, to support different types of queries that enable meaningful historical analysis. These solutions are based on the enforcement of the distinction between transaction time and valid time within the model that represents the fact of interest. In particular, we show how late registrations can be differently supported depending on the flow or stock semantics given to events.

Stefano Rizzi, Matteo Golfarelli

Cubes Processing

Computing Iceberg Quotient Cubes with Bounding

In complex data warehouse applications, high dimensional data cubes can become very big. The quotient cube is attractive in that it not only summarizes the original cube but also it keeps the roll-up and drill-down semantics between cube cells. In this paper we study the problem of semantic summarization of iceberg cubes, which comprises only cells that satisfy given aggregation constraints. We propose a novel technique for identifying groups of cells based on

bounding

aggregates and an efficient algorithm for computing iceberg quotient cubes for monotone functions. Our experiments show that iceberg quotient cubes can reduce data cube sizes and our iceberg quotient cubing algorithm can be over 10-fold more efficient than the current approach.

Xiuzhen Zhang, Pauline Lienhua Chou, Kotagiri Ramamohanarao

An Effective Algorithm to Extract Dense Sub-cubes from a Large Sparse Cube

A data cube provides aggregate information to support a class of queries such as a range-sum query. To process those queries efficiently, some auxiliary information, i.e. prefix sums, is pre-computed and maintained. In reality however most of high dimensional data cubes are very sparse, causing a serious space overhead. In this paper, we investigate an algorithm that extracts dense sub-cubes from a large sparse cube based on the density function. Instead of maintaining a large prefix-sum cube, a few dense sub-cubes are maintained to reduce the space overhead and to restrict the update propagation. We present an iterative method that identifies dense intervals in each dimension and constructs sub-cubes based on the intervals found. We show the effectiveness of our method through the analytic comparison and experiment with respect to various data sets and dimensions.

Seok-Lyong Lee

On the Computation of Maximal-Correlated Cuboids Cells

The main idea of iceberg data cubing methods relies on optimization techniques for computing only the cuboids cells above certain minimum support threshold. Even using such approach the curse of dimensionality remains, given the large number of cuboids to compute, which produces, as we know, huge outputs. However, more recently, some efforts have been done on computing only

closed cuboids

. Nevertheless, for some of the dense databases, which are considered in this paper, even the set of all closed cuboids will be too large. An alternative would be to compute only the

maximal cuboids

. However, a pure maximal approaching implies loosing some information, this is one can generate the complete set of cuboids cells from its maximal but without their respective aggregation value. To play with some “loss of information” we need to add an interesting measure, that we call the

correlated value of a cuboid cell

. In this paper, we propose a new notion for reducing cuboids aggregation by means of computing only the

maximal-correlated cuboids cells

, and present the M3C-Cubing algorithm that brings out those cuboids. Our evaluation study shows that the method followed is a promising candidate for scalable data cubing, reducing the number of cuboids by at least an order of magnitude or more in comparison with that of closed ones.

Ronnie Alves, Orlando Belo

Data Warehouse Applications

Warehousing Dynamic XML Documents

Due to the increased popularity of using XML documents in exchanging information between diverse types of applications or in representing semi-structured data, the issue of warehousing large collections of XML documents has become strongly imperative. Furthermore, an efficient XML warehouse needs to be able to answer various user queries about its content or about the history of the warehoused documents. In this paper, we propose a methodology for warehousing dynamic XML documents, which allows a low degree of redundancy of the warehoused data, while preserving critical information.

Laura Irina Rusu, Wenny Rahayu, David Taniar

Integrating Different Grain Levels in a Medical Data Warehouse Federation

Healthcare organizations practicing evidence-based medicine strive to unite their data resources in order to achieve a wider knowledge base for sophisticated research and matured decision support service. The central point of such an integrated system is a data warehouse, to which all participants have access. In order to insure a better protection of highly sensitive healthcare data, the warehouse is not created physically, but as a federated system. The paper describes the conceptual design of a health insurance data warehouse federation (HEWAF) aimed at supporting evidence-based medicine. We address a major domain-specific conceptual design issue: the integration of low-grained, time-segmented data into the traditional warehouse, whose basic grain level is higher than that of the time-segmented data. The conceptual model is based on a widely accepted international healthcare standard. We use ontologies of the data warehouse domain, as well as of the healthcare and pharmacy domains, to provide schema matching between the federation and the component warehouses.

Marko Banek, A Min Tjoa, Nevena Stolba

A Versioning Management Model for Ontology-Based Data Warehouses

More and more integration systems use ontologies to solve the problem of semantic heterogeneities between autonomous databases. To automate the integration process, a number of these systems suppose the existence of a shared domain ontology a priori referenced by the local ontologies embedded in the various sources. When the shared ontology evolves over the time, the evolution may concern (i) the ontology level, (2) the local schema level, and/or (3) the contents of sources. Since sources are autonomous and may evolve independently, managing the evolution of the integrated system turns to an

asynchronous versioning problem

. In this paper, we propose an approach and a model to deal with this problem in the context of a materialized integration system. To manage the changes of contents and schemas of sources, we adapt the existing solutions proposed in traditional databases. To support ontology changes, we propose the

principle of ontological continuity

. It supposes that an evolution of an ontology should not make false an axiom that was previously true. This principle allows the management of each old instance using the new version of ontology. With this assumption, we propose an approach, called

the floating version model

, that fully automate the whole integration process. Our proposed work has been validated by a prototype using ECCO environment and the EXPRESS language.

Dung Nguyen Xuan, Ladjel Bellatreche, Guy Pierra

Data Warehouses in Grids with High QoS

Data warehouses are repositories of large amounts of historical data and are used primarily for decision support purposes. On the other hand, grids consist on the aggregation of distributed computational resources and presentation of these as a single service with a common interface. The deployment of distributed data warehouses on a grid architecture with QoS control strategies could lead to high levels of flexibility, scalability, reliability and efficiency. However, due to grids characteristics, it could also lead to great challenges. In this paper we investigate an efficient architecture to deploy large data warehouses in grids with high availability and good load balancing. We propose architecture and present experimental results.

Rogério Luís de Carvalho Costa, Pedro Furtado

Mining Techniques (1)

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

This paper describes problem of prediction that is based on direct marketing data coming from Nationwide Products and Services Questionnaire (NPSQ) prepared by Polish division of Acxiom Corporation. The problem that we analyze is stated as prediction of accessibility to Internet. Unit of the analysis corresponds to a group of individuals in certain age category living in a certain building located in Poland. We used several machine learning methods to build our prediction models. Particularly, we applied ensembles of weak learners and ModLEM algorithm that is based on rough set approach. Comparison of results generated by these methods is included in the paper. We also report some of problems that we encountered during the analysis.

Jerzy Błaszczyński, Krzysztof Dembczyński, Wojciech Kotłowski, Mariusz Pawłowski

Efficient Mining of Dissociation Rules

Association rule mining is one of the most popular data mining techniques. Significant work has been done to extend the basic association rule framework to allow for mining rules with negation. Negative association rules indicate the presence of negative correlation between items and can reveal valuable knowledge about examined dataset. Unfortunately, the sparsity of the input data significantly reduces practical usability of negative association rules, even if additional pruning of discovered rules is performed. In this paper we introduce the concept of dissociation rules. Dissociation rules present a significant simplification over sophisticated negative association rule framework, while keeping the set of returned patterns concise and actionable. A new formulation of the problem allows us to present an efficient algorithm for mining dissociation rules. Experiments conducted on synthetic datasets prove the effectiveness of the proposed solution.

Mikołaj Morzy

Optimized Rule Mining Through a Unified Framework for Interestingness Measures

The large amount of association rules resulting from a KDD process makes the exploitation of the patterns embedded in the database difficult even impossible. In order to address this problem, various interestingness measures were proposed for selecting the most relevant rules. Nevertheless, the choice of an appropriate measure remains a hard task and the use of several measures may lead to conflicting information. In this paper, we propose a unified framework for a set of interestingness measures

$\mathcal{M}$

and prove that most of the usual objective measures behave in a similar way. In the context of classification rules, we show that each measure of

$\mathcal{M}$

admits a lower bound on condition that a minimal frequency threshold and a maximal number of exceptions are considered. Furthermore, our framework enables to characterize the whole collection of the rules simultaneously optimizing all the measures of

$\mathcal{M}$

. We finally provide a method to mine a rule cover of this collection.

Céline Hébert, Bruno Crémilleux

An Information-Theoretic Framework for Process Structure and Data Mining

Process-oriented systems have been increasingly attracting data mining community, due to the opportunities the application of inductive

process mining

techniques to log data can open to both the analysis of complex processes and the design of new process models. Currently, these techniques focus on structural aspects of the process and disregard data that are kept by many real systems, such as information about activity executors, parameter values, and time-stamps.

In this paper, an enhanced process mining approach is presented, where different process variants (use cases) can be discovered by clustering log traces, based on both structural aspects and performance measures. To this aim, an information-theoretic framework is used, where the structural information as well as performance measures are represented by a proper domain, which is correlated to the “central domain” of logged process instances. Then, the clustering of log traces is performed synergically with that of the correlated domains. Eventually, each cluster is equipped with a specific model, so providing the analyst with a compact and handy description of the execution paths characterizing each process variant.

Antonio D. Chiaravalloti, Gianluigi Greco, Antonella Guzzo, Luigi Pontieri

Mining Techniques (2)

Mixed Decision Trees: An Evolutionary Approach

In the paper, a new evolutionary algorithm (EA) for mixed tree learning is proposed. In non-terminal nodes of a mixed decision tree different types of tests can be placed, ranging from a typical univariate inequality test up to a multivariate test based on a splitting hyperplane. In contrast to classical top-down methods, our system searches for an optimal tree in a global manner, i.e. it learns a tree structure and tests in one run of the EA. Specialized genetic operators allow for generating new sub-trees, pruning existing ones as well as changing the node type and the tests. The proposed approach was experimentally verified on both artificial and real-life data and preliminary results are promising.

Marek Krȩtowski, Marek Grześ

ITER: An Algorithm for Predictive Regression Rule Extraction

Various benchmarking studies have shown that artificial neural networks and support vector machines have a superior performance when compared to more traditional machine learning techniques. The main resistance against these newer techniques is based on their lack of interpretability: it is difficult for the human analyst to understand the motivation behind these models’ decisions. Various rule extraction techniques have been proposed to overcome this opacity restriction. However, most of these extraction techniques are devised for classification and only few algorithms can deal with regression problems.

In this paper, we present ITER, a new algorithm for pedagogical regression rule extraction. Based on a trained ‘black box’ model, ITER is able to extract human-understandable regression rules. Experiments show that the extracted model performs well in comparison with CART regression trees and various other techniques.

Johan Huysmans, Bart Baesens, Jan Vanthienen

COBRA: Closed Sequential Pattern Mining Using Bi-phase Reduction Approach

In this work, we study the problem of closed sequential pattern mining. We propose a novel approach which extends a frequent sequence with closed itemsets instead of single items. The motivation is that closed sequential patterns are composed of only closed itemsets. Hence, unnecessary item extensions which generates non-closed sequential patterns can be avoided. Experimental evaluation shows that the proposed approach is two orders of magnitude faster than previous works with a modest memory cost.

Kuo-Yu Huang, Chia-Hui Chang, Jiun-Hung Tung, Cheng-Tao Ho

Frequent Itemsets

A Greedy Approach to Concurrent Processing of Frequent Itemset Queries

We consider the problem of concurrent execution of multiple frequent itemset queries. If such data mining queries operate on overlapping parts of the database, then their overall I/O cost can be reduced by integrating their dataset scans. The integration requires that data structures of many data mining queries are present in memory at the same time. If the memory size is not sufficient to hold all the data mining queries, then the queries must be scheduled into multiple phases of loading and processing. Since finding the optimal assignment of queries to phases is infeasible for large batches of queries due to the size of the search space, heuristic algorithms have to be applied. In this paper we formulate the problem of assigning the queries to phases as a particular case of hypergraph partitioning. To solve the problem, we propose and experimentally evaluate two greedy optimization algorithms.

Pawel Boinski, Marek Wojciechowski, Maciej Zakrzewicz

Two New Techniques for Hiding Sensitive Itemsets and Their Empirical Evaluation

Many privacy preserving data mining algorithms attempt to selectively hide what database owners consider as sensitive. Specifically, in the association-rules domain, many of these algorithms are based on item-restriction methods; that is, removing items from some transactions in order to hide sensitive frequent itemsets.

The infancy of this area has not produced clear methods neither evaluated those few available. However, determining what is most effective in protecting sensitive itemsets while not hiding non-sensitive ones as a side effect remains a crucial research issue. This paper introduces two new techniques that deal with scenarios where many itemsets of different sizes are sensitive. We empirically evaluate our two sanitization techniques and compare their efficiency as well as which has the minimum effect on the non-sensitive frequent itemsets.

Ahmed HajYasien, Vladimir Estivill-Castro

EStream: Online Mining of Frequent Sets with Precise Error Guarantee

In data stream applications, a good approximation obtained in a timely manner is often better than the exact answer that’s delayed beyond the window of opportunity. Of course, the quality of the approximate is as important as its timely delivery. Unfortunately, algorithms capable of online processing do not conform strictly to a precise error guarantee. Since online processing is essential and so is the precision of the error, it is necessary that stream algorithms meet both criteria. Yet, this is not the case for mining frequent sets in data streams. We present EStream, a novel algorithm that allows online processing while producing results strictly within the error bound. Our theoretical and experimental results show that EStream is a better candidate for finding frequent sets in data streams, when both constraints need to be satisfied.

Xuan Hong Dang, Wee-Keong Ng, Kok-Leong Ong

Mining Data Streams

Granularity Adaptive Density Estimation and on Demand Clustering of Concept-Drifting Data Streams

Clustering data streams has found a few important applications. While many previous studies focus on clustering objects arriving in a data stream, in this paper, we consider the novel problem of

on demand clustering concept drifting data streams

. In order to characterize concept drifting data streams, we propose an effective method to estimate densities of data streams. One unique feature of our new method is that its granularity of estimation is adaptive to the available computation resource, which is critical for processing data streams of unpredictable input rates. Moreover, we can apply any clustering method to on demand cluster data streams using their density estimations. A performance study on synthetic data sets is reported to verify our design, which clearly shows that our method obtains results comparable to CluStream [3] on clustering single stream, and much better results than COD [8] when clustering multiple streams.

Weiheng Zhu, Jian Pei, Jian Yin, Yihuang Xie

Classification of Hidden Network Streams

Traffic analysis is an important issue for network monitoring and security. We focus on identifying protocols for network traffic by analysing the size, timing and direction of network packets. By using these network stream characteristics, we propose a technique for modelling the behaviour of various TCP protocols. This model can be used for recognising protocols even when running under encrypted tunnels. This is complemented with experimental evaluation on real world network data.

Matthew Gebski, Alex Penev, Raymond K. Wong

Adaptive Load Shedding for Mining Frequent Patterns from Data Streams

Most algorithms that focus on discovering frequent patterns from data streams assumed that the machinery is capable of managing all the incoming transactions without any delay; or without the need to drop transactions. However, this assumption is often impractical due to the inherent characteristics of data stream environments. Especially under high load conditions, there is often a shortage of system resources to process the incoming transactions. This causes unwanted latencies that in turn, affects the applicability of the data mining models produced – which often has a small window of opportunity. We propose a load shedding algorithm to address this issue. The algorithm adaptively detects overload situations and drops transactions from data streams using a probabilistic model. We tested our algorithm on both synthetic and real-life datasets to verify the feasibility of our algorithm.

Xuan Hong Dang, Wee-Keong Ng, Kok-Leong Ong

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams

Recently, the data stream, which is an unbounded sequence of data elements generated at a rapid rate, provides a dynamic environment for collecting data sources. It is likely that the embedded knowledge in a data stream will change quickly as time goes by. Therefore, catching the recent trend of data is an important issue when mining frequent itemsets from data streams. Although the sliding window model proposed a good solution for this problem, the appearing information of the patterns within the sliding window has to be maintained completely in the traditional approach. In this paper, for estimating the approximate supports of patterns within the current sliding window, two data structures are proposed to maintain the average time stamps and frequency changing points of patterns, respectively. The experiment results show that our approach will reduce the run-time memory usage significantly. Moreover, the proposed FCP algorithm achieves high accuracy of mining results and guarantees no false dismissal occurring.

Jia-Ling Koh, Shu-Ning Shin

Ontology-Based Mining

Learning Classifiers from Distributed, Ontology-Extended Data Sources

There is an urgent need for sound approaches to integrative and collaborative analysis of large, autonomous (and hence, inevitably semantically heterogeneous) data sources in several increasingly data-rich application domains. In this paper, we precisely formulate and solve the problem of learning classifiers from such data sources, in a setting where each data source has a hierarchical ontology associated with it and semantic correspondences between data source ontologies and a user ontology are supplied. The proposed approach yields algorithms for learning a broad class of classifiers (including Bayesian networks, decision trees, etc.) from semantically heterogeneous distributed data with strong performance guarantees relative to their centralized counterparts. We illustrate the application of the proposed approach in the case of learning Naive Bayes classifiers from distributed, ontology-extended data sources.

Doina Caragea, Jun Zhang, Jyotishman Pathak, Vasant Honavar

A Coherent Biomedical Literature Clustering and Summarization Approach Through Ontology-Enriched Graphical Representations

In this paper, we introduce a coherent biomedical literature clustering and summarization approach that employs a graphical representation method for text using a biomedical ontology. The key of the approach is to construct document cluster models as semantic chunks capturing the core semantic relationships in the ontology-enriched scale-free graphical representation of documents. These document cluster models are used for both document clustering and text summarization by constructing Text Semantic Interaction Network (TSIN). Our extensive experimental results indicate our approach shows 45% cluster quality improvement and 72% clustering reliability improvement, in terms of misclassification index, over Bisecting K-means as a leading document clustering approach. In addition, our approach provides concise but rich text summary in key concepts and sentences. The primary contribution of this paper is we introduce a coherent biomedical literature clustering and summarization approach that takes advantage of ontology-enriched graphical representations. Our approach significantly improves the quality of document clusters and understandability of documents through summaries.

Illhoi Yoo, Xiaohua Hu, Il-Yeol Song

Automatic Extraction for Creating a Lexical Repository of Abbreviations in the Biomedical Literature

The sheer volume of biomedical text is growing at an exponential rate. This growth creates challenges for both human readers and automatic text processing algorithms. One such challenge arises from common and uncontrolled usages of abbreviations in the biomedical literature. This, in turn, requires that biomedical lexical ontologies be continuously updated. In this paper, we propose a hybrid approach combining lexical analysis techniques and the Support Vector Machine (SVM) to create an automatically generated and maintained lexicon of abbreviations. The proposed technique is differentiated from others in the following aspects: 1) It incorporates lexical analysis techniques to supervised learning for extracting abbreviations. 2) It makes use of text chunking techniques to identify long forms of abbreviations. 3) It significantly improves Recall compared to other techniques. The experimental results show that our approach outperforms the leading abbreviation algorithms, ExtractAbbrev and ALICE, at least by 6% and 13.9%, respectively, in both Precision and Recall on the Gold Standard Development corpus.

Min Song, Il-Yeol Song, Ki Jung Lee

Clustering

Priority-Based k-Anonymity Accomplished by Weighted Generalisation Structures

Biobanks are gaining in importance by storing large collections of patient’s clinical data (e.g. disease history, laboratory parameters, diagnosis, life style) together with biological materials such as tissue samples, blood or other body fluids. When releasing these patient-specific data for medical studies privacy protection has to be guaranteed for ethical and legal reasons. k-anonymity may be used to ensure privacy by generalising and suppressing attributes in order to release sufficient data twins that mask patients’ identities. However, data transformation techniques like generalisation may produce anonymised data unusable for medical studies because some attributes become too coarse-grained. We propose a priority-driven anonymisation technique that allows to specify the degree of acceptable information loss for each attribute separately. We use generalisation and suppression of attributes together with a weighting-scheme for quantifying generalisation steps. Our approach handles both numerical and categorical attributes and provides a data anonymisation based on priorities and weights. The anonymisation algorithm described in this paper has been implemented and tested on a carcinoma data set. We discuss some general privacy protecting methods for medical data and show some medical-relevant use cases that benefit from our anonymisation technique.

Konrad Stark, Johann Eder, Kurt Zatloukal

Achieving k-Anonymity by Clustering in Attribute Hierarchical Structures

Individual privacy will be at risk if a published data set is not properly de-identified.

k

-anonymity is a major technique to de-identify a data set. A more general view of

k

-anonymity is clustering with a constraint of the minimum number of objects in every cluster. Most existing approaches to achieving

k

-anonymity by clustering are for numerical (or ordinal) attributes. In this paper, we study achieving

k

-anonymity by clustering in attribute hierarchical structures. We define generalisation distances between tuples to characterise distortions by generalisations and discuss the properties of the distances. We conclude that the generalisation distance is a metric distance. We propose an efficient clustering-based algorithm for

k

-anonymisation. We experimentally show that the proposed method is more scalable and causes significantly less distortions than an optimal global recoding

k

-anonymity method.

Jiuyong Li, Raymond Chi-Wing Wong, Ada Wai-Chee Fu, Jian Pei

Calculation of Density-Based Clustering Parameters Supported with Distributed Processing

In today’s world of data-mining applications there is a strong need for processing spatial data. Spatial objects clustering is often a crucial operation in applications such as traffic-tracking systems or telemetry-oriented systems. Our current research is focused on providing an efficient caching structure for a telemetric data warehouse. We perform spatial objects clustering for every level of the structure. For this purpose we employ a density-based clustering algorithm. However efficient and scalable, the algorithm requires an user-defined parameter

Eps

. As we cannot get the

Eps

from user for every level of the structure we propose a heuristic approach for calculating the

Eps

parameter. Automatic

Eps

Calculation (AEC) algorithm analyzes pairs of points defining two quantities: distance between the points and density of the stripe between the points. In this paper we describe in detail the algorithm operation and interpretation of the results. The AEC algorithm was implemented in both centralized and distributed version. Included test results compare the two versions and verify the AEC algorithm correctness against various datasets.

Marcin Gorawski, Rafal Malczok

Cluster-Based Sampling Approaches to Imbalanced Data Distributions

For classification problem, the training data will significantly influence the classification accuracy. When the data set is highly unbalanced, classification algorithms tend to degenerate by assigning all cases to the most common outcome. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy in the imbalanced class distribution environment. The basic classification algorithm of neural network model is considered. The experimental results show that our cluster-based under-sampling approaches outperform the other under-sampling techniques in the previous studies.

Show-Jane Yen, Yue-Shi Lee

Advanced Mining Techniques

Efficient Mining of Large Maximal Bicliques

Many real world applications rely on the discovery of maximal biclique subgraphs (complete bipartite subgraphs). However, existing algorithms for enumerating maximal bicliques are not very efficient in practice. In this paper, we propose an efficient algorithm to mine large maximal biclique subgraphs from undirected graphs. Our algorithm uses a divide-and-conquer approach. It effectively uses the size constraints on both vertex sets to prune unpromising bicliques and to reduce the search space iteratively during the mining process. The time complexity of the proposed algorithm is

O

(

nd

N

), where

n

is the number of vertices,

d

is the maximal degree of the vertices and

N

is the number of maximal bicliques. Our performance study shows that the proposed algorithm outperforms previous work significantly.

Guimei Liu, Kelvin Sim, Jinyan Li

Automatic Image Annotation by Mining the Web

Automatic image annotation has been becoming an attractive research subject. Most current image annotation methods are based on training techniques. The major weaknesses of such solutions include limited annotation vocabulary and labor-intensive involvement. However, Web images possess a lot of texts, and rich annotation of samples is provided. Therefore, this report provides a novel image annotation method by mining the Web that term-image correlation is obtained from the Web not by learning. Without question, there are many noises in that relation, and some cleaning works are necessary. In the system, entropy weighting and image clustering technique are employed. Our experiment results show that our solution can achieve a satisfactory performance.

Zhiguo Gong, Qian Liu, Jingbai Zhang

Privacy Preserving Spatio-Temporal Clustering on Horizontally Partitioned Data

Time-stamped location information is regarded as spatio-temporal data and, by its nature, such data is highly sensitive from the perspective of privacy. In this paper, we propose a privacy preserving spatio-temporal clustering method for horizontally partitioned data which, to the best of our knowledge, was not done before. Our methods are based on building the dissimilarity matrix through a series of secure multi-party trajectory comparisons managed by a third party. Our trajectory comparison protocol complies with most trajectory comparison functions and complexity analysis of our methods shows that our protocol does not introduce extra overhead when constructing dissimilarity matrix, compared to the centralized approach.

Ali İnan, Yücel Saygın

Association Rules

Discovering Semantic Sibling Associations from Web Documents with XTREEM-SP

The semi-automatic extraction of semantics for ontology enhancement or semantic-based information retrieval encompasses several open challenges. There are many findings on the identification of vertical relations among concepts, but much less on indirect, horizontal relations among concepts that share a common, a priori unknown parent, such as Co-Hyponyms and Co-Meronyms. We propose the method XTREEM-SP (Xhtml TREE Mining for Sibling Pairs) for the discovery of such binary "sibling"-relations between concepts of a given vocabulary. While conventional methods process an appropriately prepared corpus, XTREEM-SP operates upon an arbitrarily heterogeneous Web Document Collection on a given topic and returns sibling relations between concepts associated to it. XTREEM-SP is independent of domain and language and does not rely on linguistic preprocessing nor on background knowledge beyond the ontology it is asked to enhance. We present our evaluation results with two gold standard ontologies and show that XTREEM-SP performs well, while being computationally inexpensive.

Marko Brunzel, Myra Spiliopoulou

Difference Detection Between Two Contrast Sets

Mining group differences is useful in many applications, such as medical research, social network analysis and link discovery. The differences between groups can be measured from either statistical or data mining perspective. In this paper, we propose an empirical likelihood (EL) based strategy of building confidence intervals for the mean and distribution differences between two contrasting groups. In our approach we take into account the structure (semi-parametric) of groups, and experimentally evaluate the proposed approach using both simulated and real-world data. The results demonstrate that our approach is effective in building confidence intervals for group differences such as mean and distribution function.

Hui-jing Huang, Yongsong Qin, Xiaofeng Zhu, Jilian Zhang, Shichao Zhang

EGEA : A New Hybrid Approach Towards Extracting Reduced Generic Association Rule Set (Application to AML Blood Cancer Therapy)

To avoid obtaining an unmanageable highly sized association rule sets– compounded with their low precision– that often make the perusal of knowledge ineffective, the extraction and exploitation of compact and informative generic basis of association rules is a becoming a must. Moreover, they provide a powerful verification technique for hampering gene mis-annotating or badly clustering in the Unigene library. However, extracted generic basis is still oversized and their exploitation is impractical. Thus, providing critical nuggets of extra-valued knowledge is a compellingly addressable issue. To tackle such a drawback, we propose in this paper a novel approach, called EGEA (Evolutionary Gene Extraction Approach). Such approach aims to considerably reduce the quantity of knowledge, extracted from a gene expression dataset, presented to an expert. Thus, we use a genetic algorithm to select the more predictive set of genes related to patient situations. Once, the relevant attributes (genes) have been selected, they serve as an input for a second approach stage, i.e., extracting generic association rules from this reduced set of genes. The notably decrease of the generic association rule cardinality, extracted from the selected gene set, permits to improve the quality of knowledge exploitation. Carried out experiments on a benchmark dataset pointed out that among this set, there are genes which are previously unknown prognosis-associated genes. This may serve as molecular targets for new therapeutic strategies to repress the relapse of pediatric acute myeloid leukemia (AML).

M. A. Esseghir, G. Gasmi, S. Ben Yahia, Y. Slimani

Miscellaneous Applications

AISS: An Index for Non-timestamped Set Subsequence Queries

In many recent applications of database management systems data may be stored in user defined complex data types (such as sequences). However, efficient querying of such data is not supported by commercially available database management systems and therefore efficient indexing schemes for complex data types need to be developed. In this paper we focus primarily on the indexing of non-timestamped sequences of sets of categorical data, specifically indexing for set subsequence queries. We address both: logical structure and implementation issues of such indexes. Our main contributions are threefold. First, we specify the logical structure of the index and we propose algorithms for set subsequence query execution, which utilize the index structure. Second, we provide the proposition for the implementation of such index, which uses means available in all of the “of the shelf” database management systems. Finally, we experimentally evaluate the performance of the index.

Witold Andrzejewski, Tadeusz Morzy

A Method for Feature Selection on Microarray Data Using Support Vector Machine

The data collected from a typical

microarray

experiment usually consists of tens of samples and thousands of genes (i.e., features). Usually only a small subset of features is relevant and non-redundant to differentiate the samples. Identifying an optimal subset of relevant genes is crucial for accurate classification of samples. In this paper, we propose a method for relevant gene subset selection for microarray gene expression data. Our method is based on gap tolerant classifier, a variation of support vector machine, and uses a hill-climbing search strategy. Unlike most other hill-climbing approaches, where classification accuracies are used as a criterion for feature selection, the proposed method uses a mixture of accuracy and SVM margin to select features. Our experimental results show that this strategy is effective both in selecting relevant and in eliminating redundant features.

Xiao Bing Huang, Jian Tang

Providing Persistence for Sensor Data Streams by Remote WAL

Rapidly changing environments such as robots, sensor networks, or medical services are emerging. To deal with them, DBMS should persist sensor data streams instantaneously. To achieve the purpose, data persisting process must be accelerated. Though write ahead logging (WAL) acceleration is essential for the purpose, only a few researches are conducted.

To accelerate data persisting process, this paper proposes

remote WAL with asynchronous checkpointing

technique. Furthermore this paper designs and implements it. To evaluate the technique, this paper conducts experiments on an object relational DBMS called KRAFT.

The result of experiments shows that remote WAL overwhelms performance disk based WAL. As for throughput evaluation, best policy shows about 12 times better performance compared with disk based WAL. As for logging time, the policy shows lower than 1000 micro seconds which is the period of motor data acquisition on conventional robots.

Hideyuki Kawashima, Michita Imai, Yuichiro Anzai

Classification

Support Vector Machine Approach for Fast Classification

In this study, we propose a new technique to integrate support vector machine and association rule mining in order to implement a fast and efficient classification algorithm that overcomes the drawbacks of machine learning and association rule-based classification algorithms. The reported test results demonstrate the applicability, efficiency and effectiveness of the proposed approach.

Keivan Kianmehr, Reda Alhajj

Document Representations for Classification of Short Web-Page Descriptions

Motivated by applying Text Categorization to sorting Web search results, this paper describes an extensive experimental study of the impact of bag-of-words document representations on the performance of five major classifiers – Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts represent short Web-page descriptions from the

dmoz

Open Directory Web-page ontology. Different transformations of input data: stemming, normalization,

logtf

and

idf

, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics – accuracy, precision, recall, F

1

and F

2

. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships.

Miloš Radovanović, Mirjana Ivanović

GARC: A New Associative Classification Approach

Many studies in data mining have proposed a new classification approach called

associative classification

. According to several reports associative classification achieves higher classification accuracy than do traditional classification approaches. However, the associative classification suffers from a major drawback: it is based on the use of a very large number of classification rules; and consequently takes efforts to select the best ones in order to construct the classifier. To overcome such drawback, we propose a new associative classification method called

Garc

that exploits a generic basis of association rules in order to reduce the number of association rules without jeopardizing the classification accuracy. Moreover,

Garc

proposes a new selection criterion called

score

, allowing to ameliorate the selection of the best rules during classification. Carried out experiments on 12 benchmark data sets indicate that

Garc

is highly competitive in terms of accuracy in comparison with popular associative classification methods.

I. Bouzouita, S. Elloumi, S. Ben Yahia

Conceptual Modeling for Classification Mining in Data Warehouses

Classification is a data mining (DM) technique that generates classes allowing to predict and describe the behavior of a variable based on the characteristics of a dataset. Frequently, DM analysts need to classify large amounts of data using many attributes. Thus, data warehouses (DW) can play an important role in the DM process, because they can easily manage huge quantities of data. There are two approaches used to model mining techniques: the Common Warehouse Model (CWM) and the Predictive Model Markup Language (PMML), both focused on metadata interchanging and sharing, respectively. These standards do not take advantage of the underlying semantic rich multidimensional (MD) model which could save development time and cost. In this paper, we present a conceptual model for Classification and a UML profile that allows the design of Classification on MD models. Our goal is to facilitate the design of these mining models in a DW context by employing an expressive conceptual model that can be used on top of a MD model. Finally, using the designed profile, we implement a case study in a standard database system and show the results.

Jose Zubcoff, Juan Trujillo

Springer Professional

About this book

Table of Contents

Frontmatter

ETL Processing

ETLDiff: A Semi-automatic Framework for Regression Test of ETL Software

Applying Transformations to Model Driven Data Warehouses

Bulk Loading a Linear Hash File

Materialized View

Dynamic View Selection for OLAP

Preview: Optimizing View Materialization Cost in Spatial Data Warehouses

Preprocessing for Fast Refreshing Materialized Views in DB2

Multidimensional Design

A Multiversion-Based Multidimensional Model

Towards Multidimensional Requirement Design

Multidimensional Design by Examples

OLAP and Multidimensional Model

Extending Visual OLAP for Handling Irregular Dimensional Hierarchies

A Hierarchy-Driven Compression Technique for Advanced OLAP Visualization of Multidimensional Data Cubes

Analysing Multi-dimensional Data Across Autonomous Data Warehouses

What Time Is It in the Data Warehouse?

Cubes Processing

Computing Iceberg Quotient Cubes with Bounding

An Effective Algorithm to Extract Dense Sub-cubes from a Large Sparse Cube

On the Computation of Maximal-Correlated Cuboids Cells

Data Warehouse Applications

Warehousing Dynamic XML Documents

Integrating Different Grain Levels in a Medical Data Warehouse Federation

A Versioning Management Model for Ontology-Based Data Warehouses

Data Warehouses in Grids with High QoS

Mining Techniques (1)

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Efficient Mining of Dissociation Rules

Optimized Rule Mining Through a Unified Framework for Interestingness Measures

An Information-Theoretic Framework for Process Structure and Data Mining

Mining Techniques (2)

Mixed Decision Trees: An Evolutionary Approach

ITER: An Algorithm for Predictive Regression Rule Extraction

COBRA: Closed Sequential Pattern Mining Using Bi-phase Reduction Approach

Frequent Itemsets

A Greedy Approach to Concurrent Processing of Frequent Itemset Queries

Two New Techniques for Hiding Sensitive Itemsets and Their Empirical Evaluation

EStream: Online Mining of Frequent Sets with Precise Error Guarantee

Mining Data Streams

Granularity Adaptive Density Estimation and on Demand Clustering of Concept-Drifting Data Streams

Classification of Hidden Network Streams

Adaptive Load Shedding for Mining Frequent Patterns from Data Streams

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams

Ontology-Based Mining

Learning Classifiers from Distributed, Ontology-Extended Data Sources

A Coherent Biomedical Literature Clustering and Summarization Approach Through Ontology-Enriched Graphical Representations

Automatic Extraction for Creating a Lexical Repository of Abbreviations in the Biomedical Literature

Clustering

Priority-Based k-Anonymity Accomplished by Weighted Generalisation Structures

Achieving k-Anonymity by Clustering in Attribute Hierarchical Structures

Calculation of Density-Based Clustering Parameters Supported with Distributed Processing

Cluster-Based Sampling Approaches to Imbalanced Data Distributions

Advanced Mining Techniques

Efficient Mining of Large Maximal Bicliques

Automatic Image Annotation by Mining the Web

Privacy Preserving Spatio-Temporal Clustering on Horizontally Partitioned Data

Association Rules

Discovering Semantic Sibling Associations from Web Documents with XTREEM-SP

Difference Detection Between Two Contrast Sets

EGEA : A New Hybrid Approach Towards Extracting Reduced Generic Association Rule Set (Application to AML Blood Cancer Therapy)

Miscellaneous Applications

AISS: An Index for Non-timestamped Set Subsequence Queries

A Method for Feature Selection on Microarray Data Using Support Vector Machine

Providing Persistence for Sensor Data Streams by Remote WAL

Classification

Support Vector Machine Approach for Fast Classification

Document Representations for Classification of Short Web-Page Descriptions

GARC: A New Associative Classification Approach

Conceptual Modeling for Classification Mining in Data Warehouses

Backmatter

Premium Partner