Data Warehouse I

A Tree Comparison Approach to Detect Changes in Data Warehouse Structures

We present a technique for discovering and representing changes between versions of data warehouse structures. We select a tree comparison algorithm, adapt it for the particularities of multidimensional data structures and extend it with a module for detection of node renamings. The result of these algorithms are so called editscripts consisting of transformation operations which, when executed in sequence, transform the earlier version to the later, and thus show the relationships between the elements of different versions of data warehouse structures. This procedure helps data warehouse administrators to register changes. We describe a prototypical implementation of the concept which imports multidimensional structures from Hyperion Essbase data warehouses, compares these versions and generates a list of differences.

Johann Eder, Christian Koncilia, Karl Wiggisser

Extending the UML for Designing Association Rule Mining Models for Data Warehouses

Association rules (AR) are one of the most popular data mining techniques in searching databases for frequently occurring patterns. In this paper, we present a novel approach to accomplish the conceptual design of data warehouses together with data mining association rules, allowing us to implement the association rules defined in the conceptual modeling phase. The great advantage of our approach is that the association rules are specified from the early stages of a data warehouse project and based on the main final user requirements and data warehouse goals, instead of specifying them on the final database implementation structures such as tables, rows or columns. Finally, to show the benefit of our approach we implement the specified association rules on a commercial data warehouse management server.

José Jacobo Zubcoff, Juan Trujillo

Event-Feeded Dimension Solution

From the point of view of a data warehouse system its part of collecting and receiving information from other systems is crucial for all subsequent business intelligence applications. The incoming information can be classified generally in two types, the state-snapshot data and the state-change or event data usually called transactional data, which contains information about the change processes applied on the instances of information objects. On the way towards active data warehouses it becomes more important to provide complete data with minimal latency. We focus in this paper on dimensional data provided by any data-master application. The information transfer is done via messages containing the change-information of the dimension instances. The receiving data warehouse system is able to validate the event-messages, reconstruct the complete history of the dimension and provide a well applicable “comprehensive slowly changing dimension” (cSCD) interface for well-performing queries on the historical and current state of the dimension. A prototype implementation of “active integration” of a data warehouse is proposed.

Tho Manh Nguyen, Jaromir Nemec, Martin Windisch

XML-OLAP: A Multidimensional Analysis Framework for XML Warehouses

Recently, a large number of XML documents are available on the Internet. This trend motivated many researchers to analyze them multi-dimensionally in the same way as relational data. In this paper, we propose a new framework for multidimensional analysis of XML documents, which we call

XML-OLAP

. We base XML-OLAP on XML warehouses where every fact data as well as dimension data are stored as XML documents. We build XML cubes from XML warehouses. We propose a new multidimensional expression language for XML cubes, which we call

XML-MDX

. XML-MDX statements target XML cubes and use XQuery expressions to designate the measure data. They specify text mining operators for aggregating text constituting the measure data. We evaluate XML-OLAP by applying it to a U.S. patent XML warehouse. We use XML-MDX queries, which demonstrate that XML-OLAP is effective for multi-dimensionally analyzing the U.S. patents.

Byung-Kwon Park, Hyoil Han, Il-Yeol Song

Data Warehouse II

Graph-Based Modeling of ETL Activities with Multi-level Transformations and Updates

Extract-Transform-Load (ETL) workflows are data centric workflows responsible for transferring, cleaning, and loading data from their respective sources to the warehouse. In this paper, we build upon existing graph-based modeling techniques that treat ETL workflows as graphs by (a) extending the activity semantics to incorporate negation, aggregation and self-joins, (b) complementing querying semantics with insertions, deletions and updates, and (c) transforming the graph to allow zoom-in/out at multiple levels of abstraction (i.e., passing from the detailed description of the graph at the attribute level to more compact variants involving programs, relations and queries and vice-versa).

Alkis Simitsis, Panos Vassiliadis, Manolis Terrovitis, Spiros Skiadopoulos

Extending UML 2 Activity Diagrams with Business Intelligence Objects

Data Warehouse (DWH) information is accessed by business processes. Today, no conceptual models exist that make the relationship between the DWH and the business processes transparent. In this paper, we extend a business process modeling diagram, namely the UML 2 activity diagram with a UML profile, which allows to make this relationship explicit. The model is tested with example business processes.

Veronika Stefanov, Beate List, Birgit Korherr

Automatic Selection of Bitmap Join Indexes in Data Warehouses

The queries defined on data warehouses are complex and use several join operations that induce an expensive computational cost. This cost becomes even more prohibitive when queries access very large volumes of data. To improve response time, data warehouse administrators generally use indexing techniques such as star join indexes or bitmap join indexes. This task is nevertheless complex and fastidious. Our solution lies in the field of data warehouse auto-administration. In this framework, we propose an automatic index selection strategy. We exploit a data mining technique ; more precisely frequent itemset mining, in order to determine a set of candidate indexes from a given workload. Then, we propose several cost models allowing to create an index configuration composed by the indexes providing the best profit. These models evaluate the cost of accessing data using bitmap join indexes, and the cost of updating and storing these indexes.

Kamel Aouiche, Jérôme Darmont, Omar Boussaïd, Fadila Bentayeb

Evaluating Data Warehouses and Tools

A Survey of Open Source Tools for Business Intelligence

The industrial use of open source Business Intelligence (BI) tools is not yet common. It is therefore of interest to explore which possibilities are available for open source BI and compare the tools.

In this survey paper, we consider the capabilities of a number of open source tools for BI. In the paper, we consider three Extract-Transform-Load (ETL) tools, three On-Line Analytical Processing (OLAP) servers, two OLAP clients, and four database management systems (DBMSs). Further, we describe the licenses that the products are released under.

It is argued that the ETL tools are still not very mature for use in industry while the DBMSs are mature and applicable to real-world projects. The OLAP servers and clients are not as powerful as commercial solutions but may be useful in less demanding projects.

Christian Thomsen, Torben Bach Pedersen

DWEB: A Data Warehouse Engineering Benchmark

Data warehouse architectural choices and optimization techniques are critical to decision support query performance. To facilitate these choices, the performance of the designed data warehouse must be assessed. This is usually done with the help of benchmarks, which can either help system users comparing the performances of different systems, or help system engineers testing the effect of various design choices. While the TPC standard decision support benchmarks address the first point, they are not tuneable enough to address the second one and fail to model different data warehouse schemas. By contrast, our Data Warehouse Engineering Benchmark (DWEB) allows to generate various ad-hoc synthetic data warehouses and workloads. DWEB is fully parameterized to fulfill data warehouse design needs. However, two levels of parameterization keep it relatively easy to tune. Finally, DWEB is implemented as a Java free software that can be interfaced with most existing relational database management systems. A sample usage of DWEB is also provided in this paper.

Jérôme Darmont, Fadila Bentayeb, Omar Boussaïd

A Set of Quality Indicators and Their Corresponding Metrics for Conceptual Models of Data Warehouses

The quality of Data Warehouses is absolutely relevant for organizations in the decision making process. The sooner we can deal with quality metrics (i.e. conceptual modelling), the more willing we are in achieving a data warehouse (DW) of a high quality. From our point of view, there is a lack of more objective indicators (metrics) to guide the designer in accomplishing an outstanding model that allows us to guarantee the quality of these data warehouses. However, in some cases, the goals and purposes of the proposed metrics are not very clear on their own. Lately, quality indicators have been proposed to properly define the goals of a measurement process and group quality measures in a coherent way. In this paper, we present a framework to design metrics in which each metric is part of a quality indicator we wish to measure. In this way, our method allows us to define metrics (theoretically validated) that are valid and perfectly measure our goals as they are defined together a set of well defined quality indicators.

Gema Berenguer, Rafael Romero, Juan Trujillo, Manuel Serrano, Mario Piattini

Design and Development of a Tool for Integrating Heterogeneous Data Warehouses

In this paper we describe the design of a tool supporting the integration of independently developed data warehouses, a problem that arises in several common scenarios. The basic facility of the tool is a test of the validity of a matching between heterogeneous dimensions, according to a number of desirable properties. Two strategies are then provided to perform the actual integration. The first approach refers to a scenario of loosely coupled integration, in which we just need to identify the common information between sources and perform drill-across queries over them. The goal of the second approach is the derivation of a materialized view built by merging the sources, and refers to a scenario of tightly coupled integration in which queries are performed against the view. We illustrate architecture and functionality of the tool and the underlying techniques that implement the two integration strategies.

Riccardo Torlone, Ivan Panella

Schema Transformations

An Evolutionary Approach to Schema Partitioning Selection in a Data Warehouse

The problem of selecting an optimal fragmentation schema of a data warehouse is more challenging compared to that in relational and object databases. This challenge is due to the several choices of partitioning star or snowflake schemas. Data partitioning is beneficial if and only if the fact table is fragmented based on the partitioning schemas of dimension tables. This may increase the number of fragments of the fact tables dramatically and makes their maintenance very costly. Therefore, the right selection of fragmenting schemas is important for better performance of OLAP queries. In this paper, we present a genetic algorithm for schema partitioning selection problem. The proposed algorithm gives better solutions since the search space is constrained by the schema partitioning. We conduct several experimental studies using the APB-1 release II benchmark for validating the proposed algorithm.

Ladjel Bellatreche, Kamel Boukhalfa

Using Schema Transformation Pathways for Incremental View Maintenance

With the increasing amount and diversity of information available on the Internet, there has been a huge growth in information systems that need to integrate data from distributed, heterogeneous data sources. Incrementally maintaining the integrated data is one of the problems being addressed in data warehousing research. This paper presents an incremental view maintenance approach based on schema transformation pathways. Our approach is not limited to one specific data model or query language, and would be useful in any data transformation/integration framework based on sequences of primitive schema transformations.

Hao Fan

Data Mapper: An Operator for Expressing One-to-Many Data Transformations

Transforming data is a fundamental operation in application scenarios involving

data integration, legacy data migration, data cleaning, and extract-transform-load processes

. Data transformations are often implemented as relational queries that aim at leveraging the optimization capabilities of most RDBMSs. However, relational query languages like SQL are not expressive enough to specify an important class of data transformations that produce several output tuples for a single input tuple. This class of data transformations is required for solving the data heterogeneities that occur when source data represents an aggregation of target data.

In this paper, we propose and formally define the

data mapper operator

as an extension of the relational algebra to address one-to-many data transformations. We supply an algebraic rewriting technique that enables the optimization of data transformation expressions that combine filters expressed as standard relational operators with mappers. Furthermore, we identify the two main factors that influence the expected optimization gains.

Paulo Carreira, Helena Galhardas, João Pereira, Antónia Lopes

Materialized Views

Parallel Consistency Maintenance of Materialized Views Using Referential Integrity Constraints in Data Warehouses

Data warehouses can be considered as materialized views which maintain the online analytical information extracted from distributed data sources. When data sources are changed, materialized views should be maintained correspondingly to keep the consistency between data sources and materialized views. If a view is defined through joining several source relations, an update in one source relation invokes a set of join subqueries thus the view maintenance takes much time of processing. In this paper, we propose a view maintenance algorithm processing these join subqueries in parallel by using referential integrity constraints over source relations. A relation which has several foreign keys can be joined with referenced relations independently. The proposed algorithm processes these join operations in parallel then it merges their results. With the parallel processing, the algorithm can maintain materialized views efficiently. We show the superiority of the proposed algorithm using an analytical cost model.

Jinho Kim, Byung-Suk Lee, Yang-Sae Moon, Soo-Ho Ok, Wookey Lee

Selective View Materialization in a Spatial Data Warehouse

A spatial data warehouse (SDW) consists of a set of materialized views defined over the source relations, either conventional, spatial, or both. Often, when compared to the traditional data warehouses, the cost of view materialization is more expensive with respect to both computation and space. This is because the spatial data is typically larger in size, which leads to high maintenance cost, and the spatial operations are more expensive to process. In this paper, we address the issue of optimizing the view materialization cost in an SDW. We build a cost model to measure the on-the-fly computation cost versus the space cost for spatial queries. We show that a spatial query can be represented in the form of the query-graph and propose three transformation rules, edge-elimination, query-splitting and query-joining, to selectively materialize spatial views. We present a greedy algorithm for materialized view selection so that the local cost optimality can be achieved.

Songmei Yu, Vijayalakshmi Atluri, Nabil Adam

PMC: Select Materialized Cells in Data Cubes

QC-Trees is one of the most storage-efficient structures for data cubes in a MOLAP system. Although QC-Trees can achieve a high compression ratio, it is still a fully materialized data cube. In this paper, we present an improved structure PMC, which allow us to partially materialize cells in a QC-Trees. There is a sharp contrast between our partially materialization algorithm and other extensively studied materialized view selection algorithms. If a view is selected in a traditional algorithm, then all cells in this selected view are to be materialized. Our algorithm, however, selects and materializes data by cells. Experiments results show that PMC can further reduce storage space occupied by the data cube, and can shorten the time for update the cube. Along with further reduced space and update cost, our algorithm can ensure a stable query performance.

Hongsong Li, Houkuan Huang, Shijin Liu

Aggregates

Progressive Ranking of Range Aggregates

Ranking-aware queries have been gaining much attention recently in many applications such as search engines and data streams. They are, however, not only restricted to such applications but are also very useful in OLAP applications. In this paper, we introduce

aggregation ranking

queries in OLAP data cubes motivated by an online advertisement tracking data warehouse application. These queries aggregate information over a specified range and then return the ranked order of the aggregated values. They differ from range aggregate queries in that range aggregate queries are mainly concerned with an aggregate operator such as

SUM

and

MIN/MAX

over the selected ranges of all dimensions in the data cubes. Existing techniques for range aggregate queries are not able to process aggregation ranking queries efficiently. Hence, in this paper we propose new algorithms to handle this problem. The essence of the proposed algorithms is based on both ranking and cumulative information to progressively rank aggregation results. Furthermore we empirically evaluate our techniques and the experimental results show that the query cost is improved significantly.

Hua-Gang Li, Hailing Yu, Divyakant Agrawal, Amr El Abbadi

On Efficient Storing and Processing of Long Aggregate Lists

In this paper we present a solution called Materialized Aggregate List designed for the efficient storing and processing of long aggregate lists. An aggregate list contains aggregates, calculated from the data stored in the database. In our approach, once created, the aggregates are materialized for further use. The list structure contains a table divided into pages. We present three different page-filling algorithms used when the list is browsed. We present test results and we use them for estimating the best combination of the configuration parameters: number of pages, size of a single page and number of available database connections. The Materialized Aggregate List can be applied on every aggregation level in various indexing structures, such as, an aR-tree.

Marcin Gorawski, Rafal Malczok

Data Warehouse Queries and Database Processing Issues

Ad Hoc Star Join Query Processing in Cluster Architectures

Processing of large amounts of data in data warehouses is increasingly being done in cluster architectures to achieve scalability. In this paper we look into the problem of ad hoc star join query processing in clusters architectures. We propose a new technique, the Star Hash Join (SHJ), which exploits a combination of multiple bit filter strategies in such architectures. SHJ is a generalization of the Pushed Down Bit Filters for clusters. The objectives of the technique are to reduce (i) the amount of data communicated, (ii) the amount of data spilled to disk during the execution of intermediate joins in the query plan, and (iii) amount of memory used by auxiliary data structures such as bit filters.

Josep Aguilar-Saborit, Victor Muntés-Mulero, Calisto Zuzarte, Josep-L. Larriba-Pey

A Precise Blocking Method for Record Linkage

Identifying approximately duplicate records between databases requires the costly computation of distances between their attributes. Thus duplicate detection is usually performed in two phases, an efficient blocking phase that determines few potential candidate duplicates based on simple criteria, followed by a second phase performing an in-depth comparison of the candidate duplicates. This paper introduces and evaluates a precise and efficient approach for the blocking phase, which requires only standard indices, but performs as well as other approaches based on special purpose indices, and outperforms other approaches based on standard indices. The key idea of the approach is to use a comparison window with a size that depends dynamically on a maximum distance, rather than using a window with fixed size.

Patrick Lehti, Peter Fankhauser

Flexible Query Answering in Data Cubes

This paper presents a new approach toward approximate query answering in data warehouses. The approach is based on an adaptation of rough set theory to multidimensional data, and offers cube exploration and mining facilities.

Since data in a data warehouse come from multiple heterogeneous sources with various degrees of reliability and data formats, users tend to be more tolerant in a data warehouse environment and prone to accept some information loss and discrepancy between actual data and manipulated ones.

The objective of this work is to integrate approximation mechanisms and associated operators into data cubes in order to produce views that can then be explored using OLAP or data mining techniques. The integration of data approximation capabilities with OLAP techniques offers additional facilities for cube exploration and analysis.

The proposed approach allows the user to work either in a

restricted

mode using a cube

lower approximation

or in a

relaxed

mode using cube

upper approximation

. The former mode is useful when the query output is large, and hence allows the user to focus on a reduced set of fully matching tuples. The latter is useful when a query returns an empty or small answer set, and hence helps relax the query conditions so that a superset of the answer is returned. In addition, the proposed approach generates classification and characteristic rules for prediction, classification and association purposes.

Sami Naouali, Rokia Missaoui

An Extendible Array Based Implementation of Relational Tables for Multi Dimensional Databases

A new implementation scheme for relational tables in multidimensional databases is proposed and evaluated. The scheme implements a relational table by employing a multidimensional array. Using multidimensional arrays provides many advantages, however suffers from some problems. In our scheme, these problems are solved by an efficient scheme of record encoding based on the notion of extendible array. Our scheme exhibits good performance in space and time costs compared with conventional implementation.

K. M. Azharul Hasan, Masayuki Kuroda, Naoki Azuma, Tatsuo Tsuji, Ken Higuchi

Data Mining Algorithms and Techniques

Nearest Neighbor Search on Vertically Partitioned High-Dimensional Data

In this paper, we present a new approach to indexing multidimensional data that is particularly suitable for the efficient incremental processing of nearest neighbor queries. The basic idea is to use index-striping that vertically splits the data space into multiple low- and medium-dimensional data spaces. The data from each of these lower-dimensional subspaces is organized by using a standard multi-dimensional index structure. In order to perform incremental NN-queries on top of index-striping efficiently, we first develop an algorithm for merging the results received from the underlying indexes. Then, an accurate cost model relying on a power law is presented that determines an appropriate number of indexes. Moreover, we consider the problem of dimension assignment, where each dimension is assigned to a lower-dimensional subspace, such that the cost of nearest neighbor queries is minimized. Our experiments confirm the validity of our cost model and evaluate the performance of our approach.

Evangelos Dellis, Bernhard Seeger, Akrivi Vlachou

A Machine Learning Approach to Identifying Database Sessions Using Unlabeled Data

In this paper, we describe a novel co-training based algorithm for identifying database user sessions from database traces. The algorithm learns to identify positive data (session boundaries) and negative data (non-session boundaries) incrementally by using two methods interactively in several iterations. In each iteration, previous identified positive and negative data are used to build better models, which in turn can label some new data and improve performance of further iterations. We also present experimental results.

Qingsong Yao, Xiangji Huang, Aijun An

Hybrid System of Case-Based Reasoning and Neural Network for Symbolic Features

Case-based reasoning is one of the most frequently used tools in data mining. Though it has been proved to be useful in many problems, it is noted to have shortcomings such as feature weighting problems. In previous research, we proposed a hybrid system of case-based reasoning and neural network. In the system, the feature weights are extracted from the trained neural network, and used to improve retrieval accuracy of case-based reasoning. However, this system has worked best in domains in which all features had numeric values. When the feature values are symbolic, nearest neighbor methods typically resort to much simpler metrics, such as counting the features that match. A more sophisticated treatment of the feature space is required in symbolic domains. We propose another hybrid system of case-based reasoning and neural network, which uses value difference metric (VDM) for symbolic features. The proposed system is validated by datasets in symbolic domains.

Kwang Hyuk Im, Tae Hyun Kim, Sang Chan Park

Data Mining

Spatio–temporal Rule Mining: Issues and Techniques

Recent advances in communication and information technology, such as the increasing accuracy of GPS technology and the miniaturization of wireless communication devices pave the road for Location–Based Services (LBS). To achieve high quality for such services, spatio–temporal data mining techniques are needed. In this paper, we describe experiences with spatio–temporal rule mining in a Danish data mining company. First, a number of real world spatio–temporal data sets are described, leading to a taxonomy of spatio–temporal data. Second, the paper describes a general methodology that transforms the spatio–temporal rule mining task to the traditional market basket analysis task and applies it to the described data sets, enabling traditional association rule mining methods to discover spatio–temporal rules for LBS. Finally, unique issues in spatio–temporal rule mining are identified and discussed.

Győző Gidófalvi, Torben Bach Pedersen

Hybrid Approach to Web Content Outlier Mining Without Query Vector

Mining outliers from large datasets is like finding needles in a haystack. Even more challenging is sifting through the dynamic, unstructured, and ever-growing web data for

outliers

. This paper presents

HyCOQ

, which is a hybrid algorithm that draws from the power of n-gram-based and word-based systems. Experimental results obtained using embedded motifs without a dictionary show significant improvement over using a domain dictionary irrespective of the type of data used (words, n-grams, or hybrid). Also, there is remarkable improvement in recall with hybrid documents compared to using raw words and n-grams without a domain dictionary.

Malik Agyemang, Ken Barker, Reda Alhajj

Incremental Data Mining Using Concurrent Online Refresh of Materialized Data Mining Views

Data mining is an iterative process. Users issue series of similar data mining queries, in each consecutive run slightly modifying either the definition of the mined dataset, or the parameters of the mining algorithm. This model of processing is most suitable for incremental mining algorithms that reuse the results of previous queries when answering a given query. Incremental mining algorithms require the results of previous queries to be available. One way to preserve those results is to use materialized data mining views. Materialized data mining views store the mined patterns and refresh them as the underlying data change.

Data mining and knowledge discovery often take place in a data warehouse environment. There can be many relatively small materialized data mining views defined over the data warehouse. Separate refresh of each materialized view can be expensive, if the refresh process has to re-discover patterns in the original database. In this paper we present a novel approach to materialized data mining view refresh process. We show that the concurrent on-line refresh of a set of materialized data mining views is more efficient than the sequential refresh of individual views. We present the framework for the integration of data warehouse refresh process with the maintenance of materialized data mining views. Finally, we prove the feasibility of our approach by conducting several experiments on synthetic data sets.

Mikołaj Morzy, Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases

Data mining and machine learning must confront the problem of pattern maintenance because data updating is a fundamental operation in data management. Most existing data-mining algorithms assume that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new data. While there are many efficient mining techniques for data additions to databases, in this paper, we propose a decremental algorithm for pattern discovery when data is being deleted from databases. We conduct extensive experiments for evaluating this approach, and illustrate that the proposed algorithm can well model and capture useful interactions within data when the data is decreasing.

Shichao Zhang, Xindong Wu, Jilian Zhang, Chengqi Zhang

Association Rules

Discovering Richer Temporal Association Rules from Interval-Based Data

Temporal association rule mining promises the ability to discover time-dependent correlations or patterns between events in large volumes of data. To date, most temporal data mining research has focused on events existing at a point in time rather than over a temporal interval. In comparison to static rules, mining with respect to time points provides semantically richer rules. However, accommodating temporal intervals offers rules that are richer still. In this paper we outline a new algorithm to discover frequent temporal patterns and to generate richer interval-based temporal association rules.

Edi Winarko, John F. Roddick

Semantic Query Expansion Combining Association Rules with Ontologies and Information Retrieval Techniques

Query expansion techniques are used to find the desired set of query terms to improve retrieval performance. One of the limitations with the query expansion techniques is that a query is often expanded only by the linguistic features of terms. This paper presents a novel semantic query expansion technique that combines association rules with ontologies and information retrieval techniques. We propose to use the association rule discovery to find good candidate terms to improve the retrieval performance. These candidate terms are automatically derived from collections and added to the original query. Our method is differentiated from others in that 1) it utilizes the semantics as well as linguistic properties of unstructured text corpus and 2) it makes use of contextual properties of important terms discovered by association rules. Experiments conducted on a subset of TREC collections give quite encouraging results. We achieve from 15.49% to 20.98% improvement in term of P@20 with TREC5 ad hoc queries.

Min Song, Il-Yeol Song, Xiaohua Hu, Robert Allen

Maintenance of Generalized Association Rules Under Transaction Update and Taxonomy Evolution

Mining generalized association rules among items in the presence of taxonomies has been recognized as an important model in data mining. Earlier work on mining generalized association rules ignore the fact that the taxonomies of items cannot be kept static while new transactions are continuously added into the original database. How to effectively update the discovered generalized association rules to reflect the database change with taxonomy evolution and transaction update is a crucial task. In this paper, we examine this problem and propose a novel algorithm, called IDTE, which can incrementally update the discovered generalized association rules when the taxonomy of items is evolved with new transactions insertion to the database. Empirical evaluations show that our algorithm can maintain its performance even in large amounts of incremental transactions and high degree of taxonomy evolution, and is more than an order of magnitude faster than applying the best generalized associations mining algorithms to the whole updated database.

Ming-Cheng Tseng, Wen-Yang Lin, Rong Jeng

Prince: An Algorithm for Generating Rule Bases Without Closure Computations

The problem of the relevance and the usefulness of extracted association rules is becoming of primary importance, since an overwhelming number of association rules may be derived, even from reasonably sized databases. To overcome such drawback, the extraction of reduced size generic bases of association rules seems to be promising. Using the concept of minimal generator, we propose an algorithm, called

Prince

, allowing a shrewd extraction of generic bases of rules. To this end,

Prince

builds the partial order. Its originality is that this partial order is maintained between minimal generators and no more between closed itemsets. A structure called

minimal generator lattice

is then built, from which the derivation of the generic association rules becomes straightforward. An intensive experimental evaluation, carried out on benchmarking sparse and dense datasets, showed that

Prince

largely outperforms the pioneer level-wise algorithms,

i.e.

,

Close

,

A-Close

and

Titanic

.

T. Hamrouni, S. Ben Yahia, Y. Slimani

Text Processing and Classification

Efficient Compression of Text Attributes of Data Warehouse Dimensions

This paper proposes the compression of data in Relational Database Management Systems (RDBMS) using existing text compression algorithms. Although the technique proposed is general, we believe it is particularly advantageous for the compression of medium size and large dimension tables in data warehouses. In fact, dimensions usually have a high number of text attributes and a reduction in their size has a big impact in the execution time of queries that join dimensions with fact tables. In general, the high complexity and long execution time of most data warehouse queries make the compression of dimension text attributes (and possible text attributes that may exist in the fact table, such as false facts) an effective approach to speed up query response time. The proposed approach has been evaluated using the well-known TPC-H benchmark and the results show that speed improvements greater than 40% can be achieved for most of the queries.

Jorge Vieira, Jorge Bernardino, Henrique Madeira

Effectiveness of Document Representation for Classification

Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.

Ding-Yi Chen, Xue Li, Zhao Yang Dong, Xia Chen

2-PS Based Associative Text Classification

Recent studies reveal that associative classification can achieve higher accuracy than traditional approaches. The main drawback of this approach is that it generates a huge number of rules, which makes it difficult to select a subset of rules for accurate classification. In this study, we propose a novel association-based approach especially suitable for text classification. The approach first builds a classifier through a 2-PS (Two-Phase) method. The first phase aims for pruning rules locally, i.e., rules mined within every category are pruned by a sentence-level constraint, and this makes the rules more semantically correlated and less redundant. In the second phase, all the remaining rules are compared and selected with a global view, i.e., training examples from different categories are merged together to evaluate these rules. Moreover, when labeling a new document, the multiple sentence-level appearances of a rule are taken into account. Experimental results on the well-known text corpora show that our method can achieve higher accuracy than many well-known methods. In addition, the performance study shows that our method is quite efficient in comparison with other classification methods.

Tieyun Qian, Yuanzhen Wang, Hao Long, Jianlin Feng

Miscellaneous Applications

Intrusion Detection via Analysis and Modelling of User Commands

Since computers have become a mainstay of everyday life, techniques and methods for detecting intrusions as well as protecting systems and data from unwanted parties have received significant attention recently. We focus on detecting improper use of computer systems through the analysis of user command data. Our approach looks at the structure of the commands used and generates a model which can be used to test new commands. This is accompanied by an analysis of the performance of the proposed approach. Although we focus on commands, the techniques presented in this paper can be extended to allow analysis of other data, such as system calls.

Matthew Gebski, Raymond K. Wong

Dynamic Schema Navigation Using Formal Concept Analysis

This paper introduces a framework for relational schema navigation via a Web-based browser application that uses Formal Concept Analysis as the metaphor for analysis and interaction. Formal Concept Analysis is a rich framework for data analysis based on applied lattice and order theory. The application we develop,

D-SIFT

, is intended to provide users untrained in Formal Concept Analysis with practical and intuitive access to the core functionality of Formal Concept Analysis for the purpose of exploring relational database schema.

D-SIFT

is a Web-based information systems architecture that supports natural search processes over a preexisting database schema and its content.

Jon Ducrou, Bastian Wormuth, Peter Eklund

Security and Privacy Issues

FMC: An Approach for Privacy Preserving OLAP

To preserve private information while providing thorough analysis is one of the significant issues in OLAP systems. One of the challenges in it is to prevent inferring the sensitive value through the more aggregated non-sensitive data. This paper presents a novel algorithm FMC to eliminate the inference problem by hiding additional data besides the sensitive information itself, and proves that this additional information is both necessary and sufficient. Thus, this approach could provide as much information as possible for users, as well as preserve the security. The strategy does not impact on the online performance of the OLAP system. Systematic analysis and experimental comparison are provided to show the effectiveness and feasibility of FMC.

Ming Hua, Shouzhi Zhang, Wei Wang, Haofeng Zhou, Baile Shi

Information Driven Evaluation of Data Hiding Algorithms

Privacy is one of the most important properties an information system must satisfy. A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used. Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of modifying the database in such a way to prevent the discovery of sensible information. Due to the large amount of possible techniques that can be used to achieve this goal, it is necessary to provide some standard evaluation metrics to determine the best algorithms for a specific application or context. Currently, however, there is no common set of parameters that can be used for this purpose. This paper explores the problem of PPDM algorithm evaluation, starting from the key goal of preserving of data quality. To achieve such goal, we propose a formal definition of data quality specifically tailored for use in the context of PPDM algorithms, a set of evaluation parameters and an evaluation algorithm. The resulting evaluation core process is then presented as a part of a more general three step evaluation framework, taking also into account other aspects of the algorithm evaluation such as efficiency, scalability and level of privacy.

Elisa Bertino, Igor Nai Fovino

Patterns

Essential Patterns: A Perfect Cover of Frequent Patterns

The extraction of frequent patterns often yields extremely voluminous results which are difficult to handle. Computing a concise representation or cover of the frequent pattern set is thus an interesting alternative investigated by various approaches. The work presented in this article fits in such a trend. We introduce the concept of essential pattern and propose a new cover based on this concept. Such a cover makes it possible to decide whether a pattern is frequent or not, to compute its frequency and, in contrast with related work, to infer its disjunction and negation frequencies. A levelwise algorithm with a pruning step which uses the maximal frequent patterns for computing the essential patterns is proposed. Experiments show that when the number of frequent patterns is very high (strongly correlated data), the defined cover is significantly more reduced than the cover considered until now as minimal: the frequent closed patterns.

Alain Casali, Rosine Cicchetti, Lotfi Lakhal

Processing Sequential Patterns in Relational Databases

Database integration of data mining has gained popularity and its significance is well recognized. However, the performance of SQL based data mining is known to fall behind specialized implementation since the prohibitive nature of the cost associated with extracting knowledge, as well as the lack of suitable declarative query language support. Recent studies have found that for association rule mining and sequential pattern mining with carefully tuned SQL formulations it is possible to achieve performance comparable to systems that cache the data in files outside the DBMS. However most of the previous pattern mining methods follow the method of

Apriori

which still encounters problems when a sequential database is large and/or when sequential patterns to be mined are numerous and long.

In this paper, we present a novel SQL based approach that we recently proposed, called

Prospad

(PROjection Sequential PAttern Discovery).

Prospad

fundamentally differs from an

Apriori

-like candidate set generation-and-test approach. This approach is a pattern growth-based approach without candidate generation. It grows longer patterns from shorter ones by successively projecting the sequential table into subsequential tables. Since a projected table for a sequential pattern

i

contains all and only necessary information for mining the sequential patterns that can grow from

i

, the size of the projected table usually reduces quickly as mining proceeds to longer patterns. Moreover, avoiding creating and dropping cost of some temporary tables, depth first approach is used to facilitate the projecting process.

Xuequn Shang, Kai-Uwe Sattler

Optimizing a Sequence of Frequent Pattern Queries

Discovery of frequent patterns is a very important data mining problem with numerous applications. Frequent pattern mining is often regarded as advanced querying where a user specifies the source dataset and pattern constraints using a given constraint model. A significant amount of research on efficient processing of frequent pattern queries has been done in recent years, focusing mainly on constraint handling and reusing results of previous queries. In this paper we tackle the problem of optimizing a sequence of frequent pattern queries, submitted to the system as a batch. Our solutions are based on previously proposed techniques of reusing results of previous queries, and exploit the fact that knowing a sequence of queries a priori gives the system a chance to schedule and/or adjust the queries so that they can use results of queries executed earlier. We begin with simple query scheduling and then consider other transformations of the original batch of queries.

Mikołaj Morzy, Marek Wojciechowski, Maciej Zakrzewicz

A General Effective Framework for Monotony and Tough Constraint Based Sequential Pattern Mining

Sequential pattern mining has now become an important data mining problem. For many practical applications, the users may be only interested in those sequential patterns satisfying some constraints expressing their interest. The proposed constraints in general can be categorized into four classes, among which monotony and tough constraints are the most difficult ones to be processed. However, many of the available algorithms are proposed for some special constraints based sequential pattern mining. It is thus difficult to be adapted to other classes of constraints. In this paper we propose a new general framework called CBPSAlgm based on the projection-based pattern growth principal. Under this framework, ineffective item pruning strategies are designed and integrated to construct effective algorithms for monotony and tough constraint based sequential pattern mining. Experimental results show that our proposed methods outperform other algorithms.

Enhong Chen, Tongshu Li, Phillip C-y Sheu

Cluster and Classification I

Hiding Classification Rules for Data Sharing with Privacy Preservation

In this paper, we propose a method of hiding sensitive classification rules from data mining algorithms for categorical datasets. Our approach is to reconstruct a dataset according to the classification rules that have been checked and agreed by the data owner for releasing to data sharing. Unlike the other heuristic modification approaches, firstly, our method classifies a given dataset. Subsequently, a set of classification rules is shown to the data owner to identify the sensitive rules that should be hidden. After that we build a new decision tree that is constituted only non-sensitive rules. Finally, a new dataset is reconstructed. Our experiments show that the sensitive rules can be hidden completely on the reconstructed datasets. While non-sensitive rules are still able to discovered without any side effect. Moreover, our method can also preserve high usability of reconstructed datasets.

Juggapong Natwichai, Xue Li, Maria Orlowska

Clustering-Based Histograms for Multi-dimensional Data

A new technique for constructing multi-dimensional histograms is proposed. This technique first invokes a density-based clustering algorithm to locate dense and sparse regions of the input data. Then the data distribution inside each of these regions is summarized by partitioning it into non-overlapping blocks laid onto a grid. The granularity of this grid is chosen depending on the underlying data distribution: the more homogeneous the data, the coarser the grid. Our approach is compared with state-of-the-art histograms on both synthetic and real-life data and is shown to be more effective.

Filippo Furfaro, Giuseppe M. Mazzeo, Cristina Sirangelo

Weighted K-Means for Density-Biased Clustering

Clustering is a task of grouping data based on similarity. A popular k-means algorithm groups data by firstly assigning all data points to the closest clusters, then determining the cluster means. The algorithm repeats these two steps until it has converged. We propose a variation called weighted k-means to improve the clustering scalability. To speed up the clustering process, we develop the reservoir-biased sampling as an efficient data reduction technique since it performs a single scan over a data set. Our algorithm has been designed to group data of mixture models. We present an experimental evaluation of the proposed method.

Kittisak Kerdprasop, Nittaya Kerdprasop, Pairote Sattayatham

Cluster and Classification II

A New Approach for Cluster Detection for Large Datasets with High Dimensionality

The study of the use of computers through human computer interfaces (HCI) is essential to improve the productivity in any computer application environment. HCI analysts use a number of techniques to build models that are faithful to actual computer use. A key technique is through eye tracking, in which the region of the screen being examined is recorded in order to determine key areas of use. Clustering techniques allow these regions to be grouped to help facilitate usability analysis. Historically, approaches such as the Expectation Maximization (EM) and K-Means algorithm have performed well. Unfortunately, these approaches require the number of clusters

k

to be known beforehand – in many real world situations, this hampers the effectiveness of the analysis of the data. We propose a novel algorithm that is well suited for cluster discovery for HCI data; we do not require the number of clusters to be specified a priori and our approach scales very well for both large datasets and high dimensionality. Experiments have demonstrated that our approach works well for real data from HCI applications.

Matthew Gebski, Raymond K. Wong

Gene Expression Biclustering Using Random Walk Strategies

A biclustering algorithm, based on a greedy technique and enriched with a local search strategy to escape poor local minima, is proposed. The algorithm starts with an initial random solution and searches for a locally optimal solution by successive transformations that improve a gain function, combining the mean squared residue, the row variance, and the size of the bicluster. Different strategies to escape local minima are introduced and compared. Experimental results on yeast and lymphoma microarray data sets show that the method is able to find significant biclusters.

Fabrizio Angiulli, Clara Pizzuti

Spectral Kernels for Classification

Spectral methods, as an unsupervised technique, have been used with success in data mining such as LSI in information retrieval, HITS and PageRank in Web search engines, and spectral clustering in machine learning. The essence of success in these applications is the spectral information that captures the semantics inherent in the large amount of data required during unsupervised learning. In this paper, we ask if spectral methods can also be used in supervised learning, e.g., classification. In an attempt to answer this question, our research reveals a novel kernel in which spectral clustering information can be easily exploited and extended to new incoming data during classification tasks. From our experimental results, the proposed Spectral Kernel has proved to speedup classification tasks without compromising accuracy.

Wenyuan Li, Kok-Leong Ong, Wee-Keong Ng, Aixin Sun

Data Warehousing and Knowledge Discovery: A Chronological View of Research Challenges

Data Warehousing and Knowledge Discovery has been widely accepted as a key technology for enterprises to improve their abilities in data analysis, decision support, and the automatic extraction of knowledge from data. Historically, the phrase knowledge discovery in databases was coined at the first KDD (Knowledge Discovery and Data Mining) workshop in 1989 to emphasize that knowledge is the end-product of a data-driven discovery process. Since then, much research has been accomplished in this field. This paper which is written as an epilogue of the DaWaK 2005 proceedings by the programme committee chairpersons together with Nguyen Manh Tho, should reflect the past development of DaWaK-results and other significant research outcomes in the area and above all should deliver a rough sketch of the current development and possible future work.

Tho Manh Nguyen, A Min Tjoa, Juan Trujillo

Springer Professional

About this book

Table of Contents

Frontmatter

Data Warehouse I

A Tree Comparison Approach to Detect Changes in Data Warehouse Structures

Extending the UML for Designing Association Rule Mining Models for Data Warehouses

Event-Feeded Dimension Solution

XML-OLAP: A Multidimensional Analysis Framework for XML Warehouses

Data Warehouse II

Graph-Based Modeling of ETL Activities with Multi-level Transformations and Updates

Extending UML 2 Activity Diagrams with Business Intelligence Objects

Automatic Selection of Bitmap Join Indexes in Data Warehouses

Evaluating Data Warehouses and Tools

A Survey of Open Source Tools for Business Intelligence

DWEB: A Data Warehouse Engineering Benchmark

A Set of Quality Indicators and Their Corresponding Metrics for Conceptual Models of Data Warehouses

Design and Development of a Tool for Integrating Heterogeneous Data Warehouses

Schema Transformations

An Evolutionary Approach to Schema Partitioning Selection in a Data Warehouse

Using Schema Transformation Pathways for Incremental View Maintenance

Data Mapper: An Operator for Expressing One-to-Many Data Transformations

Materialized Views

Parallel Consistency Maintenance of Materialized Views Using Referential Integrity Constraints in Data Warehouses

Selective View Materialization in a Spatial Data Warehouse

PMC: Select Materialized Cells in Data Cubes

Aggregates

Progressive Ranking of Range Aggregates

On Efficient Storing and Processing of Long Aggregate Lists

Data Warehouse Queries and Database Processing Issues

Ad Hoc Star Join Query Processing in Cluster Architectures

A Precise Blocking Method for Record Linkage

Flexible Query Answering in Data Cubes

An Extendible Array Based Implementation of Relational Tables for Multi Dimensional Databases

Data Mining Algorithms and Techniques

Nearest Neighbor Search on Vertically Partitioned High-Dimensional Data

A Machine Learning Approach to Identifying Database Sessions Using Unlabeled Data

Hybrid System of Case-Based Reasoning and Neural Network for Symbolic Features

Data Mining

Spatio–temporal Rule Mining: Issues and Techniques

Hybrid Approach to Web Content Outlier Mining Without Query Vector

Incremental Data Mining Using Concurrent Online Refresh of Materialized Data Mining Views

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases

Association Rules

Discovering Richer Temporal Association Rules from Interval-Based Data

Semantic Query Expansion Combining Association Rules with Ontologies and Information Retrieval Techniques

Maintenance of Generalized Association Rules Under Transaction Update and Taxonomy Evolution

Prince: An Algorithm for Generating Rule Bases Without Closure Computations

Text Processing and Classification

Efficient Compression of Text Attributes of Data Warehouse Dimensions

Effectiveness of Document Representation for Classification

2-PS Based Associative Text Classification

Miscellaneous Applications

Intrusion Detection via Analysis and Modelling of User Commands

Dynamic Schema Navigation Using Formal Concept Analysis

Security and Privacy Issues

FMC: An Approach for Privacy Preserving OLAP

Information Driven Evaluation of Data Hiding Algorithms

Patterns

Essential Patterns: A Perfect Cover of Frequent Patterns

Processing Sequential Patterns in Relational Databases

Optimizing a Sequence of Frequent Pattern Queries

A General Effective Framework for Monotony and Tough Constraint Based Sequential Pattern Mining

Cluster and Classification I

Hiding Classification Rules for Data Sharing with Privacy Preservation

Clustering-Based Histograms for Multi-dimensional Data

Weighted K-Means for Density-Biased Clustering

Cluster and Classification II

A New Approach for Cluster Detection for Large Datasets with High Dimensionality

Gene Expression Biclustering Using Random Walk Strategies

Spectral Kernels for Classification

Data Warehousing and Knowledge Discovery: A Chronological View of Research Challenges

Backmatter

Premium Partner