Data Warehouse Design

The Design and Development of a Logical System for OLAP

We report on the design and development of a novel architecture for data warehousing. This architecture adds a “logical” level of abstraction to the traditional data warehousing framework, to guarantee an independence of OLAP applications from the physical storage structure of the data warehouse. This property allows users and applications to manipulate multidimensional data ignoring implementation details. Also, it supports the integration of multidimensional data stored in heterogeneous OLAP servers. We propose MD, a simple data model for multidimensional databases, as the reference for the logical layer. We then describe the design of a system, called MDS, that supports the above logical architecture.

Luca Cabibbo, Riccardo Torlone

Applying Vertical Fragmentation Techniques in Logical Design of Multidimensional Databases

In the context of multidimensional databases implemented on relational DBMSs through star schemes, the most effective technique to enhance performances consists of materializing redundant aggregates called views. In this paper we investigate the problem of vertical fragmentation of views aimed at minimizing the workload response time. Each view includes several measures which not necessarily are always requested together; thus, the system performance may be increased by partitioning the views into smaller tables. On the other hand, drill-across queries involve measures taken from two or more views; in this case the access costs may be decreased by unifying these views into larger tables. After formalizing the fragmentation problem as a 0–1 integer linear programming problem, we define a cost function and outline a branch-and-bound algorithm to minimize it. Finally, we demonstrate the usefulness of our approach by presenting a set of experimental results based on the TPC-D benchmark.

Matteo Golfarelli, Dario Maio, Stefano Rizzi

Space-Efficient Data Cubes for Dynamic Environments

Data cubes provide aggregate information to support the analysis of the contents of data warehouses and databases. An important tool to analyze data in data cubes is the range query. For range queries that summarize large regions of massive data cubes, computing the query result on-the-fly can result in non-interactive response times. To speed up range queries, values that summarize regions of the data cube are precomputed and stored. This faster response time results in more expensive updates and/or space overhead. While the emphasis is typically on low query and update costs, growing data collections increase the demand for space-efficient approaches. In this paper two techniques are presented that have the same update and query costs as earlier approaches, without introducing any space overhead.

Mirek Riedewald, Divyakant Agrawal, Amr El Abbadi, Renato Pajarola

On Making Data Warehouses Active

Data warehouses, which are the core elements of On-Line Analytical Processing (OLAP) systems, are passive since all tasks related to analyzing and making decisions must be carried out manually. This paper introduces a novel architecture, the active data warehouse, which applies the idea of event-condition-action rules (ECA rules) from active database systems to automize repetitive analysis and decision tasks in data warehouses. The work of an analyst is mimicked by analysis rules, which extend the capabilities of conventional ECA rules in order to support multidimensional analysis and decision making.

Michael Schrefl, Thomas Thalhammer

Materialized Views

Supporting Hot Spots with Materialized Views

Since data warehousing has become a major field of research there has been a lot of interest in the selection of materialized views for query optimization. The problem is to find the set of materialized views which yields the highest cost savings for a given set of queries under a certain space constraint. The analytical perspective results in queries which on the one hand require aggregations but on the other hand are quite restrictive with regard to the fact data. Usually there are “hot spots”, i.e. regions which are requested very frequently, like the current period or the most important product group. However, most algorithms in literature do not consider restrictions of queries and therefore generate only views containing all summary data at a certain aggregation level although the space it occupies could better be used for other, more beneficial views. This article presents an algorithm for the selection of restricted views. The cost savings using this algorithm have been experimentally evaluated to be up to 80% by supplying only 5% additional space.

J. Albrecht, A. Bauer, M. Redert

Evaluation of Materialized View Indexing in Data Warehousing Environments

Index selection is one of the most important decisions in designing a data warehouse (DW). In this paper, we present a framework for a class of graph join indices used for indexing queries defined on materialized views. We develop storage cost needed for these indices, and query processing strategies using them. We formulate the graph join index selection problem, and present algorithms which can provide good query performance under limited storage space for the indices. We also evaluate these algorithms to show their utilities by using an example taken from Informix white paper.

Ladjel Bellatreche, Kamalakar Karlapalem, Qing Li

View Derivation Graph with Edge Fitting for Adaptive Data Warehousing

In this paper we propose adaptive data warehouse maintenance, based on the optimistic partial replication of base local computation at the view as well as communication with the outside sources, and lowers the execution load on the base sources, which leads to a more up-to-date state of the data warehouse view.

Ioana Stanoi, Divyakant Agrawal, Amr El Abbadi

On the Importance of Tuning in Incremental View Maintenance: An Experience Case Study

Extended Abstract

We describe the tuning of a gigabyte-scale TPC-D database for investigation of incremental maintenance of a materialized view. We find that incremental maintenance is feasible over a wide range of update sizes (granularities), that intuitive SQL formulations perform poorly, but that there are better alternatives. We show that these results can be meaningfully evaluated and presented only in the context of reasonable instance tuning.

K. O’Gorman, D. Agrawal, A. El Abbadi

Warehouse Data Creation and Maintenance

BEDAWA - A Tool for Generating Sample Data for Data Warehouses

The generation of representative sample data for data warehouses for henchmarking, testing and presentation purposes is a challenging complex task. Difficulties often arise in producing familiar, complete and consistent sample data of any scale. Producing sample data manually often causes problems, which can be avoided by an automatic generation tool producing consistent and statistical plausible data. The data generated by the tool is based on the linear statistical model of a hierarchical n-way classification with fixed effects. We present an approach for a flexible generation of “real world” sample data using the BEDAWA (BEnchmarks forDAta WArehouses) tool.

Thanh N. Huynh, Binh T. Nguyen, J. Schiefer, A. M. Tjoa

DyDa: Dynamic Data Warehouse Maintenance in a Fully Concurrent Environment

Data warehouses (DW) are an emerging technology to support high-level decision making by gathering information from several distributed information sources (ISs) into one materialized repository. In dynamic environments such as the web, DWs must be maintained in order to stay up-to-date. Recent maintenance algorithms tackle this problem of DW management under concurrent data updates (DU), whereas the EVE system is the first to handle (non-concurrent schema changes) (SC) of ISs. However, the concurrency of schema changes by different ISs as well as the concurrency of interleaved SC and DU still remain unexplored problems. In this paper, we propose a solution framework called DyDa that successfully addresses both problems. The DyDa framework detects concurrent SCs by the broken query scheme and conflicting concurrent DUs by a local timestamp scheme. The two-layered architecture of the DyDa framework separates the concerns for concurrent DU and concurrent SC handling without imposing any restrictions on the autonomy nor on the concurrent execution of the ISs. This DyDa solution is currently being implemented within the EVE data warehousing system.

Xin Zhang, Elke A. Rundensteiner

Scalable Maintenance of Multiple Interrelated Data Warehousing Systems

The maintenance of data warehouses(DWs) is becoming an increasingly important topic due to the growing use, derivation and integration of digital information. Most previous work has dealt with one centralized data warehouse only. In this paper, we now focus on environments with multiple DWs that are possibly derived from other DWs. In such a large-scale environment, data updates from base sources may arrive in individual data warehouses in different orders, thus resulting in inconsistent data warehouse extents. We propose to address this problem by employing a registry agent responsible for establishing one unique order for the propagation of updates from the base sources to the DWs. With this solution, individual DW managers can still maintain their respective extents autonomously and independently from each other, thus allowing them to apply any existing incremental maintenance algorithm from the literature. We demonstrate that this registry-based coordination approach (RyCo) indeed achieves consistency across all DWs.

Lingli Ding, Xin Zhang, Elke A. Rundensteiner

View Maintenance for Hierarchical Semistructured Data

The problem of incremental view maintenance in materialized data warehouses has been studied extensively for relational selectproject-join (SPJ) views. Many new data sources, however, are highly irregular and views often perform complex restructuring operations. This paper describes WHAX (Warehouse Architecture for XML), an architecture for defining and maintaining views over hierarchical semistructured data sources with key constraints. The WHAX model is a variant of the deterministic model [5], but is more reminiscent of XML. The view definition language is a variation of XML-QL and supports selections, joins, and important restructuring operations such as regrouping and aggregation. The incremental maintenance is based on the notion of multi-linearity and generalizes well-known techniques from SPJ-views.

Hartmut Liefke, Susan B. Davidson

Maintaining Horizontally Partitioned Warehouse Views

Data warehouses usually store large amounts of information, representing an integration of base data from different data sources over a long time period. Aggregate views can be stored as a set of its horizontal fragments for the purposes of reducing warehouse query response time and maintenance cost.This paper proposes a scheme that efficiently maintains horizontally partitioned data warehouse views. Using the proposed scheme, only one view fragment holding the relevant subset of tuples of the view is accessed for each update. The scheme also includes an approach to reduce the refresh time for maintaining views that compute aggregate functions MIN and MAX.

Mei Xu, C. I. Ezeife

Invited Talk

Funding Research in Data Warehousing and Knowledge Discovery EPROS: The European Plan for Research in Official Statistics

This paper discusses: (1) the challenges that the European Statistical System (ESS) faces as the result of the recent appearance of phenomena such as the information society and the new economy, and (2) the extent to which new technological developments in data warehousing, knowledge discovery and extensive use of the internet can contribute to successfully meeting these challenges. Two specific issues are considered: the network nature of the ESS, and the new ways of producing statistics that reinforce the needs for research applied to statistics in domains such as data integration, distributed databases, EDI, automated data capture, analytical tools and dissemination. A historical overview is given of research activities financed by the European Commission as well as their relevance for DaWaK2000. A primary goal of this paper is to provide information about relevant research within the European Statistical System, and to invite the scientific community to participate actively in upcoming calls for proposals and calls for tender financed under the IST programme to solve the urgent needs for timely and high-quality statistics.

Jean-Louis Mercy, Harald Sonnberger

Warehouse Views Selection and Evolution

Elimination of Redundant Views in Multidimensional Aggregates

On-line analytical processing provides multidimensional data analysis, through extensive computation based on aggregation, along many dimensions and hierarchies. To accelerate query-response time, pre-computed results are often stored for later retrieval. This adds a prohibitive storage overhead when applied to the whole set of aggregates. In this paper we describe a novel approach which provides the means for the efficient selection, computation and storage of multidimensional aggregates. The approach identifies redundant aggregates, by inspection, thus allowing only distinct aggregates to be computed and stored. We propose extensions to relational theory and also present new algorithms for implementing the approach, providing a solution which is both scalable and low in complexity. The experiments were conducted using real and synthetic datasets and demonstrate that significant savings in computation time and storage space can be achieved when redundant aggregates are eliminated. Savings have also been shown to increase as dimensionality increases. Finally, the implications of this work affect the indexing and maintenance of views and the user interface.

Nikolaos Kotsis, Douglas R. McGregor

Data Cube Compression with QuantiCubes

Data warehouses typically store a multidimensional fact representation of the data that can be used in any type of analysis. Many applications materialize data cubes as multidimensional arrays for fast, direct and random access to values. Those data cubes are used for exploration, with operations such as roll-up, drill-down, slice and dice. The data cubes can become very large, increasing the amount of I/O significantly due to the need to retrieve a large number of cube chunks. The large cost associated with I/O leads to degraded performance. The data cube can be compressed, but traditional compression techniques do not render it queriable, as they compress and decompress reasonably large blocks and have large costs associated with the decompression and indirect access. For this reason they are mostly used for archiving. This paper uses the QuantiCubes compression strategy that replaces the data cube by a smaller representation while maintaining full queriability and random access to values. The results show that the technique produces large improvement in performance.

Pedro Furtadoand, Henrique Madeira

History-Driven View Synchronization

Views over distributed information sources rely on the stability of the schemas of these underlying sources. In the event of meta data changes in the sources, such as the deletion of a table or column, such views may become undefined. Using meta data about information redundancy and user preferences, views can be evolved as necessary to be defined on a modified information space after a source meta data change, while assuring the best possible compliance to preferences that users may have.Previous work in view synchronization focused only on deletions of schema elements. We now offer the first approach that makes use of additions also. The algorithm is based partly on returning view definitions to previous versions by “backtracking” in the history of views and meta data. This technology enables us to adapt views to temporary meta data changes by cancelling out opposite changes and allows undo/redooperations on meta data without deteriorating the quality of the view. The mechanism described in this paper will therefore improve the quality of evolved views.

Andreas Koeller, Elke A. Rundensteiner

A Logical Model for Data Warehouse Design and Evolution

A data warehouse is a software infrastructure which supports OLAP applications by providing a collection of tools for data extraction and cleaning, data integration and aggregation, and data organization into multidimensional structures. At the design level, a data warehouse is defined as a hierarchy of view expressions whose ultimate nodes are queries on data sources. In this paper, we propose a logical model for a data warehouse representation which consists of a hierarchy of views, namely the base views, the intermediate views and the users views. This schema can be used for different design purposes, as the evolution of a data warehouse which is also the focus of this paper.

Mokrane Bouzeghoub, Zoubida Kedad

OLAP System Design and Query Analysis

An Alternative Relational OLAP Modeling Approach

Schema design is one of the fundamentals in database theory and practice as well. In this paper, we discuss the problem of locally valid dimensional attributes in a classification hierarchy of a typical OLAP scenario. In a first step, we show that the traditional star and snowflake schema approach is not feasible in this very natural case of a hierarchy. Therefore, we sketch two alternative modeling approaches resulting in practical solutions and a seamless extension of the traditional star and snowflake schema approach: In a pure relational approach, we replace each dimension table of a star / snowflake schema by a set of views directly reflecting the classification hierarchy. The second approach takes advantage of the object-relational extensions. Using object-relational techniques in the context for the relational representation of a multidimensional OLAP scenario is a novel approach and promises a clean and smooth schema design.

Andreas Bauer, Wolfgang Hümmer, Wolfgang Lehner

Functional Dependencies in Controlling Sparsity of OLAP Cubes

We will study how relational dependency information can be applied to OLAP cube design. We use dependency information to control sparsity, since functional dependencies between dimensions clearly increase sparsity. Our method helps the user in finding dimensions and hierarchies, identifying sparsity risks, and finally changing the design in order to get a more suitable result. Sparse raw data, a large amount of pre-calculated aggregations, and many dimensions may expand the need of the storage space so rapidly that the problem cannot be solved by increasing the capacity of the system. We give two methods to construct suitable OLAP cubes. In the synthesis method, attributes are divided into equivalence classes according to dependencies in which they participate. Each equivalence class may form a dimension. The decomposition method is applied when candidates for dimensions exist. We decompose dimensions based on conflicts, and construct new cubes for removed dimensions until no conflicts between dimensions exist.

Tapio Niemi, Jyrki Nummenmaa, Peter Thanisch

An OLAP-based Scalable Web Access Analysis Engine

Collecting and mining web log records (WLRs) from e-commerce web sites has become increasingly important for targeted marketing, promotions, and traffic analysis. In this paper, we describe a scalable data warehousing and OLAP-based engine for analyzing WLRs. We have to address several scalability and performance challenges in developing such a framework. Because an active web site may generate hundreds of millions of WLRs daily, we have to deal with huge data volumes and data flow rates. To support fine-grained analysis, e.g., individual users’ access profiles, we end up with huge, sparse data cubes defined over very large-sized dimensions (there may be hundreds of thousands of visitors to the site and tens of thousands of pages). While OLAP servers store sparse cubes quite efficiently, rolling up a very large cube can take prohibitively long. We have applied several non-traditional approaches to deal with this problem, which allow us to speed up WLR analysis by 3 orders of magnitude. Our framework supports multilevel and multidimensional pattern extraction, analysis and feature ranking, and in addition to the typical OLAP operations, supports data mining operations such as extended multilevel and multidimensional association rules.

Qiming Chen, Umeshwar Dayal, Meichun Hsu

PROMISE: Predicting Query Behavior to Enable Predictive Caching Strategies for OLAP Systems

This paper discusses the PROMISE (Predicting User Behavior in Multidimensional Information System Environments) approach, that deploys information about characteristic patterns in the user’s multidimensional data access in order to improve caching algorithms of OLAP systems. The paper motivates this approach by presenting results of an analysis of the user behavior in a real-world OLAP environment. Further contributions of this paper are a model for characteristic OLAP query patterns based on Markov Models and a corresponding OLAP query prediction algorithm.

Carsten Sapia

OLAP Query Evaluation

Supporting Online Queries in ROLAP

Data warehouses are becoming a powerful tool to analyze enterprise data. A critical demand imposed by the users of data warehouses is that the time to get an answer (latency) after posing a query is to be as short as possible. It is arguable that a quick, albeit approximate, answer that can be refined over time is much better than a perfect answer for which a user has to wait a long time. In this paper we addressed the issue of online support for data warehouse queries, meaning the ability to reduce the latency of the answer at the expense of having an approximate answer that can be refined as the user is looking at it. Previous work has address the online support by using sampling techniques. We argue that a better way is to preclassify the cells of the data cube into error bins and bring the target data for a query in “waves,” i.e., by fetching the data in those bins one after the other. The cells are classified into bins by means of the usage of a data model (e.g., linear regression, log-linear models) that allows the system to obtain an approximate value for each of the data cube cells. The difference between the estimated value and the true value is the estimation error, and its magnitude determines to which bin the cell belongs. The estimated value given by the model serves to give a very quick, yet approximate answer, that will be refined online by bringing cells from the error bins. Experiments show that this technique is a good way to support online aggregation.

Daniel Barbará, Xintao Wu

Optimal Multidimensional Query Processing Using Tree Striping

In this paper, we propose a new technique for multidimensional query processing which can be widely applied in database systems. Our new technique, called tree striping, generalizes the well-known inverted lists and multidimensional indexing approaches. A theoretical analysis of our generalized technique shows that both, inverted lists and multidimensional indexing approaches, are far from being optimal. A consequence of our analysis is that the use of a set of multidimensional indexes provides considerable improvements over one d-dimensional index (multidimensional indexing) or d one-dimensional indexes (inverted lists). The basic idea of tree striping is to use the optimal number k of lower dimensional indexes determined by our theoretical analysis for efficient query processing. We confirm our theoretical results by an experimental evaluation on large amounts of real and synthetic data. The results show a speed-up of up to 310% over the multidimensional indexing approach and a speed-up factor of up to 123 (12,300%) over the inverted-lists approach.

Stefan Berchtold, Christian Böhm, Daniel A. Keim, Hans-Peter Kriegel, Xiaowei Xu

Enhancing Preprocessing in Data-Intensive Domains using Online-Analytical Processing

The application of data mining algorithms needs a goal-oricntcd preprocessing of the data. In practical applications the preprocessing task is very time consuming and has an important influence on the quality of the generated models. In this paper we describe a new approach for data preprocessing. Combining database technology with classical data mining systems using an OLAP engine as interface we outline an architecture for OLAP-based preprocessing that enables interactive and iterative processing of data. This high level of interaction between human and database system enables efficient understanding and preparing of data for building scalable data mining applications. Our case study taken from the data-intensive telecommunication domain applies the proposed methodology for deriving user communication profiles. These user profiles are given as input to data mining algorithms for clustering customers with similar behavior.

Alexander Maedche, Andreas Hotho, Markus Wiese

Meta-queries - Computation and Evaluation

Metaquery (also known as metapattern) is a datamining tool useful for learning rules involving more than one relation in the database. A metaquery is a template, or a second-order proposition in a language that describes the type of pattern to be discovered. In an earlier paper we discussed the efficient computation of support for Meta-queries. In this paper we extend this work by comparing several support computation techniques.We also give real-life examples of meaningful rules which were derived by our method, and discuss briey the software environment in which the meta-queries were run (the FLEXIMINE system). Finally we compare Meta-queries to Association rules and discuss their differences.

Rachel Ben-Eliyahu-Zohary, Ehud Gudes

Partitioning Algorithms for the Computation of Average Iceberg Queries

Iceberg queries are to compute aggregate functions over an attribute (or set of attributes) to find aggregate values above some specified threshold. It’s difficult to execute these queries because the number of unique data is greater than the number of counter buckets in memory. However, previous research has the limitation that average functions were out of consideration among aggregate functions. So, in order to compute average iceberg queries efficiently we introduce the theorem to select candidates by means of partitioning, and propose POP algorithm based on it. The characteristics of this algorithm are to partition a relation logically and to postpone partitioning to use memory efficiently until all buckets are occupied with candidates. Experiments show that proposed algorithm is affected by memory size, data order, and the distribution of data set.

Jinuk Bae, Sukho Lee

Invited Talk

Security in Data Warehousing

Invited Talk

Data warehouse [2, 4, 5, 6] is an integrated repository derived from multiple source (operational and legacy) databases. The data warehouse is created by either replicating the different source data or transforming them to new representation. This process involves reading, cleaning, aggregating and storing the data in the warehouse model. The software tools are used to access the warehouse for strategic analysis, decision-making, marketing types of applications. It can be used for inventory control of shelf stock in many departmental stores.

Bharat Bhargava

Association Rules

Mining of Association Rules in Text Databases Using Inverted Hashing and Pruning

In this paper, we propose a new algorithm named Inverted Hashing and Pruning (IHP) for mining association rules between words in text databases. The characteristics of text databases are quite different from those of retail transaction databases, and existing mining algorithms cannot handle text databases efficiently, because of the large number of itemsets (i.e., words) that need to be counted. Two well-known mining algorithms, the Apriori algorithm [1] and Direct Hashing and Pruning (DHP) algorithm [5], are evaluated in the context of mining text databases, and are compared with the proposed IHP algorithm. It has been shown that the IHP algorithm has better performance for large text databases.

John D. Holt, Soon M. Chung

SQL Based Association Rule Mining Using Commercial RDBMS (IBM DB2 UDB EEE)

Data mining is becoming increasingly important since the size of databascs grows even larger and the need to explore hidden rules from the databases becomes widely recognized. Currently database systems arc dominated by relational database and the ability to perform data mining using standard SQL queries will definitely case implementation of data mining. However the performance of SQL based data mining is known to fall behind specialized implementation and expensive mining tools being on sale. In this paper we present an evaluation of SQL based data mining on commercial RDBMS (IBM DB2 UDB EEE). We examine some techniques to reduce I/O cost by using View and Subquery. Those queries can be more than 6 times faster than SETM SQL query reported previously. In addition, we have made performance evaluation on parallel database environment and compared the performance result with commercial data mining tool (IBM Intelligent Miner). We prove that SQL based data mining can achieve sufficient performance by the utilization of SQL query customization and database tuning. Keywords: data mining, parallel SQL, query optimization, commercial RDBMS

Takeshi Yoshizawa, Iko Pramudiono, Masaru Kitsuregawa

On Supporting Interactive Association Rule Mining

We investigate ways to support interactive mining sessions, in the setting of association rule mining. In such sessions, users specify conditions (filters) on the associations to be generated. Our approach is a combination of the incorporation of filtering conditions inside the mining phase, and the filtering of already generated associations. We present several concrete algorithms and compare their performance.

Bart Goethals, Jan Van den Bussche

Temporal Association Rules

Discovering Temporal Patterns for Interval-based Events

In many daily transactions, the time when an event takes place is known and stored in databases. Examples range from sales records, stock exchange, patient records, to scientific databases in geophysics and astronomy. Such databases incorporate the concept of time which describes when an event starts and ends as historical records [9]. The temporal nature of data provides us with a better understanding of the trend or pattern over time. In market-basket data, we can have a pattern like “75% of customers buy peanuts when butter starts to be in big sales and before bread is sold out”. We observe that there may be some correlations among peanuts, butter and bread so that we can have better planning for marketing strategy. Knowledge discovery in temporal databases thus catches the attention of researchers [8, 4].

Po-shan Kam, Ada Wai-chee Fu

An Integrated Query and Mining System for Temporal Association Rules

In real world the knowledge used for aiding decision-making is always time varying. Most existing data mining approaches assume that discovered knowledge is valid indefinitely. Temporal features of the knowledge are not taken into account in mining models or processes. As a consequence, people who expect to use the discovered knowledge may not know when it became valid or whether it is still valid. This limits the usability of discovered knowledge. In this paper, temporal features are considered as important components of association rules for better decision-making. The concept of temporal association rules is formally defined and the problems of mining these rules are addressed. These include identification of valid time periods and identification of periodicities of an association rule, and mining of association rules with a specific temporal feature. A system has been designed and implemented for supporting the iterative process of mining temporal association rules, along with an interactive query and mining interface with an SQL-like mining language.

Xiaodong Chen, Ilias Petrounias

Mining Changes for Real-Life Applications

Much of the data mining research has been focused on devising techniques to build accurate models and to discover rules from databases. Relatively little attention has been paid to mining changes in databases collected over time. For businesses, knowing what is changing and how it has changed is of crucial importance because it allows businesses to provide the right products and services to suit the changing market needs. If undesirable changes are detected, remedial measures need to be implemented to stop or to delay such changes. In many applications, mining for changes can be more important than producing accurate models for prediction. A model, no matter how accurate, can only predict based on patterns mined in the old data. That is, a model requires a stable environment, otherwise it will cease to be accurate. However, in many business situations, constant human intervention (i.e., actions) to the environment is a fact of life. In such an environment, building a predictive model is of limited use. Change mining becomes important for understanding the behaviors of customers. In this paper, we study change mining in the contexts of decision tree classification for real-life applications.

Bing Liu, Wynne Hsu, Heng -Siew Han, Yiyuan Xia

AIM: Approximate Intelligent Matching for Time Series Data

Time-series data mining presents many challenges due to the intrinsic large scale and high dimensionality of the data sets. Sub-sequence similarity matching has been an active research area driven by the need to analyse large data sets in the financial, biomedical and scientific databases. In this paper, we investigate an intelligent subsequence similarity matching of time series queries based on efficient graph traversal. We introduce a new problem, the approximate partial matching of a query sequence in a time series database. Our system can address such queries with high specificity and minimal time and space overhead. The performance bottleneck of the current methods were analysed and we show our method can improve the performance of the time series queries significantly. It is general and exible enough to find the best approximate match query without specifying a tolerance ε parameter.

Edward D. Kim, Joyce M. W. Lam, Jiawei Han

Mining Complex Databases

Cofe: A Scalable Method for Feature Extraction from Complex Objects

Feature Extraction, also known as Multidimensional Scaling, is a basic primitive associated with indexing, clustering, nearest neighbor searching and visualization. We consider the problem of feature extraction when the data-points are complex and the distance evaluation function is very expensive to evaluate. Examples of expensive distance evaluations include those for computing the Hausdor. distance between polygons in a spatial database, or the edit distance between macromolecules in a DNA or protein database.We propose Cofe, a method for sparse feature extraction which is based on novel random non-linear projections. We evaluate Cofe on real data and find that it performs very well in terms of quality of features extracted, number of distances evaluated, number of database scans performed and total run time.We further propose Cofe-GR, which matches Cofe in terms of distance evaluations and run-time, but outperforms it in terms of quality of features extracted.

Gabriela Hristescu, Martin Farach-Colton

The Pruning Power: Theory and Heuristics for Mining Databases with Multiple k-Nearest-Neighbor Queries

Numerous data mining algorithms rely heavily on similarity queries. Although many or even all of the performed queries do not depend on each other, the algorithms process them in a sequential way. Recently, a novel technique for efficiently processing multiple similarity queries issued simultaneously has been introduced. It was shown that multiple similarity queries substantially speed-up query intensive data mining applications. For the important case of multiple k-nearest neighbor queries on top of a multidimensional index structure the problem of scheduling directory pages and data pages arises. This aspect has not been addressed so far. In this paper, we derive the theoretic foundation of this scheduling problem. Additionally, we propose several scheduling algorithms based on our theoretical results. In our experimental evaluation, we show that considering the maximum priority of pages clearly outperforms other scheduling approaches.

Christian Böhm, Bernhard Braunmüller, Hans-Peter Kriegel

Data Mining Support in Database Management Systems

The most popular data mining techniques consist in searching databases for frequently occurring patterns, e.g. association rules, sequential patterns. We argue that in contrast to today’s loosely-coupled tools, data mining should be regarded as advanced database querying and supported by Database Management Systems (DBMSs). In this paper we describe our research prototype system, which logically extends DBMS functionality, offering extensive support for pattern discovery, storage and management. We focus on the system architecture and novel SQL-based data mining query language, which serves as the user interface to the system.

Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz

Decision trees for probabilistic data

We propose an algorithm to build decision trees when the observed data are probability distributions. This is of interest when one deals with massive database or with probabilistic models. We illustrate our method with a dataset describing districts of Great Britain. Our decision tree yields rules which explain the unemployment rate.The decision tree in our case is built by replacing the test X > α, which is used to split the nodes in the usual case of real numbers, by the test P(X > α) < β, where α and β are determined through an algorithm based on probabilistic split evaluation criteria.

Jean-Pascal Aboa, Richard Emilion

Mining Frequent Binary Expressions

In data mining, searching for frequent patterns is a common basic operation. It forms the basis of many interesting decision support processes. In this paper we present a new type of patterns, binary expressions. Based on the properties of a specified binary test, such as reflexivity, transitivity and symmetry, we construct a generic algorithm that mines all frequent binary expressions.We present three applications of this new type of expressions: mining for rules, for horizontal decompositions, and in intensional database relations.Since the number of binary expressions can become exponentially large, we use data mining techniques to avoid exponential execution times. We present results of the algorithm that show an exponential gain in time due to a well chosen pruning technique.

Toon Calders, Jan Paredaens

A Fast Algorithm for Hierarchical Text Classification

Text classification is becoming more important with the proliferation of the Internet and the huge amount of data it transfers. We present an efficient algorithm for text classification using hierarchical classifiers based on a concept hierarchy. The simple TFIDF classifier is chosen to train sample data and to classify other new data. Despite its simplicity, results of experiments on Web pages and TV closed captions demonstrate high classification accuracy. Application of feature subset selection techniques improves the performance. Our algorithm is computationally efficient being bounded by O(n log n) for n samples.

Wesley T. Chuang, Asok Tiyyagura, Jihoon Yang, Giovanni Giuffrida

A Hybrid Technique for Data Mining on Balance-Sheet Data

Recent rapid growth in the ability to generate and store data by more powerful Database Management Systems and hardware architecture, leads to a question: how can we take advantage of this large amount of information? Traditional methods for querying and reporting are inadequate because they can only manipulate data and the information content derived is very low. Obtaining new relationships among data and new hypotheses about them is the aim of Knowledge Discovery in Databases (KDD) which makes use of Data Mining techniques. These techniques have interesting applications for business data such as market basket analysis, financial resource planning, fraud detection and the scheduling of production processes. In this work we consider the application of Data Mining techniques for the analysis of the balance-sheets of Italian companies.

G. Dattilo, S. Greco, E. Masciari, L. Pontieri

Mondou: Information Navigator with Visual Interface

Since 1995, we have been developing web search engine “Mondou” using data mining techniques (http://www.kuamp.kyoto-u.ac.jp/labs/ infocom/mondou/index_e.html) in order to discover the helpful information for web search operations. In our previous works, we focus on the computing cost to derive associative keywords, we propose the method to determine system parameters, such as Minsup and Minconf threshold values. Moreover we evaluate the ROC performance of derived keywords by weighted association algorithms. In this paper, we try to implement two kinds of Java applets in our Mondou system, such as ROC graph for selecting associative keywords and documents clustering. This visual interface shows characteristics of associative rules on the ROC graph with the Minsup values. It also provides the function of document clustering in order to visualize retrieved documents.

Hiroyuki Kawano, Minoru Kawahara

Vmhist: Efficient Multidimensional Histograms with Improved Accuracy

Data warehouses must be able to process and analyze large amounts of information quickly and efficiently. Small summaries provide a very efficient way to obtain fast approximate answers to complex queries that run for too long. This paper proposes an efficient hierarchical partitioning strategy vmhist achieving a large improvement in the accuracy of the summary while maintaining all scalability. This is achieved by pre-computation, localized updating and additivity of the error measures used in the partitioning process. Evaluation reveals that a significant accuracy improvement is obtained for summaries produced with vmhist without significant increase in histogram construction time cost.

Pedro Furtado, Henrique Madeira

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Data Warehouse Design

The Design and Development of a Logical System for OLAP

Applying Vertical Fragmentation Techniques in Logical Design of Multidimensional Databases

Space-Efficient Data Cubes for Dynamic Environments

On Making Data Warehouses Active

Materialized Views

Supporting Hot Spots with Materialized Views

Evaluation of Materialized View Indexing in Data Warehousing Environments

View Derivation Graph with Edge Fitting for Adaptive Data Warehousing

On the Importance of Tuning in Incremental View Maintenance: An Experience Case Study

Warehouse Data Creation and Maintenance

BEDAWA - A Tool for Generating Sample Data for Data Warehouses

DyDa: Dynamic Data Warehouse Maintenance in a Fully Concurrent Environment

Scalable Maintenance of Multiple Interrelated Data Warehousing Systems

View Maintenance for Hierarchical Semistructured Data

Maintaining Horizontally Partitioned Warehouse Views

Invited Talk

Funding Research in Data Warehousing and Knowledge Discovery EPROS: The European Plan for Research in Official Statistics

Warehouse Views Selection and Evolution

Elimination of Redundant Views in Multidimensional Aggregates

Data Cube Compression with QuantiCubes

History-Driven View Synchronization

A Logical Model for Data Warehouse Design and Evolution

OLAP System Design and Query Analysis

An Alternative Relational OLAP Modeling Approach

Functional Dependencies in Controlling Sparsity of OLAP Cubes

An OLAP-based Scalable Web Access Analysis Engine

PROMISE: Predicting Query Behavior to Enable Predictive Caching Strategies for OLAP Systems

OLAP Query Evaluation

Supporting Online Queries in ROLAP

Optimal Multidimensional Query Processing Using Tree Striping

Enhancing Preprocessing in Data-Intensive Domains using Online-Analytical Processing

Meta-queries - Computation and Evaluation

Partitioning Algorithms for the Computation of Average Iceberg Queries

Invited Talk

Security in Data Warehousing

Association Rules

Mining of Association Rules in Text Databases Using Inverted Hashing and Pruning

SQL Based Association Rule Mining Using Commercial RDBMS (IBM DB2 UDB EEE)

On Supporting Interactive Association Rule Mining

Temporal Association Rules

Discovering Temporal Patterns for Interval-based Events

An Integrated Query and Mining System for Temporal Association Rules

Mining Changes for Real-Life Applications

AIM: Approximate Intelligent Matching for Time Series Data

Mining Complex Databases

Cofe: A Scalable Method for Feature Extraction from Complex Objects

The Pruning Power: Theory and Heuristics for Mining Databases with Multiple k-Nearest-Neighbor Queries

Data Mining Support in Database Management Systems

Decision trees for probabilistic data

Mining Frequent Binary Expressions

A Fast Algorithm for Hierarchical Text Classification

A Hybrid Technique for Data Mining on Balance-Sheet Data

Mondou: Information Navigator with Visual Interface

Vmhist: Efficient Multidimensional Histograms with Improved Accuracy

Backmatter