Skip to main content

2002 | Buch

Data Warehousing and Knowledge Discovery

4th International Conference, DaWaK 2002 Aix-en-Provence, France, September 4–6, 2002 Proceedings

herausgegeben von: Yahiko Kambayashi, Werner Winiwarter, Masatoshi Arikawa

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

Within the last few years Data Warehousing and Knowledge Discovery technology has established itself as a key technology for enterprises that wish to improve the quality of the results obtained from data analysis, decision support, and the automatic extraction of knowledge from data. The Fourth International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2002) continues a series of successful conferences dedicated to this topic. Its main objective is to bring together researchers and practitioners to discuss research issues and experience in developing and deploying data warehousing and knowledge discovery systems, applications, and solutions. The conference focuses on the logical and physical design of data warehousing and knowledge discovery systems. The scope of the papers covers the most recent and relevant topics in the areas of association rules, clustering, Web mining, security, data mining techniques, data cleansing, applications, data warehouse design and maintenance, and OLAP. These proceedings contain the technical papers selected for presentation at the conference. We received more than 100 papers from over 20 countries, and the program committee finally selected 32 papers. The conference program included one invited talk: “Text Mining Applications of a Shallow Parser” by Walter Daelemans, Univer- ty of Antwerp, Belgium. We would like to thank the DEXA 2002 Workshop General Chair (Roland Wagner) th and the organizing committee of the 13 International Conference on Database and Expert Systems Applications (DEXA 2002) for their support and their cooperation.

Inhaltsverzeichnis

Frontmatter

Association Rules

A Comparison between Query Languages for the Extraction of Association Rules

Recently inductive databases (IDBs) have been proposed to afford the problem of knowledge discovery from huge databases. With an IDB the user/analyst performs a set of very different operations on data using a special-purpose language, powerful enough to perform all the required manipulations, such as data preprocessing, pattern discovery and pattern post-processing. In this paper we present a comparison between query languages (MSQL, DMQL and MINE RULE) that have been proposed for association rules extraction in the last years and discuss their common features and differences. We present them using a set of examples, taken from the real practice of data mining. This allows us to define the language design guidelines, with particular attention to the open issues on IDBs.

Marco Botta, Jean-Francois Boulicaut, Cyrille Masson, Rosa Meo
Learning from Dissociations*

Standard association rules encapsulate the positive relationship between two sets of items: the presence of X is a good predictor for the simultaneous presence of Y. We argue that the absence of an association rule conveys valuable information as well. Dissociation rules are rules that capture the negative relationship between two sets of items: the presence of Xandz is not a good predictor for the presence of Y. We developed a representation for augmenting standard association rules with dissociation information, and presented some experimental results suggesting that such augmented rules can improve the quality of the associations obtained, both in terms of rule accuracy and in terms of using these rules as a guide to making decisions.

Choh Man Teng
Mining Association Rules from XML Data

The eXtensible Markup Language (XML) rapidly emerged as a standard for representing and exchanging information. The fastgrowing amount of available XML data sets a pressing need for languages and tools to manage collections of XML documents, as well as to mine interesting information out of them. Although the data mining community has not yet rushed into the use of XML, there have been some proposals to exploit XML. However, in practice these proposals mainly rely on more or less traditional relational databases with an XML interface. In this paper, we introduce association rules from native XML documents and discuss the new challenges and opportunities that this topic sets to the data mining community. More specifically, we introduce an extension of XQuery for mining association rules. This extension is used throughout the paper to better define association rule mining within XML and to emphasize its implications in the XML context.

Daniele Braga, Alessandro Campi, Mika Klemettinen, PierLuca Lanzi
Estimating Joint Probabilities from Marginal Ones*

Estimating joint probabilities plays an important role in many data mining and machine learning tasks. In this paper we introduce two methods, minAB and prodAB, to estimate joint probabilities. Both methods are based on a light-weight structure, partition support. The core idea is to maintain the partition support of itemsets over logically disjoint partitions and then use it to estimate joint probabilities of itemsets of higher cardinalitiess. We present extensive mathematical analyses on both methods and compare their performances on synthetic datasets. We also demonstrate a case study of using the estimation methods in Apriori algorithm for fast association mining. Moreover, we explore the usefulness of the estimation methods in other mining/learning tasks [9]. Experimental results show the effectiveness of the estimation methods.

Tao Li, Shenghuo Zhu, Mitsunori Ogihara, Yinhe Cheng

Clustering

Self-Tuning Clustering: An Adaptive Clustering Method for Transaction Data

In this paper, we devise an efficient algorithm for clustering market-basket data items. Market-basket data analysis has been well addressed in mining association rules for discovering the set of large items which are the frequently purchased items among all transactions. In essence, clustering is meant to divide a set of data items into some proper groups in such a way that items in the same group are as similar to one another as possible. In view of the nature of clustering market basket data, we present a measurement, called the small-large (SL) ratio, which is in essence the ratio of the number of small items to that of large items. Clearly, the smaller the SL ratio of a cluster, the more similar to one another the items in the cluster are. Then, by utilizing a self-tuning technique for adaptively tuning the input and output SL ratio thresholds, we develop an efficient clustering algorithm, algorithm STC (standing for Self-Tuning Clustering), for clustering market-basket data. The objective of algorithm STC is “Given a database of transactions, determine a clustering such that the average SL ratio is minimized.” We conduct several experiments on the real data and the synthetic workload for performance studies. It is shown by our experimental results that by utilizing the self-tuning technique to adaptively minimize the input and output SL ratio thresholds, algorithm STC performs very well. Specifically, algorithm STC not only incurs an execution time that is significantly smaller than that by prior works but also leads to the clustering results of very good quality.

Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen
CoFD: An Algorithm for Non-distance Based Clustering in High Dimensional Spaces*

The clustering problem, which aims at identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity clusters, has been widely studied. Traditional clustering algorithms use distance functions to measure similarity and are not suitable for high dimensional spaces. In this paper, we propose CoFD algorithm, which is a non-distance based clustering algorithm for high dimensional spaces. Based on the maximum likelihood principle, CoFD is to optimize parameters to maximize the likelihood between data points and the modelgenerated by the parameters. Experimental results on both synthetic data sets and a realdata set show the efficiency and effectiveness of CoFD.

Shenghuo Zhu, Tao Li, Mitsuonri Ogihara
An Efficient K-Medoids-Based Algorithm Using Previous Medoid Index, Triangular Inequality Elimination Criteria, and Partial Distance Search

Clustering in data mining is a discovery process that groups similar objects into the same cluster. Various clustering algorithms have been designed to fit various requirements and constraints of application. In this paper, we study several k-medoids-based algorithms including the PAM, CLARA and CLARANS algorithms. A novel and efficient approach is proposed to reduce the computational complexity of such k-medoids-based algorithms by using previous medoid index, triangular inequality elimination criteria and partial distance search. Experimental results based on elliptic, curve and Gauss-Markov databases demonstrate that the proposed algorithm applied to CLARANS may reduce the number of distance calculations by 67% to 92% while retaining the same average distance per object. In terms of the running time, the proposed algorithm may reduce computation time by 38% to 65% compared with the CLARANS algorithm.

Shu-Chuan Chu, John F. Roddick, J. S. Pan

Web Mining and Security

A Hybrid Approach to Web Usage Mining

With the large number of companies using the Internet to distribute and collect information, knowledge discovery on the web has become an important research area.Web usage mining, which is the main topic of this paper, focuses on knowledge discovery from the clicks in the web log for a given site (the so-called click-stream), especially on analysis of sequences of clicks. Existing techniques for analyzing click sequences have different drawbacks, i.e., either huge storage requirements, excessive I/O cost, or scalability problems when additional information is introduced into the analysis.In this paper we present a new hybrid approach for analyzing click sequences that aims to overcome these drawbacks. The approach is based on a novel combination of existing approaches, more specifically the Hypertext Probabilistic Grammar (HPG) and Click Fact Table approaches. The approach allows for additional information, e.g., user demographics, to be included in the analysis without introducing performance problems. The development is driven by experiences gained from industry collaboration. A prototype has been implemented and experiments are presented that show that the hybrid approach performs well compared to the existing approaches. This is especially true when mining sessions containing clicks with certain characteristics, i.e., when constraints are introduced. The approach is not limited to web log analysis, but can also be used for general sequence mining tasks.

Søren E. Jespersen, Jesper Thorhauge, Torben Bach Pedersen
Building and Exploiting Ad Hoc Concept Hierarchies for Web Log Analysis

Web usage mining aims at the discovery of interesting usage patterns from Web server log files. “Interestingness” relates to the business goals of the site owner. However, business goals refer to business objects rather than the page hits and script invocations recorded by the site server. Hence, Web usage analysis requires a preparatory mechanism that incorporates the business goals, the concepts reflecting them and the expert’s background knowledge on them into the mining process. To this purpose, we present a methodology and a mechanism for the establishment and exploitation of application-oriented concept hierarchies in Web usage analysis. We demonstrate our approach on a real data set and show how it can substantially improve both the search for interesting patterns by the mining algorithm and the interpretation of the mining results by the analyst.

Carsten Pohle, Myra Spiliopoulou
Authorization Based on Evidence and Trust*

Developing authorization mechanisms for secure information access by a large community of users in an open environment is challenging. Current research efforts grant privilege to a user based on her objective properties that are demonstrated by digital credentials (evidences). However, holding credentials is not sufficient to certify that a user is trustworthy. Therefore, we propose using the notion of trust to characterize the probability that a user will not harm an information system. We present a trust-enhanced role-mapping server, which cooperates with RBAC (Role-Based Access Control) mechanisms to together implement authorization based on evidence and trust. A prerequisite for this is our proposed formalization of trust and evidence.

Bharat Bhargava, Yuhui Zhong
An Algorithm for Building User-Role Profiles in a Trust Environment1

A good direction towards building secure systems that operate efficiently in large-scale environments (like the World Wide Web) is the deployment of Role Based Access Control Methods (RBAC). RBAC architectures do not deal with each user separately, but with discrete roles that users can acquire in the system. The goal of this paper is to present a classification algorithm that during its training phase, classifies roles of the users in clusters. The behavior of each user that enters the system holding a specific role is traced via audit trails and any misbehavior is detected and reported (classification phase). This algorithm will be incorporated in the Role Server architecture, currently under development, enhancing its ability to dynamically adjust the amount of trust of each user and update the corresponding role assignments.

Evimaria Terzi, Yuhui Zhong, Bharat Bhargava, Pankaj, Sanjay Madria

Data Mining Techniques

Neural-Based Approaches for Improving the Accuracy of Decision Trees

The decision-tree learning algorithms, e.g., C5, are good at dataset classification. But those algorithms usually work with only one attribute at a time. The dependencies among attributes are not considered in those algorithms. Unfortunately, in the real world, most datasets contain attributes, which are dependent. Generally, these dependencies are classified into two types: categorical- type and numerical-type dependencies. Thus, it is very important to construct a model to discover the dependencies among attributes, and to improve the accuracy of the decision-tree learning algorithms. Neural network model is a good choice to concern with these two types of dependencies. In this paper, we propose a Neural Decision Tree (NDT) model to deal with the problems described above. NDT model combines the neural network technologies and the traditional decision-tree learning capabilities to handle the complicated and real cases. The experimental results show that the NDT model can significantly improve the accuracy of C5.

Yue-Shi Lee, Show-Jane Yen
Approximate k-Closest-Pairs with Space Filling Curves

An approximate algorithm to efficiently solve the k-Closest- Pairs problem in high-dimensional spaces is presented. The method is based on dimensionality reduction of the space ℝd through the Hilbert space filling curve and performs at most d+1 scans of the data set. After each scan, those points whose contribution to the solution has already been analyzed, are eliminated from the data set. The pruning is lossless, in fact the remaining points along with the approximate solution found can be used for the computation of the exact solution. Although we are able to guarantee an O(d1+ 1/t ) approximation to the solution, where t = 1,…,∞ denotes the used Lt metric, experimental results give the exact k-Closest-Pairs for all the data sets considered and show that the pruning of the search space is effective.

Fabrizio Angiulli, Clara Pizzuti
Optimal Dimension Order: A Generic Technique for the Similarity Join

The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a given Parameter ∈. Although the similarity join is clearly CPU bound, most previous publications propose strategies that primarily improve the I/O performance. Only little effort has been taken to address CPU aspects. In this Paper, we show that most of the computational overhead is dedicated to the final distance computations between the feature vectors. Consequently, we propose a generic technique to reduce the response time of a large number of basic algorithms for the similarity join. It is applicable for index based join algorithms as well as for most join algorithms based on hashing or sorting. Our technique, called Optimal Dimension Order, is able to avoid and accelerate distance calculations between feature vectors by a careful order of the dimensions. The order is determined according to a probability model. In the experimental evaluation, we show that our technique yields high performance improvements for various underlying similarity join algorithms such as the R-tree similarity join, the breadth- first-R-tree join, the Multipage Index Join, and the ∈-Grid-Order.

Christian Böhm, Florian Krebs, Hans-Peter Kriegel
Fast Discovery of Sequential Patterns by Memory Indexing

Mining sequential patterns is an important issue for the complexity of temporal pattern discovering from sequences. Current mining approaches either require many times of database scanning or generate several intermediate databases. As databases may fit into the ever-increasing main memory, efficient memory-based discovery of sequential patterns will become possible. In this paper, we propose a memory indexing approach for fast sequential pattern mining, named MEMISP. During the whole process, MEMISP scans the sequence database only once for reading data sequences into memory. The findthen- index technique recursively finds the items which constitute a frequent sequence and constructs a compact index set which indicates the set of data sequences for further exploration. Through effective index advancing, fewer and shorter data sequences need to be processed in MEMISP as the discovered patterns getting longer. Moreover, the maximum size of total memory required, which is independent of minimum support threshold in MEMISP, can be estimated. The experiments indicates that MEMISP outperforms both GSP and PrefixSpan algorithms. MEMISP also has good linear scalability even with very low minimum support. When the database is too large to fit in memory in a batch, we partition the database, mine patterns in each partition, and validate the true patterns in the second pass of database scanning. Therefore, MEMISP may efficiently mine databases of any size, for any minimum support values.

Ming-Yen Lin, Suh-Yin Lee

Data Cleansing

Dynamic Similarity for Fields with NULL Values

One of the most important tasks in data cleansing is to deduplicate records, which needs to compare records to determine their equivalence. However, existing comparison methods, such as Record Similarity, Equational Theory, implicitly assume that the values in all fields are known, and NULL values are treated as empty strings, which will result in a loss of correct duplicate records. In this paper, we solve this problem by proposing a simple yet efficient method, Dynamic Similarity, which dynamically adjusts the similarity for field with NULL value. Performance results on real and synthetic datasets show that Dynamic Similarity method can achieve more correct duplicate records and without introducing more false positives as compared with Record Similarity. Furthermore, the percentage of correct duplicate records obtained by Dynamic Similarity but not obtained by Record Similarity will increase if the number of fields with NULL values increases.

Li Zhao, Sung Sam Yuan, Qi Xiao Yang, Sun Peng
Outlier Detection Using Replicator Neural Networks

We consider the problem of finding outliers in large multivariate databases. Outlier detection can be applied during the data cleansing process of data mining to identify problems with the data itself, and to fraud detection where groups of outliers are often of particular interest. We use replicator neural networks (RNNs) to provide a measure of the outlyingness of data records. The performance of the RNNs is assessed using a ranked score measure. The effectiveness of the RNNs for outlier detection is demonstrated on two publicly available databases.

Simon Hawkins, Hongxing He, Graham Williams, Rohan Baxter
The Closed Keys Base of Frequent Itemsets

In data mining, concise representations are useful and necessary to apprehending voluminous results of data processing. Recently many different concise representations of frequent itemsets have been investigated. In this paper, we present yet another concise representation of frequent itemsets, called the closed keys representation, with the following characteristics: (i) it allows to determine if an itemset is frequent, and if so, the support of the itemset is immediate, and (ii) basing on the closed keys representation, it is straightforward to determine all frequent key itemsets and all frequent closed itemsets. An efficient algorithm for computing the closed key representation is offered. We show that our approach has many advantages over the existing approaches, in terms of efficiency, conciseness and information inferences.

Viet Phan Luong

Applications

New Representation and Algorithm for Drawing RNA Structure with Pseudoknots*

Visualization of a complex molecular structure is a valuable tool in understanding the structure. A drawing of RNA pseudoknot structures is a graph (and a possibly nonplanar graph) with inner cycles within a pseudoknot as well as possible outer cycles formed between a pseudoknot and other structural elements. Thus, drawing RNA pseudoknot structures is computationally more difficult than depicting RNA secondary structures. Although several algorithms have been developed for drawing RNA secondary structures, none of these can be used to draw RNA pseudoknots and thus visualizing RNA pseudoknots relies on significant amount of manual work. Visualizing RNA pseudoknots by manual work becomes more difficult and yields worse results as the size and complexity of the RNA structures increase. We have developed a new representation method and an algorithm for visualizing RNA pseudoknots as a twodimensional drawing and implemented the algorithm in a program. The new representation produces uniform and clear drawings with no edge crossing for all kinds of pseudoknots, including H-type and other complex types. Given RNA structure data, we represent the whole structure as a tree rather than as a graph by hiding the inner cycles as well as the outer cycles in the nodes of the abstract tree. Once the top-level RNA structure is represented as a tree, nodes of the tree are placed and drawn in increasing order of their depth values. Experimental results demonstrate that the algorithm generates a clear and aesthetically pleasing drawing of large-scale RNA structures, containing any number of pseudoknots. This is the first algorithm for automatically drawing RNA structure with pseudoknots.

Yujin Lee, Wootaek Kim, Kyungsook Han
Boosting Naive Bayes for Claim Fraud Diagnosis

In this paper we apply the weight of evidence reformulation of AdaBoosted naive Bayes scoring due to Ridgeway et al. (1998) for the diagnosis of insurance claim fraud. The method effiectively combines the advantages of boosting and the modelling power and representational attractiveness of the probabilistic weight of evidence scoring framework. We present the results of an experimental comparison with an emphasis on both discriminatory power and calibration of probability estimates. The data on which we evaluate the method consists of a representative set of closed personal injury protection automobile insurance claims from accidents that occurred in Massachusetts during 1993. The findings of the study reveal the method to be a valuable contribution to the design of effective, intelligible, accountable and efficient fraud detection support.

Stijn Viaene, Richard Derrig, Guido Dedene
Optimization of Association Word Knowledge Base through Genetic Algorithm

Query expansion in knowledge based on information retrieval system requires knowledge base being considered semantic relations between words. Since Apriori algorithm extracts association word without taking user preference into account, recall is improved but accuracy is reduced. This paper shows how to establish optimized association word knowledge base with improved accuracy only including association word that users prefer among association words being considered semantic relations between words. Toward this end, web documents related to computer are classified into eight classes, and nouns are extracted from web document of each class. Association word is extracted from nouns through Apriori algorithm, and association word that users do not favor is excluded from knowledge base through genetic algorithm.

Su-Jeong Ko, Jung-Hyun Lee
Mining Temporal Patterns from Health Care Data*

This paper describes temporal data mining techniques for extracting information from temporal health records consisting of a time series of elderly diabetic patients’ tests. We propose a data mining procedure to analyse these time sequences in three steps to identify patterns from any longitudinal data set. The first step is a structure-based search using wavelets to find pattern structures. The second step employs a value-based search over the discovered patterns using the statistical distribution of data values. The third step combines the results from the first two steps to form a hybrid model. The hybrid model has the expressive power of both wavelet analysis and the statistical distribution of the values. Global patterns are therefore identified.

Weiqiang Lin, Mehmet A. Orgun, Graham J. Williams

Data Warehouse Design

Adding a Performance-Oriented Perspective to Data Warehouse Design

Data warehouse design is clearly dominated by the business perspective. Quite often, data warehouse administrators are lead to data models with little room for performance improvement. However, the increasing demands for interactive response time from the users make query performance one of the central problems of data warehousing today. In this paper we defend that data warehouse design must take into account both the business and the performance perspective from the beginning, and we propose the extension to typical design methodologies to include performance concerns in the early design steps. Specific analysis to predicted data warehouse usage profile and meta-data analysis are proposed as new inputs for improving the transition from logical to physical schema. The proposed approach is illustrated and discussed using the TPC-H performance benchmark and it is shown that significant performance improvement can be achieved without jeopardizing the business view required for data warehouse models.

Pedro Bizarro, Henrique Madeira
Cost Modeling and Estimation for OLAP-XML Federations

The ever-changing data requirements of today’s dynamic businesses are not handled well by current OLAP systems. Physical integration of data into OLAP systems is a time-consuming process, making logical federations the better choice in many cases. The increasing use of XML suggests that the required data will often be available in XML format. Thus, federations of OLAP and XML databases will be very attractive in many situations. In an efficient implementation of OLAP-XML federations, cost-based optimization is a must, creating a need for an effective cost model for OLAP-XML federations.In this paper we present a cost model for OLAP-XML federations, and outline techniques for estimating the cost model parameters in a federated OLAP-XML environment. The paper also outlines the cost models for the OLAP and XML components in the federation on which the federation cost model is built. The cost model has been used as the basis for effective cost-based query optimization in OLAP-XML federations. Experiments show that the cost model is precise enough to make a substantial difference in the query optimization process.

Dennis Pedersen, Karsten Riis, Torben Bach Pedersen
Constraint-Free Join Processing on Hyperlinked Web Data

In this paper,we introduce the concept of web join for combining hyperlinked Web data.A web join is one of the web algebraic operator in our web warehousing system called Whoweda (WareHouse Of WEb DAta).It can be used to gather useful, composite information from two web tables.Web join and outer web join (a derivative of web join)operations can be used to detect and represen changes to hyper- linked Web data.We discuss the syntax,semantics and algorithms of web join and outer web join operators.We also presen how to detect and represen changes to hyperlinked Web data using these wo operations.

Sourav S. Bhowmick, Wee Keong Ng, Sanjay Madria, Mukesh Mohania
Focusing on Data Distribution in the WebD2W System

The WebD2W system is a distributed client-server data warehousing environment, which is aimed not only at the data warehouse distribution, but also at the distributed access to these data using the Web technology as an infrastructure. In this paper, we introduce the WebD2W system, focusing on one of its main objectives: the data warehouse distribution. Such a system is presented in terms of its main components and their respective functionalities. The paper also describes the algorithm for fragmenting horizontally the warehouse data, which is used as a basis for the WebD2W system.

Cristina Dutra de Aguiar Ciferri, Fernando da Fonseca de Souza

OLAP

A Decathlon in Multidimensional Modeling: Open Issues and Some Solutions

The concept of multidimensional modeling has proven extremely successful in the area of Online Analytical Processing (OLAP) as one of many applications running on top of a data warehouse installation. Although many different modeling techniques expressed in extended multidimensional data models were proposed in the recent past, we feel that many hot issues are not properly reflected. In this paper we address ten common problems reaching from defects within dimensional structures over multidimensional structures to new analytical requirements and more.

W. Hümmer, W. Lehner, A. Bauer, L. Schlesinger
Modeling and Imputation of Large Incomplete Multidimensional Datasets

The presence of missing or incomplete data is a commonplace in large real-word databases. In this paper, we study the problem of missing values which occur at the measure dimension of data cube. We propose a two-part mixture model, which combines the logistic model and loglinear model together, to predict and impute the missing values. The logistic model here is applied to predict missing positions while the loglinear model is applied to compute the estimation. Experimental results on real datasets and synthetic datasets are presented.

Xintao Wu, Daniel Barbará
PartJoin:An Efficient Storage and Query Execution for Data Warehouses

The performance of OLAP queries can be improved drastically if the warehouse data is properly selected and indexed. The problems of selecting and materializing views and indexing data have been studied extensively in the data warehousing environment. On the other hand, data partitioning can also greatly increase the performance of queries. Data partitioning has advantage over data selection and indexing since the former one does not require additional storage requirement. In this paper,we show that it is beneficial to integrate the data partitioning and indexing (join indexes)techniques for improving the performance of data warehousing queries.We present a data warehouse tuning strategy, called PartJoin, that decomposes the fact and dimension tables of a star schema and then selects join indexes. This solution takes advantage of these two techniques, i.e., data partitioning and indexing. Finally,we present the results of an experimental evaluation that demonstrates the effectiveness of our strategy in reducing the query processing cost and providing an economical utilisation of the storage space.

Ladjel Bellatreche, Michel Schneider, Mukesh Mohania, Bharat Bhargava

Data Warehouse Maintenance

A Transactional Approach to Parallel Data Warehouse Maintenance

Data Warehousing is becoming an increasingly important technology for information integration and data analysis.Given the dynamic nature of modern distributed environments, both source data and schema changes are likely to occur autonomously and even concurrently in different sources.We have thus developed a comprehensive solution approach, called TxnWrap,that successfully maintains the warehouse views under any type of concurrent source updates.In this work, we now overcome TxnWrap’s restriction that the maintenance is processed one by one for each source update, since that limits the performance. To overcome this limitation, we exploit the transactional approach of TxnWrap to achieve parallel data warehouse maintenance. For this, we first identify the read/write conflicts among the different warehouse maintenance processes. We then propose a parallel maintenance scheduler (PMS)that generates legal schedules that resolve these conflicts.PMS has been implemented and incorporated into our TxnWrap system.The experimental results confirm that our parallel maintenance scheduler significantly improves the performance of data warehouse maintenance.

Bin Liu, Songting Chen, Elk A. Rundensteiner
Striving towards Near Real-Time Data Integration for Data Warehouses

The amount of information available to large-scale enterprises is growing rapidly. While operational systems are designed to meet well-specified (short) response time requirements, the focus of data warehouses is generally the strategic analysis of business data integrated from heterogeneous source systems. The decision making process in traditional data warehouse environments is often delayed because data cannot be propagated from the source system to the data warehouse in time. A real-time data warehouse aims at decreasing the time it takes to make business decisions and tries to attain zero latency between the cause and effect of a business decision. In this paper we present an architecture of an ETL environment for real-time data warehouses, which supports a continual near real-time data propagation. The architecture takes full advantage of existing J2EE (Java 2 Platform, Enterprise Edition) technology and enables the implementation of a distributed, scalable, near real-time ETL environment. Instead of using vendor proprietary ETL (extraction, transformation, loading) solutions, which are often hard to scale and often do not support an optimization of allocated time frames for data extracts, we propose in our approach ETLets (spoken “et-lets”) and Enterprise Java Beans (EJB) for the ETL processing tasks.

Robert M. Bruckner, Beate List, Josef Schiefer
Time-Interval Sampling for Improved Estimations in Data Warehouses

In large data warehouses it is possible to return very fast approximate answers to user queries using pre-computed sampling summaries well-fit for all types of exploration analysis. However, their usage is constrained by the fact that there must be a representative number of samples in grouping intervals to yield acceptable accuracy. In this paper we propose and evaluate a technique that deals with the representation issue by using time interval-biased stratified samples (TISS). The technique is able to deliver fast accurate analysis to the user by taking advantage of the importance of the time dimension in most user analysis. It is designed as a transparent middle layer, which analyzes and rewrites the query to use a summary instead of the base data warehouse. The estimations and error bounds returned using the technique are compared to those of traditional sampling summaries, to show that it achieves significant improvement in accuracy.

Pedro Furtado, João Pedro Costa
Backmatter
Metadaten
Titel
Data Warehousing and Knowledge Discovery
herausgegeben von
Yahiko Kambayashi
Werner Winiwarter
Masatoshi Arikawa
Copyright-Jahr
2002
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-46145-6
Print ISBN
978-3-540-44123-6
DOI
https://doi.org/10.1007/3-540-46145-0