Invited Talk

XML for Data Warehousing Chances and Challenges

The prospects of XML for data warehousing are staggering. As a primary purpose of data warehouses is to store non-operational data in the long term, i.e., to exchange them over time, the key reasons for the overwhelming success of XML as an exchange format also hold for data warehouses.

Peter Fankhauser, Thomas Klement

Data Cubes and Queries

CPM: A Cube Presentation Model for OLAP

On-Line Analytical Processing (OLAP) is a trend in database technology, based on the multidimensional view of data. In this paper we introduce the Cube Presentation Model (CPM), a presentational model for OLAP data which, to the best of our knowledge, is the only formal presentational model for OLAP found in the literature until today. First, our proposal extends a previous logical model for cubes, to handle more complex cases. Then, we present a novel presentational model for OLAP screens, intuitively based on the geometrical representation of a cube and its human perception in the space. Moreover, we show how the logical and the presentational models are integrated smoothly. Finally, we describe how typical OLAP operations can be easily mapped to the CPM.

Andreas Maniatis, Panos Vassiliadis, Spiros Skiadopoulos, Yannis Vassiliou

Computation of Sparse Data Cubes with Constraints

For a data cube there are always constraints between dimensions or between attributes in a dimension, such as functional dependencies. We introduce the problem that when there are functional dependencies, how to use them to speed up the computation of sparse data cubes. A new algorithm CFD is presented to satisfy this demand. CFD determines the order of dimensions by considering their cardinalities and functional dependencies between them together. It makes dimensions with functional dependencies adjacent and their codes satisfy monotonic mapping, thus reduces the number of partitions for such dimensions. It also combines partitioning from bottom to up and aggregate computation from top to bottom to speed up the computation further. In addition CFD can efficiently compute a data cube with hierarchies from the smallest granularity to the coarsest one, and at most one attribute in a dimension takes part in the computation each time. The experiments have shown that the performance of CFD has a significant improvement.

Changqing Chen, Jianlin Feng, Longgang Xiang

Answering Joint Queries from Multiple Aggregate OLAP Databases

Given an OLAP query expressed over multiple source OLAP databases, we study the problem of evaluating the result OLAP target database. The problem arises when it is not possible to derive the result from a single database. The method we use is the linear indirect estimator, commonly used for statistical estimation. We examine two obvious computational methods for computing such a target database, called the “Full-cross-product” (F) and the “Pre-aggregation” (P) methods. We study the accuracy and computational complexity of these methods. While the method F provides a more accurate estimate, it is more expensive computationally than P. Our contribution is in proposing a third new method, called the “Partial-Pre-aggregation” method (PP), which is significantly less expensive than F, but is just as accurate.

Elaheh Pourabbas, Arie Shoshani

An Approach to Enabling Spatial OLAP by Aggregating on Spatial Hierarchy

Investigation shows that a huge number of spatial data exists in current business databases. Traditional data warehousing and OLAP, however, could not exploit the spatial information to get deep insight into the business data in decision making. In this paper, we propose a novel approach to enabling spatial OLAP by aggregating on the spatial hierarchy. A spatial index mechanism is employed to derive the spatial hierarchy for pre-aggregation and materialization, which in turn are leveraged by the OLAP system to efficiently answer spatial OLAP queries. Our prototype system shows that the proposed approach could be integrated easily into the existing data warehouse and OLAP systems to support spatial analysis. Preliminary experiment results are also presented.

Long Zhang, Ying Li, Fangyan Rao, Xiulan Yu, Ying Chen, Dong Liu

Multidimensional Data Model

A Multidimensional Aggregation Object (MAO) Framework for Computing Distributive Aggregations

Multidimensional aggregation plays an important role in decisionmaking systems. A conceptual Multidimensional Aggregation Object (MAO), which consists of measures, scopes and aggregation function, is introduced to represent relationships among aggregators on addressable subsets of data. In the MAO model, aggregations of low-level (intermediate) data can be reused for aggregations on high-level data along the same dimension. Experimental results show that caching intermediate aggregated data can significantly improve performance. Incremental compensating and full recomputing cache-updating approaches are proposed. Execution plans for deriving the aggregations from MAOs are presented. The proposed data aggregation technique can be applied to data-warehousing, OLAP, and data mining tasks.

Meng-Feng Tsai, Wesley Chu

The GMD Data Model for Multidimensional Information: A Brief Introduction

In this paper we introduce a novel data model for multidimensional information, GMD, generalising the MD data model first proposed in Cabibbo et al (EDBT-98). The aim of this work is not to propose yet another multidimensional data model, but to find the general precise formalism encompassing all the proposals for a logical data model in the data warehouse field. Our proposal is compatible with all these proposals, making therefore possible a formal comparison of the differences of the models in the literature, and to study formal properties or extensions of such data models. Starting with a logic-based definition of the semantics of the GMD data model and of the basic algebraic operations over it, we show how the most important approaches in DW modelling can be captured by it. The star and the snowflake schemas, Gray’s cube, Agrawal’s and Vassiliadis’ models, MD and other multidimensional conceptual data can be captured uniformly by GMD. In this way it is possible to formally understand the real differences in expressivity of the various models, their limits, and their potentials.

Enrico Franconi, Anand Kamble

An Application of Case-Based Reasoning in Multidimensional Database Architecture

A concept of decision support system is considered in this paper. It provides data needed for fast, precise and good business decision making to all levels of management. The aim of the project is the development of a new online analytical processing oriented on case-based reasoning (CBR) where a previous experience for every new problem is taken into account. Methodological aspects have been tested in practice as a part of the management information system development project of “Novi Sad Fair”. A case study of an application of CBR in prediction of future payments is discussed in the paper.

Dragan Simić, Vladimir Kurbalija, Zoran Budimac

Web Warehousing

MetaCube XTM: A Multidimensional Metadata Approach for Semantic Web Warehousing Systems

Providing access and search among multiple, heterogeneous, distributed and autonomous data warehouses has become one of the main issues in the current research. In this paper, we propose to integrate data warehouse schema information by using metadata represented in XTM (XML Topic Maps) to bridge possible semantic heterogeneity. A detailed description of an architecture that enables the efficient processing of user queries involving data from heterogeneous is presented. As a result, the interoperability is accomplished by a schema integration approach based on XTM. Furthermore, important implementation aspects of the MetaCube-XTM prototype, which makes use of the Meta Data Interchange Specification (MDIS), and the Open Information Model, complete the presentation of our approach.

Thanh Binh Nguyen, A Min Tjoa, Oscar Mangisengi

Designing Web Warehouses from XML Schemas

Web warehousing plays a key role in providing the managers with up-to-date and comprehensive information about their business domain. On the other hand, since XML is now a standard de facto for the exchange of semi-structured data, integrating XML data into web warehouses is a hot topic. In this paper we propose a semi-automated methodology for designing web warehouses from XML sources modeled by XML Schemas. In the proposed methodology, design is carried out by first creating a schema graph, then navigating its arcs in order to derive a correct multidimensional representation. Differently from previous approaches in the literature, particular relevance is given to the problem of detecting shared hierarchies and convergence of dependencies, and of modeling many-to-many relationships. The approach is implemented in a prototype that reads an XML Schema and produces in output the logical schema of the warehouse.

Boris Vrdoljak, Marko Banek, Stefano Rizzi

Building XML Data Warehouse Based on Frequent Patterns in User Queries

With the proliferation of XML-based data sources available across the Internet, it is increasingly important to provide users with a data warehouse of XML data sources to facilitate decision-making processes. Due to the extremely large amount of XML data available on web, unguided warehousing of XML data turns out to be highly costly and usually cannot well accommodate the users needs in XML data acquirement. In this paper, we propose an approach to materialize XML data warehouses based on frequent query patterns discovered from historical queries issued by users. The schemas of integrated XML documents in the warehouse are built using these frequent query patterns represented as Frequent Query Pattern Trees (FreqQPTs). Using hierarchical clustering technique, the integration approach in the data warehouse is flexible with respect to obtaining and maintaining XML documents. Experiments show that the overall processing of the same queries issued against the global schema become much efficient by using the XML data warehouse built than by directly searching the multiple data sources.

Ji Zhang, Tok Wang Ling, Robert M. Bruckner, A Min Tjoa

Change Detection

A Temporal Study of Data Sources to Load a Corporate Data Warehouse

The input data of the corporate data warehouse is provided by the data sources, that are integrated. In the temporal database research area, a bitemporal database is a database supporting valid time and transaction time. Valid time is the time when the fact is true in the modeled reality, while transaction time is the time when the fact is stored in the database. Defining a data warehouse as a bitemporal database containing integrated and subject-oriented data in support of the decision making process, transaction time in the data warehouse can always be obtained, because it is internal to a given storage system. When an event is loaded into the data warehouse, its valid time is transformed into a bitemporal element, adding transaction time, generated by the database management system of the data warehouse. However, depending on whether the data sources manage transaction time and valid time or not, we could obtain the valid time for the data warehouse or not. The aim of this paper is to present a temporal study of the different kinds of data sources to load a corporate data warehouse, using a bitemporal storage structure.

Carme Martin, Alberto Abelló

Automatic Detection of Structural Changes in Data Warehouses

Data Warehouses provide sophisticated tools for analyzing complex data online, in particular by aggregating data along dimensions spanned by master data. Changes to these master data is a frequent threat to the correctness of OLAP results, in particular for multi- period data analysis, trend calculations, etc. As dimension data might change in underlying data sources without notifying the data warehouse, we are exploring the application of data mining techniques for detecting such changes and contribute to avoiding incorrect results of OLAP queries.

Johann Eder, Christian Koncilia, Dieter Mitsche

Performance Tests in Data Warehousing ETLM Process for Detection of Changes in Data Origin

In a data warehouse (DW) environment, when the operational environment does not posses or does not want to inform the data about the changes that occurred, controls have to be implemented to enable detection of these changes and to reflect them in the DW environment. The main scenarios are: i) the impossibility to instrument the DBMS (triggers, transaction log, stored procedures, replication, materialized views, old and new versions of data, etc) due to security policies, data property or performance issues; ii) the lack of instrumentation resources on the DBMS; iii) the use of legacy technologies such file systems or semi-structured data; iv) application proprietary databases and ERP systems. In another article [1], we presented the development and implementation of a technique that was derived for the comparison of database snapshots, where we use signatures to mark and detect changes. The technique is simple and can be applied to all four scenarios above. To prove the efficiency of our technique, in this article we do comparative performance tests between these approaches. We performed two benchmarks: the first one using synthetic data and the second one using the real data from a case study in the data warehouse project developed for Rio Sul Airlines, a regional aviation company belonging to the Brazil-based Varig group. We also describe the main approaches to solve the detection of changes in data origin.

Rosana L. A. Rocha, Leonardo Figueiredo Cardoso, Jano Moreira de Souza

Web Mining and Association Rule

Recent Developments in Web Usage Mining Research

Web Usage Mining is that area of Web Mining which deals with the extraction of interesting knowledge from logging information produced by web servers. In this paper, we present a survey of the recent developments in this area that is receiving increasing attention from the Data Mining community.

Federico Michele Facca, Pier Luca Lanzi

Parallel Vector Computing Technique for Discovering Communities on the Very Large Scale Web Graph

The study of the authoritative pages and community discovery from an enormous Web contents has attracted many researchers. One of the link-based analysis, the HITS algorithm, calculates authority scores as the eigenvector of a adjacency matrix created from the Web graph. Although it was considered impossible to compute the eigenvector of a very large scale of Web graph using previous techniques, due to this calculation requires enormous memory space. We make it possible using data compression and parallel computation.

Kikuko Kawase, Minoru Kawahara, Takeshi Iwashita, Hiroyuki Kawano, Masanori Kawazawa

Association Rules and Decision Trees

Ordinal Association Rules towards Association Rules

Intensity of inclination, an objective rule-interest measure, allows us to extract implications on databases without having to go through the step of transforming the initial set of attributes into binary attributes, thereby avoiding obtaining a prohibitive number of rules of little significance with many redundancies. This new kind of rule, ordinal association rules, reveals the overall behavior of the population and it is vital that this study be extended by exploring specific ordinal association rules in order to refine our analysis and to extract behaviors in sub-sets. This paper focuses on the mining of association rules based on extracted ordinal association rules in order to, on the one hand remove the discretization step of numeric attributes and the step of complete disjunctive coding, and on the other hand obtain a variable discretization of numeric attributes i.e. dependent on association of attributes. The study ends with an evaluation of an application to some banking data.

Sylvie Guillaume

Rough Set Based Decision Tree Model for Classification

Decision tree, a commonly used classification model, is constructed recursively following a top down approach (from the general concepts to particular examples) by repeatedly splitting the training data set. ID3 is a greedy algorithm that considers one attribute at a time for splitting at a node. In C4.5, all attributes, barring the nominal attributes used at the parent nodes, are retained for further computation. This leads to extra overheads of memory and computational efforts. Rough Set theory (RS) simplifies the search for dominant attributes in the information systems. In this paper, Rough set based Decision Tree (RDT) model combining the RS tools with classical DT capabilities, is proposed to address the issue of computational overheads. The experiments compare the performance of RDT with RS approach and ID3 algorithm. The performance of RDT over RS approach is observed better in accuracy and rule complexity while RDT and ID3 are comparable.

Sonajharia Minz, Rajni Jain

Inference Based Classifier: Efficient Construction of Decision Trees for Sparse Categorical Attributes

Classification is an important problem in data mining and machine learning, and the decision tree approach has been identified as an efficient means for classification. According to our observation on real data, the distribution of attributes with respect to information gain is very sparse because only a few attributes are major discriminating attributes where a discriminating attribute is an attribute, by whose value we are likely to distinguish one tuple from another. In this paper, we propose an efficient decision tree classifier for categorical attribute of sparse distribution. In essence, the proposed Inference Based Classifier (abbreviated as IBC) can alleviate the ôoverfittingö problem of conventional decision tree classifiers. Also, IBC has the advantage of deciding the splitting number automatically based on the generated partitions. IBC is empirically compared to C4.5, SLIQ and K-means based classifiers. The experimental results show that IBC significantly outperforms the companion methods in execution efficiency for dataset with categorical attributes of sparse distribution while attaining approximately the same classification accuracies. Consequently, IBC is considered as an accurate and efficient classifier for sparse categorical attributes.

Shih-Hsiang Lo, Jian-Chih Ou, Ming-Syan Chen

Generating Effective Classifiers with Supervised Learning of Genetic Programming

A new approach of learning classifiers using genetic programming has been developed recently. Most of the previous researches generate classification rules to classify data. However, the generation of rules is time consuming and the recognition accuracy is limited. In this paper, an approach of learning classification functions by genetic programming is proposed for classification. Since a classification function deals with numerical attributes only, the proposed scheme first transforms the nominal data into numerical values by rough membership functions. Then, the learning technique of genetic programming is used to generate classification functions. For the purpose of improving the accuracy of classification, we proposed an adaptive interval fitness function. Combining the learned classification functions with training samples, an effective classification method is presented. Numbers of data sets selected from UCI Machine Learning repository are used to show the effectiveness of the proposed method and compare with other classifiers.

Been-Chian Chien, Jui-Hsiang Yang, nd Wen-Yang Lin

Clustering I

Clustering by Regression Analysis

In data clustering, many approaches have been proposed such as K-means method and hierarchical method. One of the problems is that the results depend heavily on initial values and criterion to combine clusters.In this investigation, we propose a new method to avoid this deficiency. Here we assume there exists aspects of local regression in data. Then we develop our theory to combine clusters using $\mathcal{F}$ values by regression analysis as criterion. We examine experiments and show how well the theory works.

Masahiro Motoyoshi, Takao Miura, Isamu Shioya

Handling Large Workloads by Profiling and Clustering

View materialization is recognized to be one of the most effective ways to increase the Data Warehouse performance; nevertheless, due to the computational complexity of the techniques aimed at choosing the best set of views to be materialized, this task is mainly carried out manually when large workloads are involved. In this paper we propose a set of statistical indicators that can be used by the designer to characterize the workload of the Data Warehouse, thus driving the logical and physical optimization tasks; furthermore we propose a clustering algorithm that allows the cardinality of the workload to be reduced and uses these indicators for measuring the quality of the reduced workload. Using the reduced workload as the input to a view materialization algorithm allows large workloads to be efficiently handled.

Matteo Golfarelli

Incremental OPTICS: Efficient Computation of Updates in a Hierarchical Cluster Ordering

Data warehouses are a challenging field of application for data mining tasks such as clustering. Usually, updates are collected and applied to the data warehouse periodically in a batch mode. As a consequence, all mined patterns discovered in the data warehouse (e.g. clustering structures) have to be updated as well. In this paper, we present a method for incrementally updating the clustering structure computed by the hierarchical clustering algorithm OPTICS. We determine the parts of the cluster ordering that are affected by update operations and develop efficient algorithms that incrementally update an existing cluster ordering. A performance evaluation of incremental OPTICS based on synthetic datasets as well as on a real-world dataset demonstrates that incremental OPTICS gains significant speed-up factors over OPTICS for update operations.

Hans-Peter Kriegel, Peer Kröoger, Irina Gotlibovich

Clustering II

On Complementarity of Cluster and Outlier Detection Schemes

We are interested in the problem of outlier detection, which is the discovery of data that deviate a lot from other data patterns. Hawkins [7] characterizes an outlier in a quite intuitive way as follows: An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.

Zhixiang Chen, Ada Wai-Chee Fu, Jian Tang

Cluster Validity Using Support Vector Machines

Gaining confidence that a clustering algorithm has produced meaningful results and not an accident of its usually heuristic optimization is central to data analysis. This is the issue of validity and we propose here a method by which Support Vector Machines are used to evaluate the separation in the clustering results. However, we not only obtain a method to compare clustering results from different algorithms or different runs of the same algorithm, but we can also filter noise and outliers. Thus, for a fixed data set we can identify what is the most robust and potentially meaningful clustering result. A set of experiments illustrates the steps of our approach.

Vladimir Estivill-Castro, Jianhua Yang

FSSM: Fast Construction of the Optimized Segment Support Map

Computing the frequency of a pattern is one of the key operations in data mining algorithms. Recently, the Optimized Segment Support Map (OSSM) was introduced as a simple but powerful way of speeding up any form of frequency counting satisfying the monotonicity condition. However, the construction cost to obtain the ideal OSSM is high, and makes it less attractive in practice. In this paper, we propose the FSSM, a novel algorithm that constructs the OSSM quickly using a FP-Tree. Given a user-defined segment size, the FSSM is able to construct the OSSM at a fraction of the time required by the algorithm previously proposed. More importantly, this fast construction time is achieved without compromising the quality of the OSSM. Our experimental results confirm that the FSSM is a promising solution for constructing the best OSSM within user given constraints.

Kok-Leong Ong, Wee-Keong Ng, Ee-Peng Lim

Association Rule Mining

Using a Connectionist Approach for Enhancing Domain Ontologies: Self-Organizing Word Category Maps Revisited

In this paper, we present an approach based on neural networks for organizing words of a specific domain according to their semantic relations. The terms, which are extracted from domain-specific text documents, are mapped onto a two-dimensional map to provide an intuitive interface displaying semantically similar words in spatially similar regions. This representation of a domain vocabulary supports the construction and enrichment of domain ontologies by making relevant concepts and their relations evident.

Michael Dittenbach, Dieter Merkl, Helmut Berger

Parameterless Data Compression and Noise Filtering Using Association Rule Mining

The explosion of raw data in our information age necessitates the use of unsupervised knowledge discovery techniques to understand mountains of data. Cluster analysis is suitable for this task because of its ability to discover natural groupings of objects without human intervention. However, noise in the data greatly affects clustering results. Existing clustering techniques use density-based, grid-based or resolution-based methods to handle noise but they require the fine-tuning of complex parameters. Moreover, for high-dimensional data that cannot be visualized by humans, this fine-tuning process is greatly impaired. There are several noise/outlier detection techniques but they too need suitable parameters. In this paper, we present a novel parameterless method of filtering noise using ideas borrowed from association rule mining. We term our technique, FLUID (Filtering Using Itemset Discovery). FLUID automatically discovers representative points in the dataset without any input parameter by mapping the dataset into a form suitable for frequent itemset discovery. After frequent itemsets are discovered, they are mapped back to their original form and become representative points of the original dataset. As such, FLUID accomplishes both data and noise reduction simultaneously, making it an ideal preprocessing step for cluster analysis. Experiments involving a prominent synthetic dataset prove the effectiveness and efficiency of FLUID.

Yew-Kwong Woon, Xiang Li, Wee-Keong Ng, Wen-Feng Lu

Performance Evaluation of SQL-OR Variants for Association Rule Mining

In this paper, we focus on the SQL-OR approaches. We study several additional optimizations for the SQL-OR approaches (Vertical Tid, Gather-join, and Gather count) and evaluate them using DB2 and Oracle RDBMSs. We evaluate the approaches analytically and compare their performance on large data sets. Finally, we summarize the results and indicate the conditions for which the individual optimizations are useful.

P. Mishra, S. Chakravarthy

Data Analysis and Discovery

A Distance-Based Approach to Find Interesting Patterns

One of the major problems in knowledge discovery is producing too many trivial and uninteresting patterns. The measurement of interestingness is divided into subjective and objective measures and used to address the problem. In this paper, we propose a novel method to discover interesting patterns by incorporating the domain user’s preconceived knowledge. The prior knowledge constitutes a set of hypothesis about the domain. A new parameter called the distance is proposed to measure the gap between the user’s existing hypothesis and system-generated knowledge. To evaluate the practicality of our approach, we apply the proposed approach through some real-life data sets and present our findings.

Chen Zheng, Yanfen Zhao

Similarity Search in Structured Data

Recently, structured data is getting more and more important in database applications, such as molecular biology, image retrieval or XML document retrieval. Attributed graphs are a natural model for the structured data in those applications. For the clustering and classification of such structured data, a similarity measure for attributed graphs is necessary. All known similarity measures for attributed graphs are either limited to a special type of graph or computationally extremely complex, i.e. NP-complete, and are, therefore, unsuitable for data mining in large databases. In this paper, we present a new similarity measure for attributed graphs, called matching distance. We demonstrate, how the matching distance can be used for efficient similarity search in attributed graphs. Furthermore, we propose a filter-refinement architecture and an accompanying set of filter methods to reduce the number of necessary distance calculations during similarity search. Our experiments show that the matching distance is a meaningful similarity measure for attributed graphs and that it enables efficient clustering of structured data.

Hans-Peter Kriegel, Stefan Schönauer

Ontologies and Improving Data Quality

Using an Interest Ontology for Improved Support in Rule Mining

This paper describes the use of a concept hierarchy for improving the results of association rule mining. Given a large set of tuples with demographic information and personal interest information, association rules can be derived, that associate ages and gender with interests. However, there are two problems. Some data sets are too sparse for coming up with rules with high support. Secondly, some data sets with abstract interests do not represent the actual interests well. To overcome these problems, we are preprocessing the data tuples using an ontology of interests. Thus, interests within tuples that are very specific are replaced by more general interests retrieved from the interest ontology. This results in many more tuples at a more general level. Feeding those tuples to an association rule miner results in rules that have better support and that better represent the reality.

Xiaoming Chen, Xuan Zhou, Richard Scherl, James Geller

Fraud Formalization and Detection

A fraudster can be an impersonator or a swindler. An impersonator is an illegitimate user who steals resources from the victims by “taking over” their accounts. A swindler is a legitimate user who intentionally harms the system or other users by deception. Previous research efforts in fraud detection concentrate on identifying frauds caused by impersonators. Detecting frauds conducted by swindlers is a challenging issue. We propose an architecture to catch swindlers. It consists of four components: profile-based anomaly detector, state transition analysis, deceiving intention predictor, and decision-making component. Profile-based anomaly detector outputs fraud confidence indicating the possibility of fraud when there is a sharp deviation from usual patterns. State transition analysis provides state description to users when an activity results in entering a dangerous state leading to fraud. Deceiving intention predictor discovers malicious intentions. Three types of deceiving intentions, namely uncovered deceiving intention, trapping intention, and illusive intention, are defined. A deceiving intention prediction algorithm is developed. A user-configurable risk evaluation function is used for decision making. A fraud alarm is raised when the expected risk is greater than the fraud investigation cost.

Bharat Bhargava, Yuhui Zhong, Yunhua Lu

Combining Noise Correction with Feature Selection

Polishing is a noise correction mechanism which makes use of the inter-relationship between attribute and class values in the data set to identify and selectively correct components that are noisy. We applied polishing to a data set of amino acid sequences and associated information of point mutations of the gene COLIA1 for the classification of the phenotypes of the genetic collagenous disease Osteogenesis Imperfecta (OI). OI is associated with mutations in one or both of the genes COLIA1 and COLIA2. There are at least four known phenotypes of OI, of which type II is the severest and often lethal. Preliminary results of polishing suggest that it can lead to a higher classification accuracy. We further investigated the use of polishing as a scoring mechanism for feature selection, and the effect of the features so derived on the resulting classifier. Our experiments on the OI data set suggest that combining polishing and feature selection is a viable mechanism for improving data quality.

Choh Man Teng

Queries and Data Patterns

Pre-computing Approximate Hierarchical Range Queries in a Tree-Like Histogram

Histograms are a lossy compression technique widely applied in various application contexts, like query optimization, statistical and temporal databases, OLAP applications, and so on. This paper presents a new histogram based on a hierarchical decomposition of the original data distribution kept in a complete binary tree. This tree, thus containing a set of pre-computed hierarchical queries, is encoded in a compressed form using bit saving in representing integer numbers. The approach, extending a recently proposed technique based on the application of such a decomposition to the buckets of a pre-existing histogram, is shown by several experiments to improve the accuracy of the state-of-the-art histograms.

Francesco Buccafurri, Gianluca Lax

Comprehensive Log Compression with Frequent Patterns

In this paper we present a comprehensive log compression (CLC) method that uses frequent patterns and their condensed representations to identify repetitive information from large log files generated by communications networks. We also show how the identified information can be used to separate and filter out frequently occurring events that hide other, unique or only a few times occurring events. The identification can be done without any prior knowledge about the domain or the events. For example, no pre-defined patterns or value combinations are needed. This separation makes it easier for a human observer to perceive and analyse large amounts of log data. The applicability of the CLC method is demonstrated with real-world examples from data communication networks.

Kimmo Hätönen, Jean François Boulicaut, Mika Klemettinen, Markus Miettinen, Cyrille Masson

Non-recursive Generation of Frequent K-itemsets from Frequent Pattern Tree Representations

Existing association rule mining algorithms suffer from many problems when mining massive transactional datasets. One major problem is the high memory dependency: gigantic data structures built are assumed to fit in main memory; in addition, the recursive mining process to mine these structures is also too voracious in memory resources. This paper proposes a new association rule-mining algorithm based on frequent pattern tree data structure. Our algorithm does not use much more memory over and above the memory used by the data structure. For each frequent item, a relatively small independent tree called COFI-tree, is built summarizing co-occurrences. Finally, a simple and non-recursive mining process mines the COFI-trees. Experimental studies reveal that our approach is efficient and allows the mining of larger datasets than those limited by FP-Tree

Mohammad El-Hajj, Osmar R. Zaïane

Improving Database Query Engine

A New Computation Model for Rough Set Theory Based on Database Systems

We propose a new computation model for rough set theory using relational algebra operations in this paper. We present the necessary and sufficient conditions on data tables under which an attribute is a core attribute and those under which a subset of condition attributes is a reduct, respectively. With this model, two algorithms for core attributes computation and reduct generation are suggested. The correctness of both algorithms is proved and their time complexity is analyzed. Since relational algebra operations have been efficiently implemented in most widely-used database systems, the algorithms presented can be extensively applied to these database systems and adapted to a wide range of real-life applications with very large data sets.

Jianchao Han, Xiaohua Hu, T. Y. Lin

Computing SQL Queries with Boolean Aggregates

We introduce a new method for optimization of SQL queries with nested subqueries. The method is based on the idea of Boolean aggregates, aggregates that compute the conjunction or disjunction of a set of conditions. When combined with grouping, Boolean aggregates allow us to compute all types of non-aggregated subqueries in a uniform manner. The resulting query trees are simple and amenable to further optimization. Our approach can be combined with other optimization techniques and can be implemented with a minimum of changes in any cost-based optimizer.

Antonio Badia

Fighting Redundancy in SQL

Many SQL queries with aggregated subqueries exhibit redundancy (overlap in FROM and WHERE clauses). We propose a method, called the for-loop, to optimize such queries by ensuring that redundant computations are done only once. We specify a procedure to build a query plan implementing our method, give an example of its use and argue that it offers performance advantages over traditional approaches.

Antonio Badia, Dev Anand

Sampling and Vector Classification

“On-the-fly” VS Materialized Sampling and Heuristics

Aggregation queries can take hours to return answers in large Data warehouses (DW). The user interested in exploring data in several iterative steps using decision support or data mining tools may feel frustrated for such long response times. The ability to return fast approximate answers accurately and efficiently is important to these applications. Samples for use in query answering can be obtained “On-the-fly” (OS) or from a materialized summary of samples (MS). While MS are typically faster than OS summaries, they have the limitation that sampling rates are predefined upon construction. This paper analyzes the use of OS versus MS for approximate answering of aggregation queries and proposes a Sampling Heuristic that chooses the appropriate sampling rate to provide answers as fast as possible while guaranteeing accuracy targets simultaneously. The experimental section compares OS to MS, analyzing response time and accuracy (TPC-H benchmark), and shows the heuristics strategy in action.

Pedro Furtado

Incremental and Decremental Proximal Support Vector Classification using Decay Coefficients

This paper presents an efficient approach for supporting decremental learning for incremental proximal support vector machines (SVM). The presented decremental algorithm based on decay coefficients is compared with an existing window-based decremental algorithm, and is shown to perform at a similar level in accuracy, but providing significantly better computational performance.

Amund Tveit, Magnus Lie Hetland, Håavard Engum

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter