Data Warehousing Design

Conceptual Design of XML Document Warehouses

EXtensible Markup Language (XML) has emerged as the dominant standard in describing and exchanging data among heterogeneous data sources. XML with its self-describing hierarchical structure and its associated XML Schema (XSD) provides the flexibility and the manipulative power needed to accommodate complex, disconnected, heterogeneous data. The issue of large volume of data appearing deserves investigating XML Document Warehouses. But due to XML’s non-scalar, set-based semi-structured nature, traditional data design models lack the ability to represent XML design level constructs in an abstract and implementation-independent form which are crucial for designing complex domains such as data marts and data warehouses, but also during their operational and maintenance phase. We utilize Object Oriented (OO) concepts to develop a conceptual model for XML Document Warehouses. In this paper we propose a conceptual design formalism to build meaningful XML Document Warehouses (XDW). Our focus includes; (1) conceptually design and build meaningful XML (warehouse) repository (xFACT) using OO concepts in integration with XML Schema constructs, (2) conceptually model and design virtual dimensions using XML conceptual views [10a] [10b] to satisfy warehouse end-user requirements and (3) use UML package diagrams to help logically group and build hierarchical conceptual views to enhance semantics and expressiveness of the XDW.

Vicky Nassis, R. Rajugan, Tharam S. Dillon, Wenny Rahayu

Bringing Together Partitioning, Materialized Views and Indexes to Optimize Performance of Relational Data Warehouses

There has been a lot of work to optimize the performance of relational data warehouses. Three major techniques can be used for this objective: enhanced index schemes (join indexes, bitmap indexes), materialized views, and data partitioning. The existing research prototypes or products use materialized views alone or indexes alone or combination of them, but none of the prototypes use all three techniques together for optimizing the performance of the relational data warehouses. In this paper we show by a systematic experiment evaluation that the combination of these three techniques reduces the query processing cost and the maintenance overhead significantly. We conduct several experiments and analyse the situations where the data partitioning gives better performance than the materialized views and indexes. Based on rigorous experiments, we recommend the tuning parameters for better utilization of data partitioning, join indexes and materialized views to optimize the total cost.

Ladjel Bellatreche, Michel Schneider, Hervé Lorinquer, Mukesh Mohania

GeoDWFrame: A Framework for Guiding the Design of Geographical Dimensional Schemas

Data Warehouse (DW) is a dimensional database for providing decision support by means of on-line analytical processing (OLAP) techniques. Another technology also used to provide decision support is Geographical Information System (GIS). Much research aims at integrating these technologies, but there are still some open questions, particularly regarding the design of a geographical dimensional data schema. This paper discusses some related work and proposes GeoDWFrame that is a framework based on the star schema and has been specified as guidance for designing geographical dimensional schemas. Some experimental results are also given.

Robson N. Fidalgo, Valéria C. Times, Joel Silva, Fernando F. Souza

Workload-Based Placement and Join Processing in Node-Partitioned Data Warehouses

Data warehouses (DW) with enormous quantities of data put major performance and scalability challenges. The Node-Partitioned Data Warehouse (NPDW) divides the DW into cheap computer nodes for scalability. Partitioning and data placement strategies are relevant to the performance of complex queries on the NPDW. In this paper we propose a partitioning placement and join processing strategy to boost the performance of costly joins in NPDW, compare alternative strategies using the performance evaluation benchmark TPC-H and draw conclusions.

Pedro Furtado

Knowledge Discovery Framework and XML Data Minig

Novelty Framework for Knowledge Discovery in Databases

Knowledge Discovery in Databases (KDD) is an iterative process that aims at extracting interesting, previously unknown and hidden patterns from huge databases. Use of objective measures of interestingness in popular data mining algorithms often leads to another data mining problem, although of reduced complexity. The reduction in the volume of the discovered rules is desirable in order to improve the efficiency of the overall KDD process. Subjective measures of interestingness are required to achieve this. In this paper we study novelty of the discovered rules as a subjective measure of interestingness. We propose a framework to quantify novelty of the discovered rules in terms of their deviations from the known rules. The computations are carried out using the importance that the user gives to different deviations. The computed degree of novelty is then compared with the user given threshold to report novel rules to the user. We implement the proposed framework and experiment with some public datasets. The experimental results are quite promising.

Ahmed Sultan Al-Hegami, Vasudha Bhatnagar, Naveen Kumar

Revisiting Generic Bases of Association Rules

As a side effect of unprecedented amount of digitization of data, classical retrieval tools found themselves unable to go further beyond the tip of the Iceberg. Data Mining in conjunction with the Formal Concept Analysis, is a clear promise to furnish adequate tools to do so and specially to be able to derive concise generic and easy understandable bases of “hidden” knowledge, that can be reliable in a decision making process. In this paper, we propose to revisit the notion of association rule redundancy and to present sound inference axioms for deriving all association rules from generic bases of association rules.

S. Ben Yahia, E. Mephu Nguifo

Mining Maximal Frequently Changing Subtree Patterns from XML Documents

Due to the dynamic nature of online information, XML documents typically evolve over time. The change of the data values or structures of an XML document may exhibit some particular patterns. In this paper, we focus on the sequence of changes to the structures of an XML document to find out which subtrees in the XML structure frequently change together, which we call Frequently Changing Subtree Patterns (FCSP). In order to keep the discovered patterns more concise, we further define the problem of mining maximal FCSPs. An algorithm derived from the FP-growth is developed to mine the set of maximal FCSPs. Experiment results show that our algorithm is substantially faster than the naive algorithm and it scales well with respect to the size of the XML structure.

Ling Chen, Sourav S. Bhowmick, Liang-Tien Chia

Discovering Pattern-Based Dynamic Structures from Versions of Unordered XML Documents

Existing works on XML data mining deal with snapshot XML data only, while XML data is dynamic in real applications. In this paper, we discover knowledge from XML data by taking account its dynamic nature. We present a novel approach to extract pattern-based dynamic structures from versions of unordered XML documents. With the proposed dynamic metrics, the pattern-based dynamic structures are expected to summarize and predict interesting change trends of certain structures based on their past behaviors. Two types of pattern-based dynamic structures, increasing dynamic structure and decreasing dynamic structure are considered. With our proposed data model, SMH-Tree, an algorithm for mining such pattern-based dynamic structures with only two scans of the XML sequence is presented. Experimental results show that the proposed algorithm can extract the pattern-based dynamic structures efficiently with good scalability.

Qiankun Zhao, Sourav S. Bhowmick, Sanjay Madria

Data Cubes and Queries

Space-Efficient Range-Sum Queries in OLAP

In this paper, we present a fast algorithm to answer range-sum queries in OLAP data cubes. Our algorithm supports constant-time queries while maintaining sub-linear time update and using minimum space. Furthermore, we study the trade-off between query time and update time. The complexity for query is O(2ℓ d) and for updates $O((2^\ell \sqrt[2^\ell]{n})^d)$ on a data cube of nd elements, where ℓ is a trade-off parameter. Our algorithm improve over previous best known results.

Fredrik Bengtsson, Jingsen Chen

Answering Approximate Range Aggregate Queries on OLAP Data Cubes with Probabilistic Guarantees

Approximate range aggregate queries are one of the most frequent and useful kinds of queries for Decision Support Systems (DSS). Traditionally, sampling- based techniques have been proposed to tackle this problem. However, its effectiveness will degrade when the underlying data distribution is skewed. Another approach based on the outlier management can limit the effect of data skew but fails to address other requirements of approximate range aggregate queries, such as error guarantees and query processing efficiency. In this paper, we present a technique that provide approximate answers to range aggregate queries on OLAP data cubes efficiently with theoretical error guarantees. Our basic idea is to build different data structures for outliers and the rest of the data. Experimental results verified the effectiveness of our proposed methods.

Alfredo Cuzzocrea, Wei Wang, Ugo Matrangolo

Computing Complex Iceberg Cubes by Multiway Aggregation and Bounding

Iceberg cubing is a valuable technique in data warehouses. The efficiency of iceberg cube computation comes from efficient aggregation and effective pruning for constraints. In advanced applications, iceberg constraints are often non-monotone and complex, for example, “Average cost in the range [δ₁, δ₂] and standard deviation of cost less than β”. The current cubing algorithms either are efficient in aggregation but weak in pruning for such constraints, or can prune for non-monotone constraints but are inefficient in aggregation. The best algorithm of the former, Star-cubing, computes aggregations of cuboids simultaneously but its pruning is specific to only monotone constraints such as “COUNT(*) ≥ δ”. In the latter case, the Divide and Approximate pruning technique can prune for non-monotone constraints but is limited to bottom-up single-group aggregation. We propose a solution that exhibits both efficiency in aggregation and generality and effectiveness in pruning for complex constraints. Our bounding techniques are as general as the Divide and Approximate pruning techniques for complex constraints and yet our multiway aggregation is as efficient as Star-cubing.

LienHua Pauline Chou, Xiuzhen Zhang

Multidimensional Schema and Data Aggregation

An Aggregate-Aware Retargeting Algorithm for Multiple Fact Data Warehouses

The use of pre-computed aggregate data is a common technique to address performance in Data Warehouse systems (DW), but the adoption of aggregates must be transparent. An aggregate retargeting mechanism redirects a query to available aggregate table(s) whenever possible. In this work, we present an aggregate retargeting mechanism that deals with multiple fact table schemas (MFTS), which are composed of many fact tables related by conformed dimensions. The algorithm provides two types of transparency: a) aggregate unawareness, and b) MFTS join complexity unawareness. The algorithm requires simple metadata, and it is non-intrusive. The paper presents the retargeting algorithm, required metadata and performance experiments.

Karin Beckera, Duncan Dubugras Ruiz

A Partial Pre-aggregation Scheme for HOLAP Engines

This paper describes a scheme for partial pre-aggregation to speed up the response time of queries that are posed for the array-like interface, subject to the constraint that all pre-computed aggregates must fit into storage of a pre-determined size. The target query workload consists of all base and aggregate cells that are stored in a multidimensional array (i.e. cube). These queries are actually range queries pre-defined by users. Due to the huge size of all possible aggregate cells, the emphasis of our scheme is to reduce the overhead for query compilation. An efficient and effective query decomposition method is devised, which works well with a pre-aggregation scheme whereby pre-computed aggregates form a sub-cube of the full cube. A greedy algorithm is devised is to derive such a sub-cube. A HOLAP engine which implements this partial pre-aggregation scheme is described. Experimental results using both synthetic and real-life datasets are presented to demonstrate that the partial pre-aggregation scheme is viable, and for some complex queries, accelerates query execution by close to 300 times.

Wo-Shun Luk, Chao Li

Discovering Multidimensional Structure in Relational Data

On-Line Analytical Processing (OLAP) systems based on multidimensional databases are essential elements of decision support. However, most existing data is stored in “ordinary” relational OLTP databases, i.e., data has to be (re-) modeled as multidimensional cubes before the advantages of OLAP tools are available.In this paper we present an approach for the automatic construction of multidimensional OLAP database schemas from existing relational OLTP databases, enabling easy OLAP design and analysis for most existing data sources. This is achieved through a set of practical and effective algorithms for discovering multidimensional schemas from relational databases. The algorithms take a wide range of available metadata into account in the discovery process, including functional and inclusion dependencies, and key and cardinality information.

Mikael R. Jensen, Thomas Holmgren, Torben Bach Pedersen

Inductive Databases and Temporal Rules

Inductive Databases as Ranking

Most of the research in data mining has been focused on developing novel algorithms for specific data mining tasks. However, finding the theoretical foundations of data mining has recently been recognized to be even greater concern to data mining.One promising candidate to form a solid basis for data mining is known as inductive databases. The inductive databases are databases with a tight integration to data mining facilities. However, it is not clear what inductive databases actually are, what they should be and whether inductive databases differ notably from the usual databases with slightly broader notions of queries and data objects.In this paper we aim to show that the viewpoint offered by inductive databases differs from the usual databases: the inductive databases can be seen as databases with ability to rank data manipulation operations. We describe how several central data mining tasks can be naturally defined by this approach and show that the proposed inductive databases framework offers conceptual benefits by clarifying and unifying the central data mining tasks. We also discuss some challenges of inductive databases based on query ranking and grading.

Taneli Mielikäinen

Inductive Databases of Polynomial Equations

Inductive databases (IDBs) contain both data and patterns. Here we consider IDBs where patterns are polynomial equations. We present a constraint-based approach to answering inductive queries in this domain. The approach is based on heuristic search through the space of polynomial equations and can use subsumption and evaluation constraints on polynomial equations. We evaluate this approach on standard regression problems. We finally consider IDBs containing patterns in the form of polynomial equations as well as molecular fragments, where the two are combined in order to derive QSAR (Quantitative Structure-Activity Relationships) models.

Sašo Džeroski, Ljupčo Todorovski, Peter Ljubič

From Temporal Rules to Temporal Meta-rules

In this article we define a formalism to discover knowledge in the form of temporal rules, inferred from databases of events with a temporal dimension. The theoretical framework is based on first-order temporal logic and allows the definition of the main temporal data mining notions (event, temporal rule, constraint) in a formal way. The concepts of consistent linear time structure and general interpretation are fundamentals in the design of algorithms for inferring higher order temporal rules, (called temporal meta-rules), from local sets of temporal rules.

Paul Cotofrei, Kilian Stoffel

Industrial Track

How Is BI Used in Industry?: Report from a Knowledge Exchange Network

After being born in industry, the topic of business intelligence (BI), has become an active research topic, spanning research areas such as data warehousing, On-Line Analytical Processing (OLAP), data mining, and data streams. Ideally, BI research should be driven by industry needs. However, only little information on how BI is used in industry is currently available to the BI research community.This industrial paper presents the experiences that the author has obtained from running a number of knowledge exchange activities related to BI, e.g., network meetings and workshops, with partners in industry. The paper presents the network participants, the topics of the knowledge exchange activities, and the lessons learned over the last couple of years. For example, lessons relate to the popularity of OLAP and data mining tools, the time spent on different tasks in BI projects, and the importance of “soft” people-related issues. Finally, the paper makes a recommendation on a number of future research topics that are useful in industry.

Torben Bach Pedersen

Towards an Adaptive Approach for Mining Data Streams in Resource Constrained Environments

Mining data streams in resource constrained environments has emerged as a challenging research issue for the data mining community in the past two years. Several approaches have been proposed to tackle the challenges of limited capabilities for small devices that generate or receive data streams. These approaches try to approximate the mining results with acceptable accuracy and efficiency in space and time complexity. However these approaches are not resource-aware. In this paper, a thorough discussion about the state of the art of mining data streams is presented followed by a formalization of our Algorithm Output Granularity (AOG) approach in mining data streams. The incorporation of AOG within a generic ubiquitous data mining system architecture is shown and discussed. The industrial applications of AOG-based mining techniques are given and discussed.

Mohamed Medhat Gaber, Arkady Zaslavsky, Shonali Krishnaswamy

Data Clustering

Exploring Possible Adverse Drug Reactions by Clustering Event Sequences

Historically the identification of adverse drug reactions relies on manual processes whereby doctors and hospitals report incidences to a central agency. In this paper we suggest a data mining approach using administrative pharmaceutical usage data linked with hospital admissions data. Patients, represented by temporal sequences of drug usage, are clustered using unsupervised learning techniques. Such techniques rely on a distance measure, and we propose in this paper such a distance measure for comparing drug usage sequences based on an event-type hierarchy, based around the hierarchical drug classification system. Although developed for a specific domain, we indicate that it is applicable in other applications involving data where event types form a hierarchical structure, such as is found in telecommunications applications. The approach modifies the Uniform Kernel K-Nearest Neighbour Clustering algorithm to constrain the merging of clusters to those clusters within a specified distance. The approach avoids losing clusters that are less dense yet far apart, as would occur without such a modification, but is typical of the types of applications we are interested in (where outliers are important). We demonstrate the algorithm through a successful application exploring for possible adverse drug events, in particular exploring hospital admissions for severe angioedema resulting from the usage of certain drugs and drug combinations. The interesting clusters thus identified have given clues to medical researchers for further investigations.

Hongxing He, Graham Williams, Jie Chen, Simon Hawkins, Chris Kelman

SCLOPE: An Algorithm for Clustering Data Streams of Categorical Attributes

Clustering is a difficult problem especially when we consider the task in the context of a data stream of categorical attributes. In this paper, we propose SCLOPE, a novel algorithm based on CLOPE’s intuitive observation about cluster histograms. Unlike CLOPE however, our algo- rithm is very fast and operates within the constraints of a data stream environment. In particular, we designed SCLOPE according to the recent CluStream framework. Our evaluation of SCLOPE shows very promising results. It consistently outperforms CLOPE in speed and scalability tests on our data sets while maintaining high cluster purity; it also supports cluster analysis that other algorithms in its class do not.

Kok-Leong Ong, Wenyuan Li, Wee-Keong Ng, Ee-Peng Lim

Novel Clustering Approach that Employs Genetic Algorithm with New Representation Scheme and Multiple Objectives

In this paper, we propose a new encoding scheme for GA and employ multiple objectives in handling the clustering problem. The proposed encoding scheme uses links so that objects to be clustered form a linear pseudo-graph. As multiple objectives are concerned, we used two objectives: 1) to minimize the Total Within Cluster Variation (TWCV); and 2) minimizing the number of clusters in a partition. Our approach obtains the optimal partitions for all the possible numbers of clusters in the Pareto Optimal set returned by a single GA run. The performance of the proposed approach has been tested using two well-known data sets: Iris and Ruspini. The obtained results demonstrate improvement over classical approaches.

Jun Du, Erkan Korkmaz, Reda Alhajj, Ken Barker

Data Visualization and Exploration

Categorical Data Visualization and Clustering Using Subjective Factors

A common issue in cluster analysis is that there is no single correct answer to the number of clusters, since cluster analysis involves human subjective judgement. Interactive visualization is one of the methods where users can decide a proper clustering parameters. In this paper, a new clustering approach called CDCS (Categorical Data Clustering with Subjective factors) is introduced, where a visualization tool for clustered categorical data is developed such that the result of adjusting parameters is instantly reflected. The experiment shows that CDCS generates high quality clusters compared to other typical algorithms.

Chia-Hui Chang, Zhi-Kai Ding

Multidimensional Data Visual Exploration by Interactive Information Segments

Visualization techniques provide an outstanding role in KDD process for data analysis and mining. However, one image does not always convey successfully the inherent information from high dimensionality, very large databases. In this paper we introduce VSIS (Visual Set of Information Segments), an interactive tool to visually explore multidimensional, very large, numerical data. Within the supervised learning, our proposal approaches the problem of classification by searching of meaningful intervals belonging to the most relevant attributes. These intervals are displayed as multi–colored bars in which the degree of impurity with respect to the class membership can be easily perceived. Such bars can be re–explored interactively with new values of user–defined parameters. A case study of applying VSIS to some UCI repository data sets shows the usefulness of our tool in supporting the exploration of multidimensional and very large data.

Francisco J. Ferrer-Troyano, Jesús S. Aguilar-Ruiz, José C. Riquelme

Metadata to Support Transformations and Data & Metadata Lineage in a Warehousing Environment

Data warehousing is a collection of concepts and tools which aim at providing and maintaining a set of integrated data (the data warehouse – DW ) for business decision support within an organization. They extract data from different operational data sources, and after some cleansing and transformation procedures data are integrated and loaded into a central repository to enable analysis and mining. Data and metadata lineage are important processes for data analysis. The first allows users to trace warehouse data items back to the original source item from which they were derived and the latter shows which operations have been performed to achieve that target data. This work proposes integrating metadata captured during transformation processes using the CWM metadata standard in order to enable data and metadata lineage. Additionally it presents a tool specially developed for performing this task.

Aurisan Souza de Santana, Ana Maria de Carvalho Moura

Data Classification, Extraction and Interpretation

Classification Based on Attribute Dependency

The decision tree learning algorithms, e.g., C5, are good at dataset classification. But those algorithms usually work with only one attribute at a time. The dependencies among attributes are not considered in those algorithms. Thus, it is very important to construct a model to discover the dependencies among attributes and to improve the accuracy of the decision tree learning algorithms. Association mining is a good choice for us to concern with the problems of attribute dependencies. Generally, these dependencies are classified into three types: categorical-type, numerical-type, and categorical- numerical-mixed dependencies. This paper proposes a CAM (Classification based on Association Mining) model to deal with such kind of dependency. The CAM model combines the association mining technologies and the traditional decision-tree learning capabilities to handle the complicated and real cases. According to the experiments on fifteen datasets from the UCI database repository, the CAM model can significantly improve both the accuracy and the rule size of C5. At the same time, the CAM model also outperforms the existing association-based classification models, i.e., ADT (Association-based Decision Tree) and CBA (Classification Based on Association).

Yue-Shi Lee, Show-Jane Yen

OWDEAH: Online Web Data Extraction Based on Access History

Web data extraction systems are the kernel of information mediators between users and heterogeneous Web data resources. How to extract structured data from semi-structured documents has been a problem of active research. Supervised and unsupervised methods have been devised to learn extraction rules from training sets. However, trying to prepare training sets (especially to annotate them for supervised methods), is very time-consuming. We propose a framework for Web data extraction, which logged users’ access history and exploit them to assist automatic training set generation. We cluster accessed Web documents according to their structural details; define criteria to measure the importance of sub-structures; and then generate extraction rules. We also propose a method to adjust the rules according to historical data. Our experiments confirm the viability of our proposal.

Zhao Li, Wee-Keong Ng, Kok-Leong Ong

Data Mining Approaches to Diffuse Large B–Cell Lymphoma Gene Expression Data Interpretation

This paper presents a comprehensive study of gene expression patterns originating from a diffuse large B–cell lymphoma (DLBCL) database. It focuses on the implementation of feature selection and classification techniques. Thus, it firstly tackles the identification of relevant genes for the prediction of DLBCL types. It also allows the determination of key biomarkers to differentiate two subtypes of DLBCL samples: Activated B–Like and Germinal Centre B–Like DLBCL. Decision trees provide knowledge–based models to predict types and subtypes of DLBCL. This research suggests that the data may be insufficient to accurately predict DLBCL types or even detect functionally relevant genes. However, these methods represent reliable and understandable tools to start thinking about possible interesting non–linear interdependencies.

Jesús S. Aguilar-Ruiz, Francisco Azuaje, José C. Riquelme

Data Semantics

Deriving Multiple Topics to Label Small Document Regions

Information retrieval can be greatly enhanced if the semantics of document contents are made explicit as labels that can be queried by markup-sensitive languages. We focus on labelling small text fragments, such as parts of sentences or paragraphs, with frequent topics. We propose WORDtrain, a sequence miner that builds topics for small document regions, such as sentences with many subsentences. WORDtrain splits regions in such a way that non-overlapping fragments are built and the topics derived for them are frequent.WORDtrain discovers frequent topics rather than choosing from a pre-defined reference list. This raises the issue of evaluating the quality of its resuls. To this purpose, we have designed two evaluation schemes, one requiring expert involvement and an automatic one. Our first experiments with these schemes show that WORDtrain yields promising results.

Henner Graubitz, Myra Spiliopoulou

Deriving Efficient SQL Sequences via Read-Aheads

Modern information system architectures place applications in an application server and persistent objects in a relational database. In this setting, we consider the problem of improving application throughput; our proposed solution uses data prefetching (read-aheads) to minimize the total data-access time of an application, in a manner that affects neither the application code nor the backend DBMS. Our methodology is based on analyzing and automatically merging SQL queries to produce query sequences with low total response time, in ways that exploit the application’s data-access patterns. The proposed approach is independent of the application domain and can be viewed as a component of container managed persistence that can be implemented in middleware.This paper describes our proposed framework for using generic data-access patterns to improve application throughput and reports preliminary experimental results on discovering key parameters that influence the trade-offs in producing efficient merged SQL queries. The approach is evaluated in the context of a financial domain, which yields the kinds of natural conceptual relationships where our approach is valuable.

A. Soydan Bilgin, Rada Y. Chirkova, Timo J. Salo, Munindar P. Singh

Diversity in Random Subspacing Ensembles

Ensembles of learnt models constitute one of the main current directions in machine learning and data mining. It was shown experimentally and theoretically that in order for an ensemble to be effective, it should consist of classifiers having diversity in their predictions. A number of ways are known to quantify diversity in ensembles, but little research has been done about their appropriateness. In this paper, we compare eight measures of the ensemble diversity with regard to their correlation with the accuracy improvement due to ensembles. We conduct experiments on 21 data sets from the UCI machine learning repository, comparing the correlations for random subspacing ensembles with different ensemble sizes and with six different ensemble integration methods. Our experiments show that the greatest correlation of the accuracy improvement, on average, is with the disagreement, entropy, and ambiguity diversity measures, and the lowest correlation, surprisingly, is with the Q and double fault measures. Normally, the correlation decreases linearly as the ensemble size increases. Much higher correlation values can be seen with the dynamic integration methods, which are shown to better utilize the ensemble diversity than their static analogues.

Alexey Tsymbal, Mykola Pechenizkiy, Pádraig Cunningham

Association Rule Mining

Partitioned Approach to Association Rule Mining over Multiple Databases

Database mining is the process of extracting interesting and previously unknown patterns and correlations from data stored in Data Base Management Systems (DBMSs). Association rule mining is the process of discovering items, which tend to occur together in transactions. If the data to be mined were stored as relations in multiple databases, instead of moving data from one database to another, a partitioned approach would be appropriate. This paper addresses the partitioned approach to association rule mining for data stored in multiple Relational DBMSs. This paper proposes an approach that is very effective for partitioned databases as compared to the main memory partitioned approach. Our approach uses SQL-based K-way join algorithm and its optimizations. A second alternative that trades accuracy for performance is also presented. Our results indicate that beyond a certain size of data sets, the accuracy is preserved in addition to improving performance. Extensive experiments have been performed and results are presented for the two partitioned approaches using IBM DB2/UDB and Oracle 8i.

Himavalli Kona, Sharma Chakravarthy

A Tree Partitioning Method for Memory Management in Association Rule Mining

All methods of association rule mining require the frequent sets of items, that occur together sufficiently often to be the basis of potentially interesting rules, to be first computed. The cost of this increases in proportion to the databaseL size, and also with its density. Densely-populated databases can give rise to very large numbers of candidates that must be counted. Both these factors cause performance problems, especially when the data structures involved become too large for primary memory. In this paper we describe a method of partitioning that organises the data into tree structures that can be processed independently. We present experimental results that show the method scales well for increasing dimensions of data, and performs significantly better than alternatives, especially when dealing with dense data and low support thresholds.

Shakil Ahmed, Frans Coenen, Paul Leng

Mining Interesting Association Rules for Prediction in the Software Project Management Area

Association and classification are two data mining techniques traditionally used for solving different kind of problems. Association has been applied in knowledge discovery and classification in predictive tasks. Recent studies have shown that knowledge discovery algorithms can be successfully used for prediction in classification problems. The improvement of association rules algorithms is the subject of many works in the literature, whereas little research has been done concerning their classification aspect. On the other hand, methods for solving the problems of the association rules must be tailored to the particularities of the application domain. This work deals with the predictive use of association rules and addresses the problem of reducing the number of rules generated in the software project management field. We propose an algorithm for refining association rules based on incremental knowledge discovery. It provides managers with strong rules for decision making without need of domain knowledge.

María N. Moreno García, Francisco J. García Peñalvo, M. José Polo Martín

Mining Event Sequences

PROWL: An Efficient Frequent Continuity Mining Algorithm on Event Sequences

Mining association rule in event sequences is an important data mining problem with many applications. Most of previous studies on association rules are on mining intra-transaction association, which consider only relationship among the item in the same transaction. However, intra-transaction association rules are not a suitable for trend prediction. Therefore, inter-transaction association is introduced, which consider the relationship among itemset of multiple time instants. In this paper, we present PROWL, an efficient algorithm for mining inter-transaction rules. By using projected window method and depth first enumeration approach, we can discover all frequent patterns quickly. Finally, an extensive experimental evaluation on a number of real and synthetic database shows that PROWL significantly outperforms previous method.

Kuo-Yu Huang, Chia-Hui Chang, Kuo-Zui Lin

Algorithms for Discovery of Frequent Superset, Rather Than Frequent Subset

In this paper, we propose a novel mining task: mining frequent superset from the database of itemsets that is useful in bioinformatics, e-learning systems, jobshop scheduling, and so on. A frequent superset means that it contains more transactions than minimum support threshold. Intuitively, according to the Apriori algorithm, the level-wise discovering starts from 1-itemset, 2-itemset, and so forth. However, such steps cannot utilize the property of Apriori to reduce search space, because if an itemset is not frequent, its superset maybe frequent. In order to solve this problem, we propose three methods. The first is the Apriori-based approach, called Apriori-C. The second is Eclat-based approach, called Eclat-C, which is depth-first approach. The last is the proposed data complement technique (DCT) that we utilize original frequent itemset mining approach to mine frequent superset. The experiment study compares the performance of the proposed three methods by considering the effect of the number of transactions, the average length of transactions, the number of different items, and minimum support.

Zhung-Xun Liao, Man-Kwan Shan

Pattern Mining

Improving Direct Counting for Frequent Itemset Mining

During the last ten years, many algorithms have been proposed to mine frequent itemsets. In order to fairly evaluate their behavior, the IEEE/ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03) has been recently organized. According to its analysis, kDCI++ is a state-of-the-art algorithm. However, it can be observed from the FIMI’03 experiments that its efficient behavior does not occur for low minimum supports, specially on sparse databases. Aiming at improving kDCI++ and making it even more competitive, we present the kDCI-3 algorithm. This proposal directly accesses candidates not only in the first iterations but specially in the third one, which represents, in general, the highest computational cost of kDCI++ for low minimum supports. Results have shown that kDCI-3 outperforms kDCI++ in the conducted experiments. When compared to other important algorithms, kDCI-3 enlarged the number of times kDCI++ presented the best behavior.

Adriana Prado, Cristiane Targa, Alexandre Plastino

Mining Sequential Patterns with Item Constraints

Mining sequential patterns is to discover sequential purchasing behaviors for most customers from a large amount of customer transactions. Past transaction data can be analyzed to discover customer purchasing behaviors. However, the size of the transaction database can be very large. It is very time consuming to find all the sequential patterns from a large database, and users may be only interested in some items. Moreover, the criteria of the discovered sequential patterns for the user requirements may not be the same. Many uninteresting sequential patterns for the user requirements can be generated when traditional mining methods are applied. Hence, a data mining language needs to be provided such that users can query only interesting knowledge to them from a large database of customer transactions. In this paper, a data mining language is presented. From the data mining language, users can specify the interested items and the criteria of the sequential patterns to be discovered. Also, an efficient data mining technique is proposed to extract the sequential patterns according to the users’ requests.

Show-Jane Yen, Yue-Shi Lee

Mining Borders of the Difference of Two Datacubes

In this paper we use the novel concept of minimal cube transversals on the cube lattice of a categorical database relation for mining the borders of the difference of two datacubes. The problem of finding cube transversals is a sub-problem of hypergraph transversal discovery since there exists an order-embedding from the cube lattice to the power set lattice of binary attributes. Based on this result, we propose a levelwise algorithm and an optimization which uses the frequency of the disjunction for mining minimal cube transversals. Using cube transversals, we introduce a new OLAP functionality: discovering the difference of two uni-compatible datacubes or the most frequent elements in the difference. Finally we propose a merging algorithm for mining the boundary sets of the difference without computing the two related datacubes. Provided with such a difference of two datacubes capturing similar informations but computed at different dates, a user can focus on what is new or more generally on how evolve the previously observed trends.

Alain Casali

Mining Periodic Patterns in Sequence Data

Periodic pattern mining is the problem that regards temporal regularity. There are many emerging applications in periodic pattern mining, including web usage recommendation, weather prediction, computer networks and biological data. In this paper, we propose a Progressive Timelist-Based Verification (PTV) method to the mining of periodic patterns from a sequence of event sets. The parameter min_rep, is employed to specify the minimum number of repetitions required for a valid segment of non-disrupted pattern occurrences. We also describe a partitioning approach to handle extra large/long data sequence. The experiments demonstrate good performance and scalability with large frequent patterns.

Kuo-Yu Huang, Chia-Hui Chang

Springer Professional

Inhaltsverzeichnis

Frontmatter