nach oben

2004 | Buch

Kapitel lesen Erstes Kapitel lesen

Database Support for Data Mining Applications

Discovering Knowledge with Inductive Queries

herausgegeben von: Rosa Meo, Pier Luca Lanzi, Mika Klemettinen

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Professional Book Archive

Einloggen, um Zugang zu erhalten

Über dieses Buch

Data mining from traditional relational databases as well as from non-traditional ones such as semi-structured data, Web data, and scientific databases housing biological, linguistic, and sensor data has recently become a popular way of discovering hidden knowledge.

This book on database support for data mining is developed to approaches exploiting the available database technology, declarative data mining, intelligent querying, and associated issues, such as optimization, indexing, query processing, languages, and constraints. Attention is also paid to the solution of data preprocessing problems, such as data cleaning, discretization, and sampling.

The 16 reviewed full papers presented were carefully selected from various workshops and conferences to provide complete and competent coverage of the core issues. Some papers were developed within an EC funded project on discovering knowledge with inductive queries.

Inhaltsverzeichnis

Frontmatter

Database Languages and Query Execution

Inductive Databases and Multiple Uses of Frequent Itemsets: The cInQ Approach

Abstract

Inductive databases (IDBs) have been proposed to afford the problem of knowledge discovery from huge databases. With an IDB the user/analyst performs a set of very different operations on data using a query language, powerful enough to perform all the required elaborations, such as data preprocessing, pattern discovery and pattern post-processing. We present a synthetic view on important concepts that have been studied within the cInQ European project when considering the pattern domain of itemsets. Mining itemsets has been proved useful not only for association rule mining but also feature construction, classification, clustering, etc. We introduce the concepts of pattern domain, evaluation functions, primitive constraints, inductive queries and solvers for itemsets. We focus on simple high-level definitions that enable to forget about technical details that the interested reader will find, among others, in cInQ publications.

Jean-François Boulicaut

Query Languages Supporting Descriptive Rule Mining: A Comparative Study

Abstract

Recently, inductive databases (IDBs) have been proposed to tackle the problem of knowledge discovery from huge databases. With an IDB, the user/analyst performs a set of very different operations on data using a query language, powerful enough to support all the required manipulations, such as data preprocessing, pattern discovery and pattern post-processing. We provide a comparison between three query languages (MSQL, DMQL and MINE RULE) that have been proposed for descriptive rule mining and discuss their common features and differences. These query languages look like extensions of SQL. We present them using a set of examples, taken from the real practice of rule mining. In the paper we discuss also OLE DB for Data Mining and Predictive Model Markup Language, two recent proposals that like the first three query languages respectively provide native support to data mining primitives and provide a description in a standard language of statistical and data mining models.

Marco Botta, Jean-François Boulicaut, Cyrille Masson, Rosa Meo

Declarative Data Mining Using SQL3

Abstract

Researchers convincingly argue that the ability to declaratively mine and analyze relational databases using SQL for decision support is a critical requirement for the success of the acclaimed data mining technology. Although there have been several encouraging attempts at developing methods for data mining using SQL, simplicity and efficiency still remain significant impediments for further development. In this article, we propose a significantly new approach and show that any object relational database can be mined for association rules without any restructuring or preprocessing using only basic SQL3 constructs and functions, and hence no additional machineries are necessary. In particular, we show that the cost of computing association rules for a given database does not depend on support and confidence thresholds. More precisely, the set of large items can be computed using one simple join query and an aggregation once the set of all possible meets (least fixpoint) of item set patterns in the input table is known. We believe that this is an encouraging discovery especially compared to the well known SQL based methods in the literature. Finally, we capture the functionality of our proposed mining method in a mine by SQL3 operator for general use in any relational database.

Hasan M. Jamil

Towards a Logic Query Language for Data Mining

Abstract

We present a logic database language with elementary data mining mechanisms to model the relevant aspects of knowledge discovery, and to provide a support for both the iterative and interactive features of the knowledge discovery process. We adopt the notion of user-defined aggregate to model typical data mining tasks as operations unveiling unseen knowledge. We illustrate the use of aggregates to model specific data mining tasks, such as frequent pattern discovery, classification, data discretization and clustering, and show how the resulting data mining query language allows the modeling of typical steps of the knowledge discovery process, that range from data preparation to knowledge extraction and evaluation.

Fosca Giannotti, Giuseppe Manco, Franco Turini

A Data Mining Query Language for Knowledge Discovery in a Geographical Information System

Abstract

Spatial data mining is a process used to discover interesting but not explicitly available, highly usable patterns embedded in both spatial and non-spatial data, which are possibly stored in a spatial database. An important application of spatial data mining methods is the extraction of knowledge from a Geographic Information System (GIS). INGENS (INductive GEographic iNformation System) is a prototype GIS which integrates data mining tools to assist users in their task of topographic map interpretation. The spatial data mining process is aimed at a user who controls the parameters of the process by means of a query written in a mining query language. In this paper, we present SDMOQL (Spatial Data Mining Object Query Language), a spatial data mining query language used in INGENS, whose design is based on the standard OQL (Object Query Language). Currently, SDMOQL supports two data mining tasks: inducing classification rules and discovering association rules. For both tasks the language permits the specification of the task-relevant data, the kind of knowledge to be mined, the background knowledge and the hierarchies, the interestingness measures and the visualization for discovered patterns. Some constraints on the query language are identified by the particular mining task. The syntax of the query language is described and the application to a real repository of maps is briefly reported.

Donato Malerba, Annalisa Appice, Michelangelo Ceci

Towards Query Evaluation in Inductive Databases Using Version Spaces

Abstract

An inductive query specifies a set of constraints that patterns should satisfy. We study a novel type of inductive query that consists of arbitrary boolean expressions over monotonic and anti-monotonic primitives. One such query asks for all patterns that have a frequency of at least 50 on the positive examples and of at most 3 on the negative examples.

We investigate the properties of the solution spaces of boolean inductive queries. More specifically, we show that the solution space w.r.t. a conjunctive query is a version space, which can be represented by its border sets, and that the solution space w.r.t. an arbitrary boolean inductive query corresponds to a union of version spaces. We then discuss the role of operations on version spaces (and their border sets) in computing the solution space w.r.t. a given query. We conclude by formulating some thoughts on query optimization.

Luc De Raedt

The GUHA Method, Data Preprocessing and Mining

Abstract

The paper surveys basic principles and foundations of the GUHA method, relation to some well-known data mining systems, main publications, existing implementations and future plans.

Petr Hájek, Jan Rauch, David Coufal, Tomáš Feglar

Constraint Based Mining of First Order Sequences in SeqLog

Abstract

A logical language, SeqLog, for mining and querying sequential data and databases is presented. In SeqLog, data takes the form of a sequence of logical atoms, background knowledge can be specified using Datalog style clauses and sequential queries or patterns correspond to subsequences of logical atoms. SeqLog is then used as the representation language for the inductive database mining system MineSeqLog. Inductive queries in MineSeqLog take the form of a conjunction of a monotonic and an anti-monotonic constraint on sequential patterns. Given such an inductive query, MineSeqLog computes the borders of the solution space. MineSeqLog uses variants of the famous level-wise algorithm together with ideas from version spaces to realize this. Finally, we report on a number of experiments in the domains of user-modelling that validate the approach.

Sau Dan Lee, Luc De Raedt

Support for KDD-Process

Interactivity, Scalability and Resource Control for Efficient KDD Support in DBMS

Abstract

The conflict between resource consumption and query performance in the data mining context often has no satisfactory solution. This is in sharp contrast to the needs of the analysts for interactive response times and has rendered the seamless integration of data mining operators into common multiuser database systems a difficult and (so far) not very successful task. This paper describes an approach that allows to combine preprocessing and data mining operators into one common KDD-aware implementation algebra such that interactivity, scalability and resource efficiency can simultaneously be achieved. The basic idea of our framework is pipelining. However, since there is a danger of blocking pipelines, we introduce controlled ordering-, cardinality- and special-value-properties of the data stream across the whole query tree up to the complex data mining operators. The framework builds on a spezialized index that is basically an extension of the UB-Tree and efficiently provides various data orderings. These orderings and the remaining properties are then exploited by the KDD-algebra operators to release results and internal data structures early enough to allow pipelined, resource-efficient query processing with interactive response times. This paper describes the framework and demonstrates its benefits in preprocessing and in the parallel and interactive detection of outliers.

Matthias Gimbel, Michael Klein, P. C. Lockemann

Frequent Itemset Discovery with SQL Using Universal Quantification

Abstract

Algorithms for finding frequent itemsets fall into two broad categories: algorithms that are based on non-trivial SQL statements to query and update a database, and algorithms that employ sophisticated in-memory data structures, where the data is stored in flat files. Most performance experiments have shown that SQL-based approaches are inferior to main-memory algorithms. However, the current trend of database vendors to integrate analysis functionalities into their query execution and optimization components, i.e., “closer to the data,” suggests to revisit these results and to search for new, potentially better solutions.

We investigate approaches based on SQL-92 and present a new approach called Quiver that employs universal and existential quantifications. In the table schema for itemsets of our approach, a group of tuples represents a single itemset. Such a “vertical” layout is similar to the popular layout used for the transaction table, which is the input of frequent itemset discovery. We show that current DBMS do not provide efficient query processing strategies for dealing with quantified queries, mostly due to the lack of an adequate SQL syntax for set containment tests. Performance tests using a query processor prototype and a novel query operator, called set containment division, promise an improved performance for quantified queries like those used for Quiver.

Ralf Rantzau

Deducing Bounds on the Support of Itemsets

Abstract

Mining Frequent Itemsets is the core operation of many data mining algorithms. This operation however, is very data intensive and sometimes produces a prohibitively large output. In this paper we give a complete set of rules for deducing tight bounds on the support of an itemset if the supports of all its subsets are known. Based on the derived bounds [l,u] on the support of a candidate itemset I, we can decide not to access the database to count the support of I if l is larger than the support threshold (I will certainly be frequent), or if u is below the threshold (I will certainly fail the frequency test). We can also use the deduction rules to reduce the size of an adequate representation of the collection of frequent sets; all itemsets I with bounds [l,u], where l =u, do not need to be stored explicitly. To assess the usability in practice, we implemented the deduction rules and we present experiments on real-life data sets.

Toon Calders

Model-Independent Bounding of the Supports of Boolean Formulae in Binary Data

Abstract

Data mining algorithms such as the Apriori method for finding frequent sets in sparse binary data can be used for efficient computation of a large number of summaries from huge data sets. The collection of frequent sets gives a collection of marginal frequencies about the underlying data set. Sometimes, we would like to use a collection of such marginal frequencies instead of the entire data set (e.g. when the original data is inaccessible for confidentiality reasons) to compute other interesting summaries. Using combinatorial arguments, we may obtain tight upper and lower bounds on the values of inferred summaries. In this paper, we consider a class of summaries wider than frequent sets, namely that of frequencies of arbitrary Boolean formulae. Given frequencies of a number of any different Boolean formulae, we consider the problem of finding tight bounds on the frequency of another arbitrary formula. We give a general formulation of the problem of bounding formula frequencies given some background information, and show how the bounds can be obtained by solving a linear programming problem. We illustrate the accuracy of the bounds by giving empirical results on real data sets.

Artur Bykowski, Jouni K. Seppänen, Jaakko Hollmén

Condensed Representations for Sets of Mining Queries

Abstract

In this paper, we propose a general framework for condensed representations of sets of mining queries. To this end, we adapt the standard notions of maximal, closed and key patterns introduced in previous works, including those dealing with condensed representations. Whereas these previous works concentrate on condensed representations of the answer to a single mining query, we consider the more general case of sets of mining queries defined by monotonic and anti-monotonic selection predicates.

Arnaud Giacometti, Dominique Laurent, Cheikh Talibouya Diop

One-Sided Instance-Based Boundary Sets

Abstract

Instance retraction is a difficult problem for concept learning by version spaces. This chapter introduces a family of version-space representations called one-sided instance-based boundary sets. They are correct and efficiently computable representations for admissible concept languages. Compared to other representations, they are the most efficient useful version-space representations for instance retraction.

Evgueni N. Smirnov, Ida G. Sprinkhuizen-Kuyper, H Jaap van den Herik

Domain Structures in Filtering Irrelevant Frequent Patterns

Abstract

Events are used to monitor many types of processes in several technical domains. Computers and efficient electronic communication networks make it very easy to increase the accuracy and amount of logged details. While the size of logs is growing, the collection and analysis of them are becoming harder all the time. Frequent episodes offer one possible method to structure and find information hidden in logs. Unfortunately, as events reflecting simultaneous independent processes are stored to central monitoring points, signs of several unrelated phenomena get mixed with each other. This makes the algorithm searching for frequent episodes to produce accidental and irrelevant results. As a solution to this problem, we introduce here a notion of domain constraints that are based on distance measures, which can be defined in terms of domain structure and used taxonomies. We also show how these constraints can be used to prune irrelevant event combinations.

Kimmo Hätönen, Mika Klemettinen

Integrity Constraints over Association Rules

Abstract

In this paper, we propose to investigate the notion of integrity constraints in inductive databases. We advocate that integrity constraints can be used in this context as an abstract concept to encompass common data mining tasks such as the detection of corrupted data or of patterns that contradict the expert beliefs. To illustrate this possibility we propose a form of constraints called association map constraints to specify authorized confidence variations among the association rules. These constraints are easy to read and thus can be used to write clear specifications. We also present experiments showing that their satisfaction can be tested in practice.

Artur Bykowski, Thomas Daurel, Nicolas Méger, Christophe Rigotti

Backmatter

Titel: Database Support for Data Mining Applications
herausgegeben von: Rosa Meo
Pier Luca Lanzi
Mika Klemettinen
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-44497-8
Print ISBN: 978-3-540-22479-2
DOI: https://doi.org/10.1007/b99016

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Database Languages and Query Execution

Inductive Databases and Multiple Uses of Frequent Itemsets: The cInQ Approach

Query Languages Supporting Descriptive Rule Mining: A Comparative Study

Declarative Data Mining Using SQL3

Towards a Logic Query Language for Data Mining

A Data Mining Query Language for Knowledge Discovery in a Geographical Information System

Towards Query Evaluation in Inductive Databases Using Version Spaces

The GUHA Method, Data Preprocessing and Mining

Constraint Based Mining of First Order Sequences in SeqLog

Support for KDD-Process

Interactivity, Scalability and Resource Control for Efficient KDD Support in DBMS

Frequent Itemset Discovery with SQL Using Universal Quantification

Deducing Bounds on the Support of Itemsets

Model-Independent Bounding of the Supports of Boolean Formulae in Binary Data

Condensed Representations for Sets of Mining Queries

One-Sided Instance-Based Boundary Sets

Domain Structures in Filtering Irrelevant Frequent Patterns

Integrity Constraints over Association Rules

Backmatter