Top

1999 | Book

Read chapter Read first chapter

Principles of Data Mining and Knowledge Discovery

Third European Conference, PKDD’99, Prague, Czech Republic, September 15-18, 1999. Proceedings

Editors: Jan M. Żytkow, Jan Rauch

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

Included in: Professional Book Archive

About this book

This book constitutes the refereed proceedings of the Third European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD'99, held in Prague, Czech Republic in September 1999.
The 28 revised full papers and 48 poster presentations were carefully reviewed and selected from 106 full papers submitted. The papers are organized in topical sections on time series, applications, taxonomies and partitions, logic methods, distributed and multirelational databases, text mining and feature selection, rules and induction, and interesting and unusual issues.

Frontmatter

Session 1A - Time Series

Scaling up Dynamic Time Warping to Massive Datasets

There has been much recent interest in adapting data mining algorithms to time series databases. Many of these algorithms need to compare time series. Typically some variation or extension of Euclidean distance is used. However, as we demonstrate in this paper, Euclidean distance can be an extremely brittle distance measure. Dynamic time warping (DTW) has been suggested as a technique to allow more robust distance calculations, however it is computationally expensive. In this paper we introduce a modification of DTW which operates on a higher level abstraction of the data, in particular, a piecewise linear representation. We demonstrate that our approach allows us to outperform DTW by one to three orders of magnitude. We experimentally evaluate our approach on medical, astronomical and sign language data.

Eamonn J. Keogh, Michael J. Pazzani

The Haar Wavelet Transform in the Time Series Similarity Paradigm

Similarity measures play an important role in many data mining algorithms. To allow the use of such algorithms on non-standard databases, such as databases of financial time series, their similarity measure has to be defined. We present a simple and powerful technique which allows for the rapid evaluation of similarity between time series in large data bases. It is based on the orthonormal decomposition of the time series into the Haar basis. We demonstrate that this approach is capable of providing estimates of the local slope of the time series in the sequence of multi-resolution steps. The Haar representation and a number of related represenations derived from it are suitable for direct comparison, e.g. evaluation of the correlation product. We demonstrate that the distance between such representations closely corresponds to the subjective feeling of similarity between the time series. In order to test the validity of subjective criteria, we test the records of currency exchanges, finding convincing levels of correlation.

Zbigniew R. Struzik, Arno Siebes

Rule Discovery in Large Time-Series Medical Databases

Since hospital information systems have been introduced in large hospitals, a large amount of data, including laboratory examinations, have been stored as temporal databases. The characteristics of these temporal databases are: (1) Each record are inhomogeneous with respect to time-series, including short-term effects and long-term effects. (2) Each record has more than 1000 attributes when a patient is followed for more than one year. (3) When a patient is admitted for a long time, a large amount of data is stored in a very short term. Even medical experts cannot deal with these large databases, the interest in mining some useful information from the data are growing. In this paper, we introduce a combination of extended moving average method and rule induction method, called CEARI to discover new knowledge in temporal databases. This CEARI was applied to a medical dataset on Motor Neuron Diseases, the results of which show that interesting knowledge is discovered from each database.

Shusaku Tsumoto

Session 1B - Applications

Simultaneous Prediction of Multiple Chemical Parameters of River Water Quality with TILDE

Environmental studies form an increasingly popular application domain for machine learning and data mining techniques. In this paper we consider two applications of decision tree learning in the domain of river water quality: a) the simultaneous prediction of multiple physico-chemical properties of the water from its biological properties using a single decision tree (as opposed to learning a different tree for each different property) and b) the prediction of past physico-chemical properties of the river water from its current biological properties. We discuss some experimental results that we believe are interesting both to the application domain experts and to the machine learning community.

Hendrik Blockeel, Sašo Džeroski, Jasna Grbović

Applying Data Mining Techniques to Wafer Manufacturing

In this paper we report an experience from the use of data mining techniques in the area of semiconductor fabrication. The specific application we dealt with is the analysis of data concerning the wafer production process with the goal of determining possible causes for errors, resulting in lots of faulty wafers. Even though our application is very specific and deals with a specific manufacturing sector (e.g. semiconductor fabrication), we believe that our experience can be relevant to other manufacturing sectors and provide significant feedback on the use of data mining techniques.

Elisa Bertino, Barbara Catania, Eleonora Caglio

An Application of Data Mining to the Problem of the University Students’ Dropout Using Markov Chains

A new application of data mining to the problem of University dropout is presented. A new modeling technique, based on Markov chains, has been developed to mine information from data about the University students’ behavior. The information extracted by means of the proposed technique has been used to deeply understand the dropout problem, to find out the high-risk population and to drive the design of suitable politics to reduce it. To represent the behavior of the students the available data have been modeled as a Markov chain and the associated transition probabilities have been used as a base to extract the aforesaid behavioral patterns. The developed technique is general and can be successfully used to study a large range of decisional problems dealing with data in the form of events or time series. The results of the application of the proposed technique to the students’ data will be presented.

S. Massa, P. P. Puliafito

Session 2A - Taxonomies and Partitions

Discovering and Visualizing Attribute Associations Using Bayesian Networks and Their Use in KDD

In this paper we describe a way to discover attribute associations and a way to present them to users using Bayesian networks. We describe a three-dimensional visualization to present them effectively to users. Furthermore we discuss two applications of attribute associations to the KDD process. One application involves using them to support feature selection. The result of our experiment shows that feature selection using visualized attribute associations works well in 17 data sets out of the 24 that were used. The other application uses them to support the selection of data mining methods. We discuss the possibility of using attribute associations to help in deciding if a given data set is suited to learning decision trees. We found 3 types of structural characteristics in Bayesian networks obtained from the data. The characteristics have strong relevance to the results of learning decision trees.

Gou Masuda, Rei Yano, Norihiro Sakamoto, Kazuo Ushijima

Taxonomy Formation by Approximate Equivalence Relations, Revisited

Unsupervised classification of objects involves formation of classes and construction of one or more taxonomies that include those classes. Meaningful classes can be formed in feedback with acquisition of knowledge about each class. We demonstrate how contingency tables can be used to construct one-level taxonomy elements by relying only on approximate equivalence relations between attribute pairs, and how a multi-level taxonomy formation can be guided by a partition utility functions. Databases with different types of attributes and large number of records can be dealt with.

F. A. El-Mouadib, J. Koronacki, J. M. Żytkow

On the Use of Self-Organizing Maps for Clustering and Visualization

We show that the number of output units used in a self-organizing map (SOM) influences its applicability for either clustering or visualization. By reviewing the appropriate literature and theory and own empirical results, we demonstrate that SOMs can be used for clustering or visualization separately, for simultaneous clustering and visualization, and even for clustering via visualization. For all these different kinds of application, SOM is compared to other statistical approaches. This will show SOM to be a flexible tool which can be used for various forms of explorative data analysis but it will also be made obvious that this flexibility comes with a price in terms of impaired performance. The usage of SOM in the data mining community is covered by discussing its application in the data mining tools CLEMENTINE and WEBSOM.

Arthur Flexer

Speeding Up the Search for Optimal Partitions

Numerical value range partitioning is an inherent part of inductive learning. In classification problems, a common partition ranking method is to use an attribute evaluation function to assign a goodness score to each candidate. Optimal cut point selection constitutes a potential efficiency bottleneck, which is often circumvented by using heuristic methods.This paper aims at improving the efficiency of optimal multisplitting. We analyze convex and cumulative evaluation functions, which account for the majority of commonly used goodness criteria. We derive an analytical bound, which lets us filter out—when searching for the optimal multisplit—all partitions containing a specific subpartition as their prefix. Thus, the search space of the algorithm can be restricted without losing optimality.We compare the partition candidate pruning algorithm with the best existing optimization algorithms for multisplitting. For it the numbers of evaluated partition candidates are, on the average, only approximately 25% and 50% of those performed by the comparison methods. In time saving that amounts up to 50% less evaluation time per attribute.

Tapio Elomaa, Juho Rousu

Session 2B - Logic Methods

Experiments in Meta-level Learning with ILP

When considering new datasets for analysis with machine learning algorithms, we encounter the problem of choosing the algorithm which is best suited for the task at hand. The aim of meta-level learning is to relate the performance of different machine learning algorithms to the characteristics of the dataset. The relation is induced on the basis of empirical data about the performance of machine learning algorithms on the different datasets.In the paper, an Inductive Logic Programming (ILP) framework for meta-level learning is presented. The performance of three machine learning algorithms (the tree learning system C4.5, the rule learning system CN2 and the k-NN nearest neighbour classifier) were measured on twenty datasets from the UCI repository in order to obtain the dataset for meta-learning. The results of applying ILP on this meta-learning problem are presented and discussed.

Ljupčo Todorovski, Sašo Džeroski

Boolean Reasoning Scheme with Some Applications in Data Mining

We present a general encoding scheme for a wide class of problems (including among others such problems like data reduction, feature selection, feature extraction, decision rules generation, pattern extraction from data or conflict resolution in multi-agent systems) and we show how to combine it with a propositional (Boolean) reasoning to develop efficient heuristics searching for (approximate) solutions of these problems. We illustrate our approach by examples, we show some experimental results and compare them with those reported in literature. We also show that association rule generation is strongly related with reduct approximation.

Andrzej Skowron, Hung Son Nguyen

On the Correspondence between Classes of Implicational and Equivalence Quantifiers

Relations between two Boolean attributes derived from data can be quantified by truth functions defined on four-fold tables corresponding to pairs of the attributes. In the paper, several classes of such quantifiers (implicational, double implicational, equivalence ones) with truth values in the unit interval are investigated. The method of construction of the logically nearest double implicational and equivalence quantifiers to a given implicational quantifier (and vice versa) is described and approved.

Jiří Ivánek

Querying Inductive Databases via Logic-Based User-Defined Aggregates

We show how a logic-based database language can support the various steps of the KDD process by providing: a high degree of expressiveness, the ability to formalize the overall KDD process and the capability of separating the concerns between the specification level and the mapping to the underlying databases and datamining tools. We generalize the notion of Inductive Data Bases proposed in [4,12] to the case of Deductive Databases. In our proposal, deductive databases resemble relational databases while user defined aggregates provided by the deductive database language resemble the mining function and results. In the paper we concentrate on association rules and show how the mechanism of user defined aggregates allows to specify the mining evaluation functions and the returned patterns.

Fosca Giannotti, Giuseppe Manco

Session 3A - Distributed and Multirelational Databases

Peculiarity Oriented Multi-database Mining

The paper proposes a way of mining peculiarity rules from multiply statistical and transaction databases. We introduce the peculiarity rules as a new type of association rules, which can be discovered from a relatively small number of the peculiar data by searching the relevance among the peculiar data. We argue that the peculiarity rules represent a typically unexpected, interesting regularity hidden in statistical and transaction databases. We describe how to mine the peculiarity rules in the multi-database environment and how to use the RVER (Reverse Variant Entity-Relationship) model to represent the result of multi-database mining. Our approach is based on the database reverse engineering methodology and granular computing techniques.

Ning Zhong, Y. Y. Yao, Setsuo Ohsuga

Knowledge Discovery in Medical Multi-databases: A Rough Set Approach

Since early 1980’s, due to the rapid growth of hospital information systems (HIS), electronic patient records are stored as huge databases at many hospitals. One of the most important problems is that the rules induced from each hospital may be different from those induced from other hospitals, which are very difficult even for medical experts to interpret. In this paper, we introduce rough set based analysis in order to solve this problem. Rough set based analysis interprets the conflicts between rules from the viewpoint of supporting sets, which are closely related with dempster-shafer theory(evidence theory) and outputs interpretation of rules with evidential degree. The proposed method was evaluated on two medical databases, the experimental results of which show that several interesting relations between rules, including interpretation on difference and the solution of conflicts between induced rules, are discovered.

Shusaku Tsumoto

Automated Discovery of Rules and Exceptions from Distributed Databases Using Aggregates

Large amounts of data pose special problems for Knowledge Discovery in Databases. More efficient means are required to ease this problem, and one possibility is the use of sufficient statistics or “aggregates”, rather than low level data. This is especially true for Knowledge Discovery from distributed databases. The data of interest is of a similar type to that found in OLAP data cubes and the Data Warehouse. This data is numerical and is described in terms of a number of categorical attributes (Dimensions). Few algorithms to date carry out knowledge discovery on such data. Using aggregate data and accompanying meta-data returned from a number of distributed databases, we use statistical models to identify and highlight relationships between a single numerical attribute and a number of Dimensions. These are initially presented to the user via a graphical interactive middle-ware, which allows drilling down to a more detailed level. On the basis of these relationships, we induce rules in conjunctive normal form. Finally, exceptions to these rules are discovered.

Rónán Páircéir, Sally McClean, Bryan Scotney

Session 3B - Text Mining and Feature Selection

Text Mining via Information Extraction

Knowledge Discovery in Databases (KDD), also known as data mining, focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Given a collection of text documents, most approaches to text mining perform knowledge-discovery operations on labels associated with each document. At one extreme, these labels are keywords that represent the results of non-trivial keyword-labeling processes, and, at the other extreme, these labels are nothing more than a list of the words within the documents of interest. This paper presents an intermediate approach, one that we call text mining via information extraction, in which knowledge discovery takes place on a more focused collection of events and phrases that are extracted from and label each document. These events plus additional higher-level entities are then organized in a hierarchical taxonomy and are used in the knowledge discovery process. This approach was implemented in the Textoscope system. Textoscope consists of a document retrieval module which converts retrieved documents from their native formats into SGML documents used by Textoscope; an information extraction engine, which is based on a powerful attribute grammar which is augmented by a rich background knowledge; a taxonomy-creation tool by which the user can help specify higher-level entities that inform the knowledge-discovery process; and a set of knowledge-discovery tools for the resulting event-labeled documents. We evaluate our approach on a collection of newswire stories extracted by Textoscope’s own agent. Our results confirm that Text Mining via information extraction serves as an accurate and powerful technique by which to manage knowledge encapsulated in large document collections.

Ronen Feldman, Yonatan Aumann, Moshe Fresko, Orly Liphstat, Binyamin Rosenfeld, Yonatan Schler

TopCat: Data Mining for Topic Identification in a Text Corpus

TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on “traditional” data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized “ground truth” news corpus showing this technique is effective in identifying topics in collections of news articles.

Chris Clifton, Robert Cooley

Selection and Statistical Validation of Features and Prototypes

Features and protypes selection are two major problems in data mining, especially for machine learning algorithms. The goal of both selections is to reduce storage complexity, and thus computational costs, without sacrificing accuracy. In this article, we present two incremental algorithms using geometrical neighborhood graphs and a new statistical test to select, step by step, relevant features and prototypes for supervised learning problems. The feature selection procedure we present could be applied before any machine learning algorithm is used.

M. Sebban, D. A. Zighed, S. Di Palma

Session 4A - Rules and Induction

Taming Large Rule Models in Rough Set Approaches

In knowledge discovery from uncertain data we usually wish to obtain models that have good predictive properties when applied to unseen objects. In several applications, it is also desirable to synthesize models that in addition have good descriptive properties. The ultimate goal therefore, is to maximize both properties, i.e. to obtain models that are amenable to human inspection and that have high predictive performance. Models consisting of decision or classification rules, such as those produced with rough sets [19], can exhibit both properties. In practice, however, the induced models are often too large to be inspected. This paper reports on two basic approaches to obtaining manageable rule-based models that do not sacrifice their predictive qualities: a priori and a posteriori pruning. The methods are discussed in the context of rough sets, but several of the results are applicable to rule-based models in general. Algorithms realizing these approaches have been implemented in the Rosetta system. Predictive performance of the models has been estimated using accuracy and receiver operating characteristics (ROC). The methods has been tested on real-world data sets, with encouraging results.

Thomas Ågotnes, Jan Komorowski, Terje Løken

Optimizing Disjunctive Association Rules

We analyze several problems of optimizing disjunctive association rules. The problems have important applications in data mining, allowing users to focus at interesting rules extracted from databases. We consider association rules of the form $\bigwedge_{j=1}^{n}(A_{i_j}=v_j)\rightarrow C_2$, where $\{A_{i_1},A_{i_2},\ldots,A_{i_n}\}$ is a subset of the categorical attributes of the underlying relation R, and C₂ is any fixed condition defined over the attributes of the relation R. An instantiation of the rule binds the variables v_j’s to values from the corresponding attribute domains. We study several problems, in which we seek a collection of instantiations of a given rule that satisfy certain optimality constraints. Each of the problems can re-interpreted as looking for one optimized disjunctive association rule. We exhibit efficient algorithms for solving the optimized support and optimized confidence problems, the weighted support/confidence problem, and the shortest rule problem. We discuss time and splace complexity of the designed algorithms and show how they can be improved by allowing for approximate solutions.

Dmitry Zelenko

Contribution of Boosting in Wrapper Models

We describe a new way to deal with feature selection when boosting is used to assess the relevancy of feature subsets. In the context of wrapper models, the accuracy is here replaced as a performance function by a particular exponential criterion, usually optimized in boosting algorithms. A first experimental study brings to the fore the relevance of our approach. However, this new ”boosted” strategy needs the construction at each step of many learners, leading to high computational costs.We focus then, in a second part, on how to speed-up boosting convergence to reduce this complexity. We propose a new update of the instance distribution, which is the core of a boosting algorithm. We exploit these results to implement a new forward selection algorithm which converges much faster using overbiased distributions over learning instances. Speed-up is achieved by reducing the number of weak hypothesis when many identical observations are shared by different classes. A second experimental study on the UCI repository shows significantly speeding improvements with our new update without altering the feature subset selection.

Marc Sebban, Richard Nock

Experiments on a Representation-Independent “Top-Down and Prune” Induction Scheme

Recently, some methods for the induction of Decision Trees have received much theoretical attention. While some of these works focused on efficient top-down induction algorithms, others investigated the pruning of large trees to obtain small and accurate formulae. This paper discusses the practical possibility of combining and generalizing both approaches, to use them on various classes of concept representations, not strictly restricted to decision trees or formulae built from decision trees. The algorithm, WIREI, is able to produce decision trees, decision lists, simple rules, disjunctive normal form formulae, a variant of multilinear polynomials, and more. This shifting ability allows to reduce the risk of deviating from valuable concepts during the induction. As an example, in a previously used simulated noisy dataset, the algorithm managed to find systematically the target concept itself, when using an adequate concept representation. Further experiments on twenty-two readily available datasets show the ability of WIREI to build small and accurate concept representations, which lets the user choose his formalism to best suit his interpretation needs, in particular for mining purposes.

Richard Nock, Marc Sebban, Pascal Jappy

Session 5A - Interesting and Unusual

Heuristic Measures of Interestingness

The tuples in a generalized relation (i.e., a summary generated from a database) are unique, and therefore, can be considered to be a population with a structure that can be described by some probability distribution. In this paper, we present and empirically compare sixteen heuristic measures that evaluate the structure of a summary to assign a single real-valued index that represents its interestingness relative to other summaries generated from the same database. The heuristics are based upon well-known measures of diversity, dispersion, dominance, and inequality used in several areas of the physical, social, ecological, management, information, and computer sciences. Their use for ranking summaries generated from databases is a new application area. All sixteen heuristics rank less complex summaries (i.e., those with few tuples and/or few non-ANY attributes) as most interesting. We demonstrate that for sample data sets, the order in which some of the measures rank summaries is highly correlated.

Robert J. Hilderman, Howard J. Hamilton

Enhancing Rule Interestingness for Neuro-fuzzy Systems

Data Mining Algorithms extract patterns from large amounts of data. But these patterns will yield knowledge only if they are interesting, i.e. valid, new, potentially useful, and understandable. Unfortunately, during pattern search most Data Mining Algorithms focus on validity only, which also holds true for Neuro-Fuzzy Systems. In this Paper we introduce a method to enhance the interestingness of a rule base as a whole. In the first step, we aggregate the rule base through amalgamation of adjacent rules and eliminiation of redundant attributes. Supplementing this rather technical approach, we next sort rules with regard to their performance, as measured by their evidence. Finally, we compute reduced evidences, which penalize rules that are very similar to rules with a higher evidence. Rules sorted on reduced evidence are fed into an integrated rulebrowser, to allow for manual rule selection according to personal and situation-dependent preference. This method was applied successfully to two real-life classification problems, the target group selection for a retail bank, and fault diagnosis for a large car manufacturer. Explicit reference is taken to the NEFCLASS algorithm, but the procedure is easily generalized to other systems.

Thomas Wittmann, Johannes Ruhland, Matthias Eichholz

Unsupervised Profiling for Identifying Superimposed Fraud

Many fraud analysis applications try to detect “probably fraudulent” usage patterns, and to discover these patterns in historical data. This paper builds on a different detection concept; there are no fixed “probably fraudulent” patterns, but any significant deviation from the normal behavior indicates a potential fraud. In order to detect such deviations, a comprehensive representation of “customer behavior” must be used. This paper presents such representation, and discusses issues derived from it: a distance function and a clustering algorithm for probability distributions.

Uzi Murad, Gadi Pinkas

OPTICS-OF: Identifying Local Outliers

For many KDD applications finding the outliers, i.e. the rare events, is more interesting and useful than finding the common cases, e.g. detecting criminal activities in E-commerce. Being an outlier, however, is not just a binary property. Instead, it is a property that applies to a certain degree to each object in a data set, depending on how ‘isolated’ this object is, with respect to the surrounding clustering structure. In this paper, we formally introduce a new notion of outliers which bases outlier detection on the same theoretical foundation as density-based cluster analysis. Our notion of an outlier is ‘local’ in the sense that the outlier-degree of an object is determined by taking into account the clustering structure in a bounded neighborhood of the object. We demonstrate that this notion of an outlier is more appropriate for detecting different types of outliers than previous approaches, and we also present an algorithm for finding them. Furthermore, we show that by combining the outlier detection with a density-based method to analyze the clustering structure, we can get the outliers almost for free if we already want to perform a cluster analysis on a data set.

Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander

Posters

Selective Propositionalization for Relational Learning

A number of Inductive Logic Programming (ILP) systems have addressed the problem of learning First Order Logic (FOL) discriminant definitions by first reformulating the problem expressed in a FOL framework into a attribute-value problem and then applying efficient algebraic learning techniques. The complexity of such propositionalization methods is now in the size of the reformulated problem which can be exponential. We propose a method that selectively propositionalizes the FOL training set by interleaving boolean reformulation and algebraic resolution. It avoids, as much as possible, the generation of redundant boolean examples, and still ensures that explicit correct and complete definitions are learned.

Érick Alphonse, Céline Rouveirol

Circle Graphs: New Visualization Tools for Text-Mining

The proliferation of digitally available textual data necessitates automatic tools for analyzing large textual collections. Thus, in analogy to data mining for structured databases, text mining is defined for textual collections. A central tool in text-mining is the analysis of concept relationship, which discovers connections between different concepts, as reflected in the collection. However, discovering these relationships is not sufficient, as they also have to be presented to the user in a meaningful and manageable way. In this paper we introduce a new family of visualization tools, which we coin circle graphs, which provide means for visualizing concept relationships mined from large collections. Circle graphs allow for instant appreciation of multiple relationships gathered from the entire collection. A special type of circle-graphs, called Trend Graphs, allows tracking of the evolution of relationships over time.

Yonatan Aumann, Ronen Feldman, Yaron Ben Yehuda, David Landau, Orly Liphstat, Yonatan Schler

On the Consistency of Information Filters for Lazy Learning Algorithms

A common practice when filtering a case-base is to employ a filtering scheme that decides which cases to delete, as well as how many cases to delete, such that the storage requirements are minimized and the classification competence is preserved or improved. We introduce an algorithm that rivals the most successful existing algorithm in the average case when filtering 30 classification problems. Neither algorithm consistently outperforms the other, with each performing well on different problems. Consistency over many domains, we argue, is very hard to achieve when deploying a filtering algorithm.

Henry Brighton, Chris Mellish

Using Genetic Algorithms to Evolve a Rule Hierarchy

This paper describes the implementation and the functioning of RAGA (Rule Acquisition with a Genetic Algorithm), a genetic-algorithm-based data mining system suitable for both supervised and certain types of unsupervised knowledge extraction from large and possibly noisy databases. The genetic engine is modified through the addition of several methods tuned specifically for the task of association rule discovery. A set of genetic operators and techniques are employed to efficiently search the space of potential rules. During this process, RAGA evolves a default hierarchy of rules, where the emphasis is placed on the group rather than each individual rule. Rule sets of this type are kept simple in both individual rule complexity and the total number of rules that are required. In addition, the default hierarchy deals with the problem of over-fitting, particularly in classification tasks. Several data mining experiments using RAGA are described.

Robert Cattral, Franz Oppacher, Dwight Deugo

Mining Temporal Features in Association Rules

In real world applications, the knowledge that is used for aiding decision-making is always time-varying. However, most of the existing data mining approaches rely on the assumption that discovered knowledge is valid indefinitely. People who expect to use the discovered knowledge may not know when it became valid, or whether it still is valid in the present, or if it will be valid sometime in the future. For supporting better decision making, it is desirable to be able to actually identify the temporal features with the interesting patterns or rules. The major concerns in this paper are the identification of the valid period and periodicity of patterns and more specifically association rules.

Xiaodong Chen, Ilias Petrounias

The Improvement of Response Modeling: Combining Rule-Induction and Case-Based Reasoning

Direct mail is a typical example for response modeling to be used. In order to decide which people will receive the mailing, the potential customers are divided into two groups or classes (buyers and non-buyers) and a response model is created. Since the improvement of response modeling is the purpose of this paper, we suggest a combined approach of rule-induction and case-based reasoning. The initial classification of buyers and non-buyers is done by means of the C5-algorithm. To improve the ranking of the classified cases, we introduce in this research rule-predicted typicality. The combination of these two approaches is tested on synergy by elaborating a direct mail example.

F. Coenen, G. Swinnen, K. Vanhoof, G. Wets

Analyzing an Email Collection Using Formal Concept Analysis

We demonstrate the use of a data analysis technique called formal concept analysis (FCA) to explore information stored in a set of email documents. The user extends a pre-defined taxonomy of classifiers, designed to extract information from email documents with her own specialized classifiers. The classifiers extract information both from (i) the email headers providing structured information such as the date received, from:, to: and cc: lists, (ii) the email body containing free English text, and (iii) conjunctions of the two sources.

Richard Cole, Peter Eklund

Business Focused Evaluation Methods: A Case Study

Classification accuracy or other similar metrics have long been the measures used by researchers in machine learning and data mining research to compare methods and show the usefulness of such methods. Although these metrics are essential to show the predictability of the methods, they are not sufficient. In a business setting other business processes must be taken into consideration. This paper describes additional evaluations we provided potential users of our churn prediction prototype, CHAMP, to better define the characteristics of its predictions.

Piew Datta

Combining Data and Knowledge by MaxEnt-Optimization of Probability Distributions

We present a project for probabilistic reasoning based on the concept of maximum entropy and the induction of probabilistic knowledge from data. The basic knowledge source is a database of 15000 patient records which we use to compute probabilistic rules. These rules are combined with explicit probabilistic rules from medical experts which cover cases not represented in the database. Based on this set of rules the inference engine PIT (Probability Induction Tool), which uses the well-known principle of Maximum Entropy [5], provides a unique probability model while keeping the necessary additional assumptions as minimal and clear as possible. PIT is used in the medical diagnosis project Lexmed [4] for the identification of acute appendicitis. Based on the probability distribution computed by PIT, the expert system proposes treatments with minimal average cost. First clinical performance results are very encouraging.

Wolfgang Ertel, Manfred Schramm

Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation?

In many applications of data mining a – sometimes considerable – part of the data values is missing. Despite the frequent occurrence of missing data, most data mining algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. We investigate simulation-based data augmentation to handle missing data, which is based on filling-in (imputing) one or more plausible values for the missing data. One advantage of this approach is that the imputation phase is separated from the analysis phase, allowing for different data mining algorithms to be applied to the completed data sets. We compare the use of imputation to surrogate splits, such as used in CART, to handle missing data in tree-based mining algorithms. Experiments show that imputation tends to outperform surrogate splits in terms of predictive accuracy of the resulting models. Averaging over M > 1 models resulting from M imputations yields even better results as it profits from variance reduction in much the same way as procedures such as bagging.

Ad Feelders

Rough Dependencies as a Particular Case of Correlation: Application to the Calculation of Approximative Reducts

Rough Sets Theory provides a sound basis for the extraction of qualitative knowledge (dependencies) from very large relational databases. Dependencies may be expressed by means of formulas (implications) in the following way:$\{x_{\rm 1}, ..., {x}_{n}\} \Rightarrow_\rho \{y\}$where {x₁, ..., x_n} are attributes that induce partitions into equivalence classes on the underlying population.Coefficient ρ is the dependency degree, it establishes the percentage of objects that can be correctly assigned to classes of y, taking into account the classification induced by {x₁, ..., x_n}. Dealing with decision tables, it is important to determine ρ and to eliminate from {x₁, ..., x_n} redundant attributes, to obtain minimal reducts having the same classification power as the original set. The problem of reduct extraction is NP-hard. Thus, approximative reducts are often determined. Reducts have the same classification power of the original set of attributes but quite often contain redundant attributes.The main idea developed in this paper is that attributes considered as random variables related by means of a dependency, are also correlated (the opposite, in general, is not true). From this fact we try to find, making use of well stated and widely used statistical methods, only the most significant variables, that is to say, the variables that contribute the most (in a quantitative sense) to determine y.The set of attributes (in general a subset of {x₁, x₂, ..., x_n} ) obtained by means of well-founded sound statistical methods could be considered as a good approximation of a reduct.

María C. Fernandez-Baizán, Ernestina Menasalvas Ruiz, José M. Peña Sánchez, Socorro Millán, Eloina Mesa

A Fuzzy Beam-Search Rule Induction Algorithm

This paper proposes a fuzzy beam search rule induction algorithm for the classification task. The use of fuzzy logic and fuzzy sets not only provides us with a powerful, flexible approach to cope with uncertainty, but also allows us to express the discovered rules in a representation more intuitive and comprehensible for the user, by using linguistic terms (such as low, medium, high) rather than continuous, numeric values in rule conditions. The proposed algorithm is evaluated in two public domain data sets.

Cristina S. Fertig, Alex A. Freitas, Lucia V. R. Arruda, Celso Kaestner

An Innovative GA-Based Decision Tree Classifier in Large Scale Data Mining

A variety of techniques have been developed to scale decision tree classifiers in data mining to extract valuable knowledge. However, these aproaches either cause a loss of accuracy or cannot effectively uncover the data structure. We explore a more promising GA-based decision tree classifier, OOGASC4.5, to integrate the strengths of decision tree algorithms with statistical sampling and genetic algorithm. The proposed program could not only enhance the classification accuracy but assumes the potential advantage of extracting valuable rules as well. The computational results are provided along with analysis and conclusions.

Zhiwei Fu

Extension to C-means Algorithm for the Use of Similarity Functions

The C-Means algorithm has been motive of many extensions since the first publications. The extensions until now consider mainly the following aspects: the selection of initial seeds (centers); the determination of the optimal number of clusters and the use of different functionals for generate the clusters. In this paper it is proposed an extension to the C-means algorithm which considers description of the objects (data) with quantitative and qualitative features, besides consider missing data. These types of descriptions are very frequent in soft sciences as Medicine, Geology, Sociology, Marketing, etc. so the application scope for the proposed algorithm is very wide. The proposed algorithm use similarity functions that may be in function of partial similarity functions consequently allows comparing objects analyzing subdescriptions of the same. Results using standard public databases [2] are showed. In addition, a comparison with classical C-Means algorithm [7] is provided.

Javier Raymundo García-Serrano, José Francisco Martínez-Trinidad

Predicting Chemical Carcinogenesis Using Structural Information Only

This paper reports on the application of the Strongly Typed Evolutionary Programming System (STEPS) to the PTE2 challenge, which consists of predicting the carcinogenic activity of chemical compounds from their molecular structure and the outcomes of a number of laboratory analyses. Most contestants so far have relied heavily on results of short term toxicity (STT) assays. Using both types of information made available, most models incorporate attributes that make them strongly dependent on STT results. Although such models may prove to be accurate and informative, the use of toxicological information requires time cost and in some cases substantial utilisation of laboratory animals. If toxicological information only makes explicit, properties implicit in the molecular structure of chemicals, then provided a sufficiently expressive representation language, accurate solutions may be obtained from the structural information only. Such solutions may offer more tangible insight into the mechanistic paths and features that govern chemical toxicity as well as prediction based on virtual chemistry for the universe of compounds.

Claire J. Kennedy, Christophe Giraud-Carrier, Douglas W. Bristol

LA – A Clustering Algorithm with an Automated Selection of Attributes, Which is Invariant to Functional Transformations of Coordinates

A clustering algorithm called LA is described. The algorithm is based on comparison of the n-dimensional density of the data points in various regions of the space of attributes p(x₁,...,x_n) with an expected homogeneous density obtained as a simple product of the corresponding one-dimensional densities p_i(x_i). The regions with a high value of the ratio $\frac{p(x_1,\ldots,x_n)}{p_1(x_1)\ldots p_n(x_n)}$ are considered to contain clusters. A set of attributes which provides the most contrast clustering is selected automatically. The results obtained with the help of the LA algorithm are invariant to any clustering space coordinate reparametrizations, i. e. to one-dimensional monotonous functional transformations x′ = f(x). Another valuable property of the algorithm is the weak dependence of the computational time on the number of data points.

Mikhail V. Kiselev, Sergei M. Ananyan, Sergey B. Arseniev

Association Rule Selection in a Data Mining Environment

Data mining methods easily produce large collections of rules, so that the usability of the methods is hampered by the sheer size of the rule set. One way of limiting the size of the result set is to provide the user with tools to help in finding the truly interesting rules. We use this approach in a case study where we search for association rules in NCHS health care data, and select interesting subsets of the result by using a simple query language implemented in the KESO data mining system. Our results emphasize the importance of the explorative approach supported by efficient selection tools.

Mika Klemettinen, Heikki Mannila, A. Inkeri Verkamo

Multi-relational Decision Tree Induction

Discovering decision trees is an important set of techniques in KDD, both because of their simple interpretation and the efficiency of their discovery. One disadvantage is that they do not take the structure of the data into account. By going from the standard single-relation approach to the multi-relational approach as in ILP this disadvantage is removed. However, the straightforward generalisation loses the efficiency. In this paper we present a framework that allows for efficient discovery of multi-relational decision trees through exploitation of domain knowledge encoded in the data model of the database.

Arno J. Knobbe, Arno Siebes, Daniël van der Wallen

Learning of Simple Conceptual Graphs from Positive and Negative Examples

A learning model is considered in terms of formal concept analysis (FCA). This model is generalized for objects represented by sets of graphs with partially ordered labels of vertices and edges (these graphs can be considered as simple conceptual graphs). An algorithm that computes all concepts and the linear (Hasse) diagram of the concept lattice in time linear with respect to the number of concepts is presented. The linear diagram gives the structure of the set of all concepts with respect to the partial order on them and provides a useful tool for browsing or discovery of implications (associations) in data.

Sergei O. Kuznetsov

An Evolutionary Algorithm Using Multivariate Discretization for Decision Rule Induction

We describe EDRL-MD, an evolutionary algorithm-based system, for learning decision rules from databases. The main novelty of our approach lies in dealing with continuous – valued attributes. Most of decision rule learners use univariate discretization methods, which search for threshold values for one attribute at the same time. In contrast to them, EDRL-MD simultaneously searches for threshold values for all continuous-valued attributes, when inducing decision rules. We call this approach multivariate discretization. Since multivariate discretization is able to capture interdependencies between attributes it may improve the accuracy of obtained rules. The evolutionary algorithm uses problem specific operators and variable-length chromosomes, which allows it to search for complete rulesets rather than single rules. The preliminary results of the experiments on some real-life datasets are presented.

Wojciech Kwedlo, Marek Krętowski

ZigZag, a New Clustering Algorithm to Analyze Categorical Variable Cross-Classification Tables

This Paper proposes ZigZag, a new clustering algorithm, that works on categorical variable Cross-classification tables. Zigzag creates simultaneously two partitions of row and column categories in accordance with the equivalence relation ”to have the Same conditional mode” . These two partitions are associated one to one and onto, creating by that way row-column clusters. Thus, we have an efficient KDD tool which we tan apply to any database. Moreover, ZigZag visualizes predictive association for nominal data in the sense of Guttman, Goodman and Kruskal. Accordingly, the prediction rule of a nominal variable Y conditionally to an other X consists in choosing the conditionally most probable category of Y when knowing X and the power of this rule is evaluated by the mean proportional reduction in error denoted by λ_Y/X. It would appear then that the mapping furnished by ZigZag plays for nominal data the Same role as the scattered diagram and the curves of conditional means or the straight regression line plays for quantitative data, the first increased with the values of λ_Y/X and λ_X/Y, the second increased with the correlation ratio or the R2.

Stéphane Lallich

Efficient Mining of High Confidence Association Rules without Support Thresholds

Association rules describe the degree of dependence between items in transactional datasets by their confidences. In this paper, we first introduce the problem of mining top rules, namely those association rules with 100% confidence. Traditional approaches to this problem need a minimum support (minsup) threshold and then can discover the top rules with supports ≥ minsup; such approaches, however, rely on minsup to help avoid examining too many candidates and they miss those top rules whose supports are below minsup. The low support top rules (e.g. some unusual combinations of some factors that have always caused some disease) may be very interesting. Fundamentally different from previous work, our proposed method uses a dataset partitioning technique and two border-based algorithms to efficiently discover all top rules with a given consequent, without the constraint of support threshold. Importantly, we use borders to concisely represent all top rules, instead of enumerating them individually. We also discuss how to discover all zero-confidence rules and some very high (say 90%) confidence rules using approaches similar to mining top rules. Experimental results using the Mushroom, the Cleveland heart disease, and the Boston housing datasets are reported to evaluate the efficiency of the proposed approach.

Jinyan Li, Xiuzhen Zhang, Guozhu Dong, Kotagiri Ramamohanarao, Qun Sun

A Logical Approach to Fuzzy Data Analysis

In this paper, we investigate the extraction of fuzzy rules from data tables based on possibility theory. A possibilistic decision language is used to represent the extracted fuzzy rules. The algorithm for rule extraction is presented and the complexity analysis is carried out. Since the results of the rule induction process strongly depend on the representation language, we also discuss some approach for dynamic adjustment of the language based on the data.

Churn-Jung Liau, Duen-Ren Liu

AST: Support for Algorithm Selection with a CBR Approach

Providing user support for the application of Data Mining algorithms in the field of Knowledge Discovery in Databases (KDD) is an important issue. Based on ideas from the fields of statistics, machine learning and knowledge engineering we provided a general framework for defining user support. The general framework contains a combined top-down and bottom-up strategy to tackle this problem. In the current paper we describe the Algorithm Selection Tool (AST) that is one component in our framework.AST is designed to support algorithm selection in the knowledge discovery process with a case-based reasoning approach. We discuss the architecture of AST and explain the basic components. We present the evaluation of our approach in a systematic analysis of the case retrieval behaviour and thus of the selection support offered by our system.

Guido Lindner, Rudi Studer

Efficient Shared Near Neighbours Clustering of Large Metric Data Sets

Very few clustering methods are capable of clustering data without assuming the availability of operations which are defined only in strongly structured spaces, such as vector spaces. We propose an efficient data clustering method based on the shared near neighbours approach, which requires only a distance definition and is capable of discovering clusters of any shape. Using efficient data structures for querying metric data and a scheme for partitioning and sampling the data, the method can cluster effectively and efficiently data sets whose size exceeds the internal memory size.

Stefano Lodi, Luisella Reami, Claudio Sartori

Discovery of “Interesting” Data Dependencies from a Workload of SQL Statements

Discovering data dependencies consists in producing the whole set of a given class of data dependencies holding in a database, the task of selecting the interesting ones being usually left to an expert user. In this paper we take another look at the problems of discovering inclusion and functional dependencies in relational databases. We define rigourously the so-called logical navigation from a workload of SQL statements. This assumption leads us to devise tractable algorithms for discovering “interesting” inclusion and functional dependencies.

S. Lopes, J-M. Petit, F. Toumani

Learning from Highly Structured Data by Decomposition

This paper addresses the problem of learning from highly structured data. Specifically, it describes a procedure, called decomposition, that allows a learner to access automatically the subparts of examples represented as closed terms in a higher-order language. This procedure maintains a clear distinction between the structure of an individual and its properties. A learning system based on decomposition is also presented and several examples of its use are described.

René Mac Kinney-Romero, Christophe Giraud-Carrier

Combinatorial Approach for Data Binarization

This paper addresses the problem of transforming arbitrary data into binary data. This is intended as preprocessing for a supervised classification task. As a binary mapping compresses the total information of the dataset, the goal here is to design such a mapping that maintains most of the information relevant to the classification problem. Most of the existing approaches to this problem are based on correlation or entropy measures between one individual binary variable and the partition into classes. On the contrary, the approach proposed here is based on a global study of the combinatorial property of a set of binary variable.

Eddy Mayoraz, Miguel Moreira

Extending Attribute-Oriented Induction as a Key-Preserving Data Mining Method

Attribute-Oriented Induction (AOI) is a set-oriented data mining technique used to discover descriptive patterns in large databases. The classical AOI method drops attributes that possess a large number of distinct values or have either no concept hierarchies, which includes keys to relational tables. This implies that the final rule (s) produced have no direct link to the tuples that form them. Therefore the discovered knowledge cannot be used to efficiently query specific data pertaining to this knowledge in a different relation to the learning relation.This paper presents the key-preserving AOI algorithm (AOI-KP) with two implementation approaches. The order complexity of the algorithm is O (np), which is the same as for the enhanced AOI algorithm where n and p are the number of input and generalised tuples respectively. An application of the method is illustrated and prototype tool support and initial results are outlined with possible improvements.

Maybin K. Muyeba, John A. Keane

Automated Discovery of Polynomials by Inductive Genetic Programming

This paper presents an approach to automated discovery of high-order multivariate polynomials by inductive Genetic Programming (iGP). Evolutionary search is used for learning polynomials represented as non-linear multivariate trees. Optimal search performance is pursued with balancing the statistical bias and the variance of iGP. We reduce the bias by extending the set of basis polynomials for better agreement with the examples. Possible overfitting due to the reduced bias is conteracted by a variance component, implemented as a regularizing factor of the error in an MDL fitness function. Experimental results demonstrate that regularized iGP discovers accurate, parsimonious, and predictive polynomials when trained on practical data mining tasks.

Nikolay Nikolaev, Hitoshi Iba

Diagnosing Acute Appendicitis with Very Simple Classification Rules

A medical database with 257 patients thought to have acute appendicitis has been analyzed. Binary classifiers composed of very simple univariate if-then classification rules (1R rules) were synthesized, and are shown to perform well for determining the true disease status. Discriminatory performance was measured by the area under the receiver operating characteristic (ROC) curve. Although an 1R classifier seemingly performs slightly better than a team of experienced physicians when only readily available clinical variables are employed, an analysis of cross-validated simulations shows that this perceived improvement is not statistically significant (p < 0.613). However, further addition of biochemical test results to the model yields an 1R classifier that is significantly better than both the physicians (p < 0.03) and an 1R classifier based on clinical variables only (p < 0.0003).

Aleksander Øhrn, Jan Komorowski

Rule Induction in Cascade Model Based on Sum of Squares Decomposition

A cascade model is a rule induction methodology using levelwise expansion of an itemset lattice, where the explanatory power of a rule set and its constituent rules are quantitatively expressed. The sum of squares for a categorical variable has been decomposed to within-group and between-group sum of squares, where the latter provides a good representation of the power concept in a cascade model. Using the model, we can readily derive discrimination and characteristic rules that explain as much of the sum of squares as possible. Plural rule sets are derived from the core to the outskirts of knowledge. The sum of squares criterion can be applied in any rule induction system. The cascade model was implemented as DISCAS. Its algorithms are shown and an applied example is provided for illustration purposes.

Takashi Okada

Maintenance of Discovered Knowledge

The paper addresses the well-known bottleneck of knowledge based system design and implementation – the issue of knowledge maintenance and knowledge evolution throughout lifecycle of the system. Different machine learning methodologies can support necessary knowledge-base revision. This process has to be studied along two independent dimensions. The first one is concerned with complexity of the revision process itself, while the second one evaluates the quality of decision-making corresponding to the revised knowledge base. The presented case study is an attempt to analyse the relevant questions for a specific problem of industrial configuration of TV transmitters. Inductive Logic Programming (ILP) and Explanation Based Generalisation (EBG) within the Decision Planning (DP) knowledge representation methodology, have been studied, compared, and tested on this example.

Michal Pěchouček, Olga Štěpánková, Petr Mikšovský

A Divisive Initialisation Method for Clustering Algorithms

A method for the initialisation step of clustering algorithms is presented. It is based on the concept of cluster as a high density region of points. The search space is modelled as a set of d-dimensional cells. A sample of points is chosen and located into the appropriate cells. Cells are iteratively split as the number of points they receive increases. The regions of the search space having a higher density of points are considered good candidates to contain the true centers of the clusters. Preliminary experimental results show the good quality of the estimated centroids with respect to the random choice of points. The accuracy of the clusters obtained by running the K-Means algorithm with the two different initialisation techniques – random starting centers chosen uniformly on the datasets and centers found by our method – is evaluated and the better outcome of the K-Means by using our initialisation method is shown.

Clara Pizzuti, Domenico Talia, Giorgio Vonella

A Comparison of Model Selection Procedures for Predicting Turning Points in Financial Time Series

The aim of this paper is to compare the influence of different model selection criteria on the performance of ARMA- and VAR-models to predict turning points in nine financial time series. As the true data generating process (DGP) in general is unknown, so is the model that mimics the DGP. In order to find the model which fits the data best, we conduct data mining by estimating a multitude of models and selecting the best one optimizing a well-defined model selection criterion. In the focus of interest are two simple in-sample criteria (AIC, SIC) and a more complicated out-of-sample model selection procedure. We apply Analysis of Variance to assess which selection criterion produces the best forecasts. Our results indicate that there are no differences in the predictive quality when alternative model selection criteria are used.

Thorsten Poddig, Claus Huber

Mining Lemma Disambiguation Rules from Czech Corpora

Lemma disambiguation means finding a basic word form, typically nominative singular for nouns or infinitive for verbs. In Czech corpora it was observed that 10% of word positions have at least 2 lemmata. We developed a method for lemma disambiguation when no expert domain knowledge is available based on combination of ILP and kNN techniques. We propose a way how to use lemma disambiguation rules learned with ILP system Progol to minimise a number of incorrectly disambiguated words. We present results of the most important subtasks of lemma disambiguation for Czech. Although no knowledge on Czech grammar has been used the accuracy reaches 93% with a small fraction of words remaining ambiguous.

Luboš Popelínský, Tomáš Pavelek

Adding Temporal Semantics to Association Rules

The development of systems for knowledge discovery in databases, including the use of association rules, has become a major research issue in recent years. Although initially motivated by the desire to analyse large retail transaction databases, the general utility of association rules makes them applicable to a wide range of different learning tasks. However, association rules do not accommodate the temporal relationships that may be intrinsically important within some application domains. In this paper, we present an extension to association rules to accommodate temporal semantics. By finding associated items first and then looking for temporal relationships between them, it is possible to incorporate potentially valuable temporal semantics. Our approach to temporal reasoning accommodates both point-based and interval-based models of time simultaneously. In addition, the use of a generalized taxonomy of temporal relationships supports the generalization of temporal relationships and their specification at different levels of abstraction. This approach also facilitates the possibility of reasoning with incomplete or missing information.

Chris P. Rainsford, John F. Roddick

Studying the Behavior of Generalized Entropy in Induction Trees Using a M-of-N Concept

This paper study splitting criterion in decision trees using three original points of view. First we propose a unified formalization for association measures based on entropy of type beta. This formalization includes popular measures such as Gini index or Shannon entropy. Second, we generate artificial data from M-of-N concepts whose complexity and class distribution are controlled. Third, our experiment allows us to study the behavior of measures on datasets of growing complexity. The results show that the differences of performances between measures, which are significant when there is no noise in the data, disappear when the level of noise increases.

R. Rakotomalala, S. Lallich, S. Di Palma

Discovering Rules in Information Trees

The notion of an information tree has been formulated and investigated in 1982-83 by K.Chen and Z. Ras [1,2,3]. In [1] we have defined a notion of an optimal tree (the number of edges is minimal) in a class of trees which are semantically equivalent and shown how optimal trees can be constructed. Rules can be discovered almost automatically from information trees. In this paper we propose a formalized language used to manipulate information trees, give its semantics and a complete and sound set of axioms. This complete and sound set of axioms is needed for discovering rules in information trees assuming that conditional attributes and the decision attribute are not arbitrary but they are both provided by the user.

Zbigniew W. Ras

Mining Text Archives: Creating Readable Maps to Structure and Describe Document Collections

With the ever-growing amount of unstructured textual data on the web, mining these text collections is of increasing importance for the understanding of document archives. Particularly the self-organizing map has shown to be very well suited for this task. However, the interpretation of the resulting document maps still requires a tremendous effort, especially as far as the analysis of the features learned and the characteristics of identified text clusters are concerned. In this paper we present the LabelSOM method which, based on the features learned by the map, automatically assigns a set of keywords to the units of the map to describe the concepts of the underlying text clusters, thus making the characteristics of the various topical areas on the map explicit.

Andreas Rauber, Dieter Merkl

Neuro-fuzzy Data Mining for Target Group Selection in Retail Banking

Data Mining Algorithms are capable of ‘pressing the crude data coal into diamonds of knowledge’. Neuro-Fuzzy Systems (NFS) in particular promise to combine the benefits of both fuzzy systems and neural networks, and are thus able to learn IF-THEN-rules, which are easy to interpret, from data. Hence, they are a very promising Data Mining Approach. In this case study we describe how to support a bank’s new direct mailing campaign based on data about their customers and their reactions on a past campaign with a Neuro-Fuzzy System. We will describe how Neuro-Fuzzy Systems can be used as Data Mining tools to extract descriptions of interesting target groups for this bank. We will also show which preprocessing and postprocessing steps are indispensable to make this Neuro-Fuzzy Data Mining kernel work.

Johannes Ruhland, Thomas Wittmann

Mining Possibilistic Set-Valued Rules by Generating Prime Disjunctions

We describe the problem of mining possibilistic set-valued rules in large relational tables containing categorical attributes taking a finite number of values. An example of such a rule might be “IF HOUSEHOLDSIZE={Two OR Tree} AND OCCUPATION={Professional OR Clerical} THEN PAYMENT_METHOD={CashCheck (Max=249) OR DebitCard (Max=175)}. The table semantics is supposed to be represented by a frequency distribution, which is interpreted with the help of minimum and maximum operations as a possibility distribution over the corresponding finite multidimensional space. This distribution is approximated by a number of possibilistic prime disjunctions, which represent the strongest patterns. We present an original formal framework generalising the conventional boolean approach on the case of (i) finite-valued variables and (ii) continuos-valued semantics, and propose a new algorithm, called Optimist, for the computationally difficult dual transformation which generates all the strongest prime disjunctions (possibilistic patterns) given a table of data. The algorithm consists of generation, absorption and filtration parts. The generated prime disjunctions can then be used to build rules or for prediction purposes.

Alexandr A. Savinov

Towards Discovery of Information Granules

The amount of electronic data available is growing very fast and this explosive growth in databases has generated a need for new techniques and tools that can intelligently and automatically extract implicit, previously unknown, hidden and potentially useful information and knowledge from these data. Such tools and techniques are the subject of the field of Knowledge Discovery in Databases. Information granulation is a very natural concept, and appears (under different names) in many methods related to e.g. data compression, divide and conquer, interval computations, neighborhood systems, and rough sets among others. In this paper we discuss information granulation in knowledge discovery. The notions related to information granules are their syntax and semantics as well as the inclusion and closeness (similarity) of granules. We discuss some problems of KDD assuming knowledge is represented in the form of information granules. In particular we use information granules to deal with the problem of stable (robust) patterns extraction.

Andrzej Skowron, Jaroslaw Stepaniuk

Classification Algorithms Based on Linear Combinations of Features

We provide theoretical and algorithmic tools for finding new features which enable better classification of new cases. Such features are proposed to be searched for as linear combinations of continuously valued conditions. Regardless of the choice of classification algorithm itself, such an approach provides the compression of information concerning dependencies between conditional and decision features. Presented results show that properly derived combinations of attributes, treated as new elements of the conditions’ set, may significantly improve the performance of well known classification algorithms, such as k-NN and rough set based approaches.

Dominik Ślęzak, Jakub Wróblewski

Managing Interesting Rules in Sequence Mining

The goal of sequence mining is the discovery of interesting sequences of events. Conventional sequence miners discover only frequent sequences, though. This limits the applicability scope of sequence mining for domains like error detection and web usage analysis.We propose a framework for discovering and maintaining interesting rules and beliefs in the context of sequence mining. We transform frequent sequences discovered by a conventional miner into sequence rules, remove redundant rules and organize the remaining ones into interestingness categories, from which unexpected rules and new beliefs are derived.

Myra Spiliopoulou

Support Vector Machines for Knowledge Discovery

In this paper, we apply support vector machine (SVM) to knowledge discovery (KD) and confirm its effectiveness with a benchmark data set. SVM has been successfully applied to problems in various domains. However, its effectiveness as a KD method is unknown. We propose SVM for KD, which deals with a classification problem with a binary class, by rescaling each attribute based on z-scores. SVM for KD can sort attributes with respect to their effectiveness in discriminating classes. Moreover, SVM for KD can discover crucial examples for discrimination. We settled six discovery tasks with the meningoencephalitis data set, which is a benchmark data set in KD. A domain expert ranked the discovery outcomes of SVM for KD from one to five with respect to several criteria. Selected attributes in six tasks are all valid and useful: their average scores are 3.8-4.0. Discovering order of attributes about usefulness represents a challenging problem. However, concerning this problem, our method achieved a score of more than or equal to 4.0 in three tasks. Besides, crucial examples for discrimination and typical examples for each class agree with medical knowledge. These promising results demonstrate the effectiveness of our approach.

Shinsuke Sugaya, Einoshin Suzuki, Shusaku Tsumoto

Regression by Feature Projections

This paper describes a machine learning method, called Regression by Feature Projections (RFP), for predicting a real-valued target feature. In RFP training is based on simply storing the projections of the training instances on each feature separately. Prediction is computed through two approximation procedures. The first approximation process is to find the individual predictions of features by using the K-nearest neighbor algorithm (KNN). The second approximation process combines the predictions of all features. During the first approximation step, each feature is associated with a weight in order to determine the prediction ability of the feature at the local query point. The weights, found for each local query point, are used in the second step and enforce the method to have an adaptive or context-sensitive nature. We have compared RFP with the KNN algorithm. Results on real data sets show that RFP is much faster than KNN, yet its prediction accuracy is comparable with the KNN algorithm.

İlhan Uysal, H. Altay Güvenir

Generating Linguistic Fuzzy Rules for Pattern Classification with Genetic Algorithms

This paper presents a new genetic-based approach to automatically extracting classification knowledge from numerical data by means of premise learning. A genetic algorithm is utilized to search for premise structure in combination with parameters of membership functions of input fuzzy sets to yield optimal conditions of classification rules. The consequence under a specific condition is determined by choosing from all possible candidates the class which lead to a maximal truth value of the rule. The major advantage of our work is that a parsimonious knowledge base with a low number of classification rules is made possible. The effectiveness of the proposed method is demonstrated by the simulation results on the Iris data.

N. Xiong, L. Litz

Tutorials

Data Mining for Robust Business Intelligence Solutions

Data Mining has quickly matured out of isolated, small scale, PC based, single algorithm techniques to robust analytical solutions which utilize a combination of various artificial intelligence algorithms, massively parallel technology, direct both-way access to relational databases and open systems with published Application Programming Interfaces. For an organization, this presents an opening of new opportunities, but it also generates a number of integration challenges.This tutorial will introduce Data Mining from the perspective of a large organization. It will describe why Data Mining is essential for new business development and will present its enablers and drivers. Typical historically diversified organizational structure of Business Intelligence competency in a large organization will be introduced and the challenges posed to a large scale implementation of Data Mining will be portrayed.

Jan Mrazek

Query Languages for Knowledge Discovery in Databases

Discovering knowledge from data appears as a complex iterative and interactive process containing many steps: understanding the data, preparing the data set, discovering potentially interesting patterns (mining phase), postprocessing of discovered patterns and finally putting the results in use. Different kinds of patterns might be used and therefore different data mining techniques are needed (e.g., association and episode rules for alarm analysis, clusters and decision trees for sales data analysis, inclusion and functional dependencies for database reverse engineering, etc).This tutorial adresses the challenge of supporting KDD processes following a querying approach. Following Imielinski and Mannila [6], second generation data mining systems might support the whole process by means of powerful query languages. We propose not only a state of the art in that field but also introduce a research perspective, the so-called inductive database framework [3]. It is out of the scope of our presentation to consider coupling architectures between data mining algorithms and database management systems. Instead, we focuse on user written queries that capture their needs during a KDD process. The popular association rules mining processes is used to illustrate most of the concepts.

Jean-François Boulicaut

The ESPRIT Project CreditMine and its Relevance for the Internet Market

Data mining technology although it is considered as state-of-the-art it is still an embryonic technology employed only by the group of early adopters. Till now very few organisations have made the big investment in bringing together their corporate data in a data warehouse type environment for the purpose of data mining. Only large and wealthy companies have experimented with such methodologies and the outcomes of their efforts were not disclosed, as it was considered confidential material. Thus there is currently a lack of information concerning the effectiveness of such investments. The aim of the project is to provide decision support for companies, especially in the banking sector for investment in data mining.

Susanne Köhler, Michael Krieger

Logics and Statistics for Association Rules and Beyond

Abstract of Tutorial

The aim of the tutorial is four-fold: 1To present a very natural class of logical systems suitable for formalizing, generating and evaluating statements on dependences found in given data. The popular association rules form a particular, but by far not the only example. Technically, our logical systems are monadic observational predicate calculi, i.e. calculi with generalized quantifiers, only finite models and effective semantics. Be not shocked by these terms; no non-trivial knowledge of predicate logic will be assumed. Logic includes deductive properties, i.e. possibility to deduce truth of a sentence from other sentences already found true. Transparent deduction rules are very useful for systematic data mining. Special attention will be paid to sentences formalizing expressions of the form “many objects having a certain combination A of attributes have a also combination B” and, more generally, “combinations A,B of attributes are associated (dependent, correlated etc.) in a precisely defined manner”.2To show how suitable observational sentences are related to statistical hypothesis testing (this aspect appears sometimes unjustly neglected in data mining). A general pattern of statistical inference will be presented in logical terms (theoretical calculi and their relation to observational ones). In particular, statistical meaning of two variants of “associational rules” as well as of some “symmetric associations” will be explained.3To present short history of the GUHA method of automatic generation of hypotheses (General Unary Hypothesis Automaton). It is an original Czech method of exploratory data analysis, one of the oldest approaches to KDD or data mining. Its principle was formulated long before the advent of data mining. Its theoretical foundations (as presented in [5] and later publications) lead to the theory described in points (1), (2) above.4Finally, to show how modern fuzzy logic (in the narrow sense of the word, i.e. particular many-valued symbolic logic) may enter the domain of KDD and fruitfully generalize the field. This fourth point will concentrate to open problems and research directions.The tutorial will be complemented by demonstrations of two recent implementations of the GUHA method.

Petr Hájek, Jan Rauch

Data Mining for the Web

The web is being increasingly used as a borderless marketplace for the purchase and exchange of goods, the most prominent among them being information. To remain competitive, companies must recognize the interests and demands of their users and adjust their products, services and sites accordingly.In this tutorial, we discuss data mining for the web. The relevant research can be categorized into (i) web usage mining, focussing in the analysis of the navigational activities of web users, (ii) web text mining, concentrating on the information being acquired by the users and (iii) user modelling, which exploits all information and knowledge available about the web users to build user profiles.