nach oben

2002 | Buch

Kapitel lesen Erstes Kapitel lesen

Principles of Data Mining and Knowledge Discovery

6th European Conference, PKDD 2002 Helsinki, Finland, August 19–23, 2002 Proceedings

herausgegeben von: Tapio Elomaa, Heikki Mannila, Hannu Toivonen

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Professional Book Archive

Einloggen, um Zugang zu erhalten

Inhaltsverzeichnis

Frontmatter

Contributed Papers

Optimized Substructure Discovery for Semi-structured Data

In this paper, we consider the problem of discovering interesting substructures from a large collection of semi-structured data in the framework of optimized pattern discovery. We model semi-structured data and patterns with labeled ordered trees, and present an efficient algorithm that discovers the best labeled ordered trees that optimize a given statistical measure, such as the information entropy and the classification accuracy, in a collection of semi-structured data. We give theoretical analyses of the computational complexity of the algorithm for patterns with bounded and unbounded size. Experiments show that the algorithm performs well and discovered interesting patterns on real datasets.

Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Setsuo Arikawa

Fast Outlier Detection in High Dimensional Spaces

In this paper we propose a new definition of distance-based outlier that considers for each point the sum of the distances from its k nearest neighbors, called weight. Outliers are those points having the largest values of weight. In order to compute these weights, we find the k nearest neighbors of each point in a fast and efficient way by linearizing the search space through the Hilbert space filling curve. The algorithm consists of two phases, the first provides an approximated solution, within a small factor, after executing at most d + 1 scans of the data set with a low time complexity cost, where d is the number of dimensions of the data set. During each scan the number of points candidate to belong to the solution set is sensibly reduced. The second phase returns the exact solution by doing a single scan which examines further a little fraction of the data set. Experimental results show that the algorithm always finds the exact solution during the first phase after d- 《 d + 1 steps and it scales linearly both in the dimensionality and the size of the data set.

Fabrizio Angiulli, Clara Pizzuti

Data Mining in Schizophrenia Research — Preliminary Analysis

We describe methods used and some results in a study of schizophrenia in a population of affected and unaffected participants, called patients and controls. The subjects are characterized by diagnosis, genotype, brain anatomy (MRI), laboratory tests on blood samples, and basic demographic data. The long term goal is to identify the causal chains of processes leading to disease. We describe a number of preliminary findings, which confirm earlier results on deviations of brain tissue volumes in schizophrenia patients, and also indicate new effects that are presently under further investigation. More importantly, we discuss a number of issues in selection of methods from the very large set of tools in data mining and statistics.

Stefan Arnborg, Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall

Fast Algorithms for Mining Emerging Patterns

Emerging Patterns are itemsets whose supports change significantly from one dataset to another. They are useful as a means of discovering distinctions inherently present amongst a collection of datasets and have been shown to be a powerful technique for constructing accurate classifiers. The task of finding such patterns is challenging though, and efficient techniques for their mining are needed.In this paper, we present a new mining method for a particular type of emerging pattern known as a jumping emerging pattern. The basis of our algorithm is the construction of trees, whose structure specifically targets the likely distribution of emerging patterns. The mining performance is typically around 5 times faster than earlier approaches. We then examine the problem of computing a useful subset of the possible emerging patterns. We show that such patterns can be mined even more efficiently (typically around 10 times faster), with little loss of precision.

James Bailey, Thomas Manoukian, Kotagiri Ramamohanarao

On the Discovery of Weak Periodicities in Large Time Series

The search for weak periodic signals in time series data is an active topic of research. Given the fact that rarely a real world dataset is perfectly periodic, this paper approaches this problem in terms of data mining, trying to discover weak periodic signals in time series databases, when no period length is known in advance. In existing time series mining algorithms, the period length is user-specified. We propose an algorithm for finding approximate periodicities in large time series data, utilizing autocorrelation function and FFT. This algorithm is an extension to the partial periodicity detection algorithm presented in a previous paper of ours. We provide some mathematical background as well as experimental results.

Christos Berberidis, Ioannis Vlahavas, Walid G. Aref, Mikhail Atallah, Ahmed K. Elmagarmid

The Need for Low Bias Algorithms in Classification Learning from Large Data Sets

This paper reviews the appropriateness for application to large data sets of standard machine learning algorithms, which were mainly developed in the context of small data sets. Sampling and parallelisation have proved useful means for reducing computation time when learning from large data sets. However, such methods assume that algorithms that were designed for use with what are now considered small data sets are also fundamentally suitable for large data sets. It is plausible that optimal learning from large data sets requires a different type of algorithm to optimal learning from small data sets. This paper investigates one respect in which data set size may affect the requirements of a learning algorithm — the bias plus variance decomposition of classification error. Experiments show that learning from large data sets may be more effective when using an algorithm that places greater emphasis on bias management, rather than variance management.

Damien Brain, Geoffrey I. Webb

Mining All Non-derivable Frequent Itemsets

Recent studies on frequent itemset mining algorithms resulted in significant performance improvements. However, if the minimal support threshold is set too low, or the data is highly correlated, the number of frequent itemsets itself can be prohibitively large. To overcome this problem, recently several proposals have been made to construct a concise representation of the frequent itemsets, instead of mining all frequent itemsets. The main goal of this paper is to identify redundancies in the set of all frequent itemsets and to exploit these redundancies in order to reduce the result of a mining operation. We present deduction rules to derive tight bounds on the support of candidate itemsets. We show how the deduction rules allow for constructing a minimal representation for all frequent itemsets. We also present connections between our proposal and recent proposals for concise representations and we give the results of experiments on real-life datasets that show the effectiveness of the deduction rules. In fact, the experiments even show that in many cases, first mining the concise representation, and then creating the frequent itemsets from this representation outperforms existing frequent set mining algorithms.

Toon Calders, Bart Goethals

Iterative Data Squashing for Boosting Based on a Distribution-Sensitive Distance

This paper proposes, for boosting, a novel method which prevents deterioration of accuracy inherent to data squashing methods. Boosting, which constructs a highly accurate classification model by combining multiple classification models, requires long computational time. Data squashing, which speeds-up a learning method by abstracting the training data set to a smaller data set, typically lowers accuracy. Our SB (Squashing-Boosting) loop, based on a distribution-sensitive distance, alternates data squashing and boosting, and iteratively refines an SF (Squashed-Feature) tree, which provides an appropriately squashed data set. Experimental evaluation with artificial data sets and the KDD Cup 1999 data set clearly shows superiority of our method compared with conventional methods. We have also empirically evaluated our distance measure as well as our SF tree, and found them superior to alternatives.

Yuta Choki, Einoshin Suzuki

Finding Association Rules with Some Very Frequent Attributes

A key stage in the discovery of Association Rules in binary databases involves the identification of the “frequent sets”, i.e. those sets of attributes that occur together often enough to invite further attention. This stage is also the most computationally demanding, because of the exponential scale of the search space. Particular difficulty is encountered in dealing with very densely-populated data. A special case of this is that of, for example, demographic or epidemiological data, which includes some attributes with very frequent instances, because large numbers of sets involving these attributes will need to be considered. In this paper we describe methods to address this problem, using methods and heuristics applied to a previously-presented generic algorithm, Apriori-TFP. The results we present demonstrate significant performance improvements over the original Apriori-TFP in datasets which include subsets of very frequently-occurring attributes.

Frans Coenen, Paul Leng

Unsupervised Learning: Self-aggregation in Scaled Principal Component Space*

We demonstrate that data clustering amounts to a dynamic process of self-aggregation in which data objects move towards each other to form clusters, revealing the inherent pattern of similarity. Selfaggregation is governed by connectivity and occurs in a space obtained by a nonlinear scaling of principal component analysis (PCA). The method combines dimensionality reduction with clustering into a single framework. It can apply to both square similarity matrices and rectangular association matrices.

Chris Ding, Xiaofeng He, Hongyuan Zha, Horst Simon

A Classification Approach for Prediction of Target Events in Temporal Sequences

Learning to predict significant events from sequences of data with categorical features is an important problem in many application areas. We focus on events for system management, and formulate the problem of prediction as a classification problem. We perform co-occurrence analysis of events by means of Singular Value Decomposition (SVD) of the examples constructed from the data. This process is combined with Support Vector Machine (SVM) classification, to obtain efficient and accurate predictions. We conduct an analysis of statistical properties of event data, which explains why SVM classification is suitable for such data, and perform an empirical study using real data.

Carlotta Domeniconi, Chang-shing Perng, Ricardo Vilalta, Sheng Ma

Privacy-Oriented Data Mining by Proof Checking

This paper shows a new method which promotes ownership of data by people about whom the data was collected. The data owner may preclude the data from being used for some purposes, and allow it to be used for other purposes. We show an approach, based on checking the proofs of program properties, which implements this idea and provides a tool for a verifiable implementation of the Use Limitation Principle. The paper discusses in detail a scheme which implements data privacy following the proposed approach, presents the technical components of the solution, and shows a detailed example. We also discuss a mechanism by which the proposed method could be introduced in industrial practice.

Amy Felty, Stan Matwin

Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification

Good feature selection is essential for text classification to make it tractable for machine learning, and to improve classification performance. This study benchmarks the performance of twelve feature selection metrics across 229 text classification problems drawn from Reuters, OHSUMED, TREC, etc. using Support Vector Machines. The results are analyzed for various objectives. For best accuracy, F- measure or recall, the findings reveal an outstanding new feature selection metric, “Bi-Normal Separation” (BNS). For precision alone, however, Information Gain (IG) was superior. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner who seeks to choose one or two metrics to try that are mostly likely to have the best performance for the single dataset at hand. This analysis determined, for example, that IG and Chi-Squared have correlated failures for precision, and that IG paired with BNS is a better choice.

George Forman

Generating Actionable Knowledge by Expert-Guided Subgroup Discovery

This paper discusses actionable knowledge generation. Actionable knowledge is explicit symbolic knowledge, typically presented in the form of rules, that allows the decision maker to recognize some important relations and to perform an action, such as targeting a direct marketing campaign, or planning a population screening campaign aimed at targeting individuals with high disease risk. The disadvantages of using standard classification rule learning for this task are discussed, and a subgroup discovery approach proposed. This approach uses a novel definition of rule quality which is extensively discussed.

Dragan Gamberger, Nada Lavraç

Clustering Transactional Data

In this paper we present a partitioning method capable to manage transactions, namelyt uples of variable size of categorical data. We adapt the standard definition of mathematical distance used in the K- Means algorithm to represent dissimilarityam ong transactions, and rede fine the notion of cluster centroid. The cluster centroid is used as the representative of the common properties of cluster elements. We show that using our concept of cluster centroid together with Jaccard distance we obtain results that are comparable in qualityw ith the most used transactional clustering approaches, but substantiallyi mprove their efficiency.

Fosca Giannotti, Cristian Gozzi, Giuseppe Manco

Multiscale Comparison of Temporal Patterns in Time-Series Medical Databases

This paper presents a method for analyzing time-series data on laboratory examinations based on phase-constraint multiscale matching and rough clustering. Multiscale matching compares two subsequences throughout various scales of view. It has an advantage of preserving connectivity of subsequences even if the subsequences are represented at different scales. Rough clustering groups up objects according not to the topographic measures such as the center or deviance of objects in a cluster but to the relative similarity and indiscernibility of objects. We use multiscale matching to obtain similarity of sequences and rough clustering to cluster the sequences according to the obtained similarity. We slightly modified dissimilarity measure in multiscale matching so that it suppresses excessive shift of phase that may cause incorrect matching of the sequences. Experimental results on the hepatitis dataset show that the proposed method successfully clustered similar sequences into an independent cluster, and that correspondence of subsequences are also successfully captured.

Shoji Hirano, Shusaku Tsumoto

Association Rules for Expressing Gradual Dependencies

Data mining methods originally designed for binary attributes can generally be extended to quantitative attributes by partitioning the related numeric domains. This procedure, however, comes along with a loss of information and, hence, has several disadvantages. This paper shows that fuzzy partitions can overcome some of these disadvantages. Particularly, fuzzy partitions allow for the representation of association rules expressing a tendency, that is, a gradual dependence between attributes. This type of rule is introduced and investigated from a conceptual as well as a computational point of view. The evaluation and representation of a gradual association is based on linear regression analysis. Furthermore, a complementary type of association, expressing absolute deviations rather than tendencies, is discussed in this context.

Eyke Hüllermeier

Support Approximations Using Bonferroni-Type Inequalities

The purpose of this paper is to examine the usability of Bonferroni-type combinatorial inequalities to estimation of support of itemsets as well as general Boolean expressions. Families of inequalities for various types of Boolean expressions are presented and evaluated experimentally.

Szymon Jaroszewicz, Dan A. Simovici

Using Condensed Representations for Interactive Association Rule Mining

Association rule mining is a popular data mining task. It has an interactive and iterative nature, i.e., the user has to refine his mining queries until he is satisfied with the discovered patterns. To support such an interactive process, we propose to optimize sequences of queries by means of a cache that stores information from previous queries. Unlike related works, we use condensed representations like free and closed itemsets for both data mining and caching. This results in a much more efficient mining technique in highly correlated data and a much smaller cache than in previous approaches.

Baptiste Jeudy, Jean-François Boulicaut

Predicting Rare Classes: Comparing Two-Phase Rule Induction to Cost-Sensitive Boosting

Learning good classifier models of rare events is a challenging task. On such problems, the recently proposed two-phase rule induction algorithm, PNrule, outperforms other non-meta methods of rule induction. Boosting is a strong meta-classifier approach, and has been shown to be adaptable to skewed class distributions. PNrule’s key feature is to identify the relevant false positives and to collectively remove them. In this paper, we qualitatively argue that this ability is not guaranteed by the boosting methodology. We simulate learning scenarios of varying difficulty to demonstrate that this fundamental qualitative difference in the two mechanisms results in existence of many scenarios in which PNrule achieves comparable or significantly better performance than AdaCost, a strong cost-sensitive boosting algorithm. Even a comparable performance by PNrule is desirable because it yields a more easily interpretable model over an ensemble of models generated by boosting. We also show similar supporting results on real-world and benchmark datasets.

Mahesh V. Joshi, Ramesh C. Agarwal, Vipin Kumar

Dependency Detection in MobiMine and Random Matrices

This paper describes a novel approach to detect correlation from data streams in the context of MobiMine — an experimental mobile data mining system. It presents a brief description of the MobiMine and identifies the problem of detecting dependencies among stocks from incrementally observed financial data streams. This is a non-trivial problem since the stock-market data is inherently noisy and small incremental volumes of data makes the estimation process more vulnerable to noise. This paper presents EDS, a technique to estimate the correlation matrix from data streams by exploiting some properties of the distribution of eigenvalues for random matrices. It separates the “information” from the “noise” by comparing the eigen-spectrum generated from the observed data with that of random matrices. The comparison immediately leads to a decomposition of the covariance matrix into two matrices: one capturing the “noise” and the other capturing useful “information.” The paper also presents experimental results using Nasdaq 100 stock data.

Hillol Kargupta, Krishnamoorthy Sivakumar, Samiran Ghosh

Long-Term Learning for Web Search Engines

This paper considers how web search engines can learn from the successful searches recorded in their user logs. Document Transformation is a feasible approach that uses these logs to improve document representations. Existing test collections do not allow an adequate investigation of Document Transformation, but we show how a rigorous evaluation of this method can be carried out using the referer logs kept by web servers. We also describe a new strategy for Document Transformation that is suitable for long-term incremental learning. Our experiments show that Document Transformation improves retrieval performance over a medium sized collection of webpages. Commercial search engines may be able to achieve similar improvements by incorporating this approach.

Charles Kemp, Kotagiri Ramamohanarao

Spatial Subgroup Mining Integrated in an Object-Relational Spatial Database

SubgroupMiner is an advanced subgroup mining system supporting multirelational hypotheses, efficient data base integration, discovery of causal subgroup structures, and visualization based interaction options. When searching for dependencies between subgroups and a target group, spatial subgroups with multirelational descriptions are explored. Search strategies of data mining algorithms are efficiently integrated with queries in an object-relational query language and executed in a database to enable scalability for spatial data.

Willi Klösgen, Michael May

Involving Aggregate Functions in Multi-relational Search

The fact that data is scattered over many tables causes many problems in the practice of data mining. To deal with this problem, one either constructs a single table by propositionalisation, or uses a Multi- Relational Data Mining algorithm. In either case, one has to deal with the non-determinacy of one-to-many relationships. In propositionalisation, aggregate functions have already proven to be powerful tools to handle this non-determinacy. In this paper we show how aggregate functions can be incorporated in the dynamic construction of patterns of Multi-Relational Data Mining.

Arno J. Knobbe, Arno Siebes, Bart Marseille

Information Extraction in Structured Documents Using Tree Automata Induction

Information extraction (IE) addresses the problem of extracting specific information from a collection of documents. Much of the previous work for IE from structured documents formatted in HTML or XML uses techniques for IE from strings, such as grammar and automata induction. However, such documents have a tree structure. Hence it is natural to investigate methods that are able to recognise and exploit this tree structure. We do this by exploring the use of tree automata for IE in structured documents. Experimental results on benchmark data sets show that our approach compares favorably with previous approaches.

Raymond Kosala, Jan Van den Bussche, Maurice Bruynooghe, Hendrik Blockeel

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

With the availability of large scale computing platforms and instrumentation for data gathering, increased emphasis is being placed on efficient techniques for analyzing large and extremely high-dimensional datasets. In this paper, we present a novel algebraic technique based on a variant of semi-discrete matrix decomposition (SDD), which is capable of compressing large discrete-valued datasets in an error bounded fashion. We show that this process of compression can be thought of as identifying dominant patterns in underlying data. We derive efficient algorithms for computing dominant patterns, quantify their performance analytically as well as experimentally, and identify applications of these algorithms in problems ranging from clustering to vector quantization.We demonstrate the superior characteristics of our algorithm in terms of (i) scalability to extremely high dimensions; (ii) bounded error; and (iii) hierarchical nature, which enables multiresolution analysis. Detailed experimental results are provided to support these claims.

Mehmet Koyutürk, Ananth Grama, Naren Ramakrishnan

Geography of Di.erences between Two Classes of Data

Easily comprehensible ways of capturing main differences between two classes of data are investigated in this paper. In addition to examining individual di.erences, we also consider their neighbourhood. The new concepts are applied to three gene expression datasets to discover diagnostic gene groups. Based on the idea of prediction by collective likelihoods (PCL), a new method is proposed to classify testing samples. Its performance is competitive to several state-of-the-art algorithms.

Jinyan Li, Limsoon Wong

Rule Induction for Classification of Gene Expression Array Data

Gene expression array technology has rapidly become a standard tool for biologists. Its use within areas such as diagnostics, toxicology, and genetics, calls for good methods for finding patterns and prediction models from the generated data. Rule induction is one promising candidate method due to several attractive properties such as high level of expressiveness and interpretability. In this work we investigate the use of rule induction methods for mining gene expression patterns from various cancer types. Three different rule induction methods are evaluated on two public tumor tissue data sets. The methods are shown to obtain as good prediction accuracy as the best current methods, at the same time allowing for straightforward interpretation of the prediction models. These models typically consist of small sets of simple rules, which associate a few genes and expression levels with specific types of cancer. We also show that information gain is a useful measure for ranked feature selection in this domain.

Per Lidén, Lars Asker, Henrik Bostróm

Clustering Ontology-Based Metadata in the Semantic Web

The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. Recently, different applications based on this vision have been designed, e.g. in the fields of knowledge management, community web portals, e-learning, multimedia retrieval, etc. It is obvious that the complex metadata descriptions generated on the basis of pre-defined ontologies serve as perfect input data for machine learning techniques. In this paper we propose an approach for clustering ontology-based metadata. Main contributions of this paper are the definition of a set of similarity measures for comparing ontology-based metadata and an application study using these measures within a hierarchical clustering algorithm.

Alexander Maedche, Valentin Zacharias

Iteratively Selecting Feature Subsets for Mining from High-Dimensional Databases

We propose a new data mining method that is effective for mining from extremely high-dimensional databases. Our proposed method iteratively selects a subset of features from a database and builds a hypothesis with the subset. Our selection of a feature subset has two steps, i.e. selecting a subset of instances from the database, to which predictions by multiple hypotheses previously obtained are most unreliable, and then selecting a subset of features, the distribution of whose values in the selected instances varies the most from that in all instances of the database. We empirically evaluate the effectiveness of the proposed method by comparing its performance with those of two other methods, including Xing et al.’s one of the latest feature subset selection methods. The evaluation was performed on a real-world data set with approximately 140,000 features. Our results show that the performance of the proposed method exceeds those of the other methods, both in terms of the final predictive accuracy and the precision attained at a recall given by Xing et al.’s method. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels.

Hiroshi Mamitsuka

SVM Classification Using Sequences of Phonemes and Syllables

In this paper we use SVMs to classify spoken and written documents. We show that classification accuracy for written material is improved by the utilization of strings of sub-word units with dramatic gains for small topic categories. The classification of spoken documents for large categories using sub-word units is only slightly worse than for written material, with a larger drop for small topicc ategories. Finally it is possible, without loss, to train SVMs on syllables generated from written material and use them to classify audio documents. Our results confirm the strong promise that SVMs hold for robust audio document classification, and suggest that SVMs can compensate for speech recognition error to an extent that allows a significant degree of topic independence to be introduced into the system.

Gerhard Paaß, Edda Leopold, Martha Larson, Jörg Kindermann, Stefan Eickeler

A Novel Web Text Mining Method Using the Discrete Cosine Transform

Fourier Domain Scoring (FDS) has been shown to give a 60% improvement in precision over the existing vector space methods, but its index requires a large storage space. We propose a new Web text mining method using the discrete cosine transform (DCT) to extract useful information from text documents and to provide improved document ranking, without having to store excessive data. While the new method preserves the performance of the FDS method, it gives a 40% improvement in precision over the established text mining methods when using only 20% of the storage space required by FDS.

Laurence A. F. Park, Marimuthu Palaniswami, Kotagiri Ramamohanarao

A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases

Many data mining tasks can be seen as an instance of the problem of finding the most interesting (according to some utility function) patterns in a large database. In recent years, significant progress has been achieved in scaling algorithms for this task to very large databases through the use of sequential sampling techniques. However, except for sampling-based greedy algorithms which cannot give absolute quality guarantees, the scalability of existing approaches to this problem is only with respect to the data, not with respect to the size of the pattern space: it is universally assumed that the entire hypothesis space fits in main memory. In this paper, we describe how this class of algorithms can be extended to hypothesis spaces that do not fit in memory while maintaining the algorithms’ precise ∈-δ quality guarantees. We present a constant memory algorithm for this task and prove that it possesses the required properties. In an empirical comparison, we compare variable memory and constant memory sampling.

Tobias Scheffer, Stefan Wrobel

Answering the Most Correlated N Association Rules Efficiently

Many algorithms have been proposed for computing association rules using the support-confidence framework. One drawback of this framework is its weakness in expressing the notion of correlation. We propose an efficient algorithm for mining association rules that uses statistical metrics to determine correlation. The simple application of conventional techniques developed for the support-confidence framework is not possible, since functions for correlation do not meet the antimonotonicity property that is crucial to traditional methods. In this paper, we propose the heuristics for the vertical decomposition of a database, for pruning unproductive itemsets, and for traversing a setenumeration tree of itemsets that is tailored to the calculation of the N most significant association rules, where N can be specified by the user. We experimentally compared the combination of these three techniques with the previous statistical approach. Our tests confirmed that the computational performance improves by several orders of magnitude.

Jun Sese, Shinichi Morishita

Mining Hierarchical Decision Rules from Clinical Databases Using Rough Sets and Medical Diagnostic Model

One of the most important problems on rule induction methods is that they cannot extract rules, which plausibly represent experts’ decision processes. On one hand, rule induction methods induce probabilistic rules, the description length of which is too short, compared with the experts’ rules. On the other hand, construction of Bayesian networks generates too lengthy rules. In this paper, the characteristics of experts’ rules are closely examined and a new approach to extract plausible rules is introduced, which consists of the following three procedures. First, the characterization of decision attributes (given classes) is extracted from databases and the classes are classified into several groups with respect to the characterization. Then, two kinds of sub-rules, characterization rules for each group and discrimination rules for each class in the group are induced. Finally, those two parts are integrated into one rule for each decision attribute. The proposed method was evaluated on a medical database, the experimental results of which show that induced rules correctly represent experts’ decision processes.

Shusaku Tsumoto

Efficiently Mining Approximate Models of Associations in Evolving Databases

Much of the existing work in machine learning and data mining has relied on devising efficient techniques to build accurate models from the data. Research on how the accuracyof a model changes as a function of dynamic updates to the databases is very limited. In this work we show that extracting this information: knowing which aspects of the model are changing; and how theyare changing as a function of data updates; can be verye effective for interactive data mining purposes (where response time is often more important than model qualityas long as model qualityi s not too far off the best (exact) model.In this paper we consider the problem of generating approximate models within the context of association mining, a keyda ta mining task. We propose a new approach to incrementallyg enerate approximate models of associations in evolving databases. Our approach is able to detect how patterns evolve over time (an interesting result in its own right), and uses this information in generating approximate models with high accuracy at a fraction of the cost (of generating the exact model). Extensive experimental evaluation on real databases demonstrates the effectiveness and advantages of the proposed approach.

Adriano Veloso, Bruno Gusmão, Wagner Meira Jr., Marcio Carvalho, Srini Parthasarathy, Mohammed Zaki

Explaining Predictions from a Neural Network Ensemble One at a Time

This paper introduces a new method for explaining the predictions of ensembles of neural networks on a case by case basis. The approach of explaining individual examples differs from much of the current research which focuses on producing a global model of the phenomenon under investigation. Explaining individual results is accomplished by modelling each of the networks as a rule-set and computing the resulting coverage statistics for each rule given the data used to train the network. This coverage information is then used to choose the rule or rules that best describe the example under investigation. This approach is based on the premise that ensembles perform an implicit problem space decomposition with ensemble members specialising in different regions of the problem space. Thus explaining an ensemble involves explaining the ensemble members that best fit the example.

Robert Wall, Pádraig Cunningham, Paul Walsh

Structuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD

Domain-specific documents often share an inherent, though undocumented structure. This structure should be made explicit to facilitate efficient, structure-based search in archives as well as information integration. Inferring a semantically structured XML DTD for an archive and subsequently transforming its texts into XML documents is a promising method to reach these objectives. Based on the KDD-driven DIAs-DEM framework, we propose a new method to derive an archive-specific structured XML document type definition (DTD). Our approach utilizes association rule discovery and sequence mining techniques to structure a previously derived flat, i.e. unstructured DTD. We introduce the notion of a probabilistic DTD that is derived by discovering associations among and frequent sequences of XML tags, respectively.

Karsten Winkler, Myra Spiliopoulou

Separability Index in Supervised Learning

We propose a new statistical approach for characterizing the class separability degree in Rp. This approach is based on a nonparametric statistic called “the Cut Edge Weight”. We show in this paper the principle and the experimental applications of this statistic. First, we build a geometrical connected graph like the Relative Neighborhood Graph of Toussaint on all examples of the learning set. Second, we cut all edges between two examples of a different class. Third, we calculate the relative weight of these cut edges. If the relative weight of the cut edges is in the expected interval of a random distribution of the labels on all the neighborhood graph’s vertices, then no neighborhood-based method will give a reliable prediction model. We will say then that the classes to predict are non-separable.

Djamel A. Zighed, Stéphane Lallich, Fabrice Muhlenbach

Invited Papers

Finding Hidden Factors Using Independent Component Analysis

Independent Component Analysis (ICA) is a computational technique for revealing hidden factors that underlie sets of measurements or signals. ICA assumes a statistical model whereby the observed multivariate data, typically given as a large database of samples, are assumed to be linear or nonlinear mixtures of some unknown latent variables. The mixing coefficients are also unknown. The latent variables are nongaussian and mutually independent, and they are called the independent components of the observed data. By ICA, these independent components, also called sources or factors, can be found. Thus ICA can be seen as an extension to Principal Component Analysis and Factor Analysis. ICA is a much richer technique, however, capable of finding the sources when these classical methods fail completely.In many cases, the measurements are given as a set of parallel signals or time series. Typical examples are mixtures of simultaneous sounds or human voices that have been picked up by several microphones, brain signal measurements from multiple EEG sensors, several radio signals arriving at a portable phone, or multiple parallel time series obtained from some industrial process. The term blind source separation is used to characterize this problem.The lecture will first cover the basic idea of demixing in the case of a linear mixing model and then take a look at the recent nonlinear demixing approaches.Although ICA was originally developed for digital signal processing applications, it has recently been found that it may be a powerful tool for analyzing text document data as well, if the documents are presented in a suitable numerical form. A case study on analyzing dynamically evolving text is covered in the talk.

Erkki Oja

Reasoning with Classifiers*

Research in machine learning concentrates on the study of learning single concepts from examples. In this framework the learner attempts to learn a single hidden function from a collection of examples, assumed to be drawn independently from some unknown probability distribution. However, in many cases — as in most natural language and visual processing situations — decisions depend on the outcomes of several different but mutually dependent classifiers. The classifiers’ outcomes need to respect some constraints that could arise from the sequential nature of the data or other domain specific conditions, thus requiring a level of inference on top the predictions.We will describe research and present challenges related to Inference with Classifiers — a paradigm in which we address the problem of using the outcomes of several different classifiers in making coherent inferences — those that respect constraints on the outcome of the classifiers. Examples will be given from the natural language domain.

Dan Roth

A Kernel Approach for Learning from Almost Orthogonal Patterns

In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well. We propose and test several methods for dealing with this problem by reducing the dynamic range of the matrix while preserving the positive definiteness of the Hessian of the quadratic programming problem that one has to solve when training a Support Vector Machine.

Bernhard Schölkopf, Jason Weston, Eleazar Eskin, Christina Leslie, William Stafford Noble

Learning with Mixture Models: Concepts and Applications

Probabilistic mixture models have been used in statistics for well over a century as flexible data models. More recently these techniques have been adopted by the machine learning and data mining communities in a variety of application settings. We begin this talk with a review of the basic concepts of finite mixture models: what can they represent? how can we learn them from data? and soon. We will then discuss how the traditional mixture model (defined in a fixed dimensional vector space) can be usefully generalized to model non-vector data, such as sets of sequences and sets of curves. A number of real-world applications will be used to illustrate how these techniques can be applied to large-scale real-world data exploration and prediction problems, including clustering of visitors to a Web site based on their sequences of page requests, modeling of sparse high-dimensional |ldmarket basket” data for retail forecasting, and clustering of storm trajectories in atmospheric science.

Padhraic Smyth

Backmatter

Titel: Principles of Data Mining and Knowledge Discovery
herausgegeben von: Tapio Elomaa
Heikki Mannila
Hannu Toivonen
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-45681-0
Print ISBN: 978-3-540-44037-6
DOI: https://doi.org/10.1007/3-540-45681-3

Springer Professional