Skip to main content

2009 | Buch

Advances in Intelligent Data Analysis VIII

8th International Symposium on Intelligent Data Analysis, IDA 2009, Lyon, France, August 31 - September 2, 2009. Proceedings

herausgegeben von: Niall M. Adams, Céline Robardet, Arno Siebes, Jean-François Boulicaut

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter

Invited Papers

Intelligent Data Analysis in the 21st Century

When IDA began, data sets were small and clean, data provenance and management were not significant issues, workflows and grid computing and cloud computing didn’t exist, and the world was not populated with billions of cellphone and computer users. The original conception of intelligent data analysis — automating some of the reasoning of skilled data analysts — has not been updated to account for the dramatic changes in what skilled data analysis means, today. IDA might update its mission to address pressing problems in areas such as climate change, habitat loss, education, and medicine. It might anticipate data analysis opportunities five to ten years out, such as customizing educational trajectories to individual students, and personalizing medical protocols. Such developments will elevate the conference and our community by shifting our focus from arbitrary measures of the performance of isolated algorithms to the practical, societal value of intelligent data analysis systems.

Paul Cohen, Niall Adams
Analyzing the Localization of Retail Stores with Complex Systems Tools

Measuring the spatial distribution of locations of many entities (trees, atoms, economic activities, ...), and, more precisely, the deviations from purely random configurations, is a powerful method to unravel their underlying interactions. I study here the spatial organization of retail commercial activities. From pure location data, network analysis leads to a community structure that closely follows the commercial classification of the US Department of Labor. The interaction network allows to build a ’quality’ index of optimal location niches for stores, which has been empirically tested.

Pablo Jensen

Selected Contributions 1 (Long Talks)

Change (Detection) You Can Believe in: Finding Distributional Shifts in Data Streams

Data streams are dynamic, with frequent distributional changes. In this paper, we propose a statistical approach to detecting distributional shifts in multi-dimensional data streams. We use relative entropy, also known as the Kullback-Leibler distance, to measure the statistical distance between two distributions. In the context of a multi-dimensional data stream, the distributions are generated by data from two sliding windows. We maintain a sample of the data from the stream inside the windows to build the distributions.

Our algorithm is streaming, nonparametric, and requires no distributional or model assumptions. It employs the statistical theory of hypothesis testing and bootstrapping to determine whether the distributions are statistically different. We provide a full suite of experiments on synthetic data to validate the method and demonstrate its effectiveness on data from real-life applications.

Tamraparni Dasu, Shankar Krishnan, Dongyu Lin, Suresh Venkatasubramanian, Kevin Yi
Exploiting Data Missingness in Bayesian Network Modeling

This paper proposes a framework built on the use of Bayesian networks (BN) for representing statistical dependencies between the existing random variables and additional dummy boolean variables, which represent the presence/absence of the respective random variable value. We show how augmenting the BN with these additional variables helps pinpoint the mechanism through which missing data contributes to the classification task. The missing data mechanism is thus explicitly taken into account to predict the class variable using the data at hand. Extensive experiments on synthetic and real-world incomplete data sets reveals that the

missingness

information improves classification accuracy.

Sérgio Rodrigues de Morais, Alex Aussem
DEMScale: Large Scale MDS Accounting for a Ridge Operator and Demographic Variables

In this paper, a method called DEMScale is introduced for large scale MDS. DEMScale can be used to reduce MDS problems into manageable sub-problems, which are then scaled separately. The MDS items can be split into sub-problems using demographic variables in order to choose the sections of the data with optimal and sub-optimal mappings. The lower dimensional solutions from the scaled sub-problems are recombined by taking sample points from each sub-problem, scaling the sample points, and using an affine mapping with a ridge operator to map the non-sample points. DEMScale builds upon the methods of distributional scaling and FastMDS, which are used to split and recombine MDS mappings. The use of a ridge regression parameter enables DEMScale to achieve stronger solution stability than the basic distributional scaling and FastMDS techniques. The DEMScale method is general, and is independent of the MDS technique and optimization method used.

Stephen L. France, J. Douglas Carroll
How to Control Clustering Results? Flexible Clustering Aggregation

One of the most important and challenging questions in the area of clustering is how to choose the best-fitting algorithm and parameterization to obtain an optimal clustering for the considered data. The clustering aggregation concept tries to bypass this problem by generating a set of separate, heterogeneous partitionings of the same data set, from which an aggregate clustering is derived. As of now, almost every existing aggregation approach combines given crisp clusterings on the basis of pair-wise similarities. In this paper, we regard an input set of soft clusterings and show that it contains additional information that is efficiently useable for the aggregation. Our approach introduces an expansion of mentioned pair-wise similarities, allowing control and adjustment of the aggregation process and its result. Our experiments show that our flexible approach offers adaptive results, improved identification of structures and high useability.

Martin Hahmann, Peter B. Volk, Frank Rosenthal, Dirk Habich, Wolfgang Lehner
Compensation of Translational Displacement in Time Series Clustering Using Cross Correlation

Although k-means clustering is often applied to time series clustering, the underlying Euclidean distance measure is very restrictive in comparison to the human perception of time series. A time series and its translated copy appear dissimilar under the Euclidean distance (because the comparison is made pointwise), whereas a human would perceive both series as similar. As the human perception is tolerant to translational effects, using the cross correlation distance would be a better choice than Euclidean distance. We show how to modify a k-means variant such that it operates correctly with the cross correlation distance. The resulting algorithm may also be used for meaningful clustering of time series subsequences, which delivers meaningless results in case of Euclidean or Pearson distance.

Frank Höppner, Frank Klawonn
Context-Based Distance Learning for Categorical Data Clustering

Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute

A

i

can be determined by the way in which the values of the other attributes

A

j

are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of

A

i

a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes

A

j

. We validate our approach on various real world and synthetic datasets, by embedding our distance learning method in both a partitional and a hierarchical clustering algorithm. Experimental results show that our method is competitive w.r.t. categorical data clustering approaches in the state of the art.

Dino Ienco, Ruggero G. Pensa, Rosa Meo
Semi-supervised Text Classification Using RBF Networks

Semi-supervised text classification has numerous applications and is particularly applicable to the problems where large quantities of unlabeled data are readily available while only a small number of labeled training samples are accessible. The paper proposes a semi-supervised classifier that integrates a clustering based Expectation Maximization (EM) algorithm into radial basis function (RBF) neural networks and can learn for classification from a very small number of labeled training samples and a large pool of unlabeled data effectively. A generalized centroid clustering algorithm is also investigated in this work to balance predictive values between labeled and unlabeled training data and to improve classification accuracy. Experimental results with three popular text classification corpora show that the proper use of additional unlabeled data in this semi-supervised approach can reduce classification errors by up to 26%.

Eric P. Jiang
Improving k-NN for Human Cancer Classification Using the Gene Expression Profiles

The

k

Nearest Neighbor classifier has been applied to the identification of cancer samples using the gene expression profiles with encouraging results. However,

k

-NN relies usually on the use of Euclidean distances that fail often to reflect accurately the sample proximities. Non Euclidean dissimilarities focus on different features of the data and should be integrated in order to reduce the misclassification errors.

In this paper, we learn a linear combination of dissimilarities using a regularized kernel alignment algorithm. The weights of the combination are learnt in a HRKHS (Hyper Reproducing Kernel Hilbert Space) using a Semidefinite Programming algorithm. This approach allow us to incorporate a smoothing term that penalizes the complexity of the family of distances and avoids overfitting.

The experimental results suggest that the method proposed outperforms other metric learning strategies and improves the classical

k

-NN algorithm based on a single dissimilarity.

Manuel Martín-Merino, Javier De Las Rivas
Subgroup Discovery for Test Selection: A Novel Approach and Its Application to Breast Cancer Diagnosis

We propose a new approach to test selection based on the discovery of subgroups of patients sharing the same optimal test, and present its application to breast cancer diagnosis. Subgroups are defined in terms of background information about the patient. We automatically determine the best

t

subgroups a patient belongs to, and decide for the test proposed by their majority. We introduce the concept of prediction quality to measure how accurate the test outcome is regarding the disease status. The quality of a subgroup is then the best mean prediction quality of its members (choosing the same test for all). Incorporating the quality computation in the search heuristic enables a significant reduction of the search space. In experiments on breast cancer diagnosis data we showed that it is faster than the baseline algorithm APRIORI-SD while preserving its accuracy.

Marianne Mueller, Rómer Rosales, Harald Steck, Sriram Krishnan, Bharat Rao, Stefan Kramer
Trajectory Voting and Classification Based on Spatiotemporal Similarity in Moving Object Databases

We propose a method for trajectory classification based on trajectory voting in Moving Object Databases (MOD). Trajectory voting is performed based on local trajectory similarity. This is a relatively new topic in the spatial and spatiotemporal database literature with a variety of applications like trajectory summarization, classification, searching and retrieval. In this work, we have used moving object databases in space, acquiring spatiotemporal 3-D trajectories, consisting of the 2-D geographic location and the 1-D time information. Each trajectory is modelled by sequential 3-D line segments. The global voting method is applied for each segment of the trajectory, forming a local trajectory descriptor. By the analysis of this descriptor the representative paths of the trajectory can be detected, that can be used to visualize a MOD. Our experimental results verify that the proposed method efficiently classifies trajectories and their sub-trajectories based on a robust voting method.

Costas Panagiotakis, Nikos Pelekis, Ioannis Kopanakis
Leveraging Call Center Logs for Customer Behavior Prediction

Most major businesses use business process outsourcing for performing a process or a part of a process including financial services like mortgage processing, loan origination, finance and accounting and transaction processing. Call centers are used for the purpose of receiving and transmitting a large volume of requests through outbound and inbound calls to customers on behalf of a business. In this paper we deal specifically with the call centers notes from banks. Banks as financial institutions provide loans to non-financial businesses and individuals. Their call centers act as the nuclei of their client service operations and log the transactions between the customer and the bank. This crucial conversation or information can be exploited for predicting a customer’s behavior which will in turn help these businesses to decide on the next action to be taken. Thus the banks save considerable time and effort in tracking delinquent customers to ensure minimum subsequent defaulters. Majority of the time the call center notes are very concise and brief and often the notes are misspelled and use many domain specific acronyms. In this paper we introduce a novel domain specific spelling correction algorithm which corrects the misspelled words in the call center logs to meaningful ones. We also discuss a procedure that builds the behavioral history sequences for the customers by categorizing the logs into one of the predefined behavioral states. We then describe a pattern based predictive algorithm that uses temporal behavioral patterns mined from these sequences to predict the customer’s next behavioral state.

Anju G. Parvathy, Bintu G. Vasudevan, Abhishek Kumar, Rajesh Balakrishnan
Condensed Representation of Sequential Patterns According to Frequency-Based Measures

Condensed representations of patterns are at the core of many data mining works and there are a lot of contributions handling data described by items. In this paper, we tackle sequential data and we define an exact condensed representation for sequential patterns according to the frequency-based measures. These measures are often used, typically in order to evaluate classification rules. Furthermore, we show how to infer the best patterns according to these measures, i.e., the patterns which maximize them. These patterns are immediately obtained from the condensed representation so that this approach is easily usable in practice. Experiments conducted on various datasets demonstrate the feasibility and the interest of our approach.

Marc Plantevit, Bruno Crémilleux
ART-Based Neural Networks for Multi-label Classification

Multi-label classification is an active and rapidly developing research area of data analysis. It becomes increasingly important in such fields as gene function prediction, text classification or web mining. This task corresponds to classification of instances labeled by multiple classes rather than just one. Traditionally, it was solved by learning independent binary classifiers for each class and combining their outputs to obtain multi-label predictions. Alternatively, a classifier can be directly trained to predict a label set of an unknown size for each unseen instance. Recently, several direct multi-label machine learning algorithms have been proposed. This paper presents a novel approach based on ART (Adaptive Resonance Theory) neural networks. The Fuzzy ARTMAP and ARAM algorithms were modified in order to improve their multi-label classification performance and were evaluated on benchmark datasets. Comparison of experimental results with the results of other multi-label classifiers shows the effectiveness of the proposed approach.

Elena P. Sapozhnikova
Two-Way Grouping by One-Way Topic Models

We tackle the problem of new users or documents in collaborative filtering. Generalization over users by grouping them into user groups is beneficial when a rating is to be predicted for a relatively new document having only few observed ratings. The same applies for documents in the case of new users. We have shown earlier that if there are both new users and new documents, two-way generalization becomes necessary, and introduced a probabilistic Two-Way Model for the task. The task of finding a two-way grouping is a non-trivial combinatorial problem, which makes it computationally difficult. We suggest approximating the Two-Way Model with two URP models; one that groups users and one that groups documents. Their two predictions are combined using a product of experts model. This combination of two one-way models achieves even better prediction performance than the original Two-Way Model.

Eerika Savia, Kai Puolamäki, Samuel Kaski
Selecting and Weighting Data for Building Consensus Gene Regulatory Networks

Microarrays are the major source of data for gene expression activity, allowing the expression of thousands of genes to be measured simultaneously. Gene regulatory networks (GRNs) describe how the expression level of genes affect the expression of the other genes. Modelling GRNs from expression data is a topic of great interest in current bioinformatics research. Previously, we took advantage of publicly available gene expression datasets generated by similar biological studies by drawing together a richer and/or broader collection of data in order to produce GRN models that are more robust, have greater confidence and place less reliance on a single dataset. In this paper a new approach, Weighted Consensus Bayesian Networks, introduces the use of weights in order to place more influence on certain input networks or remove the least reliable networks from the input with encouraging results on both synthetic data and real world yeast microarray datasets.

Emma Steele, Allan Tucker
Incremental Bayesian Network Learning for Scalable Feature Selection

Our aim is to solve the feature subset selection problem with thousands of variables using an incremental procedure. The procedure combines incrementally the outputs of non-scalable search-and-score Bayesian network structure learning methods that are run on much smaller sets of variables. We assess the scalability, the performance and the stability of the procedure through several experiments on synthetic and real databases scaling up to 139 351 variables. Our method is shown to be efficient in terms of both running time and accuracy.

Grégory Thibault, Alex Aussem, Stéphane Bonnevay
Feature Extraction and Selection from Vibration Measurements for Structural Health Monitoring

Structural Health Monitoring (SHM) aims at monitoring buildings or other structures and assessing their condition, alerting about new defects in the structure when necessary. For instance, vibration measurements can be used for monitoring the condition of a bridge. We investigate the problem of extracting features from lightweight wireless acceleration sensors. On-line algorithms for frequency domain monitoring are considered, and the resulting features are combined to form a large bank of candidate features. We explore the feature space by selecting random sets of features and estimating probabilistic classifiers for damage detection purposes. We assess the relevance of the features in a large population of classifiers. The methods are assessed with real-life data from a wooden bridge model, where structural problems are simulated with small added weights.

Janne Toivola, Jaakko Hollmén
Zero-Inflated Boosted Ensembles for Rare Event Counts

Two linked ensembles are used for a supervised learning problem with rare-event counts. With many target instances of zero, more traditional loss functions (such as squared error and class error) are often not relevant and a statistical model leads to a likelihood with two related parameters from a zero-inflated Poisson (ZIP) distribution. In a new approach, a linked pair of gradient boosted tree ensembles are developed to handle the multiple parameters in a manner that can be generalized to other problems. The result is a unique learner that extends machine learning methods to data with nontraditional structures. We empirically compare to two real data sets and two artificial data sets versus a single-tree approach (ZIP-tree) and a statistical generalized linear model.

Alexander Borisov, George Runger, Eugene Tuv, Nuttha Lurponglukana-Strand

Selected Contributions 2 (Short Talks)

Mining the Temporal Dimension of the Information Propagation

In the last decade, Social Network Analysis has been a field in which the effort devoted from several researchers in the Data Mining area has increased very fast. Among the possible related topics, the study of the information propagation in a network attracted the interest of many researchers, also from the industrial world. However, only a few answers to the questions “How does the information propagates over a network, why and how fast?” have been discovered so far. On the other hand, these answers are of large interest, since they help in the tasks of finding experts in a network, assessing viral marketing strategies, identifying fast or slow paths of the information inside a collaborative network. In this paper we study the problem of finding frequent patterns in a network with the help of two different techniques: TAS (Temporally Annotated Sequences) mining, aimed at extracting sequential patterns where each transition between two events is annotated with a typical transition time that emerges from input data, and Graph Mining, which is helpful for locally analyzing the nodes of the networks with their properties. Finally we show preliminary results done in the direction of mining the information propagation over a network, performed on two well known email datasets, that show the power of the combination of these two approaches.

Michele Berlingerio, Michele Coscia, Fosca Giannotti
Adaptive Learning from Evolving Data Streams

We propose and illustrate a method for developing algorithms that can adaptively learn from data streams that drift over time. As an example, we take Hoeffding Tree, an incremental decision tree inducer for data streams, and use as a basis it to build two new methods that can deal with distribution and concept drift: a sliding window-based algorithm, Hoeffding Window Tree, and an adaptive method, Hoeffding Adaptive Tree. Our methods are based on using change detectors and estimator modules at the right places; we choose implementations with theoretical guarantees in order to extend such guarantees to the resulting adaptive learning algorithm. A main advantage of our methods is that they require no guess about how fast or how often the stream will drift; other methods typically have several user-defined parameters to this effect.

In our experiments, the new methods never do worse, and in some cases do much better, than CVFDT, a well-known method for tree induction on data streams with drift.

Albert Bifet, Ricard Gavaldà
An Application of Intelligent Data Analysis Techniques to a Large Software Engineering Dataset

Within the development of large software systems, there is significant value in being able to predict changes. If we can predict the likely changes that a system will undergo, then we can estimate likely developer effort and allocate resources appropriately. Within object oriented software development, these changes are often identified as refactorings. Very few studies have explored the prediction of refactorings on a wide-scale. Within this paper we aim to do just this, through applying intelligent data analysis techniques to a uniquely large and comprehensive software engineering time series dataset. Our analysis show extremely promising results, allowing us to predict the occurrence of future large changes.

James Cain, Steve Counsell, Stephen Swift, Allan Tucker
Which Distance for the Identification and the Differentiation of Cell-Cycle Expressed Genes?

This paper addresses the clustering and classification of active genes during the process of cell division. Cell division ensures the proliferation of cells, but becomes drastically aberrant in cancer cells. The studied genes are described by their expression profiles (i.e. time series) during the cell division cycle. This work focuses on evaluating the efficiency of four major metrics for clustering and classifying gene expression profiles. The study is based on a random-periods model for the expression of cell-cycle genes. The model accounts for the observed attenuation in cycle amplitude or duration, variations in the initial amplitude, and drift in the expression profiles.

Alpha Diallo, Ahlame Douzal-Chouakria, Francoise Giroud
Ontology-Driven KDD Process Composition

One of the most interesting challenges in Knowledge Discovery in Databases (KDD) field is giving support to users in the composition of tools for forming a valid and useful KDD process. Such an activity implies that users have both to choose tools suitable to their knowledge discovery problem, and to compose them for designing the KDD process. To this end, they need expertise and knowledge about functionalities and properties of all KDD algorithms implemented in available tools. In order to support users in this heavy activity, in this paper we introduce a goal-driven procedure for automatically compose algorithms. The proposed procedure is based on the exploitation of KDDONTO, an ontology formalizing the domain of KDD algorithms, allowing us to generate valid and non-trivial processes.

Claudia Diamantini, Domenico Potena, Emanuele Storti
Mining Frequent Gradual Itemsets from Large Databases

Mining gradual rules plays a crucial role in many real world applications where huge volumes of complex numerical data must be handled, e.g., biological databases, survey databases, data streams or sensor readings. Gradual rules highlight complex order correlations of the form “

The more/less X, then the more/less Y

”. Such rules have been studied since the early 70’s, mostly in the fuzzy logic domain, where the main efforts have been focused on how to model and use such rules. However, mining gradual rules remains challenging because of the exponential combination space to explore. In this paper, we tackle the particular problem of handling huge volumes by proposing scalable methods. First, we formally define gradual association rules and we propose an original lattice-based approach. The GRITE algorithm is proposed for extracting gradual itemsets in an efficient manner. An experimental study on large-scale synthetic and real datasets is performed, showing the efficiency and interest of our approach.

Lisa Di-Jorio, Anne Laurent, Maguelonne Teisseire
Selecting Computer Architectures by Means of Control-Flow-Graph Mining

Deciding which computer architecture provides the best performance for a certain program is an important problem in hardware design and benchmarking. While previous approaches require expensive simulations or program executions, we propose an approach which solely relies on program analysis. We correlate substructures of the control-flow graphs representing the individual functions with the runtime on certain systems. This leads to a prediction framework based on graph mining, classification and classifier fusion. In our evaluation with the SPEC CPU 2000 and 2006 benchmarks, we predict the faster system out of two with high accuracy and achieve significant speedups in execution time.

Frank Eichinger, Klemens Böhm
Visualization-Driven Structural and Statistical Analysis of Turbulent Flows

Knowledge extraction from data volumes of ever increasing size requires ever more flexible tools to facilitate interactive query. Interactivity enables real-time hypothesis testing and scientific discovery, but can generally not be achieved without some level of data reduction. The approach described in this paper combines multi-resolution access, region-of-interest extraction, and structure identification in order to provide interactive spatial and statistical analysis of a terascale data volume. Unique aspects of our approach include the incorporation of both local and global statistics of the flow structures, and iterative refinement facilities, which combine geometry, topology, and statistics to allow the user to effectively tailor the analysis and visualization to the science. Working together, these facilities allow a user to focus the spatial scale and domain of the analysis and perform an appropriately tailored multivariate visualization of the corresponding data. All of these ideas and algorithms are instantiated in a deployed visualization and analysis tool called VAPOR, which is in routine use by scientists internationally. In data from a 1024

3

simulation of a forced turbulent flow, VAPOR allowed us to perform a visual data exploration of the flow properties at interactive speeds, leading to the discovery of novel scientific properties of the flow, in the form of two distinct vortical structure populations. These structures would have been very difficult (if not impossible) to find with statistical overviews or other existing visualization-driven analysis approaches. This kind of intelligent, focused analysis/refinement approach will become even more important as computational science moves towards petascale applications.

Kenny Gruchalla, Mark Rast, Elizabeth Bradley, John Clyne, Pablo Mininni
Distributed Algorithm for Computing Formal Concepts Using Map-Reduce Framework

Searching for interesting patterns in binary matrices plays an important role in data mining and, in particular, in formal concept analysis and related disciplines. Several algorithms for computing particular patterns represented by maximal rectangles in binary matrices were proposed but their major drawback is their computational complexity limiting their application on relatively small datasets. In this paper we introduce a scalable distributed algorithm for computing maximal rectangles that uses the map-reduce approach to data processing.

Petr Krajca, Vilem Vychodil
Multi-Optimisation Consensus Clustering

Ensemble Clustering has been developed to provide an alternative way of obtaining more stable and accurate clustering results. It aims to avoid the biases of individual clustering algorithms. However, it is still a challenge to develop an efficient and robust method for Ensemble Clustering. Based on an existing ensemble clustering method, Consensus Clustering (CC), this paper introduces an advanced Consensus Clustering algorithm called Multi-Optimisation Consensus Clustering (MOCC), which utilises an optimised Agreement Separation criterion and a Multi-Optimisation framework to improve the performance of CC. Fifteen different data sets are used for evaluating the performance of MOCC. The results reveal that MOCC can generate more accurate clustering results than the original CC algorithm.

Jian Li, Stephen Swift, Xiaohui Liu
Improving Time Series Forecasting by Discovering Frequent Episodes in Sequences

This work aims to improve an existing time series forecasting algorithm –LBF– by the application of frequent episodes techniques as a complementary step to the model. When real-world time series are forecasted, there exist many samples whose values may be specially unexpected. By the combination of frequent episodes and the LBF algorithm, the new procedure does not make better predictions over these outliers but, on the contrary, it is able to predict the apparition of such atypical samples with a great accuracy. In short, this work shows how to detect the occurrence of anomalous samples in time series improving, thus, the general forecasting scheme. Moreover, this hybrid approach has been successfully tested on electricity-related time series.

Francisco Martínez-Álvarez, Alicia Troncoso, José C. Riquelme
Measure of Similarity and Compactness in Competitive Space

The given work is devoted to measures of similarity which are used at discovering of empirical regularities (knowledge). The function of competitive (rival) similarity (FRiS) is proposed as a similarity measure for classification and pattern recognition applications. This function allows one to design effective algorithms for solving all basic data mining tasks, obtain quantitative estimates of the compactness of patterns and the informativeness of feature spaces, and construct easily interpretable decision rules. The method is suitable for any number of patterns regardless of the nature of their distributions and conditionality of training samples (the ratio of the numbers of objects and features). The usefulness of the FRiS is shown by solving a problems of molecular biology.

Nikolay Zagoruiko
Bayesian Solutions to the Label Switching Problem

The label switching problem, the unidentifiability of the permutation of clusters or more generally latent variables, makes interpretation of results computed with MCMC sampling difficult. We introduce a fully Bayesian treatment of the permutations which performs better than alternatives. The method can even be used to compute summaries of the posterior samples for nonparametric Bayesian methods, for which no good solutions exist so far. Although being approximative in that case, the results are very promising. The summaries are intuitively appealing: A summarized cluster is defined as a set of points for which the likelihood of being in the same cluster is maximized.

Kai Puolamäki, Samuel Kaski
Efficient Vertical Mining of Frequent Closures and Generators

The effective construction of many association rule bases requires the computation of both frequent closed and frequent generator itemsets (FCIs/FGs). However, only few miners address both concerns, typically by applying levelwise breadth-first traversal. As depth-first traversal is known to be superior, we examine here the depth-first FCI/FG-mining. The proposed algorithm,

Touch

, deals with both tasks separately, i.e., uses a well-known vertical method,

Charm

, to extract FCIs and a novel one,

Talky-G

, to extract FGs. The respective outputs are matched in a post-processing step. Experimental results indicate that

Touch

is highly efficient and outperforms its levelwise competitors.

Laszlo Szathmary, Petko Valtchev, Amedeo Napoli, Robert Godin
Isotonic Classification Trees

We propose a new algorithm for learning isotonic classification trees. It relabels non-monotone leaf nodes by performing the isotonic regression on the collection of leaf nodes. In case two leaf nodes with a common parent have the same class after relabeling, the tree is pruned in the parent node. Since we consider problems with ordered class labels, all results are evaluated on the basis of

L

1

prediction error. We experimentally compare the performance of the new algorithm with standard classification trees.

Rémon van de Kamp, Ad Feelders, Nicola Barile
Backmatter
Metadaten
Titel
Advances in Intelligent Data Analysis VIII
herausgegeben von
Niall M. Adams
Céline Robardet
Arno Siebes
Jean-François Boulicaut
Copyright-Jahr
2009
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-03915-7
Print ISBN
978-3-642-03914-0
DOI
https://doi.org/10.1007/978-3-642-03915-7

Premium Partner