Skip to main content

2003 | Buch

Advances in Intelligent Data Analysis V

5th International Symposium on Intelligent Data Analysis, IDA 2003, Berlin, Germany, August 28-30, 2003. Proceedings

herausgegeben von: Michael R. Berthold, Hans-Joachim Lenz, Elizabeth Bradley, Rudolf Kruse, Christian Borgelt

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

We are glad to present the proceedings of the 5th biennial conference in the Intelligent Data Analysis series. The conference took place in Berlin, Germany, August 28–30, 2003. IDA has by now clearly grown up. Started as a small si- symposium of a larger conference in 1995 in Baden-Baden (Germany) it quickly attractedmoreinterest(bothsubmission-andattendance-wise),andmovedfrom London (1997) to Amsterdam (1999), and two years ago to Lisbon. Submission ratesalongwiththeeverimprovingqualityofpapershaveenabledtheor- nizers to assemble increasingly consistent and high-quality programs. This year we were again overwhelmed by yet another record-breaking submission rate of 180 papers. At the Program Chairs meeting we were – based on roughly 500 reviews – in the lucky position of carefully selecting 17 papers for oral and 42 for poster presentation. Poster presenters were given the opportunity to summarize their papers in 3-minute spotlight presentations. The oral, spotlight and poster presentations were then scheduled in a single-track, 2. 5-day conference program, summarized in this book. In accordance with the goal of IDA, “to bring together researchers from diverse disciplines,” we achieved a nice balance of presentations from the more theoreticalside(bothstatisticsandcomputerscience)aswellasmoreapplicati- oriented areas that illustrate how these techniques can be used in practice. Work presented in these proceedings ranges from theoretical contributions dealing, for example, with data cleaning and compression all the way to papers addressing practical problems in the areas of text classi?cation and sales-rate predictions. A considerable number of papers also center around the currently so popular applications in bioinformatics.

Inhaltsverzeichnis

Frontmatter

Machine Learning

Pruning for Monotone Classification Trees

For classification problems with ordinal attributes very often the class attribute should increase with each or some of the explanatory attributes. These are called classification problems with monotonicity constraints. Standard classification tree algorithms such as CART or C4.5 are not guaranteed to produce monotone trees, even if the data set is completely monotone. We look at pruning based methods to build monotone classification trees from monotone as well as nonmonotone data sets. We develop a number of fixing methods, that make a non-monotone tree monotone by additional pruning steps. These fixing methods can be combined with existing pruning techniques to obtain a sequence of monotone trees. The performance of the new algorithms is evaluated through experimental studies on artificial as well as real life data sets. We conclude that the monotone trees have a slightly better predictive performance and are considerably smaller than trees constructed by the standard algorithm.

Ad Feelders, Martijn Pardoel
Regularized Learning with Flexible Constraints

By its very nature, inductive inference performed by machine learning methods is mainly data-driven. Still, the consideration of background knowledge – if available – can help to make inductive inference more efficient and to improve the quality of induced models. Fuzzy set-based modeling techniques provide a convenient tool for making expert knowledge accessible to computational methods. In this paper, we exploit such techniques within the context of the regularization (penalization) framework of inductive learning. The basic idea is to express knowledge about an underlying data-generating model in terms of flexible constraints and to penalize those models violating these constraints. Within this framework, an optimal model is one that achieves an optimal trade-off between fitting the data and satisfying the constraints.

Eyke Hüllermeier
Learning to Answer Emails

Many individuals, organizations, and companies have to answer large amounts of emails. Often, most of these emails contain variations of relatively few frequently asked questions. We address the problem of predicting which of several frequently used answers a user will choose to respond to an email. Our approach effectively utilizes the data that is typically available in this setting: inbound and outbound emails stored on a server. We take into account that there are no explicit references between inbound and corresponding outbound mails on the server. We map the problem to a semi-supervised classification problem that can be addressed by the transductive Support Vector Machine. We evaluate our approach using emails sent to a corporate customer service department.

Michael Kockelkorn, Andreas Lüneburg, Tobias Scheffer
A Semi-supervised Method for Learning the Structure of Robot Environment Interactions

For a mobile robot to act autonomously, it must be able to construct a model of its interaction with the environment. Oates et al. developed an unsupervised learning method that produces clusters of robot experiences based on the dynamics of the interaction, rather than on static features. We present a semi-supervised extension of their technique that uses information about the controller and the task of the robot to (i) segment the stream of experiences, (ii) optimise the final number of clusters and (iii) automatically select the individual sensors to feed to the clustering process. The technique is evaluated on a Pioneer 2 robot navigating obstacles and passing through doors in an office environment. We show that the technique is able to classify high dimensional robot time series several times the length previously handled with an accuracy of 91%.

Axel Großmann, Matthias Wendt, Jeremy Wyatt
Using Domain Specific Knowledge for Automated Modeling

In this paper, we present a knowledge based approach to automated modeling of dynamic systems based on equation discovery. The approach allows for integration of domain specific modeling knowledge in the process of modeling. A formalism for encoding knowledge is proposed that is accessible to mathematical modelers from the domain of use. Given a specification of an observed system, the encoded knowledge can be easily transformed into an operational form of grammar that specifies the space of candidate models of the observed system. Then, equation discovery method Lagramge is used to search the space of candidate models and find the one that fits measured data best. The use of the automated modeling framework is illustrated on the example domain of population dynamics. The performance and robustness of the framework is evaluated on the task of reconstructing known models of a simple aquatic ecosystem.

Ljupčo Todorovski, Sašo Džeroski
Resolving Rule Conflicts with Double Induction

When applying an unordered set of classification rules, the rules may assign more than one class to a particular example. Previous methods of resolving such conflicts between rules include using the most frequent class in the conflicting rules (as done in CN2) and using naïve Bayes to calculate the most probable class. An alternative way of solving this problem is presented in this paper: by generating new rules from the examples covered by the conflicting rules. These newly induced rules are then used for classification. Experiments on a number of domains show that this method significantly outperforms both the CN2 approach and naïve Bayes.

Tony Lindgren, Henrik Boström
A Novel Partial-Memory Learning Algorithm Based on Grey Relational Structure

In instance-based learning, the storage of instances must increase along with the number of training instances. In addition, it usually takes too much time to classify an unseen instance because all training instances have to be considered in determining the ‘nearness’ between instances. This paper proposes a novel partial-memory learning method based on the grey relational structure. That is, only some of the training instances are adopted for classification. The relationships among instances are first determined according to the grey relational structure. In this relational structure, the inward edges of each training instance, indicating how many times each instance is used as the nearest neighbor or neighbors in determining the class labels of other instances, can be found. This approach excludes the training instances with no or few inward edges for learning. By using the proposed approach, new instances can be classified with a few training instances. Five datasets are used for demonstrating the performance of the proposed approach. Experimental results indicate that the classification accuracy can be maintained when most of the training instances are pruned prior to learning. Meanwhile, the number of remained training instances is comparable to that of other existing pruning techniques.

Chi-Chun Huang, Hahn-Ming Lee
Constructing Hierarchical Rule Systems

Rule systems have failed to attract much interest in large data analysis problems because they tend to be too simplistic to be useful or consist of too many rules for human interpretation. We present a method that constructs a hierarchical rule system, with only a small number of rules at each stage of the hierarchy. Lower levels in this hierarchy focus on outliers or areas of the feature space where only weak evidence for a rule was found in the data. Rules further up, at higher levels of the hierarchy, describe increasingly general and strongly supported aspects of the data. We demonstrate the proposed method’s usefulness on several classification benchmark data sets using a fuzzy rule induction process as the underlying learning algorithm. The results demonstrate how the rule hierarchy allows to build much smaller rule systems and how the model—especially at higher levels of the hierarchy—remains interpretable. The presented method can be applied to a variety of local learning systems in a similar fashion.

Thomas R. Gabriel, Michael R. Berthold
Text Categorization Using Hybrid Multiple Model Schemes

Automatic text categorization techniques using inductive machine learning methods have been employed in various applications. In this paper, we review the characteristics of the existing multiple model schemes which include bagging, boosting, and stacking. Multiple model schemes try to optimize the predictive accuracy by combining predictions of diverse models derived from different versions of training examples or learning algorithms. In this study, we develop hybrid schemes which combine the techniques of existing multiple model schemes to improve the accuracy of text categorization, and conduct experiments to evaluate the performances of the proposed schemes on MEDLINE, Usenet news, and Web document collections. The experiments demonstrate the effectiveness of the hybrid multiple model schemes. Boosted stacking algorithms, that are a kind of the extended stacking algorithms proposed in this study, yield higher accuracies relative to the conventional multiple model schemes and single model schemes.

In-Cheol Kim, Soon-Hee Myoung

Probability and Topology

Learning Dynamic Bayesian Networks from Multivariate Time Series with Changing Dependencies

Many examples exist of multivariate time series where dependencies between variables change over time. If these changing dependencies are not taken into account, any model that is learnt from the data will average over the different dependency structures. Paradigms that try to explain underlying processes and observed events in multivariate time series must explicitly model these changes in order to allow non-experts to analyse and understand such data. In this paper we have developed a method for generating explanations in multivariate time series that takes into account changing dependency structure. We make use of a dynamic Bayesian network model with hidden nodes. We introduce a representation and search technique for learning such models from data and test it on synthetic time series and real-world data from an oil refinery, both of which contain changing underlying structure. We compare our method to an existing EM-based method for learning structure. Results are very promising for our method and we include sample explanations, generated from models learnt from the refinery dataset.

Allan Tucker, Xiaohui Liu
Topology and Intelligent Data Analysis

A broad range of mathematical techniques, ranging from statistics to fuzzy logic, have been used to great advantage in intelligent data analysis. Topology—the fundamental mathematics of shape—has to date been conspicuously absent from this repertoire. This paper shows how topology, properly reformulated for a finite-precision world, can be useful in intelligent data analysis tasks.

V. Robins, J. Abernethy, N. Rooney, E. Bradley
Coherent Conditional Probability as a Measure of Information of the Relevant Conditioning Events

We go on with our study of a coherent conditional probability looked on as a general non-additive “uncertainty” measure ϕ(.) = P(E|.) of the conditioning events. In particular, we proved in [6] (see also the book [5]) that ϕ can be interpreted as a possibility measure. In a previous paper [7] we gave a relevant characterization, showing that ϕ is a capacity if and only if it is a possibility. In this paper we give a characterization of ϕ as an (antimonotone) information measure in the sense of Kampé de Feriet and Forte.

Giulianella Coletti, Romano Scozzafava
Very Predictive Ngrams for Space-Limited Probabilistic Models

In sequential prediction tasks, one repeatedly tries to predict the next element in a sequence. A classical way to solve these problems is to fit an order-n Markov model to the data, but fixed-order models are often bigger than they need to be. In a fixed-order model, all predictors are of length n, even if a shorter predictor would work just as well. We present a greedy algorithm, vpr, for finding variable-length predictive rules. Although vpr is not optimal, we show that on English text, it performs similarly to fixed-order models but uses fewer parameters.

Paul R. Cohen, Charles A. Sutton
Interval Estimation Naïve Bayes

Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier called naïve Bayes is competitive with state of the art classifiers. This simple approach stands from assumptions of conditional independence among features given the class. In this paper a new naïve Bayes classifier called Interval Estimation naïve Bayes is proposed. Interval Estimation naïve Bayes is performed in two phases. First, an interval estimation of each probability necessary to specify the naïve Bayes is calculated. On the second phase the best combination of values inside these intervals is calculated using a heuristic search that is guided by the accuracy of the classifiers. The founded values in the search are the new parameters for the naïve Bayes classifier. Our new approach has shown to be quite competitive related to simple naïve Bayes. Experimental tests have been done with 21 data sets from the UCI repository.

V. Robles, P. Larrañaga, J. M. Peña, E. Menasalvas, M. S. Pérez
Mining Networks and Central Entities in Digital Libraries. A Graph Theoretic Approach Applied to Co-author Networks

Based on graph theoretic concepts, the paper introduces the notion of a social network for mining central entities in bibliographic Digital Libraries. Due to this issue, several concepts of centrality in social networks have been employed that evaluate the strategic position of actors in a social network. Moreover, the paper discusses a structural technique (MPA*) that optimizes network propagation and the evaluation of actor centrality without imposing low depth threshold. The models proposed have been applied to co–author networks for finding human experts in scientific domains. The expressiveness of our network model is demonstrated by testing the precision of document sets that have been ranked by author centrality.

Peter Mutschke

Classification and Pattern Recognition

Learning Linear Classifiers Sensitive to Example Dependent and Noisy Costs

Learning algorithms from the fields of artificial neural networks and machine learning, typically, do not take any costs into account or allow only costs depending on the classes of the examples that are used for learning. As an extension of class dependent costs, we consider costs that are example, i.e. feature and class dependent. We derive a cost-sensitive perceptron learning rule for non-separable classes, that can be extended to multi-modal classes (DIPOL) and present a natural cost-sensitive extension of the support vector machine (SVM).

Peter Geibel, Ulf Brefeld, Fritz Wysotzki
An Effective Associative Memory for Pattern Recognition

Neuron models of associative memory provide a new and prospective technology for reliable date storage and patterns recognition. However, even when the patterns are uncorrelated, the efficiency of most known models of associative memory is low. We developed a new version of associative memory with record characteristics of its storage capacity and noise immunity, which, in addition, is effective when recognizing correlated patterns.

Boris Kryzhanovsky, Leonid Litinskii, Anatoly Fonarev
Similarity Based Classification

We describe general conditions for data classification which can serve as a unifying framework in the study of kernel based Machine Learning Algorithms. From these conditions we derive a new algorithm called SBC (for Similarity Based Classification), which has attractive theoretical properties regarding underfitting, overfitting, power of generalization, computational complexity and robustness. Compared to classical algorithms, such as Parzen windows and non-linear Perceptrons, SBC can be seen as an optimized version of them. Finally it is a conceptually simpler and a more efficient alternative to Support Vector Machines for an arbitrary number of classes. Its practical significance is illustrated through a number of benchmark classification problems.

Axel E. Bernal, Karen Hospevian, Tayfun Karadeniz, Jean-Louis Lassez
Numerical Attributes in Decision Trees: A Hierarchical Approach

Decision trees are probably the most popular and commonly-used classification model. They are recursively built following a top-down approach (from general concepts to particular examples) by repeated splits of the training dataset. When this dataset contains numerical attributes, binary splits are usually performed by choosing the threshold value which minimizes the impurity measure used as splitting criterion (e.g. C4.5 gain ratio criterion or CART Gini’s index). In this paper we propose the use of multi-way splits for continuous attributes in order to reduce the tree complexity without decreasing classification accuracy. This can be done by intertwining a hierarchical clustering algorithm with the usual greedy decision tree learning.

Fernando Berzal, Juan-Carlos Cubero, Nicolás Marín, Daniel Sánchez
Similarity-Based Neural Networks for Applications in Computational Molecular Biology

This paper presents an alternative to distance-based neural networks. A distance measure is the underlying property on which many neural models rely, for example self-organizing maps or neural gas. However, a distance measure implies some requirements on the data which are not always easy to satisfy in practice. This paper shows that a weaker measure, the similarity measure, is sufficient in many cases. As an example, similarity-based networks for strings are presented. Although a metric can also be defined on strings, similarity is the established measure in string-intensive research, like computational molecular biology. Similarity-based neural networks process data based on the same criteria as other tools for analyzing DNA or amino-acid sequences.

Igor Fischer
Combining Pairwise Classifiers with Stacking

Pairwise classification is the technique that deals with multi-class problems by converting them into a series of binary problems, one for each pair of classes. The predictions of the binary classifiers are typically combined into an overall prediction by voting and predicting the class that received the largest number of votes. In this paper we try to generalize the voting procedure by replacing it with a trainable classifier, i.e., we propose the use of a meta-level classifier that is trained to arbiter among the conflicting predictions of the binary classifiers. In our experiments, this yielded substantial gains on a few datasets, but no gain on others. These performance differences do not seem to depend on quantitative parameters of the datasets, like the number of classes.

Petr Savicky, Johannes Fürnkranz
APRIORI-SD: Adapting Association Rule Learning to Subgroup Discovery

This paper presents a subgroup discovery algorithm APRIORI-SD, developed by adapting association rule learning to subgroup discovery. This was achieved by building a classification rule learner APRIORI-C, enhanced with a novel post-processing mechanism, a new quality measure for induced rules (weighted relative accuracy) and using probabilistic classification of instances. Results of APRIORI-SD are similar to the subgroup discovery algorithm CN2-SD while experimental comparisons with CN2, RIPPER and APRIORI-C demonstrate that the subgroup discovery algorithm APRIORI-SD produces substantially smaller rule sets, where individual rules have higher coverage and significance.

Branko Kavšek, Nada Lavrač, Viktor Jovanoski
Solving Classification Problems Using Infix Form Genetic Programming

A new evolutionary technique, called Infix Form Genetic Programming (IFGP) is proposed in this paper. The IFGP individuals are strings encoding complex mathematical expressions. The IFGP technique is used for solving several classification problems. All test problems are taken from PROBEN1 and contain real world data. IFGP is compared to Linear Genetic Programming (LGP) and Artificial Neural Networks (ANNs). Numerical experiments show that IFGP is able to solve the considered test problems with the same (and sometimes even better) classification error than that obtained by LGP and ANNs.

Mihai Oltean, Crina Groşan

Clustering

What Is Fuzzy about Fuzzy Clustering? Understanding and Improving the Concept of the Fuzzifier

The step from the well-known c-means clustering algorithm to the fuzzy c-means algorithm and its vast number of sophisticated extensions and generalisations involves an additional clustering parameter, the so called fuzzifier. This fuzzifier controls how much clusters may overlap. It also has some negative effects causing problems for clusters with varying data density, noisy data and large data sets with a higher number of clusters. In this paper we take a closer look at what the underlying general principle of the fuzzifier is. Based on these investigations, we propose an improved more general framework that avoids the undesired effects of the fuzzifier.

Frank Klawonn, Frank Höppner
A Mixture Model Approach for Binned Data Clustering

In some particular data analysis problems, available data takes the form of an histogram. Such data are also called binned data. This paper addresses the problem of clustering binned data using mixture models. A specific EM algorithm has been proposed by Cadez et al.([2]) to deal with these data. This algorithm has the disadvantage of being computationally expensive. In this paper, a classification version of this algorithm is proposed, which is much faster. The two approaches are compared using simulated data. The simulation results show that both algorithms generate comparable solutions in terms of resulting partition if the histogram is accurate enough.

Allou Samé, Christophe Ambroise, Gérard Govaert
Fuzzy Clustering Based Segmentation of Time-Series

The segmentation of time-series is a constrained clustering problem: the data points should be grouped by their similarity, but with the constraint that all points in a cluster must come from successive time points. The changes of the variables of a time-series are usually vague and do not focused on any particular time point. Therefore it is not practical to define crisp bounds of the segments. Although fuzzy clustering algorithms are widely used to group overlapping and vague objects, they cannot be directly applied to time-series segmentation. This paper proposes a clustering algorithm for the simultaneous identification of fuzzy sets which represent the segments in time and the local PCA models used to measure the homogeneity of the segments. The algorithm is applied to the monitoring of the production of high-density polyethylene.

Janos Abonyi, Balazs Feil, Sandor Nemeth, Peter Arva
An Iterated Local Search Approach for Minimum Sum-of-Squares Clustering

Since minimum sum-of-squares clustering (MSSC) is an NP-hard combinatorial optimization problem, applying techniques from global optimization appears to be promising for reliably clustering numerical data. In this paper, concepts of combinatorial heuristic optimization are considered for approaching the MSSC: An iterated local search (ILS) approach is proposed which is capable of finding (near-)optimum solutions very quickly. On gene expression data resulting from biological microarray experiments, it is shown that ILS outperforms multi–start k-means as well as three other clustering heuristics combined with k-means.

Peter Merz
Data Clustering in Tolerance Space

This paper studies an abstract data clustering model, in which the similarity is explicitly represented by a tolerance relation. Three basic types of clusters are defined from each tolerance relation: maximal complete similarity clusters, representative clusters, and closure clusters. Heuristic methods of computing corresponding clusterings are introduced and an experiment on two real-world datasets are discussed. This paper provides a different view in the study of data clustering, where clusters are derived from a given similarity and different clusters may have non-empty intersection.

Chun-Hung Tzeng, Fu-Shing Sun
Refined Shared Nearest Neighbors Graph for Combining Multiple Data Clusterings

We recently introduced the idea of solving cluster ensembles using a Weighted Shared nearest neighbors Graph (WSnnG). Preliminary experiments have shown promising results in terms of integrating different clusterings into a combined one, such that the natural cluster structure of the data can be revealed. In this paper, we further study and extend the basic WSnnG. First, we introduce the use of fixed number of nearest neighbors in order to reduce the size of the graph. Second, we use refined weights on the edges and vertices of the graph. Experiments show that it is possible to capture the similarity relationships between the data patterns on a compact refined graph. Furthermore, the quality of the combined clustering based on the proposed WSnnG surpasses the average quality of the ensemble and that of an alternative clustering combining method based on partitioning of the patterns’ co-association matrix.

Hanan Ayad, Mohamed Kamel
Clustering Mobile Trajectories for Resource Allocation in Mobile Environments

The recent developments in computer and communication technologies gave rise to Personal Communication Systems. Due to the nature of the PCS, the bandwidth allocation problem arises, which is based on the notion of bandwidth-on-demand. We deal with the problem of how to predict the position of a mobile client. We propose a new algorithm, called DCP, to discover user mobility patterns from collections of recorded mobile trajectories and use them for the prediction of movements and dynamic allocation of resources. The performance of the proposed algorithm is examined against two baseline algorithms. The simulation results illustrate that the proposed algorithm achieves recall that is comparable to that of the baseline algorithms and substantial improvement in precision. This improvement guarantees very good predictions for resource allocation with the advantage of very low resource consumption.

Dimitrios Katsaros, Alexandros Nanopoulos, Murat Karakaya, Gokhan Yavas, Özgür Ulusoy, Yannis Manolopoulos
Fuzzy Clustering of Short Time-Series and Unevenly Distributed Sampling Points

This paper proposes a new algorithm in the fuzzy-c-means family, which is designed to cluster time-series and is particularly suited for short time-series and those with unevenly spaced sampling points. Short time-series, which do not allow a conventional statistical model, and unevenly sampled time-series appear in many practical situations. The algorithm developed here is motivated by common experiments in molecular biology. Conventional clustering algorithms based on the Euclidean distance or the Pearson correlation coefficient are not able to include the temporal information in the distance metric. The temporal order of the data and the varying length of sampling intervals are important and should be considered in clustering time-series. The proposed short time-series (STS) distance is able to measure similarity of shapes which are formed by the relative change of amplitude and the corresponding temporal information. We develop a fuzzy time-series (FSTS) clustering algorithm by incorporating the STS distance into the standard fuzzy clustering scheme. An example is provided to demonstrate the performance of the proposed algorithm.

Carla S. Möller-Levet, Frank Klawonn, Kwang-Hyun Cho, Olaf Wolkenhauer
Combining and Comparing Cluster Methods in a Receptor Database

Biological data such as DNA and protein sequences can be analyzed using a variety of methods. This paper combines phylogenetic trees, experience-based classification and self-organizing maps for cluster analysis of G protein-coupled receptors (GPCRs), a class of pharmacologically relevant transmembrane proteins with specific characteristics. This combination allows to gain additional insights.

Elena V. Samsonova, Thomas Bäck, Margot W. Beukers, Ad P. Ijzerman, Joost N. Kok

Applications

Selective Sampling with a Hierarchical Latent Variable Model

We present a new method which combines a hierarchical stochastic latent variable model and a selective sampling strategy, for learning from co-occurrence events, i.e. a fundamental issue in intelligent data analysis. The hierarchical stochastic latent variable model we employ enables us to use existing background knowledge of observable co-occurrence events as a latent variable. The selective sampling strategy we use iterates selecting plausible non-noise examples from a given data set and running the learning of a component stochastic model alternately and then improves the predictive performance of a component model. Combining the model and the strategy is expected to be effective for enhancing the performance of learning from real-world co-occurrence events. We have empirically tested the performance of our method using a real data set of protein-protein interactions, a typical data set of co-occurrence events. The experimental results have shown that the presented methodology significantly outperformed an existing approach and other machine learning methods compared, and that the presented method is highly effective for unsupervised learning from co-occurrence events.

Hiroshi Mamitsuka
Obtaining Quality Microarray Data via Image Reconstruction

This paper introduces a novel method for processing spotted microarray images, inspired from image reconstruction. Instead of the usual approach that focuses on the signal when removing the noise, the new method focuses on the noise itself, performing a type of interpolation. By recreating the image of the microarray slide, as it would have been with all the genes removed, the gene ratios can be calculated with more precision and less influence from outliers and other artefacts that would normally make the analysis of this data more difficult. The new technique is also beneficial, as it does not rely on the accurate fitting of a region to each gene, with its only requirement being an approximate coordinate. In experiments conducted the new method was tested against one of the mainstream methods of processing spotted microarray images. Our method is shown to produce much less variation in gene measurements. This evidence is supported by clustering results that show a marked improvement in accuracy.

Paul O’Neill, George D. Magoulas, Xiaohui Liu
Large Scale Mining of Molecular Fragments with Wildcards

The main task of drug discovery is to find novel bioactive molecules, i.e., chemical compounds that, for example, protect human cells against a virus. One way to support solving this task is to analyze a database of known and tested molecules with the aim to build a classifier that predicts whether a novel molecule will be active or inactive, so that future chemical tests can be focused on the most promising candidates. In [1] an algorithm for constructing such a classifier was proposed that uses molecular fragments to discriminate between active and inactive molecules. In this paper we present two extensions of this approach: A special treatment of rings and a method that finds fragments with wildcards based on chemical expert knowledge.

Heiko Hofer, Christian Borgelt, Michael R. Berthold
Genome-Wide Prokaryotic Promoter Recognition Based on Sequence Alignment Kernel

In this paper an application of Sequence Alignment Kernel for recognition of prokaryotic promoters with transcription start sites (TSS) is presented. An algorithm for computing this kernel in square time is described. Using this algorithm, a “promoter map” of E.coli genome has been computed. This is a curve reflecting the likelihood of every base of a given genomic sequence to be a TSS. A viewer showing the likelihood curve with positions of known and putative genes and known TSS has also been developed and made available online.Although the visual analysis of the promoter regions is very intuitive, we propose an automatic genome-wide promoter prediction scheme that simplifies the routine of checking the whole promoter map visually. Computational efficiency and speed issue are also discussed.

Leo Gordon, Alexey Ya. Chervonenkis, Alex J. Gammerman, Ilham A. Shahmuradov, Victor V. Solovyev
Towards Automated Electrocardiac Map Interpretation: An Intelligent Contouring Tool Based on Spatial Aggregation

The interpretation of cardiac potential maps is essential for the localization of the anomalous excitation sites associated with electrical conduction pathologies, such as arrythmia, and its automation would make clinical practice both realistic and of great impact on health care. Conventional contouring tools allow us to efficiently visualize patterns of electrical potential distribution but they do not facilitate the automated extraction of general rules necessary to infer the correlation between pathopysiological patterns and wavefront structure and propagation. The Spatial Aggregation (SA) approach, which aims at interpreting a numeric input field, is potentially capable of capturing structural information about the underlying physical phenomenon, and of identifying its global patterns and the causal relations between events. These characteristics are due to its hierarchical strategy in aggregating objects at higher and higher levels. However, when dealing with a complex domain geometry and/or with non uniform data meshes, as in our context, the original version of SA may unsoundly perform contouring. Isocurve entanglement and/or segmentation phenomena may occur due to metric-based adjacency relations and a scarse density of isopoints. This paper discusses the problems above, and presents definitions and algorithms for a sound representation of the spatial contiguity between isopoints and for the construction of isocurves.

Liliana Ironi, Stefania Tentoni
Study of Canada/US Dollar Exchange Rate Movements Using Recurrent Neural Network Model of FX-Market

Understanding exchange rate movements has long been an extremely challenging and important task. Unsatisfactory results produced by time series regression models have led to the claim by several authors that in foreign exchange markets, past movements of the price of a given currency have no predictive power in forecasting future movements of the currency price. In this paper, we build a recurrent neural network model for FX-market to explain exchange rate movements. Asset prices are discovered in the marketplace by the interaction of market design and agents’ behaviour. The interaction is simulated by integrating 1) the FX-market mechanism; 2) an economic framework; and 3) the embedding of both tasks in neural network architectures. The results indicate that both macroeconomic and microeconomic variables are useful to forecast exchange rate changes. Results from regression model based on neural-fuzzy forecasting system are also included for comparison.

Ashwani Kumar, D. P. Agrawal, S. D. Joshi
Gaussian Mixture Density Estimation Applied to Microarray Data

Several publications have focused on fitting a specific distribution to overall microarray data. Due to a number of biological features the distribution of overall spot intensities can take various shapes. It appears to be impossible to find a specific distribution fitting all experiments even if they are carried out perfectly. Therefore, a probabilistic representation that models a mixture of various effects would be suitable. We use a Gaussian mixture model to represent signal intensity profiles. The advantage of this approach is the derivation of a probabilistic criterion for expressed and non-expressed genes. Furthermore our approach does not involve any prior decision on the number of model parameters. We properly fit microarray data of various shapes by a mixture of Gaussians using the EM algorithm and determine the complexity of the mixture model by the Bayesian Information Criterion (BIC). Finally, we apply our method to simulated data and to biological data.

Christine Steinhoff, Tobias Müller, Ulrike A. Nuber, Martin Vingron
Classification of Protein Localisation Patterns via Supervised Neural Network Learning

There are so many existing classification methods from diverse fields including statistics, machine learning and pattern recognition. New methods have been invented constantly that claim superior performance over classical methods. It has become increasingly difficult for practitioners to choose the right kind of the methods for their applications. So this paper is not about the suggestion of another classification algorithm, but rather about conveying the message that some existing algorithms, if properly used, can lead to better solutions to some of the challenging real-world problems. This paper will look at some important problems in bioinformatics for which the best solutions were known and shows that improvement over those solutions can be achieved with a form of feed-forward neural networks by applying more advanced schemes for network supervised learning. The results are evaluated against those from other commonly used classifiers, such as the K nearest neighbours using cross validation, and their statistical significance is assessed using the nonparametric Wilcoxon test.

Aristoklis D. Anastasiadis, George D. Magoulas, Xiaohui Liu
Applying Intelligent Data Analysis to Coupling Relationships in Object-Oriented Software

Class coupling lies at the heart of object-oriented (OO) programming systems, since most objects need to communicate with other objects in any OO system as part of the system’s functions. Minimisation of coupling in a system is a goal of every software developer, since overly-coupled systems are complex and difficult to maintain. There are various ways of coupling classes in OO systems. However, very little is known about the different forms of coupling and their relationships. In this paper, three data analysis techniques, namely, Bayesian Networks, Association Rules and Clustering were used to identify coupling relationships in three C++ systems. Encouraging results were shown for the Bayesian Network approach, re-inforcing existing knowledge and highlighting new features about the three systems. Results for the other two techniques were disappointing. With association rules, it was clear that only a very general relationship could be discovered from the data. The clustering approach produced inconsistent results, casting doubt on whether such a technique can provide any insight into module discovery when applied to these type of systems.

Steve Counsell, Xiaohui Liu, Rajaa Najjar, Stephen Swift, Allan Tucker
The Smaller the Better: Comparison of Two Approaches for Sales Rate Prediction

We describe a system to predict the daily sales rates of newspapers. We deduce a mathematical modeling and its implementation, a data cleaning approach, and a way to augment the training sets using similar time series. The results are compared with a neural prediction system currently in use.

Martin Lauer, Martin Riedmiller, Thomas Ragg, Walter Baum, Michael Wigbers

Modeling

A Multiagent-Based Constructive Approach for Feedforward Neural Networks

In this paper, a new constructive approach for the automatic definition of feedforward neural networks (FNNs) is introduced. Such approach (named MASCoNN) is multiagent-oriented and, thus, can be regarded as a kind of hybrid (synergetic) system. MASCoNN centers upon the employment of a two-level hierarchy of agent-based elements for the progressive allocation of neuronal building blocks. By this means, an FNN can be considered as an architectural organization of reactive neural agents, orchestrated by deliberative coordination entities via synaptic interactions. MASCoNN was successfully applied to implement nonlinear dynamic systems identification devices and some comparative results, involving alternative proposals, are analyzed here.

Clodoaldo A. M. Lima, André L. V. Coelho, Fernando J. Von Zuben
Evolutionary System Identification via Descriptive Takagi Sugeno Fuzzy Systems

System identification is used to identify relevant input-output space relations. In this article the relations are used to model a descriptive Takagi-Sugeno fuzzy system. Basic terms of system identification, fuzzy systems and evolutionary computation are briefly reviewed. These concepts are used to present the implementation of an evolutionary algorithm which identifies (sub)optimal descriptive Takagi-Sugeno fuzzy systems according to given data. The proposed evolutionary algorithm is tested on the well known gas furnace data set and results are presented.

Ingo Renners, Adolf Grauel
Minimum Message Length Criterion for Second-Order Polynomial Model Selection Applied to Tropical Cyclone Intensity Forecasting

This paper outlines a body of work that tries to merge polynomial model selection research and tropical cyclone forecasting research. The contributions of the work are four-fold. First, a new criterion based on the Minimum Message Length principle specifically formulated for the task of polynomial model selection up to the second order is presented. Second, a programmed optimisation search algorithm for second-order polynomial models that can be used in conjunction with any model selection criterion is developed. Third, critical examinations of the differences in performance of the various criteria when applied to artificial vis-a-vis to real tropical cyclone data are conducted. Fourth, a novel strategy which uses a synergy between the new criterion built based on the Minimum Message Length principle and other model selection criteria namely, Minimum Description Length, Corrected Akaike’s Information Criterion, Structured Risk Minimization and Stochastic Complexity is proposed. The forecasting model developed using this new automated strategy has better performance than the benchmark models SHIFOR (Statistical HurrIcane FORcasting) [4] and SHIFOR94 [8] which are being used in operation in the Atlantic basin.

Grace W. Rumantir, Chris S. Wallace
On the Use of the GTM Algorithm for Mode Detection

The problem of detecting the modes of the multivariate continuous distribution generating the data is of central interest in various areas of modern statistical analysis. The popular self-organizing map (SOM) structure provides a rough estimate of that underlying density and can therefore be brought to bear with this problem. In this paper we consider the recently proposed, mixture-based generative topographic mapping (GTM) algorithm for SOM training. Our long-term goal is to develop, from a map appropriately trained via GTM, a fast, integrated and reliable strategy involving just a few key statistics. Preliminary simulations with Gaussian data highlight various interesting aspects of our working strategy.

Susana Vegas-Azcárate, Jorge Muruzábal
Regularization Methods for Additive Models

This paper tackles the problem of model complexity in the context of additive models. Several methods have been proposed to estimate smoothing parameters, as well as to perform variable selection. However, these procedures are inefficient or computationally expensive in high dimension. To answer this problem, the lasso technique has been adapted to additive models, but its experimental performance has not been analyzed.We propose a modified lasso for additive models, performing variable selection. A benchmark is developed to examine its practical behavior, comparing it with forward selection. Our simulation studies suggest ability to carry out model selection of the proposed method. The lasso technique shows up better than forward selection in the most complex situations. The computing time of modified lasso is considerably smaller since it does not depend on the number of relevant variables.

Marta Avalos, Yves Grandvalet, Christophe Ambroise
Automated Detection of Influenza Epidemics with Hidden Markov Models

We present a method for automated detection of influenza epidemics. The method uses Hidden Markov Models with an Exponential-Gaussian mixture to characterize the non-epidemic and epidemic dynamics in a time series of influenza-like illness incidence rates. Our evaluation on real data shows a reduction in the number of false detections compared to previous approaches and increased robustness to variations in the data.

Toni M. Rath, Maximo Carreras, Paola Sebastiani
Guided Incremental Construction of Belief Networks

Because uncertain reasoning is often intractable, it is hard to reason with a large amount of knowledge. One solution to this problem is to specify a set of possible models, some simple and some complex, and choose which to use based on the problem. We present an architecture for interpreting temporal data, called AIID, that incrementally constructs belief networks based on data that arrives asynchronously. It synthesizes the opportunistic control of the blackboard architecture with recent work on constructing belief networks from fragments. We have implemented this architecture in the domain of military analysis.

Charles A. Sutton, Brendan Burns, Clayton Morrison, Paul R. Cohen
Distributed Regression for Heterogeneous Data Sets

Existing meta-learning based distributed data mining approaches do not explicitly address context heterogeneity across individual sites. This limitation constrains their applications where distributed data are not identically and independently distributed. Modeling heterogeneously distributed data with hierarchical models, this paper extends the traditional meta-learning techniques so that they can be successfully used in distributed scenarios with context heterogeneity.

Yan Xing, Michael G. Madden, Jim Duggan, Gerard J. Lyons

(Data) Preprocessing

A Logical Formalisation of the Fellegi-Holt Method of Data Cleaning

The Fellegi-Holt method automatically “corrects” data that fail some predefined requirements. Computer implementations of the method were used in many national statistics agencies but are less used now because they are slow. We recast the method in propositional logic, and show that many of its results are well-known results in propositional logic. In particular we show that the Fellegi-Holt method of “edit generation” is essentially the same as a technique for automating logical deduction called resolution. Since modern implementations of resolution are capable of handling large problems efficiently, they might lead to more efficient implementations of the Fellegi-Holt method.

Agnes Boskovitz, Rajeev Goré, Markus Hegland
Compression Technique Preserving Correlations of a Multivariate Temporal Sequence

The application of Data Analysis methods to temporal sequences is limited because of the inherent high temporal dimension of the data. One promising solution consists in performing dimensionality reduction before analysing the sequences. We propose a new technique that reduces the temporal dimension of a multivariate sequence while preserving correlations between the sequence’s descriptive features, a highly important property for many Data Analysis process. We illustrate the effectiveness of the proposed technique for a multivariate temporal sequence of an intensive care unit application.

Ahlame Chouakria-Douzal
Condensed Representations in Presence of Missing Values

Missing values are an old problem that is very common in real data bases. We describe the damages caused by missing values on condensed representations of patterns extracted from large data bases. This is important because condensed representations are very useful to increase the efficiency of the extraction and enable new uses of frequent patterns (e.g., rules with minimal body, clustering, classification). We show that, unfortunately, such condensed representations are unreliable in presence of missing values. We present a method of treatment of missing values for condensed representations based on δ-free or closed patterns, which are the most common condensed representations. This method provides an adequate condensed representation of these patterns. We show the soundness of our approach, both on a formal point of view and experimentally. Experiments are performed with our prototype MVminer (for Missing Values miner), which computes the collection of appropriate δ-free patterns.

François Rioult, Bruno Crémilleux
Measures of Rule Quality for Feature Selection in Text Categorization

Text Categorization is the process of assigning documents to a set of previously fixed categories. A lot of research is going on with the goal of automating this time-consuming task. Several different algorithms have been applied, and Support Vector Machines have shown very good results. In this paper we propose a new family of measures taken from the Machine Learning environment to apply them to feature reduction task. The experiments are performed on two different corpus (Reuters and Ohsumed). The results show that the new family of measures performs better than the traditional Information Theory measures.

Elena Montañés, Javier Fernández, Irene Díaz, Elías F. Combarro, José Ranilla
Genetic Approach to Constructive Induction Based on Non-algebraic Feature Representation

The aim of constructive induction (CI) is to transform the original data representation of hard concepts with complex interaction into one that outlines the relation among attributes. CI methods based on greedy search suffer from the local optima problem because of high variation in the search space of hard learning problems. To reduce the local optima problem, we propose a CI method based on genetic (evolutionary) algorithms. The method comprises two integrated genetic algorithms to construct functions over subsets of attributes in order to highlight regularities for the learner. Using non-algebraic representation for constructed functions assigns an equal degree of complexity to functions. This reduces the difficulty of constructing complex features. Experiments show that our method is comparable with and in some cases superior to existing CI methods.

Leila S. Shafti, Eduardo Pérez
Active Feature Selection Based on a Very Limited Number of Entities

In data analysis, the necessary data are not always prepared in a database in advance. If the precision of extracted classification knowledge is not sufficient, gathering additional data is sometimes necessary. Practically, if some critical attributes for the classification are missing from the database, it is very important to identify such missing attributes effectively in order to improve the precision. In this paper, we propose a new method to identify the attributes that will improve the precision of Support Vector Classifiers (SVC) based solely on values of candidate attributes of a very limited number of entities. In experiments, we show the incremental addition of attributes by the proposed method effectively improves the precision of SVC using only a very small number of entities.

Yasusi Sinohara, Teruhisa Miura
Backmatter
Metadaten
Titel
Advances in Intelligent Data Analysis V
herausgegeben von
Michael R. Berthold
Hans-Joachim Lenz
Elizabeth Bradley
Rudolf Kruse
Christian Borgelt
Copyright-Jahr
2003
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-45231-7
Print ISBN
978-3-540-40813-0
DOI
https://doi.org/10.1007/978-3-540-45231-7