nach oben

2002 | Buch

Kapitel lesen Erstes Kapitel lesen

Classification, Clustering, and Data Analysis

Recent Advances and Applications

herausgegeben von: Prof. Krzysztof Jajuga, Prof. Andrzej Sokołowski, Prof. Hans-Hermann Bock

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Classification, Data Analysis, and Knowledge Organization

Enthalten in: Professional Book Archive

Einloggen, um Zugang zu erhalten

Über dieses Buch

The present volume contains a selection of papers presented at the Eighth Conference of the International Federation of Classification Societies (IFCS) which was held in Cracow, Poland, July 16-19, 2002. All originally submitted papers were subject to a reviewing process by two independent referees, a procedure which resulted in the selection of the 53 articles presented in this volume. These articles relate to theoretical investigations as well as to practical applications and cover a wide range of topics in the broad domain of classifi cation, data analysis and related methods. If we try to classify the wealth of problems, methods and approaches into some representative (partially over lapping) groups, we find in particular the following areas: • Clustering • Cluster validation • Discrimination • Multivariate data analysis • Statistical methods • Symbolic data analysis • Consensus trees and phylogeny • Regression trees • Neural networks and genetic algorithms • Applications in economics, medicine, biology, and psychology. Given the international orientation of IFCS conferences and the leading role of IFCS in the scientific world of classification, clustering and data anal ysis, this volume collects a representative selection of current research and modern applications in this field and serves as an up-to-date information source for statisticians, data analysts, data mining specialists and computer scientists.

Inhaltsverzeichnis

Frontmatter

Clustering and Discrimination

Frontmatter

Clustering

Some Thoughts about Classification

The paper contains some general remarks on the high art of data analysis, some philosophical thoughts about classification, a partial review of outliers and robustness from the point of view of applications, including a discussion of the problem of model choice, and a review of several aspects of robust estimation of covariance matrices, including the pragmatic choice of a weight function based on empirical and theoretical evidence. Several sections contain new (or at least original) ideas: There are some proposals for incorporating robustness into Bayesian practice and theory, including weighted log likelihoods and Bayes’ theorem for weighted data. Some small ideas refer to artificial classification in a continuum, to a “robust” (Prohorov-type) metric for high-dimensional data, and to the use of multiple minimum spanning trees. A promising but difficult research idea for clustering on the real line, based on a new smoothing method, concludes the paper.

Frank Hampel

Partial Defuzzification of Fuzzy Clusters

Two methods of partial defuzzification of fuzzy clusters in a fuzzy k-partition are proposed. The first method is based on sharpening of fuzzy clusters with respect to the threshold 1/k. Sharpening is accomplished by application of the generalized operator of contrast intensification of fuzzy sets to k — 1 fuzzy clusters. The second method utilizes the idea of strong sharpening. In this approach the very small membership grades in each fuzzy cluster are reduced to zero. It is explained how these methods can lead to the well known approximation of a fuzzy partition by its crisp maximum membership partition.

Slavka Bodjanova

A New Clustering Approach, Based on the Estimation of the Probability Density Function, for Gene Expression Data

Many techniques have already been suggested for handling and analyzing the large and high-dimensional data sets produced by newly developed gene expression experiments. These techniques include supervised classification and unsupervised agglomerative or hierarchical clustering techniques. Here, we present an alternative approach that does not make assumption on the shape, size and volumes of the clusters. The technique is based on the estimation of the probability density function (pdf). Once the pdf is estimated, with the Parzen technique (with the right amount of smoothing), the parameter space is partitioned according to methods inherited from image processing, namely the skeleton by influence zones and the watershed. We show some advantages of this suggested approach.

Noël Bonnet, Michel Herbin, Jérôme Cutrona, Jean-Marie Zahm

Two-mode Partitioning: Review of Methods and Application of Tabu Search

As the special contribution of this paper we deal with an application of tabu search, to the problem of two-mode clustering for minimization of the two-mode variance criterion. States are two-mode partitions, neighborhoods are defined by transfers of a single row or column-mode object into a new class, and the tabu list contains the values of the criterion. We compare the results obtained with those of other methods, such as alternating exchanges, k-means, simulated annealing and a fuzzy-set approach.

William Castillo, Javier Trejos

Dynamical Clustering of Interval Data: Optimization of an Adequacy Criterion Based on Hausdorff Distance

In order to extend the dynamical clustering algorithm to interval data sets, we define the prototype of a cluster by optimization of a classical adequacy criterion based on Hausdorff distance. Once this class prototype properly defined we give a simple and converging algorithm for this new type of interval data.

Marie Chavent, Yves Lechevallier

Removing Separation Conditions in a 1 against 3-Components Gaussian Mixture Problem

In this paper we address the problem of testing homogeneity against a three components Gaussian mixture in the univariate case. As alternative we consider a contamination model. The likelihood ratio test statistic (LRTS) is derived up to an o_p(1), and two separation conditions are removed. An example with real data is discussed.

Bernard Garel, Franck Goussanou

Obtaining Partitions of a Set of Hard or Fuzzy Partitions

The paper presents methodology for obtaining a (secondary) partition of a set of (hard or fuzzy) primary partitions, specifying a consensus fuzzy partition for each of the classes in the secondary partition.

Allan D. Gordon, Maurizio Vichi

Clustering for Prototype Selection using Singular Value Decomposition

Data clustering is an important technique for exploratory data analysis. The speed, reliability and consistency with which a clustering algorithm can organize large amounts of data constitute reasons to use it in applications like data mining, document retrieval, signal compression, coding and pattern classification. In this paper, we use clustering for efficient large-scale pattern classification; more specifically, we achieve it by selecting appropriate prototypes and features using Singular Value Decomposition (SVD). It is found that the SVD based clustering not only selects better prototypes, but also reduces the memory and computational requirements by 98% over the conventional Nearest Neighbour Classifier (NNC) (T.M.Cover and P.E.Hart (1967)), on OCR data.

A. K. V. Sai Jayram, M. Narasimha Murty

Clustering in High-dimensional Data Spaces

By high-dimensional we mean dimensionality of the same order as the number of objects or observations to cluster, and the latter in the range of thousands upwards. Bellman’s “curse of dimensionality” applies to many widely-used data analysis methods in high-dimensional spaces. One way to address this problem is by array permuting methods, involving row/column reordering. Such methods are closely related to dimensionality reduction methods such as principal components analysis. An imposed order on an array is beneficial not only for visualization but also for use of a vast range of image processing methods. For example, clustering becomes in this context image feature detection.

Fionn Murtagh

Quantization of Models: Local Approach and Asymptotically Optimal Partitions

In this paper we review algorithmic aspects related to maximum-supportplane partitions. These partitions have been defined in Bock (1992) and analyzed in Pötzelberger and Strasser (2001). The local approach to inference leads to a certain subclass of partitions. They are obtained from quantizing the distribution of the score function. We propose a numerical method to compute these partitions B approximately, in the sense that they are asymptotically optimal for increasing sizes |B|. These findings are based on recent results on the asymptotic distribution of sets of prototypes.

Klaus Pötzelberger

The Performance of an Autonomous Clustering Technique

Recently, a multi-agents system has been discussed in the field of adaptive control. An autonomous clustering is considered to be a multiple agents system which constructs clusters by moving each pair of objects closer or farther according to their relative similarity to all of the objects.In this system, the objects correspond to autonomous agents, and the similarity relation is regarded as the environment. Defining a suitable action rule for the multi-agents, the clusters are constructed automatically.In this paper, we discuss the ability of the detection of clusters by the autonomous clustering technique through the concrete examples.

Yoshiharu Sato

Cluster Analysis by Restricted Random Walks

A new graph theoretical method is introduced to analyze a data set of objects described by a dissimilarity matrix d_ij. This method is based on the generation of a series of random walks in the data set. We define a random walk in a data set by moving in each time step from one object to another one at random. In order that the random walk depends on the pattern of the data, a restriction depending on the previous moves is imposed during its generation, so that the random walk is attracted by clusters formed of similar objects. We define a hierarchical set of graphs consisting of all connections of a series of random walks at a certain time step to detect the structure of the data and to derive an similarity measure between the objects. In an example an application of the method is shown.

Joachim Schöll, Elisabeth Paschinger

Missing Data in Hierarchical Classification of Variables — a Simulation Study

Here we develop from a first work the effect of missing data in hierarchical classification of variables according to the following factors: amount of missing data, imputation techniques, similarity coefficient, and aggregation criterion. We have used two methods of imputation, a regression method using an OLS method and an EM algorithm. For the similarity matrices we have used the basic affinity coefficient and the Pearson’s correlation coefficient. As aggregation criteria we apply average linkage, single linkage and complete linkage methods. To compare the structure of the hierarchical classifications the Spearman’s coefficient between the associated ultrametrics has been used. We present here simulation experiments in five multivariate normal cases.

Ana Lorga da Silva, Helena Bacelar-Nicolau, Gilbert Saporta

Cluster Validation

Representation and Evaluation of Partitions

Many methods lead to build a partition of a finite set, given a metric. In this text we propose some criteria to evaluate the quality of each class and other parameters to compare several partitions on the same data set. Then, we indicate how to represent graphically a partition as a tree of boxes, each one containing information about the quality of a class.

Alain Guénoche, Henri Garreta

Assessing the Number of Clusters of the Latent Class Model

When partitioning the data is the main concern, it is implicitly assumed that each cluster can be approximately regarded as a sample from one component of a mixture model. Thus, the clustering problem can be viewed as an estimation problem of the parameters of the mixture. Setting this problem under the Maximum likelihood and Classification likelihood approaches, we first study the clustering of objects described by categorical attributes using the latent class model and we concentrate our attention on the problem of the number of components. To this end, we use three criteria derived within a Bayesian framework to tackle this problem. These criteria based on approximations of integrated likelihood and of integrated classification likelihood have been recently compared in Gaussian mixture. In this work, we propose to extend these comparisons to the latent class model.

François-Xavier Jollois, Mohamed Nadif, Gérard Govaert

Validation of Very Large Data Sets Clustering by Means of a Nonparametric Linear Criterion

In this paper we present a linearization of the Lerman clustering index for determining the number of clusters in a data set. Our goal was to apply the linearized index to large data sets containing both numerical and categorical values. The initial index, which was based on the set of pairs of objects, had a complexity O(n2). In this work its complexity is reduced to O(n),and so, we can apply it to large data sets frequently encountered in Data Mining applications. The clustering algorithm used is an extention of the k-means algorithm to domains with mixed numerical and categorical values (Huang (1998)). The quality of the index is empirically evaluated on some data sets, both artificial and real.

Israel Lerman, Joaquim Pinto da Costa, Helena Silva

Discrimination

Effect of Feature Selection on Bagging Classifiers Based on Kernel Density Estimators

A combination of classification rules (classifiers) is known as an Ensemble, and in general it is more accurate than the individual classifiers used to build it. One method to construct an Ensemble is Bagging introduced by Breiman, (1996). This method relies on resampling techniques to obtain different training sets for each of the classifiers. Previous work has shown that Bagging is very effective for unstable classifiers. In this paper we present some results in application of Bagging to classifiers where the class conditional density is estimated using kernel density estimators. The effect of feature selection in bagging is also considered.

Edgar Acuña, Alex Rojas, Frida Coaquira

Biplot Methodology for Discriminant Analysis Based upon Robust Methods and Principal Curves

Biplots not only are useful graphical representations of multidimensional data,but formulating discriminant analysis in terms of biplot methodology can lead to several novel extensions. In this paper it is shown that incorporating both principal curves and robust canonical variate analysis algorithms in biplot methodology often leads to superior classification.

Sugnet Gardner, Niel le Roux

Bagging Combined Classifiers

Aggregated classifiers have proven to be successful in reducing misclassification error in a wide range of classification problems. One of the most popular is bagging. But often simple procedures perform comparably in specific applications. For example, linear discriminant analysis (LDA) provides efficient classifiers if the underlying class structure is linear regarding the predictors.We suggest bagging for a combination of tree classifiers and LDA. The out-of-bag sample is used as an independent learning sample for the computation of linear discriminant functions. The corresponding discriminant variables of the bootstrap sample are used as additional predictors for a classification tree. We illustrate the proposal by a glaucoma classification with laser scanning image data. Moreover, we analyse the properties with a simulation study and benchmark data sets. In summary, our proposal has misclassification error comparable to LDA when LDA performs best and comparable to bagged trees when bagged trees perform best.

Torsten Hothorn, Berthold Lausen

Application of Bayesian Decision Theory to Constrained Classification Networks

A classification network of classifying subjects as either suitable or unsuitable for a treatment followed by classifying accepted subjects as either a master or nonmaster will be formalized in the case of several relevant subpopulations. It will further be assumed that only a fixed number of subjects can be accepted for the treatment. The purpose of this paper is to optimize simultaneously this constrained classification network using Bayesian decision theory. In doing so, important distinctions will be made between weak and strong as well as monotone and nonmonotone decision rules. Also, a theorem will be given under what conditions optimal (weak) rules will be monotone.

Hans J. Vos

Multivariate Data Analysis and Statistics

Frontmatter

Multivariate Data Analysis

Quotient Dissimilarities, Euclidean Embeddability, and Huygens’ Weak Principle

We introduce a broad class of categorical dissimilarities, the quotient dissimilarities, for which aggregation invariance is automatically satisfied. This class contains the chi-square, ratio, Kullback-Leibler and Hellinger dissimilarities, as well as presumably new “power” and “threshold” dissimilarity families. For a large sub-class of the latter, the product dissimilarities, we show that the Euclidean embeddability property on one hand and the weak Huygens’ principle on the other hand are mutually exclusive, the only exception being provided by the chi-square dissimilarity DX. Various suggestions are presented, aimed at generalizing Factorial Correspondence Analysis beyond the chi-square metric, by non-linear distortion of departures from independence. In particular, the central inertia appearing in one formulation precisely amounts to the mutual information of Information Theory. 1

François Bavaud

Conjoint Analysis and Stimulus Presentation — a Comparison of Alternative Methods

The rapid development of the multimedia industry has led to improved possibilities to realistically present new product concepts to potential buyers even before prototypical realizations of the new products are available. Especially in conjoint studies — where product concepts are presented as stimuli with systematically varying features — the usage of pictures, sounds, animations, mock ups or even virtual reality should result in a reduction of respondent’s uncertainty with respect to (w.r.t.) innovative features and (hopefully) to an improved validity of the collected preferential responses. This paper examines differences between three different stimulus presentation methods: verbal, multimedia, and real.

Michael Brusch, Daniel Baier, Antje Treppa

Grade Correspondence-cluster Analysis Applied to Separate Components of Reversely Regular Mixtures

The paper presents how the method called grade correspondence-cluster analysis (GCCA) can extract data subtables, which are characterized by specifically regular, distinctly different data structures. A short review of basic ideas underlying GCCA is given in Sec. 1. A description of straight and reverse regularity of data tables transformed by the GCCA is given in Sec. 2. These concepts are illustrated on a real data example, which describes development factors and economic status of Polish small business service firms. In the next sections, this data table is analyzed and effects of the method are demonstrated.

Alicja Ciok

Obtaining Reducts with a Genetic Algorithm

Given a data table associated with p qualitative attributes measured on n objects, grouped clasified in k classes or categories, a rough set of that table is a subset of the p attributes that produce the same classification of the individuals than all the attributes together. In this paper we present a genetic algorithm to calculate some reducts of such a data table. We propose a fitness function and an evolutive program, with the modifications to assure convergence. Finally, we present some numerical results.

José Luis Espinoza

A Projection Algorithm for Regression with Collinearity

Principal component regression (PCR) is often used in regression with multicollinearity. Although this method avoids the problems which can arise in the least squares (LS) approach, it is not optimized with respect to the ability to predict the response variable. We propose a method which combines the two steps in the PCR procedure, namely finding the principal components (PCs) and regression of the response variable on the PCs. The resulting method aims at maximizing the coefficient of determination for a selected number of predictor variables, and therefore the number of predictor variables can be reduced compared to PCR. An important feature of the proposed method is that it can easily be robustified using robust measures of correlation.

Peter Filzmoser, Christophe Croux

Confronting Data Analysis with Constructivist Philosophy

This paper develops some ideas from the confrontation of data analysis with constructivist philosophy. This epistemology considers reality only dependent of its observers. Objective reality can never be observed. Perceptions are not considered as representations of objective reality, but as a means of the self-organization of humans. In data analysis, this leads to thoughts about the impact of the gathering of data to the reality, the necessity of subjective decisions and their frank discussion, the nature of statistical predictions, and the role of probability models (frequentist and epistemic). An example from market segmentation is discussed.

Christian Hennig

Statistical Methods

Maximum Likelihood Clustering with Outliers

Suppose that we are given a list of n ℝp-valued observations and a natural number r ≤ n. Further, assume that r of them arise from any one of g normally distributed populations, whereas the other n — r observations are assumed to be contaminations. We develop estimators which simultaneously detect n — r outliers and partition the remaining r observations in g clusters. We analyze under which conditions these estimators are maximum likelihood estimators. Finally, we propose algorithms that approximate these estimators.

María Teresa Gallegos

An Improved Method for Estimating the Modes of the Probability Density Function and the Number of Classes for PDF-based Clustering

This paper proposes an improvement of the estimation of the modes of the Probability Density Function (PDF) in clustering procedures. The k nearest neighbours are excluded when the PDF is estimated trying to avoid parasitic modes of the PDF estimation. The number of detected modes is analyzed using the bootstrap technique. The number of clusters is equal to the most frequent number of modes obtained when resampling the data (if the frequency is greater than 50%). The method to estimate the number of clusters is illustrated by an example in image processing.

Michel Herbin, Noel Bonnet

Maximization of Measure of Allowable Sample Sizes Region in Stratified Sampling

This paper concerns the problem of optimal stratified sampling when the purpose is to simultaneously estimate the means of many variables. The goal is stratify the population and allocate sample in ways that minimize survey costs and ensure that all variances of the mean estimators are sufficiently small. The paper proposes a solution method for the case of imprecisely determined unit sampling costs. The method is based on maximization of the volume of a certain subset of the admissible sample sizes set.

Marcin Skibicki

On Estimation of Population Averages on the Basis of Cluster Sample

Clustering methods are applied to improving precision of estimation of means in a fixed and finite population. The particular case of estimation of population means on the basis of sample drawn from a population partitioned into strata by means of clustering algorithms is considered by Huisman (2000) and Wywiał (1995, 1998, 2001). Here an estimator from cluster sample of a vector of population averages is considered. The accuracy of the estimator increases when the degree of the within-cluster spread increases. The precision of the vector estimator is compared with precision of simple sample means. The problem is finding such partition of a population into mutually disjoint clusters that the vector estimator from the cluster sample is more precise estimator of the population averages than the vector of simple sample means. Some clustering algorithm is proposed.

Janusz Wywiał

Symbolic Data Analysis

Symbolic Regression Analysis

Billard and Diday (2000) developed procedures for fitting a regression equation to symbolic interval-valued data. The present paper compares that approach with several possible alternative models using classical techniques; the symbolic regression approach is preferred. Thence, a regression approach is provided for symbolic histogram-valued data. The results are illustrated with a medical data set.

Lynne Billard, Edwin Diday

Modelling Memory Requirement with Normal Symbolic Form

Symbolic objects can deal with domain knowledge expressed by dependency rules between variables. However taking into account this rules in order to analyze symbolic data can lead to exponential computation time. It’s why we introduced an approach called Normal Symbolic Form (NSF) which lead to a polynomial computation time, but may sometimes bring about an exponential explosion of space memory requirement. In a previous paper we studied this possible memory requirement and we saw how it is bounded according to the nature of the variables and rules. The aim of this paper is to model this memory space requirement. The proposed modelling of this problem is related to the Maxwell-Boltzmann statistics currently used on thermodynamics.

Marc Csernel, Francisco A. T. de Carvalho

Mixture Decomposition of Distributions by Copulas

Each unit is described by a vector of p distributions each one associated to one of p variables. Our aim is to find simultaneously a “good” partition of these units and a model using “copulas” associated to each class of this partition. Different copulas models are recalled. The mixture decomposition problem is settled in this general case. It extends the standard mixture decomposition problem to the case where each unit is described by a vector of distributions instead as usual, by a vector of a single (categorical or numerical) values. Several generalization of standard algorithms are suggested. All these results are first considered in the case of a single variable and then extended to the case of a vector of p variables by using a top-down binary tree approach.

Edwin Diday

Determination of the Number of Clusters for Symbolic Objects Described by Interval Variables

One of the important problems in cluster analysis is the objective assessment of the validity of the clusters found by a clustering algorithm. The problem of the determination of the “best ”number of clusters has often been called the central problem of cluster validation. Numerous methods for the determination of the number of clusters have been proposed, but most of them are applicable only to classical data (qualitative, quantitative). In this paper we investigate the problem of the determination of the number of clusters for symbolic objects described by interval variables. We define a notion of convex hull for a set of symbolic objects of interval type. We obtain classical quantitative data, and consequently the classical rules for the determination of the number of clusters can be used. We consider the Hypervolumes test and the best stopping rules from the Milligan and Cooper (1985) study.Two symbolic clustering methods are also used: Scluster (Bock et al. (2001)), a dynamic partitioning procedure, and a monothetic divisive clustering algorithm (Chavent (1997)). Two data sets illustrate the methods. The first one is an artificially generated data set. The second one is a real data set.

André Hardy, Pascale Lallemand

Symbolic Data Analysis Approach to Clustering Large Datasets

The paper builds on the representation of units/clusters with a special type of symbolic objects that consist of distributions of variables. Two compatible clustering methods are developed: the leaders method, that reduces a large dataset to a smaller set of symbolic objects (clusters) on which a hierarchical clustering method is applied to reveal its internal structure. The proposed approach is illustrated on USDA Nutrient Database.

Simona Korenjak-Černe, Vladimir Batagelj

Symbolic Class Descriptions

Our aim is to describe a partition of a class by a conjunction of characteristic properties. We use a stepwise top-down binary tree method. At each step we select the best variable and its optimal splitting to optimize simultaneously a discrimination criterion given by a prior partition and a homogeneity criterion. Moreover, this method deals with a data table in which each cell contains a histogram of nominal categories but the method can be extended or reduced to other types of data. The method is illustrated on both simulated data and real data.

Mathieu Vrac, Edwin Diday, Suzanne Winsberg, Mohamed Mehdi Limam

Consensus Trees and Phylogenetics

A Comparison of Alternative Methods for Detecting Reticulation Events in Phylogenetic Analysis

A growing concern in phylogenetic analysis is our ability to detect events of reticulate evolution (e.g. hybridization) that deviate from the strictly branching pattern depicted by phylogenetic trees. Although algorithms for estimating networks rather than trees are available, no formal evaluation of their ability to detect actual reticulations has been performed. In this paper, we evaluate the performance of reticulograms and split decomposition graphs (or splitsgraphs) in the identification of known hybridization events in a phylogeny. Our results show that neither technique permits unambiguous identification of hybrids. We thus introduce a quartet-based approach used in combination with these two methods and show that quartet analysis of splitsgraphs lead to a near perfect identification of hybrids. We also suggest ways in which the reticulogram reconstruction algorithm could be improved.

Olivier Gauthier, François-Joseph Lapointe

Hierarchical Clustering of Multiple Decision Trees

Decision tree learning is relatively non-robust: a small change in the training set may significantly change the structure of the induced decision tree. This paper presents a decision tree construction method in which the domain model is constructed by consensus clustering of N decision trees induced in N-fold cross-validation. Experimental results show that consensus decision trees are simpler than C4.5 decision trees, indicating that they may be a more stable approximation of the intended domain model than decision trees, constructed from the entire set of training instances.

Branko Kavšek, Nada Lavrač, Anuška Ferligoj

Multiple Consensus Trees

A consensus function takes as input a profile of trees representing the relationships among overlapping or identical sets of objects and returns a tree that is in some sense closest to the entire profile. A single tree is obtained by most consensus functions, although this solution may not always be representative of the profile of input trees. If a unique consensus tree is not suitable, how many consensus trees are needed? In this paper, we propose a procedure for the computation of multiple consensus trees to summarize a profile of input trees. Different stopping rules are discussed to determine the optimal number of consensus trees. This procedure is developed in the special case of the average consensus method for weighted trees. An application to phylogenetic trees is presented.

François-Joseph Lapointe, Guy Cucumel

A Family of Average Consensus Methods for Weighted Trees

Consensus techniques represent useful tools in phylogenetic analysis, particularly for combining trees derived from different data sets. In the present paper, a family of average consensus methods for weighted trees is presented; the mean and median procedures are compared and applied to combine phylogenetic trees while taking into account their branch lengths. We also provide some recommendations about the use of these consensus techniques in relation to the more classical methods based on topological relationships alone.

Claudine Levasseur, François-Joseph Lapointe

Comparison of Four Methods for Inferring Additive Trees from Incomplete Dissimilarity Matrices

The problem of inference of an additive tree from an incomplete dissimilarity matrix is known to be very delicate. As a solution to this problem, it has been suggested either to estimate the missing entries of a given partial dissimilarity matrix prior to tree reconstruction (De Soete, 1984 and Landry et al., 1997) or directly reconstruct an additive tree from incomplete data (Makarenkov and Leclerc, 1999 and Guénoche and Leclerc, 2001). In this paper, I propose a new method, that is based on the least-squares approximation, for inferring additive trees from partial dissimilarity matrices. The capacity of the new method to recover a true tree structure will be compared to those of three well-known techniques for tree reconstruction from partial data. The new method will be proven to work better than widely used Ultrametric and Additive reconstruction techniques, as well as the recently proposed Triangle method on incomplete dissimilarity matrices of different sizes and under different noise conditions.

Vladimir Makarenkov

Quartet Trees as a Tool to Reconstruct Large Trees from Sequences

We suggest a method to reconstruct phylogenetic trees from a large set of aligned sequences. Our approach is based upon a modified version of the quartet puzzling algorithm and does not require the computation of all $$\left( {\begin{array}{*{20}{c}} n \\ 4 \end{array}} \right)$$ quartets, if n sequences are aligned. We show that the phylogenetic resolution of the resulting tree depends on the number of quartets one is willing to analyze. As an biological example we study the alignment of 215 red algae sequences obtained from the European ssu rRNA Database Finally, we discuss several modifications and extensions of the algorithm.

Heiko A. Schmidt, Arndt von Haeseler

Regression Trees

Regression Trees for Longitudinal Data with Time-Dependent Covariates

In this paper the problem of longitudinal data modelling in presence of time-dependent covariates is addressed. A solution based on recursive partitioning method is proposed. The key points of this solution are the definition of a suitable split function Φ(s,g), able to account for the autocorrelation structure typical of longitudinal data, the definition of splits on time-dependent covariates and the estimation procedure for the value of the step function on each element of the partition of the covariate space induced by the solution itself. The performances of the proposed strategy are studied by a simulation experiment.

Giuliano Galimberti, Angela Montanari

Tree-based Models in Statistics: Three Decades of Research

The interest in tree-structured methods has been growing rapidly in statistics. In fact, all commercial statistical packages and Data Mining tools have been equipped with tree building modules. The research in this field has its roots in early 70s when early papers on recursive partitioning of the feature space (and its result which has the form of a tree) were published in statistical journals. They began intensive research in nonparametric statistical methods for classification, regression, survival analysis etc. The aim of this paper is to summarize achievements of this research and point out some still open problems.

Eugeniusz Gatnar

Computationally Efficient Linear Regression Trees

This paper describes a method for obtaining regression trees using linear regression models in the leaves in a computationally efficient way that allows the use of this method on large data sets. This work is focused on deriving a set of formulae with the goal of allowing an efficient evaluation of all candidate tests that are considered during tree growth.

Luis Torgo

Neural Networks and Genetic Algorithms

A Clustering Based Procedure for Learning the Hidden Unit Parameters in Elliptical Basis Function Networks

Radial basis function (RBF) neural networks can be used for a wide range of applications as they can be regarded as universal approximators and their training is faster than that of multilayer perceptrons. A more general version of these neural networks are referred to as elliptical basis function (EBF) networks. In this paper a robust method for EBF parameter estimation is proposed, based on hyperellipsoidal clustering and on multivariate sign and rank concepts. A simulation study on a classification problem has shown that this method represents a valid learning scheme, particularly in presence of outlying data.

Marilena Pillati, Daniela G. Calò

Multi-layer Perceptron on Interval Data

We study in this paper several methods that allow one to use interval data as inputs for Multi-layer Perceptrons. We show that interesting results can be obtained by using together two methods: the extremal values method which is based on a complete description of intervals, and the simulation method which is based on a probabilistic understanding of intervals. Both methods can be easily implemented on top of existing neural network software.

Fabrice Rossi, Brieuc Conan-Guez

Applications

Frontmatter

Textual Analysis of Customer Statements for Quality Control and Help Desk Support

Several business to customer applications, i.e. analysis of customer feedback and inquiries, can be improved by text mining approaches. They give new insights in the customer’s needs and desires by automatically processing their messages. Previously unknown facts and relations can be detected and organizations as well as employees profit by these document and knowledge management tools. The techniques used are rather simple but robust: they are derived from basic distance calculation between feature vectors in the vector space model.

Ulrich Bohnacker, Lars Dehning, Jürgen Franke, Ingrid Renz

AHP as Support for Strategy Decision Making in Banking

This article is an approach to the problem of group decision making in which one of decision makers plays a decisive role. Due to our research we will focus on banking sector and possible choice concerning marketing strategy under a very limited cost budget. Our solution to the problem is an adjusted Analytic Hierarchy Process (AHP) application model.

Czesław Domański, Jarosław Kondrasiuk

Bioinformatics and Classification: The Analysis of Genome Expression Data

The availability of microarray data caused a new interest in clustering and classification methods. DNA microarrays are likely to play an important role for diagnosis and prognosis in clinical practice. Using the example of gene expression of diffuse large B-cell lymphona I introduce and review proposals for the experimental design and pattern recognition problems of gene expression experiments, the supervised learning or classification problem, the unsupervised learning or clustering problem and the potential of improving prognostic models. Moreover, I suggest boostrap methods to estimate the stability of the used hierarchical clustering. The proposal is applied to prognostic factors by micro array data for censored survival data.

Berthold Lausen

Glaucoma Diagnosis by Indirect Classifiers

Medical decision making is based on various diagnostic measurements and the evaluation of anamnestic information. For example glaucoma diagnosis is based on several direct and indirect assessments of the eye. We discuss possibilities to use a definition of glaucoma to improve supervised glaucoma classification by laser scanning image data. The learning sample consists of laser scanning image data and other diagnostic measurements.We discuss indirect supervised classification proposed by Hand et al. (2001), which provides a framework to incorporate the full information of the learning sample as intermediate variables. We compare direct classifiers like linear discriminant analysis, classification trees and bagged classification trees with indirect classification. Comparing bagged direct and bagged indirect classifiers, we achieve a reduction of estimated misclassification error from 18.3% to 15.4%.

Andrea Peters, Torsten Hothorn, Berthold Lausen

A Cluster Analysis of the Importance of Country and Sector on Company Returns

Does a company’s country of incorporation or the sector of its activity have a greater influence on its equity returns? Rather than using dummy variable regressions or factor models on pre-determined characteristics, this study finds naturally occurring groups of companies based on the company returns themselves. The resulting groups are then examined in terms of their country and sector composition. The cluster analysis outcome indicates that companies group by country rather than by sector and that this effect is not diminishing.

Clifford W. Sell

Problems of Classification in Investigative Psychology

Problems of classification in the field of Investigative Psychology are defined and examples of each problem class are introduced. The problems addressed are behavioural differentiation, discrimination among alternatives, and prioritisation of investigative options. Contemporary solutions to these problems are presented that use smallest space analysis, receiver operating characteristic analysis, and probability functions.

Paul J. Taylor, Craig Bennell, Brent Snook

Backmatter

Titel: Classification, Clustering, and Data Analysis
herausgegeben von: Prof. Krzysztof Jajuga
Prof. Andrzej Sokołowski
Prof. Hans-Hermann Bock
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-56181-8
Print ISBN: 978-3-540-43691-1
DOI: https://doi.org/10.1007/978-3-642-56181-8

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Clustering and Discrimination

Frontmatter

Clustering

Some Thoughts about Classification

Partial Defuzzification of Fuzzy Clusters

A New Clustering Approach, Based on the Estimation of the Probability Density Function, for Gene Expression Data

Two-mode Partitioning: Review of Methods and Application of Tabu Search

Dynamical Clustering of Interval Data: Optimization of an Adequacy Criterion Based on Hausdorff Distance

Removing Separation Conditions in a 1 against 3-Components Gaussian Mixture Problem

Obtaining Partitions of a Set of Hard or Fuzzy Partitions

Clustering for Prototype Selection using Singular Value Decomposition

Clustering in High-dimensional Data Spaces

Quantization of Models: Local Approach and Asymptotically Optimal Partitions

The Performance of an Autonomous Clustering Technique

Cluster Analysis by Restricted Random Walks

Missing Data in Hierarchical Classification of Variables — a Simulation Study

Cluster Validation

Representation and Evaluation of Partitions

Assessing the Number of Clusters of the Latent Class Model

Validation of Very Large Data Sets Clustering by Means of a Nonparametric Linear Criterion

Discrimination

Effect of Feature Selection on Bagging Classifiers Based on Kernel Density Estimators

Biplot Methodology for Discriminant Analysis Based upon Robust Methods and Principal Curves

Bagging Combined Classifiers

Application of Bayesian Decision Theory to Constrained Classification Networks

Multivariate Data Analysis and Statistics

Frontmatter

Multivariate Data Analysis

Quotient Dissimilarities, Euclidean Embeddability, and Huygens’ Weak Principle

Conjoint Analysis and Stimulus Presentation — a Comparison of Alternative Methods

Grade Correspondence-cluster Analysis Applied to Separate Components of Reversely Regular Mixtures

Obtaining Reducts with a Genetic Algorithm

A Projection Algorithm for Regression with Collinearity

Confronting Data Analysis with Constructivist Philosophy

Statistical Methods

Maximum Likelihood Clustering with Outliers

An Improved Method for Estimating the Modes of the Probability Density Function and the Number of Classes for PDF-based Clustering

Maximization of Measure of Allowable Sample Sizes Region in Stratified Sampling

On Estimation of Population Averages on the Basis of Cluster Sample

Symbolic Data Analysis

Symbolic Regression Analysis

Modelling Memory Requirement with Normal Symbolic Form

Mixture Decomposition of Distributions by Copulas

Determination of the Number of Clusters for Symbolic Objects Described by Interval Variables

Symbolic Data Analysis Approach to Clustering Large Datasets

Symbolic Class Descriptions

Consensus Trees and Phylogenetics

A Comparison of Alternative Methods for Detecting Reticulation Events in Phylogenetic Analysis

Hierarchical Clustering of Multiple Decision Trees

Multiple Consensus Trees

A Family of Average Consensus Methods for Weighted Trees

Comparison of Four Methods for Inferring Additive Trees from Incomplete Dissimilarity Matrices

Quartet Trees as a Tool to Reconstruct Large Trees from Sequences

Regression Trees

Regression Trees for Longitudinal Data with Time-Dependent Covariates

Tree-based Models in Statistics: Three Decades of Research

Computationally Efficient Linear Regression Trees

Neural Networks and Genetic Algorithms

A Clustering Based Procedure for Learning the Hidden Unit Parameters in Elliptical Basis Function Networks

Multi-layer Perceptron on Interval Data

Applications

Frontmatter

Textual Analysis of Customer Statements for Quality Control and Help Desk Support

AHP as Support for Strategy Decision Making in Banking

Bioinformatics and Classification: The Analysis of Genome Expression Data

Glaucoma Diagnosis by Indirect Classifiers

A Cluster Analysis of the Importance of Country and Sector on Company Returns

Problems of Classification in Investigative Psychology

Backmatter