Skip to main content
Top

2004 | Book

Classification, Clustering, and Data Mining Applications

Proceedings of the Meeting of the International Federation of Classification Societies (IFCS), Illinois Institute of Technology, Chicago, 15–18 July 2004

Editors: Dr. David Banks, Dr. Frederick R. McMorris, Dr. Phipps Arabie, Prof. Dr. Wolfgang Gaul

Publisher: Springer Berlin Heidelberg

Book Series : Studies in Classification, Data Analysis, and Knowledge Organization

insite
SEARCH

About this book

Modern data analysis stands at the interface of statistics, computer science, and discrete mathematics. This volume describes new methods in this area, with special emphasis on classification and cluster analysis. Those methods are applied to problems in information retrieval, phylogeny, medical diagnosis, microarrays, and other active research areas.

Table of Contents

Frontmatter

New Methods in Cluster Analysis

Frontmatter
Thinking Ultrametrically

The triangular inequality is a defining property of a metric space, while the stronger ultrametric inequality is a defining property of an ultrametric space. Ultrametric distance is defined from p-adic valuation. It is known that ultrametricity is a natural property of spaces that are sparse. Here we look at the quantification of ultrametricity. We also look at data compression based on a new ultrametric wavelet transform. We conclude with computational implications of prevalent and perhaps ubiquitous ultrametricity.

Fionn Murtagh
Clustering by Vertex Density in a Graph

In this paper we introduce a new principle for two classical problems in clustering: obtaining a set of partial classes and a partition on a set X of n elements. These structures are built from a distance D and a threshold value σ giving a threshold graph on X with maximum degree δ. The method is based on a density function De : X → R which is computed first from D. Then, the number of classes, the classes, and the partitions are established using only this density function and the graph edges, with a computational complexity of o(nδ). Monte Carlo simulations, from random Euclidian distances, validate the method.

Alain Guénoche
Clustering by Ant Colony Optimization

We use the heuristic known as ant colony optimization in the partitioning problem for improving solutions of k-means method (McQueen (1967)). Each ant in the algorithm is associated with a partition, which is modified by the principles of the heuristic; that is, by the random selection of an element, and the assignment of another element which is chosen according to a probability that depends on the pheromone trail (related to the overall criterion: the maximization of the between-classes variance), and a local criterion (the distance between objects). The pheromone trail is reinforced for those objects that belong to the same class. We present some preliminary results, compared to results of other techniques, such as simulated annealing, genetic algorithms, tabu search, and k-means. Our results are as good as the best of the above methods.

Javier Trejos, Alex Murillo, Eduardo Piza
A Dynamic Cluster Algorithm Based on L r Distances for Quantitative Data

Dynamic clustering methods aim to obtain both a single partition of the input data into a fixed number of clusters and the identification of a suitable representation of each cluster simultaneously. In its adaptive version, at each iteration of these algorithms there is a different distance for the comparison of each cluster with its representation. In this paper, we present a dynamic cluster method based on Lr distances for quantitative data.

Francisco de A. T. de Carvalho, Yves Lechevallier, Renata M. C. R. de Souza
The Last Step of a New Divisive Monothetic Clustering Method: the Gluing-Back Criterion

Pirçon and Rasson (2003) propose a divisive monothetic clustering method. In that work, the selection of relevant variables is performed simultaneously with the formation of clusters. The method treats only one variable at each stage; a single variable is chosen to split a cluster into two sub-clusters. Then the sub-clusters are successively split until a stopping criterion is satisfied. The original splitting criterion is the solution of a maximum likelihood problem conditional on the fact that data are generated by a nonhomogeneous Poisson process on two disjoint intervals. The criterion splits at the place of the maximum integrated intensity between two consecutive points.This paper extends Pirçon and Rasson (2003) by developing “gluing-back” criterion. The previous work explained the new method for combining trees and nonhomogenous Poisson processes and the splitting criterion. It was shown that the maximum likelihood criterion reduces to minimization of the integrated intensity on the domain containing all of the points. This method of clustering is indexed, divisive and monothetic hierarchical, but its performance can be improved through a gluing-back criterion. That criterion is developed in this paper, after a brief review of the main ideas.

Jean-Yves Pirçon, Jean-Paul Rasson
Standardizing Variables in K-means Clustering

Several standardization methods are investigated in conjunction with the K-means algorithm under various conditions. We find that traditional standardization methods (i.e., z-scores) are inferior to alternative standardization methods. Future suggestions concerning the combination of standardization and variable selection are considered.

Douglas Steinley
A Self-Organizing Map for Dissimilarity Data

Treatment of complex data (for example symbolic data, semistructured data, or functional data) cannot be easily done by clustering methods that are based on calculating the center of gravity. We present in this paper an extension of self-organizing maps to dissimilarity data. This extension allows to apply this algorithm to numerous types of data in a convenient way.

Aïcha El Golli, Brieuc Conan-Guez, Fabrice Rossi
Another Version of the Block EM Algorithm

While most clustering procedures aim to construct an optimal partition of objects or, sometimes, of variables, there are other methods, called block clustering methods, which consider simultaneously the two sets and organize the data into homogeneous blocks. Recently, we have proposed a new mixture model called a block mixture model that addresses this situation. Our model allows one to embed simultaneous clustering of objects and variables through a mixture approach. We use maximum likelihood (ML) to implement the method, and have developed a new EM algorithm to estimate the parameters of this model. This requires an approximation of the likelihood and we propose an alternating-optimization algorithm, which is compared to another version of EM based on an interpretation given by Neal and Hinton. The comparison is performed through numerical experiments on simulated binary data.

Mohamed Nadif, Gérard Govaert
Controlling the Level of Separation of Components in Monte Carlo Studies of Latent Class Models

The paper proposes a new method to control the level of separation of components using a single parameter. An illustration for the latent class model (mixture of conditionally independent multinomial distributions) is provided. Further extensions to other finite mixture models are discussed.

José G. Dias
Fixing Parameters in the Constrained Hierarchical Classification Method: Application to Digital Image Segmentation

We analyse the parameter influence of the likelihood of the maximal link criteria family (LLA hierarchical classification method) in the context of the CAHCVR algorithm (Classification Ascendante Hiérarchique sous Contrainte de contiguïté et par agrégation des Voisins Réciproques). The results are compared to those obtained with the inertia criterion (Ward) in the context of digital image segmentation. New strategies concerning multiple aggregation in the class formation and contiguity notion are positively evaluated in terms of algorithmic complexity.

Kaddour Bachar, Israël-César Lerman
New Approaches for Sum-of-Diameters Clustering

It was shown recently that by employing ideas from incremental graph connectivity the asymptotic complexity of sum-of-diameters bipartitioning could be reduced. This article further exploits this idea to develop simpler algorithms and reports on an experimental comparison of all algorithms for this problem.

Sarnath Ramnath, Mynul H. Khan, Zubair Shams
Spatial Pyramidal Clustering Based on a Tessellation

Indexed hierarchies and indexed clustering pyramids are based on an underlying order for which each cluster is connected. In this paper our aim is to extend standard pyramids and their underlying one-to-one correspondence with Robinsonian dissimilarities to spatial pyramids where each cluster is “compatible” with a spatial network given by a kind of tessellation called “m/k-network”. We focus on convex spatial pyramids and we show that they are in one-to-one correspondence with a new kind of dissimilarities. We give a building algorithm for convex spatial pyramids illustrated by an example. We show that spatial pyramids can converge towards geometrical pyramids. We indicate finally that spatial pyramids can give better results than Kohonen mappings and can produce a geometrical representation of conceptual lattices.

Edwin Diday

Modern Nonparametrics

Frontmatter
Relative Projection Pursuit and its Application

In this paper, we propose a new method of projection pursuit, relative projection pursuit (RPP), which finds ‘interesting‘ low dimensional spaces different from reference data sets predefined by the user. In addition, as an application of the method, we develop a new dimension reduction method: sliced inverse regression with relative projection pursuit.

Masahiro Mizuta, Shintaro Hiro
Priors for Neural Networks

Neural networks are commonly used for classification and regression. The Bayesian approach may be employed, but choosing a prior for the parameters presents challenges. This paper reviews several priors in the literature and introduces Jeffreys priors for neural network models. The effect on the posterior is demonstrated through an example.

Herbert K. H. Lee
Combining Models in Discrete Discriminant Analysis Through a Committee of Methods

In this paper we are concerned with combining models in discrete discriminant analysis in the small sample setting. Our approach consists of an adaptation to discriminant analysis of a method for combining models introduced in the neural computing literature (Bishop, 1995). For combining models we consider a single coefficient obtained in a fuzzy classification rule context and we evaluate its performance through numerical experiments (Sousa Ferreira, 2000).

Ana Sousa Ferreira
Phoneme Discrimination with Functional Multi-Layer Perceptrons

In many situations, high dimensional data can be considered as sampled functions. We recall in this paper how to implement a Multi-Layer Perceptron (MLP) on such data by approximating a theoretical MLP on functions thanks to basis expansion. We illustrate the proposed method on a phoneme discrimination problem.

Brieuc Conan-Guez, Fabrice Rossi
PLS Approach for Clusterwise Linear Regression on Functional Data

The Partial Least Squares (PLS) approach is used for the clusterwise linear regression algorithm when the set of predictor variables forms an L2-continuous stochastic process. The number of clusters is treated as unknown and the convergence of the clusterwise algorithm is discussed. The approach is compared with other methods via an application to stock-exchange data.

Cristian Preda, Gilbert Saporta
On Classification and Regression Trees for Multiple Responses

The tree method can be extended to multivariate responses, such as repeated measures and longitudinal data, by modifying the split function so as to accommodate multiple responses. Recently, some decision trees for multiple responses have been constructed by other researchers. However, their methods have limitations on the type of response, that is, they allow only continuous or only binary responses. Also, there is no tree method to analyze polytomous and ordinal responses.In this paper , we will modify the t ree for the univari ate response procedure and suggest a new t ree-based method that can an alyze any ty pe of multiple response by using Generalized Estimating Equations (GEE) techniques.

Seong Keon Lee
Subsetting Kernel Regression Models Using Genetic Algorithm and the Information Measure of Complexity

Recently in statistical data mining and knowledge discovery, kernel-based methods have attracted attention by many researchers. As a result, many kernel-based methods have been developed, particularly, a family of regularized least squares regression models in a Reproducing Kernel Hilbert Space (RKHS) have been developed (Aronszajn, 1950). The RKHS family includes kernel principal component regression K-PCR (Rosipal et al. 2000, 2001), kernel ridge regression K-RR (Saunders et al., 1998, Cristianini and Shawe-Taylor, 2000) and most recently kernel partial least squares K-PLSR (Rosipal and Trejo 2001, Bennett and Emrechts 2003). Rosipal et al. (2001) compared the K-PLSR, K-PCR and K-RR techniques using conventional statistical procedures and demonstrated that “K-PLSR achieves the same results as K-PCR, but uses significantly fewer and qualitatively different components” through computational experiments.

Xinli Bao, Hamparsum Bozdogan
Cherry-Picking as a Robustness Tool

When there are problems with data quality, it often happens that a reasonably large fraction is good data, and expresses a clear statistical signal, while a smaller fraction is bad data that shows little signal. If it were possible to identify the subset of the data that collectively expresses a strong signal, then one would have a robust tool for uncovering structure in problematic datasets. This paper describes a search strategy for finding large subsets of data with strong signals. The methodology is illustrated for problems in regression. This work is part of a year-long program in statistical data mining that has been organized by SAMSI, the new National Science Foundation center for research at the interface of statistics and applied mathematics.Recently, high dimensional datasets such as micro array gene data and point-of-sale data have become important. It is generally difficult to see the structure of data when the dimension of data is high. Therefore, many studies have invented methods that reduce high dimensional data to lower dimensional data. Among these methods, projection pursuit was developed by Friedman and Tukey (1974) in order to search for an ‘interesting’ linear projection of multidimensional data. They defined the degree of ‘interestingness’ as the difference between the distribution of the projected data and the normal distribution. We call this measure a projection index.However, projection indices that measure the difference from the normal distribution do not always reveal interesting structure because interesting structure depends on the purpose of the analysis. According to th e scientific situation that motivates the data analysis, ‘uninteresting’ structure is not always the normal distribution.Relative projection pursuit allows the user to predefine a reference data set that represent s ‘uninteresting’ structure. The projection index for relative projection pursuit measures the distance between the distribution of the projected target data set and that of the projected reference data set. We show the effectiveness of RPP with numerical examples and actual data.

Leanna L. House, David Banks

Classification and Dimension Reduction

Frontmatter
Academic Obsessions and Classification Realities: Ignoring Practicalities in Supervised Classification

Supervised classification methods have been the focus of a vast amount of research in recent decades, within a variety of intellectual disciplines, including statistics, machine learning, pattern recognition, and data mining. Highly sophisticated methods have been developed, using the full power of recent advances in computation. Many of these methods would have been simply inconceivable to earlier generations. However, most of these advances have largely taken place within the context of the classical supervised classification paradigm of data analysis. That is, a classification rule is constructed based on a given ‘design sample’ of data, with known and well-defined classes, and this rule is then used to classify future objects. This paper argues that this paradigm is often, perhaps typically, an over-idealisation of the practical realities of supervised classification problems. Furthermore, it is also argued that the sequential nature of the statistical modelling process means that the large gains in predictive accuracy are achieved early in the modelling process. Putting these two facts together leads to the suspicion that the apparent superiority of the highly sophisticated methods is often illusory: simple methods are often equally effective or even superior in classifying new data points.

David J. Hand
Modified Biplots for Enhancing Two-Class Discriminant Analysis

When applied to discriminant analysis (DA) biplot methodology leads to useful graphical displays for describing and quantifying multidimensional separation and overlap among classes. The principles of ordinary scatterplots are extended in these plots by adding information of all variables on the plot. However, we show that there are fundamental differences between two-class DA problems and the case J > 2: describing overlap in the two-class situation is relatively straightforward using density estimates but adding information by way of multiple axes to the plot can be ambiguous unless care is taken. Contrary to this, describing overlap for J > 2 classes is relatively more complicated but the fitting of multiple calibrated axes to biplots is well defined.We propose modifications to existing biplot methodology leading to useful biplots for use in the important case of two-class DA problems.

Sugnet Gardner, Niel le Roux
Weighted Likelihood Estimation of Person Locations in an Unfolding Model for Polytomous Responses

It is well known that there are no meaningful sufficient statistics for the person locations in a single peaked unfolding response model. The bias in the estimates of person locations of the general unfolding model for polytomous responses (Luo 2001) with conventional Maximum Likelihood Estimation (MLE) is likely to accumulate with various algorithms proposed in the literature. With the main aim of preventing the bias in the estimates of person locations in the equi-distant unfolding model when the values of item parameters are given, this paper derives the Weighted Likelihood Estimation (WLE) equations, following the approach of Warm (1989). A preliminary simulation study is also included.

Guanzhong Luo, David Andrich
Classification of Geospatial Lattice Data and their Graphical Representation

Statistical analyses for spatial data are important problems in various types of fields. Lattice data are synoptic observations covering an entire spatial region, like cancer rates broken out by each county in a state. There are few approaches for cluster analysis of spatial data. But echelons are useful techniques to study the topological structure of such spatial data. In this paper, we explore cluster analysis for geospatial lattice data based on echelon analysis. We also provide new definitions of the neighbors and families of spatial data in order to support the clustering procedure. In addition, the spatial cluster structure is demonstrated by hierarchical graphical representation with several examples. Regional features are also shown in this dendrogram.

Koji Kurihara
Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Dimension reduction techniques based on principal component analysis (PCA) and factor analysis are commonly used in statistical data analysis. The effectiveness of these methods is limited by their global nature. Recent efforts have focused on relaxing global restrictions in order to identify subsets of data that are concentrated on lower dimensional subspaces. In this paper, we propose an adaptive local dimension reduction method, called the Degenerate Expectation-Maximization Algorithm (DEM). This method is based on the finite mixture model. We demonstrate that the DEM yields significantly better results than the local PCA (LPCA) and other related methods in a variety of synthetic and real datasets. The DEM algorithm can be used in various applications ranging from clustering to information retrieval.

Xiaodong Lin, Yu Zhu
A Dimension Reduction Technique for Local Linear Regression

We offer a refinement to the technique of loca regression which allows for dimension reduction and data compression.

Joseph Lucas
Reducing the Number of Variables Using Implicative Analysis

The interpretation of complex data with data mining techniques is often a difficult task. Nevertheless, this task may be simplified by reducing the variables which could be considered as equivalent. The aim of this paper is to describe a new method for reducing the number of variables in a large set of data. Implicative analysis, which builds association rules with a measure more powerful than conditional probability, is used to detect quasi-equivalent variables. This technique has some advantages over traditional similarity analysis.

Raphaël Couturier, Régis Gras, Fabrice Guillet
Optimal Discretization of Quantitative Attributes for Association Rules

Association rules for objects with quantitative attributes require the discretization of these attributes to limit the size of the search space. As each such discretization might collapse attribute levels that need to be distinguished for finding association rules, optimal discretization strategies are of interest. In 1996 Srikant and Agrawal formulated an information loss measure called measure of partial completeness and claimed that equidepth partitioning (i.e. discretization based on base intervals of equal support) minimizes this measure. We prove that in many cases equidepth partitioning is not an optimal solution of the corresponding optimization problem. In simple cases an exact solution can be calculated, but in general optimization techniques have to be applied to find good solutions.

Stefan Born, Lars Schmidt-Thieme

Symbolic Data Analysis

Frontmatter
Clustering Methods in Symbolic Data Analysis

We present an overview of the clustering methods developed in Symbolic Data Analysis to partition a set of conceptual data into a fixed number of classes. The proposed algorithms are based on a generalization of the classical Dynamical Clustering Algorithm (DCA) (Nuées Dynamiques méthode). The criterion optimized in DCA is a measure of the fit between the partition and the representation of the classes. The prototype is a model of the representation of a class, and it can be an element of the same space of representation as the symbolic data to be clustered or, according to the nature of the data, the prototype is a higher-order object which generalizes the characteristics of the elements belonging to the class. The allocation function for the assignment of the objects to the classes depends on the nature of the variables which describe the symbolic objects. The choice of such function must be related to the particular type of prototype preferred as the representation model of the classes.The allocation function for t he assignment of the objects to the classes depends on the nature of the variables which describe the symbolic objects. The choice of such function must be related to the particular type of prototype preferred as the representation model of the classes.

Rosanna Verde
Dependencies in Bivariate Interval-Valued Symbolic Data

This paper looks at measures of dependence for symbolic intervalvalued data. A method is given to calculate an empirical copula for a bivariate interval-valued variable. This copula is then used to determine an empirical formula for calculating Spearman’s rho for such data. The methodology is illustrated from a set of hematocrit-hemoglobin data and the results compared with Pearson’s product-moment correlation coefficient.

L. Billard
Clustering of Symbolic Objects Described by Multi-Valued and Modal Variables

In this paper we investigate the problem of the determination of the number of clusters for symbolic objects described by multi-valued and modal variables. Three dissimilarity measures are selected in order to define distances on the set of symbolic objects. Methods for the determination of the number of clusters are applied to hierarchies of partitions produced by four hierarchical clustering methods, and to sets of partitions given by the symbolic clustering procedure SCLUST. Two real data sets are analysed.

André Hardy, Pascale Lallemand
A Hausdorff Distance Between Hyper-Rectangles for Clustering Interval Data

The Hausdorff distance between two sets is used in this paper to compare hyper-rectangles. An explicit formula for the optimum class prototype is found in the particular case of the Hausdorff distance for the L norm. When used for dynamical clustering of interval data, this prototype will ensure that the clustering criterion decreases at each iteration.

Marie Chavent
Kolmogorov-Smirnov for Decision Trees on Interval and Histogram Variables

With advances in technology, data sets often contain a very large number of observations. Symbolic data analysis treats new units that are underlying concepts of the given data base or that are found by clustering. In this way, it is possible to reduce the size of the data to be treated by transforming the initial classical variables into variables called symbolic variables. In symbolic data analysis, we consider, among other types, interval and histogram variables. The algebraic structure of these variables leads us to adapt dissimilarity measures to be able to study them. The Kolmogorov-Smirnov criterion is used as a test selection metric for decision tree induction. Our contribution in this paper is to adapt this criterion of Kolmogorov-Smirnov to these types of variables. We present an example to illustrate this approach.

Chérif Mballo, Edwin Diday
Dynamic Cluster Methods for Interval Data Based on Mahalanobis Distances

Dynamic cluster methods for interval data are presented. Two methods are considered: the first method furnishes a partition of the input data and a corresponding prototype (a vector of intervals) for each class by optimizing an adequacy criterion which is based on Mahalanobis distances between vectors of intervals. The second is an adaptive version of the first method. Experimental results with artificial interval-valued data sets show the usefulness of these methods. In general, the adaptive method outperforms the non-adaptive one in terms of the quality of the clusters which are produced by the algorithms.

Renata M.CR. de Souza, Francisco de A. T. de Carvalho, Camilo P. Tenório, Yves Lechevallier
A Symbolic Model-Based Approach for Making Collaborative Group Recommendations

In recent years, recommender systems have achieved great success. Popular sites give thousands of recommendations every day. However, despite the fact that many activities are carried out in groups, like going to the theater with friends, these systems are focused on recommending items for sole users. This brings out the need for systems capable of performing recommendations for groups of people, a domain that has received little attention in the literature. In this article we introduce a novel method of making collaborative recommendations for groups, based on models built using techniques from symbolic data analysis. Finally, we empirically evaluate the proposed method to see its behaviour for groups of different sizes and degrees of homogeneity, and compare the achieved results with a baseline methodology.

Sergio R. M. Queiroz, Francisco de A. T. de Carvalho
Probabilistic Allocation of Aggregated Statistical Units in Classification Trees for Symbolic Class Description

Consider a class of statistical units, in which each unit may be an aggregate of individual statistical units. Each unit is decribed by an interval of values for each variable. Our aim is to develop a partition of this class of aggregated statistical units in which each part of the partition is described by a conjunction of characteristic properties. We use a stepwise top-down binary tree method and we introduce a probabilistic approach to assign units to the nodes of the tree. At each step we select the best variable and its best split to optimize simultaneously a discrimination criterion given by a prior partition and a homogeneity criterion. Finally, we present an example of real data.

Mohamed Mehdi Limam, Edwin Diday, Suzanne Winsberg
Building Small Scale Models of Multi-Entity Databases By Clustering

A framework is proposed to build small scale models of very large databases describing several entities and their relationships. In the first part, it is shown that the use of sampling is not a good solution when several entities are stored in a database. In the second part, a model is proposed which is based on clustering all entities of the database and storing aggregates on the clusters and on the relationships between the clusters. The last part of the paper discusses the different problems which are raised by this approach. Some solutions are proposed: in particular, the link with symbolic data analysis is established.

Georges Hébrail, Yves Lechevallier

Taxonomy and Medicine

Frontmatter
Phylogenetic Closure Operations and Homoplasy-Free Evolution

Phylogenetic closure operations on partial splits and quartet trees turn out to be both mathematically interesting and computationally useful. Although these operations were defined two decades ago, until recently little had been established concerning their properties. Here we present some further new results and links between these closure operations, and show how they can be applied in phylogeny reconstruction and enumeration. Using the operations we study how effectively one may be able to reconstruct phylogenies from evolved multi-state characters that take values in a large state space (such as may arise with certain genomic data).

Tobias Dezulian, Mike Steel
Consensus of Classification Systems, with Adams’ Results Revisited

The problem of aggregating a profile of closure systems into a consensus closure system has interesting applications in classification. We first present an overview of the results obtained by a lattice approach. Then, we develop a more refined approach based on overhangings and implications that appears to be a generalization of Adams’ consensus tree algorithm. Adams’ uniqueness result is explained and generalized.

Florent Domenach, Bruno Leclerc
Symbolic Linear Regression with Taxonomies

This work deals with the extension of classical linear regression to symbolic data and constitutes a continuation of previous papers from Billard and Diday on linear regression with interval and histogram-valued data. In this paper, we present a method for regression with taxonomic variables. Taxonomic variables are variables organized in a tree with several levels of generality. For example, the towns are aggregated up to their regions, the regions are aggregated up to their country.

F. Afonso, L. Billard, E. Diday
Determining Horizontal Gene Transfers in Species Classification: Unique Scenario

The problem of species classification, taking into account the mechanisms of reticulate evolution such as horizontal gene transfer (HGT), species hybridization,or gene duplication, is very delicate. In this paper, we describe a new algorithm for determining a unique scenario of HGT events in a given additive tree (i.e., a phylogenetic tree) representing the evolution of a group of species. The algorithm first establishes differences between topologies of species and gene-additive trees. Then it uses a least-squares optimization procedure to test for the possibility of horizontal gene transfers between any pair of edges of the species in the tree, considering all previously added HGTs in order to determine the next one. We show how the proposed algorithm can be used to represent possible ways in which the rubisco rbcL gene has spread in a species classification that includes plastids, cyanobacteria, and proteobacteria.

Vladimir Makarenkov, Alix Boc, Abdoulaye Baniré Diallo
Active and Passive Learning to Explore a Complex Metabolism Data Set

Metabolomics is the new’ omics’ science that concerns biochemistry. A metabolomic dataset includes quantitative measurements of all small molecules, known as metabolites, in a biological sample. These datasets are rich in information regarding dynamic metabolic (or biochemical) networks that are unattainable using classical methods and have great potential, conjointly with other omic data, in understanding the functionality of genes. Herein, we explore a complex metabolomic dataset with the goal of using the metabolic information to correctly classify individuals into different classes. Unfortunately, these datasets incur many statistical challenges: the number of samples is less than the number of metabolites; there is missing data and non-normal data; and there are high correlations among the metabolites. Thus, we investigate the use of robust singular value decomposition, rSVD, and recursive partitioning to understand this metabolomic data set. The dataset consists of 63 samples, in which we know the status of the individuals (disease or normal and for the diseased individuals, if they are on drug or not). Clustering using only the metabolite data is only modestly successful. Using distances generated from multiple tree recursive partitioning is more successful.

Susan J. Simmons, Xiaodong Lin, Chris Beecher, Young Truong, S. Stanley Young
Mathematical and Statistical Modeling of Acute Inflammation

A mathematical model involving a system of ordinary differential equations has been developed with the goal of assisting the design of therapies directed against the inflammatory consequences of infection and trauma. Though the aim is to build a model of greater complexity, which would eventually overcome some existing limitations (such as a reduced subset of inflammatory interactions, the use of mass action kinetics, and calibration to circulating but not local levels of cytokines), the model can at this time simulate certain disease scenarios qualitatively as well as predicting the time course of cytokine levels in distinct paradigms of inflammation in mice. A parameter search algorithm is developed that aids in the identification of different regimes of behaviour of the model and helps with its calibration to data. Extending this mathematical model, with validation in humans, may lead to the in silico development of novel therapeutic approaches and real-time diagnostics.

Gilles Clermont, Carson C. Chow, Gregory M. Constantine, Yoram Vodovotz, John Bartels
Combining Functional MRI Data on Multiple Subjects

Worsley et al. (2002) propose a practical approach to multiple-subject functional MRI data analyses which uses the EM algorithm to estimate the between-subject variance component at each voxel. The main result of this article is a demonstration that the much more efficient Newton-Raphson algorithm can be reliably used for these calculations. This result follows from an extension of a simple algorithm proposed by Mandel and Paule (1970) for the one-way unbalanced ANOVA model, two variants of which have been shown to be equivalent to modified ML and REML, in which the “modification” is that the within-subject variances as treated as known.

Mark G. Vangel
Classifying the State of Parkinsonism by Using Electronic Force Platform Measures of Balance

Several measures of balance obtained from quiet stance on an electronic force platform are described. These measures were found to discriminate patients with Parkinson’s disease (PD) from normal control subjects. First-degree relatives of patients with PD show greater variability on these measures. A primary goal is to develop sensitive measures that would be capable of identifying impaired balance in early stages of non-clinical PD. We developed a trinomial logistic model that classifies a subject as either normal, pre-parkinsonian, or parkinsonian taking as input the measures developed from the platform data.

Nicholas I. Bohnen, Marius G. Buliga, Gregory M. Constantine
Subject Filtering for Passive Biometric Monitoring

Biometrie data can provide useful information about the person’s overall wellness. However, the invasiveness of the data collection process often prevents their wider exploitation. To alleviate this difficulty we are developing a biométrie monitoring system that relies on nonintrusive biological traits such as speech and gait. We report on the development of the pattern recognition module of the system that is used to filter out nonsubject data. Our system builds upon a number of signal processing and statistical machine learning techniques to process and filter the data, including, Principal Component Analysis for feature reduction, the Naive Bayes classifier for the gait analysis, and the Mixture of Gaussian classifiers for the voice analysis. The system achieves high accuracy in filtering non-subject data, more specifically , 84% accuracy on the gait channel and 98% accuracy on the voice signal. These results allow us to generate sufficiently accurate data streams for health monitoring purposes.

Vahan Grigoryan, Donald Chiarulli, Milos Hauskrecht

Text Mining

Frontmatter
Mining Massive Text Data and Developing Tracking Statistics

This paper outlines a systematic data mining procedure for exploring large free-style text datasets to discover useful features and develop tracking statistics, generally referred to as performance measures or risk indicators. The procedure includes text mining, risk analysis, classification for error measurements and nonparametric multivariate analysis. Two aviation safety report repositories PTRS from the FAA and AAS from the NTSB will be used to illustrate applications of our research to aviation risk management and general decision-support systems. Some specific text analysis methodologies and tracking statistics will be discussed. Approaches to incorporating misclassified data or error measurements into tracking statistics will be discussed as well.

Daniel R. Jeske, Regina Y. Liu
Contributions of Textual Data Analysis to Text Retrieval

The aim of this paper is to show how Textual Data Analysis techniques, developed in Europe under the influence of the Analyse Multidimen-sionelle des Données School, can improve performance of the LSI retrieval method. A first improvement can be obtained by properly considering the data contained in a lexical table. LSI is based on Euclidean distance, which is not adequate for frequency data. By using the chi-squared metric, on which Correspondence Analysis is based, significant improvements can be achieved. Further improvements can be obtained by considering external information such as keywords, authors, etc. Here an approach to text retrieval with external information based on PLS regression is shown. The suggested strategies are applied in text retrieval experiments on medical journal abstracts.

Simona Balbi, Emilio Di Meglio
Automated Resolution of Noisy Bibliographic References

We describe a system used by the NASA Astrophysics Data System to identify bibliographic references obtained from scanned article pages by OCR methods with records in a bibliographic database. We analyze the process generating the noisy references and conclude that the three-step procedure of correcting the OCR results, parsing the corrected string and matching it against the database provides unsatisfactory results. Instead, we propose a method that allows a controlled merging of correction, parsing and matching, inspired by dependency grammars. We also report on the effectiveness of various heuristics that we have employed to improve recall.

Markus Demleitner, Michael Kurtz, Alberto Accomazzi, Günther Eichhorn, Carolyn S. Grant, Steven S. Murray
Choosing the Right Bigrams for Information Retrieval

After more than 30 years of research in information retrieval, the dominant paradigm remains the “bag-of-words,” in which query terms are considered independent of their coocurrences with each other. Although there has been some work on incorporating phrases or other syntactic information into IR, such attempts have given modest and inconsistent improvements, at best. This paper is a first step at investigating more deeply the question of using bigrams for information retrieval. Our results indicate that only certain kinds of bigrams are likely to aid retrieval. We used linear regression methods on data from TREC 6, 7, and 8 to identify which bigrams are able to help retrieval at all. Our characterization was then tested through retrieval experiments using our information retrieval engine, AIRE, which implements many standard ranking functions and retrieval utilities.

Maojin Jiang, Eric Jensen, Steve Beitzel, Shlomo Argamon
A Mixture Clustering Model for Pseudo Feedback in Information Retrieval

In this paper, we present a new mixture model for performing pseudo feedback for information retrieval. The basic idea is to treat the words in each feedback document as observations from a two-component multinomial mixture model, where one component is a topic model anchored to the original query model through a prior and the other is fixed to some background word distribution. We estimate the topic model based on the feedback documents and use it as a new query model for ranking documents.

Tao Tao, ChengXiang Zhai
Analysis of Cross-Language Open-Ended Questions Through MFACT

International surveys lead to multilingual open-ended responses. By grouping the answers for the same categories from different countries, sets of comparable category-documents are obtained. The methodology here presented, called multiple factor analysis of contingency tables (MFACT), provides a representation of all of the documents in a common reference space, on the one hand, and of all of the words in a common reference space, on the other hand; these both spaces are related by transition formulae. This methodology allows a direct generalized canonical analysis approach to analyzing multiple contingency tables as well as for keeping correspondence analysis-like features. It works from the raw data, without any previous translation. An example, extracted from a large international survey in four countries, illustrates the potential of this approach.

Monica Bécue, Jérôme Pagés, Campo-Elías Pardo
Inferring User’s Information Context from User Profiles and Concept Hierarchies

The critical elements that make up a user’s information context include the user profiles that reveal long-term interests and trends, the short-term information need as might be expressed in a query, and the semantic knowledge about the domain being investigated. The next generation of intelligent information agents, that can seamlessly integrate these elements into a single framework, are enabled to effectively locate and provide the most appropriate results for users’ information needs. In this paper we present one such framework for contextualized information access. We model the problem in the context of our client-side Web agent ARCH (Adaptive Retrieval based on Concept Hierarchies). In ARCH, the user profiles are generated using an unsupervised document clustering technique. These profiles, in turn, are used to automatically learn the semantic context of user’s information need from a domain-specific concept hierarchy. Our experimental results show that implicit measures of user interests, combined with the semantic knowledge embedded in a concept hierarchy, can be used effectively to infer the user context and improve the results of information retrieval.

Ahu Sieg, Bamshad Mobasher, Robin Burke
Database Selection for Longer Queries

Given the enormous amount of information now available on the Web, search engines have become the indispensable tools for people to find desired information from the Web. These search engines can be classified into two broad categories by the extent of their coverage. In the first category, we have the general-purpose search engines, such as Google and Yahoo, which have been attempting to index the whole Web and provide a search capability for all Web documents. Unfortunately, these centralized search engines suffer from several serious limitations such as the poor scalability and the difficulty in maintaining the freshness of their contents (Hawking and Thistlewaite, 1999).

Wensheng Wu, Clement Yu, Weiyi Meng

Contingency Tables and Missing Data

Frontmatter
An Overview of Collapsibility

Collapsing over variables is a necessary procedure in much empirical research. Consequences are yet not always properly evaluated. In this paper, different definitions of collapsibility (simple, strict, strong, etc.) and corresponding necessary and sufficient conditions are reviewed and evaluated. We point out the relevance and limitations of the main contributions within a unifying interpretative framework. We deem such work to be useful since the debate on the topic has often developed in terms that are neither focused nor clear.

Stefano De Cantis, Antonino M. Oliveri
Generalized Factor Analyses for Contingency Tables

Quotient dissimilarities constitute a broad aggregation-invariant family; among them, f-dissimilarities are Euclidean embeddable (Bavaud, 2002). We present a non-linear principal components analysis (NPA) applicable to any quotient dissimilarity, based upon the spectral decomposition of the central inertia. For f-dissimilarities, the same decomposition yields a non-linear correspondence analysis (NCA), permitting us to modulate as finely as wished the contributions of positive or negative deviations from independence. The resulting coordinates exactly reproduce the original dissimilarities between rows or between columns; however, Huygens’s weak principle is generally violated, as measured by a quantity we call ’eccentricity’.

François Bavaud
A PLS Approach to Multiple Table Analysis

A situation where J blocks of variables are observed on the same set of individuals is considered in this paper. A factor analysis logic is applied to tables instead of individuals. The latent variables of each block should explain their own block well and at the same time the latent variables of the same rank should be as positively correlated as possible. In the first part of the paper we describe the hierarchical PLS path model and review the fact that it allows one to recover the usual multiple table analysis methods. In the second part we suppose that the number of latent variables can be different from one block to another and that these latent variables are orthogonal. PLS regression and PLS path modeling are used for this situation. This approach is illustrated by an example from sensory analysis.

Michel Tenenhaus
Simultaneous Row and Column Partitioning in Several Contingency Tables

This paper focuses on the simultaneous aggregation of modalities for more than two categorical variables. I propose to maximize an objective function closely similar to the criteria used in multivariate analysis. The algorithm I suggest is a greedy process which, at each step, merges the two most criterion-improving items in the nomenclature. As the solution is only quasi-optimal, I present a consolidation algorithm to improve on this solution, for a given number of clusters.

Vincent Loonis
Missing Data and Imputation Methods in Partition of Variables

We deal with the effect of missing data under a “Missing at Random Model” on classification of variables with non-hierarchical methods. The partitions are compared by the Rand index.

Ana Lorga da Silva, Gilbert Saporta, Helena Bacelar-Nicolau
The Treatment of Missing Values and its Effect on Classifier Accuracy

The presence of missing values in a dataset can affect the performance of a classifier constructed using that dataset as a training sample. Several methods have been proposed to treat missing data and the one used most frequently deletes instances containing at least one missing value of a feature. In this paper we carry out experiments with twelve datasets to evaluate the effect on the misclassification error rate of four methods for dealing with missing values: the case deletion method, mean imputation, median imputation, and the KNN imputation procedure. The classifiers considered were the Linear Discriminant Analysis (LDA) and the KNN classifier. The first one is a parametric classifier whereas the second one is a nonparametric classifier.

Edgar Acuña, Caroline Rodriguez
Clustering with Missing Values: No Imputation Required

Clustering algorithms can identify groups in large data sets, such as star catalogs and hyperspectral images. In general, clustering methods cannot analyze items that have missing data values. Common solutions either fill in the missing values (imputation) or ignore the missing data (marginalization). Imputed values are treated as being just as reliable as the observed data, but they are only as good as the assumptions used to create them. In contrast, we present a method for encoding partially observed features as a set of supplemental soft constraints and introduce the KSC algorithm, which incorporates constraints into the clustering process. In experiments on artificial data and data from the Sloan Digital Sky Survey, we show that soft constraints are an effective way to enable clustering with missing values.

Kiri Wagstaff
Metadata
Title
Classification, Clustering, and Data Mining Applications
Editors
Dr. David Banks
Dr. Frederick R. McMorris
Dr. Phipps Arabie
Prof. Dr. Wolfgang Gaul
Copyright Year
2004
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-17103-1
Print ISBN
978-3-540-22014-5
DOI
https://doi.org/10.1007/978-3-642-17103-1