Skip to main content
Top

2009 | Book

Similarity-Based Clustering

Recent Developments and Biomedical Applications

Editors: Michael Biehl, Barbara Hammer, Michel Verleysen, Thomas Villmann

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

Similarity-based learning methods have a great potential as an intuitive and ?exible toolbox for mining, visualization,and inspection of largedata sets. They combine simple and human-understandable principles, such as distance-based classi?cation, prototypes, or Hebbian learning, with a large variety of di?erent, problem-adapted design choices, such as a data-optimum topology, similarity measure, or learning mode. In medicine, biology, and medical bioinformatics, more and more data arise from clinical measurements such as EEG or fMRI studies for monitoring brain activity, mass spectrometry data for the detection of proteins, peptides and composites, or microarray pro?les for the analysis of gene expressions. Typically, data are high-dimensional, noisy, and very hard to inspect using classic (e. g. , symbolic or linear) methods. At the same time, new technologies ranging from the possibility of a very high resolution of spectra to high-throughput screening for microarray data are rapidly developing and carry thepromiseofane?cient,cheap,andautomaticgatheringoftonsofhigh-quality data with large information potential. Thus, there is a need for appropriate - chine learning methods which help to automatically extract and interpret the relevant parts of this information and which, eventually, help to enable und- standingofbiologicalsystems,reliablediagnosisoffaults,andtherapyofdiseases such as cancer based on this information. Moreover, these application scenarios pose fundamental and qualitatively new challenges to the learning systems - cause of the speci?cs of the data and learning tasks. Since these characteristics are particularly pronounced within the medical domain, but not limited to it and of principled interest, this research topic opens the way toward important new directions of algorithmic design and accompanying theory.

Table of Contents

Frontmatter

Chapter I: Dynamics of Similarity-Based Clustering

Statistical Mechanics of On-line Learning
Abstract
We introduce and discuss the application of statistical physics concepts in the context of on-line machine learning processes. The consideration of typical properties of very large systems allows to perfom averages over the randomness contained in the sequence of training data. It yields an exact mathematical description of the training dynamics in model scenarios. We present the basic concepts and results of the approach in terms of several examples, including the learning of linear separable rules, the training of multilayer neural networks, and Learning Vector Quantization.
Michael Biehl, Nestor Caticha, Peter Riegler
Some Theoretical Aspects of the Neural Gas Vector Quantizer
Abstract
We investigate the neural gas quantizer in the light of statistical physics concepts. We show that this algorithm can be extended to a vector quantizer with general differentiable similarity measure offering a greater flexibility. Further, we show that the neighborhood cooperativeness control parameter is not equivalent to an inverse temperature like in the deterministic annealing vector quantizer introduced by K. Rose et al. Instead, an annealed variant of neural gas can be obtained using the formalism proposed by T. Heskes for self-organizing maps.
Thomas Villmann, Barbara Hammer, Michael Biehl
Immediate Reward Reinforcement Learning for Clustering and Topology Preserving Mappings
Abstract
We extend a reinforcement learning algorithm which has previously been shown to cluster data. Our extension involves creating an underlying latent space with some pre-defined structure which enables us to create a topology preserving mapping. We investigate different forms of the reward function, all of which are created with the intent of merging local and global information, thus avoiding one of the major difficulties with e.g. K-means which is its convergence to local optima depending on the initial values of its parameters. We also show that the method is quite general and can be used with the recently developed method of stochastic weight reinforcement learning [14].
Colin Fyfe, Wesam Barbakh

Chapter II: Information Representation

Advances in Feature Selection with Mutual Information
Abstract
The selection of features that are relevant for a prediction or classification problem is an important problem in many domains involving high-dimensional data. Selecting features helps fighting the curse of dimensionality, improving the performances of prediction or classification methods, and interpreting the application. In a nonlinear context, the mutual information is widely used as relevance criterion for features and sets of features. Nevertheless, it suffers from at least three major limitations: mutual information estimators depend on smoothing parameters, there is no theoretically justified stopping criterion in the feature selection greedy procedure, and the estimation itself suffers from the curse of dimensionality. This chapter shows how to deal with these problems. The two first ones are addressed by using resampling techniques that provide a statistical basis to select the estimator parameters and to stop the search procedure. The third one is addressed by modifying the mutual information criterion into a measure of how features are complementary (and not only informative) for the problem at hand.
Michel Verleysen, Fabrice Rossi, Damien François
Unleashing Pearson Correlation for Faithful Analysis of Biomedical Data
Abstract
Pearson correlation is one of the standards for comparisons in biomedical analyses, possessing yet unused potential. Substantial value is added by transferring Pearson correlation into the framework of adaptive similarity measures and by exploiting properties of the mathematical derivatives. This opens access to optimization-based data models applicable in tasks of attribute characterization, clustering, classification, and visualization. Modern high-throughput measuring equipment creates high demand for analysis of extensive biomedical data including spectra and high-resolution gel-electrophoretic images. In this study cDNA arrays are considered as data sources of interest. Recent computational methods are presented for the characterization and analysis of these huge-dimensional data sets.
Marc Strickert, Frank-Michael Schleif, Thomas Villmann, Udo Seiffert
Median Topographic Maps for Biomedical Data Sets
Abstract
Median clustering extends popular neural data analysis methods such as the self-organizing map or neural gas to general data structures given by a dissimilarity matrix only. This offers flexible and robust global data inspection methods which are particularly suited for a variety of data as occurs in biomedical domains. In this chapter, we give an overview about median clustering and its properties and extensions, with a particular focus on efficient implementations adapted to large scale data analysis.
Barbara Hammer, Alexander Hasenfuss, Fabrice Rossi
Visualization of Structured Data via Generative Probabilistic Modeling
Abstract
We propose a generative probabilistic approach to constructing topographic maps of sequences and tree-structured data. The model formulation specifies a low-dimensional manifold of local noise models on the structured data. The manifold of noise models is induced by a smooth mapping from a low dimensional Euclidean latent space to the parameter space of local noise models. In this paper, we consider noise models endowed with hidden Markovian state space structure, namely Hidden Markov Tree Models (HMTM) and Hidden Markov Models (HMM). Compared with recursive extensions of the traditional Self-Organizing Map that can be used to visualize sequential or tree-structured data, topographic maps formulated within this framework possess a number of advantages such as a well defined cost function that drives the model optimization, the ability to test for overfitting and the accommodation of alternative local noise models implicitly expressing different notions of structured data similarity. Additionally, using information geometry one can calculate magnification factors on the constructed topographic maps. Magnification factors are a useful tool for cluster detection in non-linear topographic map formulations. We demonstrate the framework on two artificial data sets and chorals by J.S. Bach represented as sequences, as well as on images represented as trees.
Nikolaos Gianniotis, Peter Tiňo

Chapter III: Particular Challenges in Applications

Learning Highly Structured Manifolds: Harnessing the Power of SOMs
Abstract
In this paper we elaborate on the challenges of learning manifolds that have many relevant clusters, and where the clusters can have widely varying statistics. We call such data manifolds highly structured. We describe approaches to structure identification through self-organized learning, in the context of such data. We present some of our recently developed methods to show that self-organizing neural maps contain a great deal of information that can be unleashed and put to use to achieve detailed and accurate learning of highly structured manifolds, and we also offer some comparisons with existing clustering methods on real data.
Erzsébet Merényi, Kadim Tasdemir, Lili Zhang
Estimation of Boar Sperm Status Using Intracellular Density Distribution in Grey Level Images
Abstract
In this work we review three methods proposed to estimate the fraction of alive sperm cells in boar semen samples. Images of semen samples are acquired, preprocessed and segmented in order to obtain images of single sperm heads. A model of intracellular density distribution characteristic of alive cells is computed by averaging a set of images of cells assumed to be alive by veterinarian experts. We quantify the deviation of the distribution of a cell from this model and use it for classification deploying three different approaches. One is based on a decision criterion used for single cell classification and gives misclassification error of 20.40%. The other two methods are focused on estimating the fraction of alive sperm in a sample, instead of single cell classification. One of them applies the least squares method, achieving an absolute error below 25% for 89% of the considered sample images. The other uses an iterative procedure to find an optimal decision criterion that equalizes the number of misclassifications of alive and dead cells. It provides an estimation of the fraction of alive cells that is within 8% of its actual value for 95% of the samples.
Lidia Sánchez, Nicolai Petkov
HIV-1 Drug Resistance Prediction and Therapy Optimization: A Case Study for the Application of Classification and Clustering Methods
Abstract
This chapter provides a review of the challenges machine-learning specialists face when trying to assist virologists by generating an automatic prediction of an outcome of HIV therapy.
Optimizing HIV therapies is crucial since the virus rapidly develops mutations to evade drug pressures. Modern anti-HIV regimens comprise multiple drugs in order to prevent, or at least delay, the development of resistance mutations. In recent years, large databases have been collected to allow the automatic analysis of relations between the virus genome other clinical and demographical information, and the failure or success of a therapy. The EuResist integrated database (EID) collected from about 18500 patients and 65000 different therapies is probably one of the largest clinical genomic databases. Only one third of the therapies in the EID contain therapy response information and only 5% of the therapy records have response information as well as genotypic data. This leads to two specific challenges (a) semi-supervised learning – a setting where many samples are available but only a small proportion of them are labeled and (b) missing data.
We review a novel solution for the first setting: a novel dimensionality reduction framework that binds information theoretic considerations with geometrical constraints over the simplex. The dimensionality reduction framework is formulated to find optimal low dimensional geometric embedding of the simplex that preserves pairwise distances. This novel similarity-based clustering solution was tested on toy data and textual data. We show that this solution, although it outperforms other methods and provides good results on a small sample of the Euresist data, is impractical for the large EuResist dataset. In addition, we review a generative-discriminative prediction system that successfully overcomes the missing value challenge.
Apart from a review of the EuResist project and related challenges, this chapter provides an overview of recent developments in the field of machine learning-based prediction methods for HIV drug resistance.
Michal Rosen-Zvi, Ehud Aharoni, Joachim Selbig
Backmatter
Metadata
Title
Similarity-Based Clustering
Editors
Michael Biehl
Barbara Hammer
Michel Verleysen
Thomas Villmann
Copyright Year
2009
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-01805-3
Print ISBN
978-3-642-01804-6
DOI
https://doi.org/10.1007/978-3-642-01805-3