Skip to main content

2008 | Buch

Principal Manifolds for Data Visualization and Dimension Reduction

herausgegeben von: Alexander N. Gorban, Balázs Kégl, Donald C. Wunsch, Andrei Y. Zinovyev

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computational Science and Engineering

insite
SUCHEN

Über dieses Buch

In 1901, Karl Pearson invented Principal Component Analysis (PCA). Since then, PCA serves as a prototype for many other tools of data analysis, visualization and dimension reduction: Independent Component Analysis (ICA), Multidimensional Scaling (MDS), Nonlinear PCA (NLPCA), Self Organizing Maps (SOM), etc. The book starts with the quote of the classical Pearson definition of PCA and includes reviews of various methods: NLPCA, ICA, MDS, embedding and clustering algorithms, principal manifolds and SOM. New approaches to NLPCA, principal manifolds, branching principal components and topology preserving mappings are described as well. Presentation of algorithms is supplemented by case studies, from engineering to astronomy, but mostly of biological data: analysis of microarray and metabolite data. The volume ends with a tutorial "PCA and K-means decipher genome". The book is meant to be useful for practitioners in applied data analysis in life sciences, engineering, physics and chemistry; it will also be valuable to PhD students and researchers in computer sciences, applied mathematics and statistics.

Inhaltsverzeichnis

Frontmatter
1. Developments and Applications of Nonlinear Principal Component Analysis – a Review
Although linear principal component analysis (PCA) originates from the work of Sylvester [67] and Pearson [51], the development of nonlinear counterparts has only received attention from the 1980s. Work on nonlinear PCA, or NLPCA, can be divided into the utilization of autoassociative neural networks, principal curves and manifolds, kernel approaches or the combination of these approaches. This article reviews existing algorithmic work, shows how a given data set can be examined to determine whether a conceptually more demanding NLPCA model is required and lists developments of NLPCA algorithms. Finally, the paper outlines problem areas and challenges that require future work to mature the NLPCA research field.
Uwe Kruger, Junping Zhang, Lei Xie
2. Nonlinear Principal Component Analysis: Neural Network Models and Applications
Nonlinear principal component analysis (NLPCA) as a nonlinear generalisation of standard principal component analysis (PCA) means to generalise the principal components from straight lines to curves. This chapter aims to provide an extensive description of the autoassociative neural network approach for NLPCA. Several network architectures will be discussed including the hierarchical, the circular, and the inverse model with special emphasis to missing data. Results are shown from applications in the field of molecular biology. This includes metabolite data analysis of a cold stress experiment in the model plant Arabidopsis thaliana and gene expression analysis of the reproductive cycle of the malaria parasite Plasmodium falciparum within infected red blood cells.
Matthias Scholz, Martin Fraunholz, Joachim Selbig
3. Learning Nonlinear Principal Manifolds by Self-Organising Maps
This chapter provides an overview on the self-organised map (SOM) in the context of manifold mapping. It first reviews the background of the SOM and issues on its cost function and topology measures. Then its variant, the visualisation induced SOM (ViSOM) proposed for preserving local metric on the map, is introduced and reviewed for data visualisation. The relationships among the SOM, ViSOM, multidimensional scaling, and principal curves are analysed and discussed. Both the SOM and ViSOM produce a scaling and dimension-reduction mapping or manifold of the input space. The SOM is shown to be a qualitative scaling method, while the ViSOM is a metric scaling and approximates a discrete principal curve/surface. Examples and applications of extracting data manifolds using SOM-based techniques are presented.
Hujun Yin
4. Elastic Maps and Nets for Approximating Principal Manifolds and Their Application to Microarray Data Visualization
Principal manifolds are defined as lines or surfaces passing through “the middle” of data distribution. Linear principal manifolds (Principal Components Analysis) are routinely used for dimension reduction, noise filtering and data visualization. Recently, methods for constructing non-linear principal manifolds were proposed, including our elastic maps approach which is based on a physical analogy with elastic membranes. We have developed a general geometric framework for constructing “principal objects” of various dimensions and topologies with the simplest quadratic form of the smoothness penalty which allows very effective parallel implementations. Our approach is implemented in three programming languages (C++, Java and Delphi) with two graphical user interfaces (VidaExpert and ViMiDa applications). In this paper we overview the method of elastic maps and present in detail one of its major applications: the visualization of microarray data in bioinformatics. We show that the method of elastic maps outperforms linear PCA in terms of data approximation, representation of between-point distance structure, preservation of local point neighborhood and representing point classes in low-dimensional spaces.
Alexander N. Gorban, Andrei Y. Zinovyev
5. Topology-Preserving Mappings for Data Visualisation
We present a family of topology preserving mappings similar to the Self-Organizing Map (SOM) and the Generative Topographic Map (GTM). These techniques can be considered as a non-linear projection from input or data space to the output or latent space (usually 2D or 3D), plus a clustering technique, that updates the centres. A common frame based on the GTM structure can be used with different clustering techniques, giving new properties to the algorithms.
Thus we have the topographic product of experts (ToPoE) with the Product of Experts substituting the Mixture of Experts of the GTM, two versions of the Harmonic Topographic Mapping (HaToM) that utilise the K-Harmonic Means (KHM) clustering, and the faster Topographic Neural Gas (ToNeGas), with the inclusion of Neural Gas in the inner loop. We also present the Inverse-weighted K-means Topology-Preserving Map (IKToM), based on the same structure for non-linear projection, that makes use of a new clustering technique called The Inverse Weighted K-Means. We apply all the algorithms to a high dimensional dataset, and compare it as well with the Self-Organizing Map, in terms of visualisation, clustering and topology preservation.
Marian Pena, Wesam Barbakh, Colin Fyfe
6. The Iterative Extraction Approach to Clustering
The Iterative Extraction approach (ITEX) extends the one-by-one extraction techniques in Principal Component Analysis to other additive data models. We describe additive models for clustering entity-to-feature and similarity data and apply ITEX for deriving computationally feasible clustering solutions. Specifically, two ITEX derived clustering methods, iK-Means and ADDI-S, are presented as well as update results on theoretical, experimental and applicational aspects of these methods.
Boris Mirkin
7. Representing Complex Data Using Localized Principal Components with Application to Astronomical Data
Often the relation between the variables constituting amultivariate data space might be characterized by one or more of the terms: “nonlinear”, “branched”, “disconnected”, “bended”, “curved”, “heterogeneous”, or, more general, “complex”. In these cases, simple principal component analysis (PCA) as a tool for dimension reduction can fail badly. Of the many alternative approaches proposed so far, local approximations of PCA are among the most promising. This paper will give a short review of localized versions of PCA, focusing on local principal curves and local partitioning algorithms. Furthermore we discuss projections other than the local principal components. When performing local dimension reduction for regression or classification problems it is important to focus not only on the manifold structure of the covariates, but also on the response variable(s). Local principal components only achieve the former, whereas localized regression approaches concentrate on the latter. Local projection directions derived from the partial least squares (PLS) algorithm offer an interesting trade-off between these two objectives.
We apply these methods to several real data sets. In particular, we consider simulated astrophysical data from the future Galactic survey mission Gaia.
Jochen Einbeck, Ludger Evers, Coryn Bailer-Jones
8. Auto-Associative Models, Nonlinear Principal Component Analysis, Manifolds and Projection Pursuit
Auto-associative models have been introduced as a new tool for building nonlinear Principal component analysis (PCA) methods. Such models rely on successive approximations of a dataset by manifolds of increasing dimensions. In this chapter, we propose a precise theoretical comparison between PCA and autoassociative models. We also highlight the links between auto-associative models, projection pursuit algorithms, and some neural network approaches. Numerical results are presented on simulated and real datasets.
Stéphane Girard, Serge Iovleff
9. Beyond The Concept of Manifolds: Principal Trees, Metro Maps, and Elastic Cubic Complexes
Multidimensional data distributions can have complex topologies and variable local dimensions. To approximate complex data, we propose a new type of low-dimensional “principal object”: a principal cubic complex. This complex is a generalization of linear and non-linear principal manifolds and includes them as a particular case. To construct such an object, we combine a method of topological grammars with the minimization of an elastic energy defined for its embedment into multidimensional data space. The whole complex is presented as a system of nodes and springs and as a product of one-dimensional continua (represented by graphs), and the grammars describe how these continua transform during the process of optimal complex construction
Alexander N. Gorban, Neil R. Sumner, Andrei Y. Zinovyev
10. Diffusion Maps - a Probabilistic Interpretation for Spectral Embedding and Clustering Algorithms
Spectral embedding and spectral clustering are common methods for non-linear dimensionality reduction and clustering of complex high dimensional datasets. In this paper we provide a diffusion based probabilistic analysis of algorithms that use the normalized graph Laplacian. Given the pairwise adjacency matrix of all points in a dataset, we define a random walk on the graph of points and a diffusion distance between any two points. We show that the diffusion distance is equal to the Euclidean distance in the embedded space with all eigenvectors of the normalized graph Laplacian. This identity shows that characteristic relaxation times and processes of the random walk on the graph are the key concept that governs the properties of these spectral clustering and spectral embedding algorithms. Specifically, for spectral clustering to succeed, a necessary condition is that the mean exit times from each cluster need to be significantly larger than the largest (slowest) of all relaxation times inside all of the individual clusters. For complex, multiscale data, this condition may not hold and multiscale methods need to be developed to handle such situations.
Boaz Nadler, Stephane Lafon, Ronald Coifman, Ioannis G. Kevrekidis
11. On Bounds for Diffusion, Discrepancy and Fill Distance Metrics
Criteria for optimally discretizing measurable sets in Euclidean space is a difficult and old problem which relates directly to the problem of good numerical integration rules or finding points of low discrepancy. On the other hand, learning meaningful descriptions of a finite number of given points in a measure space is an exploding area of research with applications as diverse as dimension reduction, data analysis, computer vision, critical infrastructure, complex networks, clustering, imaging neural and sensor networks, wireless communications, financial marketing and dynamic programming. The purpose of this paper is to show that a general notion of extremal energy as defined and studied recently by Damelin, Hickernell and Zeng on measurable sets X in Euclidean space, defines a diffusion metric on X which is equivalent to a discrepancy on X and at the same time bounds the fill distance on X for suitable measures with discrete support. The diffusion metric is used to learn via normalized graph Laplacian dimension reduction and the discepancy is used to discretize.
Steven B. Damelin
12. Geometric Optimization Methods for the Analysis of Gene Expression Data
DNA microarrays provide such a huge amount of data that unsupervised methods are required to reduce the dimension of the data set and to extract meaningful biological information. This work shows that Independent Component Analysis (ICA) is a promising approach for the analysis of genome-wide transcriptomic data. The paper first presents an overview of the most popular algorithms to perform ICA. These algorithms are then applied on a microarray breast-cancer data set. Some issues about the application of ICA and the evaluation of biological relevance of the results are discussed. This study indicates that ICA significantly outperforms Principal Component Analysis (PCA).
Michel Journée, Andrew E. Teschendorff, Pierre-Antoine Absil, Simon Tavaré, Rodolphe Sepulchre
13. Dimensionality Reduction and Microarray Data
Microarrays are being currently used for the expression levels of thousands of genes simultaneously. They present new analytical challenges because they have a very high input dimension and a very low sample size. It is highly complex to analyse multi-dimensional data with complex geometry and to identify low-dimensional “principal objects” that relate to the optimal projection while losing the least amount of information. Several methods have been proposed for dimensionality reduction of microarray data. Some of these methods include principal component analysis and principal manifolds. This article presents a comparison study of the performance of the linear principal component analysis and the non linear local tangent space alignment principal manifold methods on such a problem. Two microarray data sets will be used in this study. A classification model will be created using fully dimensional and dimensionality reduced data sets. To measure the amount of information lost with the two dimensionality reduction methods, the level of performance of each of the methods will be measured in terms of level of generalisation obtained by the classification models on previously unseen data sets. These results will be compared with the ones obtained using the fully dimensional data sets.
David A. Elizondo, Benjamin N. Passow, Ralph Birkenhead, Andreas Huemer
14. PCA and K-Means Decipher Genome
In this paper, we aim to give a tutorial for undergraduate students studying statistical methods and/or bioinformatics. The students will learn how data visualization can help in genomic sequence analysis. Students start with a fragment of genetic text of a bacterial genome and analyze its structure. By means of principal component analysis they “discover” that the information in the genome is encoded by non-overlapping triplets. Next, they learn how to find gene positions. This exercise on PCA and K-Means clustering enables active study of the basic bioinformatics notions. The Appendix contains program listings that go along with this exersice.
Alexander N. Gorban, Andrei Y. Zinovyev
Backmatter
Metadaten
Titel
Principal Manifolds for Data Visualization and Dimension Reduction
herausgegeben von
Alexander N. Gorban
Balázs Kégl
Donald C. Wunsch
Andrei Y. Zinovyev
Copyright-Jahr
2008
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-73750-6
Print ISBN
978-3-540-73749-0
DOI
https://doi.org/10.1007/978-3-540-73750-6

Neuer Inhalt