Skip to main content

Open Access 2018 | Open Access | Buch

Buchtitelbild

Projection-Based Clustering through Self-Organization and Swarm Intelligence

Combining Cluster Analysis with the Visualization of High-Dimensional Data

insite
SUCHEN

Über dieses Buch

This book is published open access under a CC BY 4.0 license.

It covers aspects of unsupervised machine learning used for knowledge discovery in data science and introduces a data-driven approach to cluster analysis, the Databionic swarm (DBS). DBS consists of the 3D landscape visualization and clustering of data. The 3D landscape enables 3D printing of high-dimensional data structures. The clustering and number of clusters or an absence of cluster structure are verified by the 3D landscape at a glance. DBS is the first swarm-based technique that shows emergent properties while exploiting concepts of swarm intelligence, self-organization and the Nash equilibrium concept from game theory. It results in the elimination of a global objective function and the setting of parameters. By downloading the R package DBS can be applied to data drawn from diverse research fields and used even by non-professionals in the field of data mining.

Inhaltsverzeichnis

Frontmatter

Open Access

Chapter 1. Introduction
Abstract
We live in a time when information is cheaply available and saved as data nearly everywhere. The amount of generated data is growing exponentially. By the end of the year 2016 alone, 9000 exabytes of data will have been generated, equal to 9 trillion gigabytes or the capacity of 360 billion Blu-ray Discs [Schiele, 2016].
Michael Christoph Thrun

Open Access

Chapter 2. Fundamentals
Abstract
The first section of this chapter familiarizes the reader with the definitions of the basic notation and terminology used in this thesis. Concepts of graph theory are introduced in the next section. They give rise to a new concept of neighborhoods, which is utilized in several chapters.
Michael Christoph Thrun

Open Access

Chapter 3. Approaches to Cluster Analysis
Abstract
Many data mining methods rely on some concept of the similarity between pieces of information encoded in the data of interest. Various names have been applied to these clustering methods, depending largely on the field of application in data science. For example, in biology the term “numerical taxonomy” is used [Thorel et al., 1990], in psychology the term Q analysis is sometimes employed, market researchers often talk about “segmentation” [Arimond/Elfessi, 2001] and in the artificial intelligence literature, unsupervised pattern recognition is the favored label [Everitt et al., 2001, p. 4].
Michael Christoph Thrun

Open Access

Chapter 4. Methods of Projection
Abstract
Dimensionality reduction techniques reduce the dimensions of the input space to facilitate the exploration of structures in high-dimensional data. Two general dimensionality reduction approaches exist: manifold learning and projection. Manifold-learning methods attempt to find a sub-space in which the high-dimensional distances can be preserved.
Michael Christoph Thrun

Open Access

Chapter 5. Visualizing the Output Space
Abstract
Projection methods are a common approach to dimensionality reduction with the aim of transforming high-dimensional data into a low-dimensional space. For data visualization purposes, projections into two dimensions are considered here. However, when the output space is limited to two dimensions, the low-dimensional similarities cannot completely represent the high-dimensional distances, which can result in a misleading interpretation of the underlying structures.
Michael Christoph Thrun

Open Access

Chapter 6. Quality Assessments of Visualizations
Abstract
Dimensionality reduction techniques reduce the dimensions of the input space to facilitate the exploration of structures in high-dimensional data. Two general dimensionality reduction approaches exist: manifold learning and projection. Manifold learning methods attempt to find sub-spaces in which the high-dimensional distances are preserved.
Michael Christoph Thrun

Open Access

Chapter 7. Behavior-based Systems in Data Science
Abstract
Many technological advances have been achieved with the help of bionics, which is defined as the application of biological methods and systems found in nature. A related, rarely discussed subfield of information technology is called databionics. Databionics refers to the attempt to adopt information processing techniques from nature.
Michael Christoph Thrun

Open Access

Chapter 8. Databionic Swarm (DBS)
Abstract
This chapter introduces a new concept for the use of swarm intelligence. It makes use of insights from the previous chapter and proposes a projection method based on a swarm of intelligent agents called DataBots [Ultsch, 2000c]. This new swarm is called a polar swarm (Pswarm) because its agents move in polar coordinates based on symmetry considerations (see [Feynman et al., 2007, pp. 147-153, 745]).
Michael Christoph Thrun

Open Access

Chapter 9. Experimental Methodology
Abstract
This chapter describes all the data sets used in the results chapter and the parameter settings for the various methods. In the final section, brief overviews of the Gene Ontology (GO) database and overrepresentation analysis (ORA) are provided. For general distribution analyses, the CRAN R package AdaptGauss [Thrun/Ultsch, 2015; Ultsch et al., 2015] was used.
Michael Christoph Thrun

Open Access

Chapter 10. Results on Pre-classified Data Sets
Abstract
This chapter has three sections. In the first section, the results of the Databionic swarm (DBS) clustering framework are compared with the given prior classifications for data sets from the Fundamental Clustering Problems Suite (FCPS) [Ultsch, 2005a]. The results for nine data sets analyzed using common clustering algorithms are compared in the first subsection.
Michael Christoph Thrun

Open Access

Chapter 11. DBS on Natural Data Sets
Abstract
Several real-world data sets are used in this chapter to show that Databionic swarm (DBS) is able to find clusters in a variety of cases. The leukemia data set is based on luminance measurements of 7747 different active or non-active genes in 554 human subjects. The World GDP data set is a multivariate time series that consists of monetary values for 190 countries from 1970 to 2010.
Michael Christoph Thrun

Open Access

Chapter 12. Knowledge Discovery with DBS
Abstract
In contrast to chapter 11, in which Databionic swarm (DBS) clustering was applied to recognize more or less obvious knowledge, this chapter shows that DBS is also able to discover new knowledge. A hydrological data set of multivariate time series [Aubert et al., 2016] and a data set consisting of pain genes [Ultsch et al., 2016b] are used for this purpose. In [Aubert et al., 2016], a high-frequency time series analysis was performed, but no prediction could be made.
Michael Christoph Thrun

Open Access

Chapter 13. Discussion
Abstract
This work examined and analyzed patterns in high-dimensional data characterized by discontinuity. Such distance- or density-based patterns are either compact or connected structures. If the structures are compact, inter- versus intracluster distances are relevant.
Michael Christoph Thrun

Open Access

Chapter 14. Conclusion
Abstract
A new and data-driven approach for cluster analysis and visualization is introduced in this work. The projection based clustering combines structures preserved in two dimensions with underlying high-dimensional structures (see also [Thrun et al., 2017, Thrun/Ultsch, 2017a]). It is a flexible and robust approach for cluster analysis that consists of three independent modules which can be optionally combined into the Databionic swarm (DBS).
Michael Christoph Thrun
Backmatter
Metadaten
Titel
Projection-Based Clustering through Self-Organization and Swarm Intelligence
verfasst von
Michael Christoph Thrun
Copyright-Jahr
2018
Electronic ISBN
978-3-658-20540-9
Print ISBN
978-3-658-20539-3
DOI
https://doi.org/10.1007/978-3-658-20540-9