Skip to main content

2007 | Buch

Cluster Analysis for Data Mining and System Identification

verfasst von: János Abonyi, Balázs Feil

Verlag: Birkhäuser Basel

insite
SUCHEN

Über dieses Buch

Dataclusteringisacommontechniqueforstatisticaldataanalysis,whichisusedin many ?elds, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the classi?cation of similar objects into di?erent groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait – often proximity according to some de?ned distance measure. The aim of this book is to illustrate that advanced fuzzy clustering algorithms can be used not only for partitioning of the data, but it can be used for visuali- tion,regression,classi?cationandtime-seriesanalysis,hence fuzzy cluster analysis is a good approach to solve complex data mining and system identi?cation pr- lems. Overview In the last decade the amount of the stored data has rapidly increased related to almost all areas of life. The most recent survey was given by Berkeley University of California about the amount of data. According to that, data produced in 2002 and stored in pressed media, ?lms and electronics devices only are about 5 - abytes. For comparison, if all the 17 million volumes of Library of Congress of the UnitedStatesofAmericaweredigitalized,itwouldbeabout136terabytes. Hence, 5 exabytes is about 37,000 Library of Congress. If this data mass is projected into 6. 3 billion inhabitants of the Earth, then it roughly means that each contem- rary generates 800 megabytes of data every year. It is interesting to compare this amount with Shakespeare’s life-work, which can be stored even in 5 megabytes.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Classical Fuzzy Cluster Analysis
Abstract
The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. Data can reveal clusters of different geometrical shapes, sizes and densities as demonstrated in Figure 1.1. Clusters can be spherical (a), elongated or “linear” (b), and also hollow (c) and (d). Their prototypes can be points (a), lines (b), spheres (c) or ellipses (d) or their higher-dimensional analogs. Clusters (b) to (d) can be characterized as linear and nonlinear subspaces of the data space (ℝ2 in this case). Algorithms that can detect subspaces of the data space are of particular interest for identification. The performance of most clustering algorithms is influenced not only by the geometrical shapes and densities of the individual clusters but also by the spatial relations and distances among the clusters. Clusters can be well separated, continuously connected to each other, or overlapping each other. The separation of clusters is influenced by the scaling and normalization of the data (see Example 1.1, Example 1.2 and Example 1.3).
Chapter 2. Visualization of the Clustering Results
Abstract
Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects which are difficult to analyze and interpret. Clustering always fits the clusters to the data, even if the cluster structure is not adequate for the problem. To analyze the adequateness of the cluster prototypes and the number of the clusters, cluster validity measures are used (see Section 1.7). However since validity measures reduce the overall evaluation to a single number, they cannot avoid a certain loss of information. A low-dimensional graphical representation of the clusters could be much more informative than such a single value of the cluster validity because one can cluster by eye and qualitatively validate conclusions drawn from clustering algorithms. This chapter introduces the reader to the visualization of high-dimensional data in general, and presents two new methods for the visualization of fuzzy clustering results.
Chapter 3. Clustering for Fuzzy Model Identification — Regression
Abstract
For many real world applications a great deal of information is provided by human experts, who do not reason in terms of mathematics but instead describe the system verbally through vague or imprecise statements like, If The Temperature is Big then The Pressure is High. (3.1) Because so much human knowledge and expertise is given in terms of verbal rules, one of the sound engineering approaches is to try to integrate such linguistic information into the modelling process. A convenient and common approach of doing this is to use fuzzy logic concepts to cast the verbal knowledge into a conventional mathematics representation (model structure), which subsequently can be fine-tuned using input-output data.
Chapter 4. Fuzzy Clustering for System Identification
Abstract
In this chapter we deal with fuzzy model identification, especially by dynamical systems. In practice, there is a need for model-based engineering tools, and they require the availability of suitable dynamical models. Consequently, the development of a suitable nonlinear model is of paramount importance. Fuzzy systems have been effectively used to identify complex nonlinear dynamical systems. In this chapter we would like to show how effectively clustering algorithms can be used to identify a compact Takagi-Sugeno fuzzy model to represent single-input single-output and also multiple-input multiple-output dynamical systems.
Chapter 5. Fuzzy Model based Classifiers
Abstract
Two forms of the data-driven modelling are regression and classification. Based on some measured variables, both of them predict the value of one or more variables we are interested in. In case of regression there are continuous or ordered variables, in case of classification there are discrete or nominal variables needed to be predicted. Classification is also called supervised learning because the labels of the samples are known beforehand. This is the main difference between classification and clustering. The later is unsupervised learning since clusters want to be determined and the labels of the data points are not known.
Chapter 6. Segmentation of Multivariate Time-series
Abstract
Partitioning a time-series into internally homogeneous segments is an important data mining problem. The changes of the variables of a multivariate time-series are usually vague and do not focus on any particular time point. Therefore, it is not practical to define crisp bounds of the segments. Although fuzzy clustering algorithms are widely used to group overlapping and vague objects, they cannot be directly applied to time-series segmentation, because the clusters need to be contiguous in time. This chapter proposes a clustering algorithm for the simultaneous identification of local Probabilistic Principal Component Analysis (PPCA) models used to measure the homogeneity of the segments and fuzzy sets used to represent the segments in time. The algorithm favors contiguous clusters in time and is able to detect changes in the hidden structure of multivariate time-series. A fuzzy decision making algorithm based on a compatibility criteria of the clusters have been worked out to determine the required number of segments, while the required number of principal components are determined by the screeplots of the eigenvalues of the fuzzy covariance matrices. The application example shows that this new technique is a useful tool for the analysis of historical process data.
Backmatter
Metadaten
Titel
Cluster Analysis for Data Mining and System Identification
verfasst von
János Abonyi
Balázs Feil
Copyright-Jahr
2007
Verlag
Birkhäuser Basel
Electronic ISBN
978-3-7643-7988-9
Print ISBN
978-3-7643-7987-2
DOI
https://doi.org/10.1007/978-3-7643-7988-9

Premium Partner