Cluster Analysis for Data Mining and System Identification

verfasst von: János Abonyi, Balázs Feil

Verlag: Birkhäuser Basel

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Dataclusteringisacommontechniqueforstatisticaldataanalysis,whichisusedin many ?elds, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the classi?cation of similar objects into di?erent groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait – often proximity according to some de?ned distance measure. The aim of this book is to illustrate that advanced fuzzy clustering algorithms can be used not only for partitioning of the data, but it can be used for visuali- tion,regression,classi?cationandtime-seriesanalysis,hence fuzzy cluster analysis is a good approach to solve complex data mining and system identi?cation pr- lems. Overview In the last decade the amount of the stored data has rapidly increased related to almost all areas of life. The most recent survey was given by Berkeley University of California about the amount of data. According to that, data produced in 2002 and stored in pressed media, ?lms and electronics devices only are about 5 - abytes. For comparison, if all the 17 million volumes of Library of Congress of the UnitedStatesofAmericaweredigitalized,itwouldbeabout136terabytes. Hence, 5 exabytes is about 37,000 Library of Congress. If this data mass is projected into 6. 3 billion inhabitants of the Earth, then it roughly means that each contem- rary generates 800 megabytes of data every year. It is interesting to compare this amount with Shakespeare’s life-work, which can be stored even in 5 megabytes.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Classical Fuzzy Cluster Analysis

Abstract

The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. Data can reveal clusters of different geometrical shapes, sizes and densities as demonstrated in Figure 1.1. Clusters can be spherical (a), elongated or “linear” (b), and also hollow (c) and (d). Their prototypes can be points (a), lines (b), spheres (c) or ellipses (d) or their higher-dimensional analogs. Clusters (b) to (d) can be characterized as linear and nonlinear subspaces of the data space (ℝ² in this case). Algorithms that can detect subspaces of the data space are of particular interest for identification. The performance of most clustering algorithms is influenced not only by the geometrical shapes and densities of the individual clusters but also by the spatial relations and distances among the clusters. Clusters can be well separated, continuously connected to each other, or overlapping each other. The separation of clusters is influenced by the scaling and normalization of the data (see Example 1.1, Example 1.2 and Example 1.3).

Chapter 2. Visualization of the Clustering Results

Abstract

Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects which are difficult to analyze and interpret. Clustering always fits the clusters to the data, even if the cluster structure is not adequate for the problem. To analyze the adequateness of the cluster prototypes and the number of the clusters, cluster validity measures are used (see Section 1.7). However since validity measures reduce the overall evaluation to a single number, they cannot avoid a certain loss of information. A low-dimensional graphical representation of the clusters could be much more informative than such a single value of the cluster validity because one can cluster by eye and qualitatively validate conclusions drawn from clustering algorithms. This chapter introduces the reader to the visualization of high-dimensional data in general, and presents two new methods for the visualization of fuzzy clustering results.

Chapter 3. Clustering for Fuzzy Model Identification — Regression

Abstract

For many real world applications a great deal of information is provided by human experts, who do not reason in terms of mathematics but instead describe the system verbally through vague or imprecise statements like, If The Temperature is Big then The Pressure is High. (3.1) Because so much human knowledge and expertise is given in terms of verbal rules, one of the sound engineering approaches is to try to integrate such linguistic information into the modelling process. A convenient and common approach of doing this is to use fuzzy logic concepts to cast the verbal knowledge into a conventional mathematics representation (model structure), which subsequently can be fine-tuned using input-output data.

Chapter 4. Fuzzy Clustering for System Identification

Abstract

In this chapter we deal with fuzzy model identification, especially by dynamical systems. In practice, there is a need for model-based engineering tools, and they require the availability of suitable dynamical models. Consequently, the development of a suitable nonlinear model is of paramount importance. Fuzzy systems have been effectively used to identify complex nonlinear dynamical systems. In this chapter we would like to show how effectively clustering algorithms can be used to identify a compact Takagi-Sugeno fuzzy model to represent single-input single-output and also multiple-input multiple-output dynamical systems.

Chapter 5. Fuzzy Model based Classifiers

Abstract

Two forms of the data-driven modelling are regression and classification. Based on some measured variables, both of them predict the value of one or more variables we are interested in. In case of regression there are continuous or ordered variables, in case of classification there are discrete or nominal variables needed to be predicted. Classification is also called supervised learning because the labels of the samples are known beforehand. This is the main difference between classification and clustering. The later is unsupervised learning since clusters want to be determined and the labels of the data points are not known.

Chapter 6. Segmentation of Multivariate Time-series

Abstract

Partitioning a time-series into internally homogeneous segments is an important data mining problem. The changes of the variables of a multivariate time-series are usually vague and do not focus on any particular time point. Therefore, it is not practical to define crisp bounds of the segments. Although fuzzy clustering algorithms are widely used to group overlapping and vague objects, they cannot be directly applied to time-series segmentation, because the clusters need to be contiguous in time. This chapter proposes a clustering algorithm for the simultaneous identification of local Probabilistic Principal Component Analysis (PPCA) models used to measure the homogeneity of the segments and fuzzy sets used to represent the segments in time. The algorithm favors contiguous clusters in time and is able to detect changes in the hidden structure of multivariate time-series. A fuzzy decision making algorithm based on a compatibility criteria of the clusters have been worked out to determine the required number of segments, while the required number of principal components are determined by the screeplots of the eigenvalues of the fuzzy covariance matrices. The application example shows that this new technique is a useful tool for the analysis of historical process data.

Backmatter

Titel: Cluster Analysis for Data Mining and System Identification
verfasst von: János Abonyi
Balázs Feil
Verlag: Birkhäuser Basel
Electronic ISBN: 978-3-7643-7988-9
Print ISBN: 978-3-7643-7987-2
DOI: https://doi.org/10.1007/978-3-7643-7988-9

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Chapter 1. Classical Fuzzy Cluster Analysis

Chapter 2. Visualization of the Clustering Results

Chapter 3. Clustering for Fuzzy Model Identification — Regression

Chapter 4. Fuzzy Clustering for System Identification

Chapter 5. Fuzzy Model based Classifiers

Chapter 6. Segmentation of Multivariate Time-series

Backmatter

Premium Partner