Skip to main content
main-content

Über dieses Buch

This book offers an original and broad exploration of the fundamental methods in Clustering and Combinatorial Data Analysis, presenting new formulations and ideas within this very active field.

With extensive introductions, formal and mathematical developments and real case studies, this book provides readers with a deeper understanding of the mutual relationships between these methods, which are clearly expressed with respect to three facets: logical, combinatorial and statistical.

Using relational mathematical representation, all types of data structures can be handled in precise and unified ways which the author highlights in three stages:

Clustering a set of descriptive attributesClustering a set of objects or a set of object categories Establishing correspondence between these two dual clusterings

Tools for interpreting the reasons of a given cluster or clustering are also included.

Foundations and Methods in Combinatorial and Statistical Data Analysis and Clustering will be a valuable resource for students and researchers who are interested in the areas of Data Analysis, Clustering, Data Mining and Knowledge Discovery.

Inhaltsverzeichnis

Frontmatter

Chapter 1. On Some Facets of the Partition Set of a Finite Set

Abstract
As indicated in the Preface, we shall start by describing mathematically the structure sought in Clustering. The latter is a partition or an ordered partition chain on a finite set E. A new structure has appeared these last years where each partition class is a subset of E, linearly ordered. Relative to the latter structure, an introduction will be given in Sect. 1.4.5.
Israël César Lerman

Chapter 2. Two Methods of Non-hierarchical Clustering

Abstract
As mentioned in the Preface, the development provided in this book is dominated by the potential of applying ascendant agglomerative hierarchical clustering to all types of data. Nonetheless, the specific methodology devoted to non-hierarchical clustering is also very important. In these conditions, we shall describe in this chapter two mutually very different methods of non-hierarchical clustering. The first one, called “Central Partition” method, is due to S. Régnier (I.C.C. Bull. 4:175–191, 1965 [35], Revue Mathématiques et Sciences Humaines 22:13–29, 1983 [36], Revue Mathématiques et Sciences Humaines 22:31–44, 1983 [37]). The second method called “méthode des Nuées Dynamiques” or “Dynamic cluster method” is due to E. Diday and collaborators (Revue de Statistique Appliquée XIX(2):19–33, 1971 [10], J. Comput. Inf. Sci. 2(1):61–88, 1973 [11], Recherche Opérationnelle 10(6):75–1060, 1976 [16], R.A.I.R.O. Inf. Comput. Sci. 11(4):329–349, 1977 [13]). This approach corresponds to a vast generalization of the K-means method, initiated by Forgy (Biometrics, Biometric Society Meeting [17]) and Jancey (Aust. J. Bot. 14:127–130, 1966 [19]).
Israël César Lerman

Chapter 3. Structure and Mathematical Representation of Data

Abstract
In Chap. 2 two basic methods of non-hierarchical clustering were presented: the “transfer” method and the “dynamic adaptative method”. Frequently, a clustering method is related to a given type of data.
Israël César Lerman

Chapter 4. Ordinal and Metrical Analysis of the Resemblance Notion

Abstract
The concept of resemblance between data units (objects, categories or attributes) is the most important element in Data Analysis and Machine Learning. In Chap. 3, the description of a set of objects \(\mathcal {O}\) (resp., categories \(\mathcal {C}\)) by a set of descriptive attributes \(\mathcal {A}\) is formalized and a mathematical representation of this description is established.
Israël César Lerman

Chapter 5. Comparing Attributes by Probabilistic and Statistical Association I

Abstract
Data is defined by the observation of a set \(\mathcal {A}\) of descriptive attributes on a set \(\mathcal {O}\) of elementary objects or a set \(\Gamma \) of categories. In this and in the following chapters, we develop the construction of an association coefficient on \(\mathcal {A}\). For this purpose, Likelihood Linkage Analysis approach is emphasized. It leads, in a unified process, to a very rich family of probabilistic association coefficients between descriptive attributes of any type. On the other hand, the principle of this method enables several association coefficients to be mutually compared.
Israël César Lerman

Chapter 6. Comparing Attributes by a Probabilistic and Statistical Association II

Abstract
The data is defined by the observation of a set \(\mathcal {A}\) of descriptive attributes on a set \(\mathcal {O}\) of elementary objects. As indicated in the introduction of the preceding chapter (see Sect. 5.​1 of Chap. 5) \(\mathcal {A}\) is constituted of attributes of a same type belonging to the general type II (see Sect. 3.​3 of Chap. 3). To fix ideas in this introduction, we may imagine \(\mathcal {A}\) as composed of nominal categorical attributes. The different comparison cases are listed at the beginning of the following Section (see Sect. 6.2). For this comparison, as expressed in the introductive Sect. 5.​1 of Chap. 5, the LLA approach will be emphasized. It leads, in a unified process, to a very rich family of probabilistic association coefficients between descriptive attributes of any type. On the other hand, the principle of this method enables several association coefficients to be mutually compared.
Israël César Lerman

Chapter 7. Comparing Objects or Categories Described by Attributes

Abstract
Let us recapitulate briefly the development of the previous chapters. As emphasized above (see Sect. 3.​5.​4 of Chap. 3), given a set \(\mathcal {A} = \{ a^{1}, a^{2}, \ldots , a^{j}, \ldots , a^{p} \}\) of descriptive attributes, description may concern a set \(\mathcal {O} = \{ o_{1}, o_{2}, \ldots , o_{i}, \ldots , o_{n} \}\) of elementary objects or a set \(\Gamma = \{ C_{1}, C_{2}, \ldots , C_{i}, \ldots , C_{I} \}\) of categories or classes. In Sect. 4.​2 of Chap. 4 we studied the problem of the definition of a similarity index between objects or attributes when \(\mathcal {A}\) is constituted by Boolean attributes. In this analysis several facets are considered: combinatorial, metrical, ordinal and statistical. We saw in Sect. 4.​2.​1.​3, how to transpose a similarity index between objects defined in the framework of Boolean description, to the case of description by numerical attributes. Conversely, we have transposed a similarity or distance function defined on \(\mathcal {O}\) in the case of numerical description to that where \(\mathcal {A}\) is a set of Boolean attributes.
Israël César Lerman

Chapter 8. The Notion of “Natural” Class, Tools for Its Interpretation. The Classifiability Concept

Abstract
Numerical, Statistical and Algorithmic approaches in traditional Biology Classification appear around the 50 years. This introduction of new investigating methods was due to an English and American school of Biologists or Naturalists taxonomists open to metrical statistical and computer science approaches. The principal objective consists of discovering a system of “natural” classes organizing the vast set of living beings.
Israël César Lerman

Chapter 9. Quality Measures in Clustering

Abstract
The construction of a clustering criterion depends on the nature of the data and the mathematical structure retained for its representation. We saw in Chap. 2 two criteria associated with two different methods: the “Central partition” and the “Dynamic adaptative” methods, respectively. A formal definition of a criterion in Data Analysis may be expressed as follows: “The structure \(\sigma \) on the set E concerned to which the data representation belongs (e.g. similarity coefficient on E) is more general than that \(\tau \) of the structure sought (partition or ordered chain of partitions on E). The general strategy consists of injecting the family \(\Theta \) of the \(\tau \) structures on E into the family \(\Sigma \) of the \(\sigma \) structures on E. The objective is then to determine that or those of the elements of \(\Theta \) which are the most comparable with \(\sigma (E)\) (\(\sigma \) calculated on E). For this purpose, a criterion is built. Quite generally, a criterion is a ranking (preorder) relation (possibly partial) on the set \(\Theta \). For the latter, \(\tau (E)\) is preferred to \(\tau '(E)\) if and only if, in a given sense, \(\tau (E)\) is nearer \(\sigma (E)\) than \(\tau '(E)\)”.
Israël César Lerman

Chapter 10. Building a Classification Tree

Abstract
The basic data consists of a finite set E provided with a dissimilarity or a similarity function. The elements of E are of the same nature. As seen in Chap. 3, E can be a set of attributes, a set of objects or a set of categories. Tables 3.​4 and 3.​5 of Chap. 3 give a precise idea of the possible different versions of E. On the other hand, the set E may be weighted by a positive numerical measure \(\mu _{E}\), assigning to each of its elements x (\(x \in E\)) a weight \(\mu _{x}\). \(\mu _{x}\) defines the “importance” with which x has to be considered. The dissimilarity or similarity function on E is mostly numerical. However, it might be ordinal. In Chaps. 47, facets of building a similarity index or association coefficient on E have been minutely examined in relation to the descriptive nature of E.
Israël César Lerman

Chapter 11. Applying the LLA Method to Real Data

Abstract
The theoretical and methodological aspects of the LLA approach were highlighted and studied in Chaps. 3, 57, 9 and 10. The edification of this methodology enables us an overall view on nonsupervised methods of Data Analysis and clustering methods.
Israël César Lerman

Chapter 12. Conclusion and Thoughts for Future Works

Abstract
The work presented in this book is at the heart of Data Science. In this general area, the Data Analysis and more particularly Data Clustering plays a fundamental part.
Israël César Lerman
Weitere Informationen

Premium Partner

    Bildnachweise