Skip to main content

Über dieses Buch

This book summarizes the state-of-the-art in unsupervised learning. The contributors discuss how with the proliferation of massive amounts of unlabeled data, unsupervised learning algorithms, which can automatically discover interesting and useful patterns in such data, have gained popularity among researchers and practitioners. The authors outline how these algorithms have found numerous applications including pattern recognition, market basket analysis, web mining, social network analysis, information retrieval, recommender systems, market research, intrusion detection, and fraud detection. They present how the difficulty of developing theoretically sound approaches that are amenable to objective evaluation have resulted in the proposal of numerous unsupervised learning algorithms over the past half-century. The intended audience includes researchers and practitioners who are increasingly using unsupervised learning algorithms to analyze their data. Topics of interest include anomaly detection, clustering, feature extraction, and applications of unsupervised learning. Each chapter is contributed by a leading expert in the field.



Anomaly Detection for Data with Spatial Attributes

The problem of detecting spatially-coherent groups of data that exhibit anomalous behavior has started to attract attention due to applications across areas such as epidemic analysis and weather forecasting. Earlier efforts from the data mining community have largely focused on finding outliers, individual data objects that display deviant behavior. Such point-based methods are not easy to extend to find groups of data that exhibit anomalous behavior. Scan statistics are methods from the statistics community that have considered the problem of identifying regions where data objects exhibit a behavior that is atypical of the general dataset. The spatial scan statistic and methods that build upon it mostly adopt the framework of defining a character for regions (e.g., circular or elliptical) of objects and repeatedly sampling regions of such character followed by applying a statistical test for anomaly detection. In the past decade, there have been efforts from the statistics community to enhance efficiency of scan statistics as well as to enable discovery of arbitrarily shaped anomalous regions. On the other hand, the data mining community has started to look at determining anomalous regions that have behavior divergent from their neighborhood. In this chapter, we survey the space of techniques for detecting anomalous regions on spatial data from across the data mining and statistics communities while outlining connections to well-studied problems in clustering and image segmentation. We analyze the techniques systematically by categorizing them appropriately to provide a structured birds-eye view of the work on anomalous region detection; we hope that this would encourage better cross-pollination of ideas across communities to help advance the frontier in anomaly detection.
P. Deepak

Anomaly Ranking in a High Dimensional Space: The Unsupervised TreeRank Algorithm

Ranking unsupervised data in a multivariate feature space \(\mathcal{X} \subset \mathbb{R}^{d}\), d ≥ 1 by degree of abnormality is of crucial importance in many applications (e.g., fraud surveillance, monitoring of complex systems/infrastructures such as energy networks or aircraft engines, system management in data centers). However, the learning aspect of unsupervised ranking has only received attention in the machine-learning community in the past few years. The Mass-Volume (MV) curve has been recently introduced in order to evaluate the performance of any scoring function \(s: \mathcal{X} \rightarrow \mathbb{R}\) with regard to its ability to rank unlabeled data. It is expected that relevant scoring functions will induce a preorder similar to that induced by the density function f(x) of the (supposedly continuous) probability distribution of the statistical population under study. As far as we know, there is no efficient algorithm to build a scoring function from (unlabeled) training data with nearly optimal MV curve when the dimension d of the feature space is high. It is the major purpose of this chapter to introduce such an algorithm which we call the Unsupervised TreeRank algorithm. Beyond its description and the statistical analysis of its performance, numerical experiments are exhibited in order to provide empirical evidence of its accuracy.
S. Clémençon, N. Baskiotis, N. Vayatis

Genetic Algorithms for Subset Selection in Model-Based Clustering

Model-based clustering assumes that the data observed can be represented by a finite mixture model, where each cluster is represented by a parametric distribution. The Gaussian distribution is often employed in the multivariate continuous case. The identification of the subset of relevant clustering variables enables a parsimonious number of unknown parameters to be achieved, thus yielding a more efficient estimate, a clearer interpretation and often improved clustering partitions. This paper discusses variable or feature selection for model-based clustering. Following the approach of Raftery and Dean (J Am Stat Assoc 101(473):168–178, 2006), the problem of subset selection is recast as a model comparison problem, and BIC is used to approximate Bayes factors. The criterion proposed is based on the BIC difference between a candidate clustering model for the given subset and a model which assumes no clustering for the same subset. Thus, the problem amounts to finding the feature subset which maximises such a criterion. A search over the potentially vast solution space is performed using genetic algorithms, which are stochastic search algorithms that use techniques and concepts inspired by evolutionary biology and natural selection. Numerical experiments using real data applications are presented and discussed.
Luca Scrucca

Clustering Evaluation in High-Dimensional Data

Clustering evaluation plays an important role in unsupervised learning systems, as it is often necessary to automatically quantify the quality of generated cluster configurations. This is especially useful for comparing the performance of different clustering algorithms as well as determining the optimal number of clusters in clustering algorithms that do not estimate it internally. Many clustering quality indexes have been proposed over the years and different indexes are used in different contexts. There is no unifying protocol for clustering evaluation, so it is often unclear which quality index to use in which case. In this chapter, we review the existing clustering quality measures and evaluate them in the challenging context of high-dimensional data clustering. High-dimensional data is sparse and distances tend to concentrate, possibly affecting the applicability of various clustering quality indexes. We analyze the stability and discriminative power of a set of standard clustering quality measures with increasing data dimensionality. Our evaluation shows that the curse of dimensionality affects different clustering quality indexes in different ways and that some are to be preferred when determining clustering quality in many dimensions.
Nenad Tomašev, Miloš Radovanović

Combinatorial Optimization Approaches for Data Clustering

Target of cluster analysis is to group data represented as a vector of measurements or a point in a multidimensional space such that the most similar objects belong to the same group or cluster. The greater the similarity within a cluster and the greater the dissimilarity between clusters, the better the clustering task has been performed. Starting from the 1990s, cluster analysis has emerged as an important interdisciplinary field, applied to several heterogeneous domains with numerous applications, including among many others social sciences, information retrieval, natural language processing, galaxy formation, image segmentation, and biological data.Scope of this paper is to provide an overview of the main types of criteria adopted to classify and partition the data and to discuss properties and state-of-the-art solution approaches, with special emphasis to the combinatorial optimization and operational research perspective.
Paola Festa

Kernel Spectral Clustering and Applications

In this chapter we review the main literature related to kernel spectral clustering (KSC), an approach to clustering cast within a kernel-based optimization setting. KSC represents a least-squares support vector machine-based formulation of spectral clustering described by a weighted kernel PCA objective. Just as in the classifier case, the binary clustering model is expressed by a hyperplane in a high dimensional space induced by a kernel. In addition, the multi-way clustering can be obtained by combining a set of binary decision functions via an Error Correcting Output Codes (ECOC) encoding scheme. Because of its model-based nature, the KSC method encompasses three main steps: training, validation, testing. In the validation stage model selection is performed to obtain tuning parameters, like the number of clusters present in the data. This is a major advantage compared to classical spectral clustering where the determination of the clustering parameters is unclear and relies on heuristics. Once a KSC model is trained on a small subset of the entire data, it is able to generalize well to unseen test points. Beyond the basic formulation, sparse KSC algorithms based on the Incomplete Cholesky Decomposition (ICD) and L 0, \(L_{1},L_{0} + L_{1}\), Group Lasso regularization are reviewed. In that respect, we show how it is possible to handle large-scale data. Also, two possible ways to perform hierarchical clustering and a soft clustering method are presented. Finally, real-world applications such as image segmentation, power load time-series clustering, document clustering, and big data learning are considered.
Rocco Langone, Raghvendra Mall, Carlos Alzate, Johan A. K. Suykens

Uni- and Multi-Dimensional Clustering Via Bayesian Networks

This chapter discusses model based clustering via Bayesian networks. Both uni-dimensional and multi-dimensional clustering methods are discussed. The main idea for uni-dimensional clustering via Bayesian networks is to use the Bayesian structural clustering algorithm, which is a greedy algorithm that makes use of the EM algorithm. On the other hand, for multi-dimensional clustering we investigate latent tree models which according to our knowledge, are the only model based approach to multi-dimensional clustering. There are generally two approaches for learning latent tree models: Greedy search and feature selection. The former is able to cover a wider range of models, but the latter is more time efficient. However, latent tree models are unable to capture dependency between partitions through attributes. So we propose two approaches to overcome this shortcoming. Our first approach extends the idea of Bayesian structural clustering for uni-dimensional clustering, while the second one is a combination of feature selection methods and the main idea of multi-dimensional classification with Bayesian networks. We test our second approach on both real and synthetic data. The results show the goodness of our approach in finding meaningful and novel partitions.
Omid Keivani, Jose M. Peña

A Radial Basis Function Neural Network Training Mechanism for Pattern Classification Tasks

This chapter proposes a radial basis function network learning approach for classification problems that combines hierarchical fuzzy clustering and particle swarm optimization (PSO) with discriminant analysis to elaborate on an effective design of radial basis function neural network classifier. To eliminate the redundant information, the training data are pre-processed to create a partition of the feature space into a number of fuzzy subspaces. The center elements of the subspaces are considered as a new data set which is further clustered by means of a weighted clustering scheme. The obtained cluster centers coincide with the centers of the network’s basis functions. The method of PSO is used to estimate the neuron connecting weights involved in the learning process. The proposed classifier is applied to three machine learning data sets, and its results are compared to other relative approaches that exist in the literature.
Antonios D. Niros, George E. Tsekouras

A Survey of Constrained Clustering

Traditional data mining methods for clustering only use unlabeled data objects as input. The aim of such methods is to find a partition of these unlabeled data objects in order to discover the underlying structure of the data. In some cases, there may be some prior knowledge about the data in the form of (a few number of) labels or constraints. Performing traditional clustering methods by ignoring the prior knowledge may result in extracting irrelevant information for the user. Constrained clustering, i.e., clustering with side information or semi-supervised clustering, addresses this problem by incorporating prior knowledge into the clustering process to discover relevant information from the data. In this chapter, a survey of advances in the area of constrained clustering will be presented. Different types of prior knowledge considered in the literature, and clustering approaches that make use of this prior knowledge will be reviewed.
Derya Dinler, Mustafa Kemal Tural

An Overview of the Use of Clustering for Data Privacy

In this chapter we review some of our results related to the use of clustering in the area of data privacy. The paper gives a brief overview of data privacy and, more specifically, on data driven methods for data privacy and discusses where clustering can be applied in this setting. We discuss the role of clustering in the definition of masking methods, and on the calculation of information loss and data utility.
Vicenç Torra, Guillermo Navarro-Arribas, Klara Stokes

Nonlinear Clustering: Methods and Applications

As a fundamental classification method for pattern recognition, data clustering plays an important role in various fields such as computer science, medical science, social science, and economics. According to the data distribution of clusters, data clustering problem can be categorized into linearly separable clustering and nonlinearly separable clustering. Due to the complex manifold of the real-world data, nonlinearly separable clustering is one of most popular and widely studied clustering problems. This chapter reviews nonlinear clustering algorithms from four viewpoints, namely kernel-based clustering, multi-exemplar model, graph-based method, and support vector clustering (SVC). Accordingly, this chapter reviews four nonlinear clustering methods, namely conscience on-line learning (COLL) for kernel-based clustering, multi-exemplar affinity propagation (MEAP), graph-based multi-prototype competitive learning (GMPCL), and position regularized support vector clustering (PSVC), and demonstrates their applications in computer vision such as digital image clustering, video segmentation, and color image segmentation.
Chang-Dong Wang, Jian-Huang Lai

Swarm Intelligence-Based Clustering Algorithms: A Survey

Swarm intelligence (SI) is an artificial intelligence technique that depends on the collective properties emerging from multi-agents in a swarm. In this work, the SI-based algorithms for hard (crisp) clustering are reviewed. They are studied in five groups: particle swarm optimization, ant colony optimization, ant-based sorting, hybrid algorithms, and other SI-based algorithms. Agents are the key elements of the SI-based algorithms, as they determine how the solutions are generated and directly affect the exploration and exploitation capabilities of the search procedure. Hence, a new classification scheme is proposed for the SI-based clustering algorithms according to the agent representation. We elaborate on which representation schemes are used in different algorithm categories. We also examine how the SI-based algorithms, together with the representation schemes, address the challenging characteristics of the clustering problem such as multiple objectives, unknown number of clusters, arbitrary-shaped clusters, data types, constraints, and scalability. The pros and cons of each representation scheme are discussed. Finally, future research directions are suggested.
Tülin İnkaya, Sinan Kayalıgil, Nur Evin Özdemirel

Extending Kmeans-Type Algorithms by Integrating Intra-cluster Compactness and Inter-cluster Separation

Kmeans-type clustering aims at partitioning a data set into clusters such that the objects in a cluster are compact and the objects in different clusters are well-separated. However, most of kmeans-type clustering algorithms rely on only intra-cluster compactness while overlooking inter-cluster separation. In this chapter, a series of new clustering algorithms by extending the existing kmeans-type algorithms is proposed by integrating both intra-cluster compactness and inter-cluster separation. First, a set of new objective functions for clustering is developed. Based on these objective functions, the corresponding updating rules for the algorithms are then derived analytically. The properties and performances of these algorithms are investigated on several synthetic and real-life data sets. Experimental studies demonstrate that our proposed algorithms outperform the state-of-the-art kmeans-type clustering algorithms with respects to four metrics: Accuracy, Rand Index, Fscore, and normal mutual information (NMI).
Xiaohui Huang, Yunming Ye, Haijun Zhang

A Fuzzy-Soft Competitive Learning Approach for Grayscale Image Compression

In this chapter we develop a fuzzy-set-based vector quantization algorithm for the efficient compression of grayscale still images. In general, vector quantization can be carried out by using crisp-based and fuzzy-based methods. The motivation of the current work is to provide a systematic framework upon which the above two general methodologies can effectively cooperate. The proposed algorithm accomplishes this task through the utilization of two main steps. First, it introduces a specially designed fuzzy neighborhood function to quantify the lateral neuron interaction phenomenon and the degree of the neuron excitation of the standard self-organizing map. Second, it involves a codeword migration strategy, according to which codewords that correspond to small and underutilized clusters are moved to areas that appear high probability to contain large number of training vectors. The proposed methodology is rigorously compared to other relative approaches that exist in the literature. An interesting outcome of the simulation study is that although the proposed algorithm constitutes a fuzzy-based learning mechanism, it finally obtains computational costs that are comparable to crisp-based vector quantization schemes, an issue that can hardly be maintained by the standard fuzzy vector quantizers.
Dimitrios M. Tsolakis, George E. Tsekouras

Unsupervised Learning in Genome Informatics

With different genomes available, unsupervised learning algorithms are essential in learning genome-wide biological insights. Especially, the functional characterization of different genomes is essential for us to understand lives. In this chapter, we review the state-of-the-art unsupervised learning algorithms for genome informatics from DNA to MicroRNA.
DNA (DeoxyriboNucleic Acid) is the basic component of genomes. A significant fraction of DNA regions (transcription factor binding sites) are bound by proteins (transcription factors) to regulate gene expression at different development stages in different tissues. To fully understand genetics, it is necessary to apply unsupervised learning algorithms to learn and infer those DNA regions. Here we review several unsupervised learning methods for deciphering the genome-wide patterns of those DNA regions.
MicroRNA (miRNA), a class of small endogenous noncoding RNA (RiboNucleic acid) species, regulate gene expression post-transcriptionally by forming imperfect base-pair with the target sites primarily at the \(3^{{\prime}}\) untranslated regions of the messenger RNAs. Since the discovery of the first miRNA let-7 in worms, a vast amount of studies have been dedicated to functionally characterizing the functional impacts of miRNA in a network context to understand complex diseases such as cancer. Here we review several representative unsupervised learning frameworks on inferring miRNA regulatory network by exploiting the static sequence-based information pertinent to the prior knowledge of miRNA targeting and the dynamic information of miRNA activities implicated by the recently available large data compendia, which interrogate genome-wide expression profiles of miRNAs and/or mRNAs across various cell conditions.
Ka-Chun Wong, Yue Li, Zhaolei Zhang

The Application of LSA to the Evaluation of Questionnaire Responses

This chapter will discuss research surrounding the development of applications for automated evaluation and generation of instructional feedback based on the analysis of responses to open-ended essay questions. While an effective means of evaluation, examining responses to essay questions requires natural language processing that often demands human input and guidance which can be labor intensive, time consuming, and language dependent. In order to provide this means of evaluation on a larger scale, an automated unsupervised learning approach is necessary that overcomes all these limitations. Latent Semantic Analysis (LSA) is an unsupervised learning system used for deriving and representing the semantic relationships between items in a body of natural language content. It mimics the representation of meaning that is formed by a human who learns linguistic constructs through exposure to natural language over time. The applications described in this chapter leverage LSA as an unsupervised system to learn language and provide a semantic framework that can be used for mapping natural language responses, evaluating the quality of those responses, and identifying relevant instructional feedback based on their semantic content. We will discuss the learning algorithms used to construct the LSA framework of meaning as well as methods to apply that framework for the evaluation and generation of feedback.
Dian I. Martin, John C. Martin, Michael W. Berry

Mining Evolving Patterns in Dynamic Relational Networks

Dynamic networks have recently been recognized as a powerful abstraction to model and represent the temporal changes and dynamic aspects of the data underlying many complex systems. This recognition has resulted in a burst of research activity related to modeling, analyzing, and understanding the properties, characteristics, and evolution of such dynamic networks. The focus of this growing research has been on mainly defining important recurrent structural patterns and developing algorithms for their identification. Most of these tools are not designed to identify time-persistent relational patterns or do not focus on tracking the changes of these relational patterns over time. Analysis of temporal aspects of the entity relations in these networks can provide significant insight in determining the conserved relational patterns and the evolution of such patterns over time. In this chapter we present new data mining methods for analyzing the temporal evolution of relations between entities of relational networks. A qualitative analysis of the results shows that the discovered patterns are able to capture network characteristics that can be used as features for modeling the underlying dynamic network in the context of a classification task.
Rezwan Ahmed, George Karypis

Probabilistically Grounded Unsupervised Training of Neural Networks

The chapter is a survey of probabilistic interpretations of artificial neural networks (ANN) along with the corresponding unsupervised learning algorithms. ANNs for estimating probability density functions (pdf) are reviewed first, including parametric estimation via constrained radial basis functions and nonparametric estimation via multilayer perceptrons. The approaches overcome the limitations of traditional statistical estimation methods, possibly leading to improved pdf models. The focus is then moved from pdf estimation to online neural clustering, relying on maximum-likelihood training. Finally, extension of the techniques to the unsupervised training of generative probabilistic hybrid paradigms for sequences of random observations is discussed.
Edmondo Trentin, Marco Bongini
Weitere Informationen