Skip to main content



1. Information Theory, Machine Learning, and Reproducing Kernel Hilbert Spaces

The common problem faced by many data processing professionals is how to best extract the information contained in data. In our daily lives and in our professions, we are bombarded by huge amounts of data, but most often data are not our primary interest. Data hides, either in time structure or in spatial redundancy, important clues to answer the information-processing questions we pose. We are using the term information in the colloquial sense, and therefore it may mean different things to different people, which is OK for now. We all realize that the use of computers and the Web accelerated tremendously the accessibility and the amount of data being generated. Therefore the pressure to distill information from data will mount at an increasing pace in the future, and old ways of dealing with this problem will be forced to evolve and adapt to the new reality. To many (including the author) this represents nothing less than a paradigm shift, from hypothesis-based, to evidence-based science and it will affect the core design strategies in many disciplines including learning theory and adaptive systems.
José C. Principe

2. Renyi’s Entropy, Divergence and Their Nonparametric Estimators

It is evident from Chapter 1 that Shannon’s entropy occupies a central role in information-theoretic studies. Yet, the concept of information is so rich that perhaps there is no single definition that will be able to quantify information properly. Moreover, from an engineering perspective, one must estimate entropy from data which is a nontrivial matter. In this book we concentrate on Alfred Renyi’s seminal work on information theory to derive a set of estimators to apply entropy and divergence as cost functions in adaptation and learning. Therefore, we are mainly interested in computationally simple, nonparametric estimators that are continuous and differentiable in terms of the samples to yield well-behaved gradient algorithms that can optimize adaptive system parameters. There are many factors that affect the determination of the optimum of the performance surface, such as gradient noise, learning rates, and misadjustment, therefore in these types of applications the entropy estimator’s bias and variance are not as critical as, for instance, in coding or rate distortion theories. Moreover in adaptation one is only interested in the extremum (maximum or minimum) of the cost, with creates independence from its actual values, because only relative assessments are necessary. Following our nonparametric goals, what matters most in learning is to develop cost functions or divergence measures that can be derived directly from data without further assumptions to capture as much structure as possible within the data’s probability density function (PDF).
Dongxin Xu, Deniz Erdogmuns

3. Adaptive Information Filtering with Error Entropy and Error Correntropy Criteria

This chapter formulates a new cost function for adaptive filtering based on Renyi’s quadratic error entropy. The problem of estimating the linear system parameters \(\mathrm{\mathbf {w}} = {[{w}_{0},\ldots, {w}_{M-1}]}^{\mathrm{T}}\) in the setting of Figure 3.1 where x(n), and z(n) are random variables can be framed as model-based inference, because it relates measured data, uncertainty, and the functional description of the system and its parameters. The desired response z(n) can be thought of as being created by an unknown transformation of the input vector \(\mathrm{\mathbf {x}} = {[x(n),\ldots, x(n - M + 1)]}^{\mathrm{T}}\). Adaptive filtering theory [143, 284] addresses this problem using the MSE criterion applied to the error signal, \(e(n) = z(n) - f(\mathrm{\mathbf {w}},x(n))\)
$${J}_{w}(e(n)) = E[{(z(n) - f(\mathrm{\mathbf{w}},x(n)))}^{2}]$$
when the linear filter is a finite impulse response filter (FIR);
$$y(n) =\sum\limits_{k=0}^{M-1}{w}_{ k}x(n - k).$$
Deniz Erdogmus, Weifeng Liu

4. Algorithms for Entropy and Correntropy Adaptation with Applications to Linear Systems

This chapter develops several batch and online learning algorithms for the error entropy criterion (EEC) that are counterparts to the most widely used algorithms for the mean square error criterion (MSE). Because the chapter assumes knowledge of adaptive filter design, readers unfamiliar with this topic should seek a textbook such as [332] or [253] for a review of fundamentals. But the treatment does not require an in-depth knowledge of this field. The case studies in this chapter address only adaptation of linear systems, not because entropic costs are particularly useful for the linear model, but because the solutions for linear systems are well understood and performance comparisons can be easily drawn. This chapter also considers applications of fast evaluations of the IP using the fast Gauss transform and incomplete Cholesky decomposition, and ends with an application of the error correntropy criterion (ECC) to adaptive noise cancellation.
Deniz Erdogmus, Seungju Han, Abhishek Singh

5. Nonlinear Adaptive Filtering with MEE, MCC, and Applications

Our emphasis on the linear model in Chapter 4 was only motivated by simplicity and pedagogy. As we have demonstrated in the simple case studies, under the linearity and Gaussianity conditions, the final solution of the MEE algorithms was basically equivalent to the solution obtained with the LMS. Because the LMS algorithm is computationally simpler and better understood, there is really no advantage to use MEE in such cases.
Deniz Erdogmus, Rodney Morejon, Weifeng Liu

6. Classification with EEC, Divergence Measures, and Error Bounds

The previous chapters provided extensive coverage of the error entropy criterion (EEC) especially in regard to minimization of the error entropy (MEE) for linear and nonlinear filtering (or regression) applications. However, the spectrum of engineering applications of adaptive systems is much broader than filtering or regression. Even looking at the subclass of supervised applications we have yet to deal with classification, which is an important application area for learning technologies. All of the practical ingredients are here to extend EEC to classification inasmuch as Chapter 5 covered the integration of EEC with the backpropagation algorithm (MEE-BP). Hence we have all the tools needed to train classifiers with MEE. We show that indeed this is the case and that the classifiers trained with MEE have performances normally better than MSE-trained classifiers. However, there are still no mathematical foundations to ascertain under what conditions EEC is optimal for classification, and further work is necessary.
Deniz Erdogmus, Dongxin Xu, Kenneth Hild

7. Clustering with ITL Principles

Learning and adaptation deal with the quantification and exploitation of the input source “structure” as pointed out perhaps for the first time by Watanabe [330]. Although structure is a vague and difficult concept to quantify, structure fills the space with identifiable patterns that may be distinguishable macroscopically by the shape of the probability density function. Therefore, entropy and the concept of dissimilarity naturally form the foundations for unsupervised learning because they are descriptors of PDFs.
Robert Jenssen, Sudhir Rao

8. Self-Organizing ITL Principles for Unsupervised Learning

Chapter 1 presented a synopsis of information theory to understand its foundations and how it affected the field of communication systems. In a nutshell, mutual information characterizes the fundamental compromise of maximum rate for error-free information transmission (the channel capacity theorem) as well as the minimal information that needs to be sent for a given distortion (the rate distortion theorem). In essence given the statistical knowledge of the data and these theorems the optimal communication system emerges, or self-organizes from the data.
Sudhir Rao, Deniz Erdogmus, Dongxin Xu, Kenneth Hild

9. A Reproducing Kernel Hilbert Space Framework for ITL

During the last decade, research on Mercer kernel-based learning algorithms has flourished [294, 226, 289]. These algorithms include, for example, the support vector machine (SVM) [63], kernel principal component analysis (KPCA) [289], and kernel Fisher discriminant analysis (KFDA) [219]. The common property of these methods is that they operate linearly, as they are explicitly expressed in terms of inner products in a transformed data space that is a reproducing kernel Hilbert space (RKHS). Most often they correspond to nonlinear operators in the data space, and they are still relatively easy to compute using the so-called “kernel-trick”. The kernel trick is no trick at all; it refers to a property of the RKHS that enables the computation of inner products in a potentially infinite-dimensional feature space, by a simple kernel evaluation in the input space. As we may expect, this is a computational saving step that is one of the big appeals of RKHS. At first glance one may even think that it defeats the “no free lunch theorem” (get something for nothing), but the fact of the matter is that the price of RKHS is the need for regularization and in the memory requirements as they are memory-intensive methods. Kernel-based methods (sometimes also called Mercer kernel methods) have been applied successfully in several applications, such as pattern and object recognition [194], time series prediction [225], and DNA and protein analysis [350], to name just a few.
Jianwu Xu, Robert Jenssen, Antonio Paiva, Il Park

10. Correntropy for Random Variables: Properties and Applications in Statistical Inference

Similarity is a key concept to quantify temporal signals or static measurements. Similarity is difficult to define mathematically, however, one never really thinks too much about this difficulty and naturally translates similarity by correlation. This is one more example of how engrained second-order moment descriptors of the probability density function really are in scientific thinking. Successful engineering or pattern recognition solutions from these methodologies rely heavily on the Gaussianity and linearity assumptions, exactly for the same reasons discussed in Chapter 3.
Weifeng Liu, Puskal Pokharel, Jianwu Xu, Sohan Seth

11. Correntropy for Random Processes: Properties and Applications in Signal Processing

The previous chapter defined cross-correntropy for the case of a pair of scalar random variables, and presented applications in statistical inference. This chapter extends the definition of correntropy for the case of random (or stochastic) processes, which are index sets of random variables. In statistical signal processing the index set is time; we are interested in random variables that are a function of time and the goal is to quantify their statistical dependencies (although the index set can also be defined over inputs or channels of multivariate random variables). The autocorrelation function, which measures the statistical dependency between random variables at two different times, is conventionally utilized for this goal. Hence, we generalize the definition of autocorrelation to an autocorrentropy function. The name correntropywas coined to reflect the fact that the function “looks like” correlation but the sum over the lags (or over dimensions of the multivariate random variable) is the information potential (i.e., the argument of Renyi’s quadratic entropy). The definition of cross-correntropy for random variables carries over to time series with a minor but important change in the domain of the variables that now are an index set of lags. When it is clear from the context, we simplify the terminology and refer to the different functions (autocorrentropy, or crosscorrentropy) simply as correntropy function, but keep the word “function” to distinguish them from Chapter 10 quantities.
Puskal Pokharel, Ignacio Santamaria, Jianwu Xu, Kyu-hwa Jeong, Weifeng Liu


Weitere Informationen