Skip to main content

2011 | Buch

Kernel-based Data Fusion for Machine Learning

Methods and Applications in Bioinformatics and Text Mining

verfasst von: Shi Yu, Léon-Charles Tranchevent, Bart De Moor, Yves Moreau

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Computational Intelligence

insite
SUCHEN

Über dieses Buch

Data fusion problems arise frequently in many different fields. This book provides a specific introduction to data fusion problems using support vector machines. In the first part, this book begins with a brief survey of additive models and Rayleigh quotient objectives in machine learning, and then introduces kernel fusion as the additive expansion of support vector machines in the dual problem. The second part presents several novel kernel fusion algorithms and some real applications in supervised and unsupervised learning. The last part of the book substantiates the value of the proposed theories and algorithms in MerKator, an open software to identify disease relevant genes based on the integration of heterogeneous genomic data sources in multiple species.


The topics presented in this book are meant for researchers or students who use support vector machines. Several topics addressed in the book may also be interesting to computational biologists who want to tackle data fusion challenges in real applications. The background required of the reader is a good knowledge of data mining, machine learning and linear algebra.

Inhaltsverzeichnis

Frontmatter
Introduction
Abstract
The history of learning has been accompanied by the pace of evolution and the progress of civilization. Some modern ideas of learning (e.g., pattern analysis and machine intelligence) can be traced back thousands of years in the analects of oriental philosophers [16] and Greek mythologies (e.g., The Antikythera Mechanism [83]). Machine learning, a contemporary topic rooted in computer science and engineering, has always being inspired and enriched by the unremitting efforts of biologists and psychologists in their investigation and understanding of the nature. The Baldwin effect [4], proposed by James Mark Baldwin 110 years ago, concerns the the costs and benefits of learning in the context of evolution, which has greatly influenced the development of evolutionary computation.
Shi Yu, Léon-Charles Tranchevent, Bart De Moor, Yves Moreau
Rayleigh Quotient-Type Problems in Machine Learning
Abstract
For real matrices and vectors, given a positive definite matrix Q and a nonzero vector w, a Rayleigh quotient (also known as Rayleigh-Ritz ratio) is defined as
$$ \rho =\rho( w;Q) = \frac{ w^T Q w}{ w^T w}. $$
It was originally proposed in the theorem to approximate eigenvalues of a Hermitian matrix.
Shi Yu, Léon-Charles Tranchevent, Bart De Moor, Yves Moreau
L n -norm Multiple Kernel Learning and Least Squares Support Vector Machines
Abstract
In the era of information overflow, data mining and machine learning are indispensable tools to retrieve information and knowledge from data. The idea of incorporating several data sources in analysis may be beneficial by reducing the noise, as well as by improving statistical significance and leveraging the interactions and correlations between data sources to obtain more refined and higher-level information [50], which is known as data fusion. In bioinformatics, considerable effort has been devoted to genomic data fusion, which is an emerging topic pertaining to a lot of applications. At present, terabytes of data are generated by high-throughput techniques at an increasing rate. In data fusion, these terabytes are further multiplied by the number of data sources or the number of species. A statistical model describing this data is therefore not an easy matter. To tackle this challenge, it is rather effective to consider the data as being generated by a complex and unknown black box with the goal of finding a function or an algorithm that operates on an input to predict the output. About 15 years ago, Boser [8] and Vapnik [51] introduced the support vector method which makes use of kernel functions. This method has offered plenty of opportunities to solve complicated problems but also brought lots of interdisciplinary challenges in statistics, optimization theory, and the applications therein [40].
Shi Yu, Léon-Charles Tranchevent, Bart De Moor, Yves Moreau
Optimized Data Fusion for Kernel k-means Clustering
Abstract
In this chapter, we will present a novel optimized kernel k -means clustering (OKKC) algorithm to combine multiple data sources. The objective of k-means clustering is formulated as a Rayleigh quotient function of the between-cluster scatter and the cluster membership matrix. To incorporate multiple data sources, the between-cluster matrix is calculated in the high dimensional Hilbert space where the heterogeneous data sources can be easily combined as kernel matrices. The objective to optimize the kernel combination and the cluster memberships on unlabeled data is non-convex. To solve it, we apply an alternating minimization [6] method to optimize the cluster memberships and the kernel coefficients iteratively to convergence. When the cluster membership is given, we optimize the kernel coefficients as kernel Fisher Discriminant (KFD) and solve it as least squares support vector machine (LSSVM). The objectives of KFD and k-means are combined in a unified model thus the two components optimize towards the same objective, therefore, the proposed alternating algorithm converges locally.
Shi Yu, Léon-Charles Tranchevent, Bart De Moor, Yves Moreau
Multi-view Text Mining for Disease Gene Prioritization and Clustering
Abstract
Text mining helps biologists to collect disease-gene associations automatically from large volumes of biological literature. During the past ten years, there was a surge of interests in automatic exploration of the biomedical literature, ranging from the modest approach of annotating and extracting keywords from text to more ambitious attempts such as Natural Language Processing, text-mining based network construction and inference. In particular, these efforts effectively help biologists to identify the most likely disease candidates for further experimental validation. The most important resource for text mining applications now is the MEDLINE database developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM). MEDLINE covers all aspects of biology, chemistry, and medicine, there is almost no limit to the types of information that may be recovered through careful and exhaustive mining [45]. Therefore, a successful text mining approach relies much on an appropriate model. To create a text mining model, the selection of Controlled Vocabulary (CV) and the representation schemes of terms occupy a central role and the efficiency of biomedical knowledge discovery varies greatly between different text mining models. To address these challenges, we propose a multi-view text mining approach to retrieve information from different biomedical domain levels and combine them to identify the disease relevant genes through prioritization. The view represents a text mining result retrieved by a specific CV, so the concept of multi-view text mining is featured as applying multiple controlled vocabularies to retrieve the gene-centric perspectives from free text publications. Since all the information is retrieved from the same MEDLINE database but only varied by the CV, the term view also indicates that the data consists of multiple domain-based perspectives of the same corpus. We expect that the correlated and complementary information contained in the multi-view textual data can facilitate the understanding about the roles of genes in genetic diseases.
Shi Yu, Léon-Charles Tranchevent, Bart De Moor, Yves Moreau
Optimized Data Fusion for k-means Laplacian Clustering
Abstract
Clustering is a fundamental problem in unsupervised learning and a number of different algorithms and methods have emerged over the years. k-means and spectral clustering are two popular methods for clustering analysis. k-means (KM) is proposed to cluster attribute-based data into k numbers of clusters with the minimal distortion [4, 8]. Another well known method, spectral clustering (SC) [18, 20], is also widely adopted in many applications. Unlike KM, SC is specifically developed for graphs, where the data samples are represented as vertices connected by non-negatively weighted undirected edges. The problem of clustering on graphs belongs to another paradigm than the algorithms based on the distortion measure.
Shi Yu, Léon-Charles Tranchevent, Bart De Moor, Yves Moreau
Weighted Multiple Kernel Canonical Correlation
Abstract
In the preceding chapters we have presented several supervised and unsupervised algorithms using kernel fusion to combine multi-source and multi-representation of data. In this chapter we will investigate a different unsupervised learning problem Canonical Correlation Analysis (CCA), and its extension in kernel fusion techniques. The goal of CCA (taking two data sets for example) is to identify the canonical variables that minimize or maximize the linear correlations between the transformed variables [8]. The conventional CCA is employed on two data sets in the observation space (original space). The extension of CCA on multiple data sets is also proposed by Kettenring and it leads to different criteria of selecting the canonical variables, which are summarized as 5 different models: sum of correlation model, sum of squared correlation model, maximum variance model, minimal variance model and generalized variance model [9].
Shi Yu, Léon-Charles Tranchevent, Bart De Moor, Yves Moreau
Cross-Species Candidate Gene Prioritization with MerKator
Abstract
In modern biology, the use of high-throughput technologies allows researchers and practicians to quickly and efficiently screen the genome in order to identify the genetic factors of a given disorder.However these techniques are often generating large lists of candidate genes among which only one or a few are really associated to the biological process of interest. Since the individual validation of all these candidate genes is often too costly and time consuming, only the most promising genes are experimentally assayed.
Shi Yu, Léon-Charles Tranchevent, Bart De Moor, Yves Moreau
Conclusion
Abstract
The exquisite nature of combining various senses in human cognition motivated our approach to incorporate multiple sources in data mining. The research described in this book covers a number of topics which are relevant to supervised and unsupervised learning by kernel-based data fusion. The discussion of these topics were distinguished in four different aspects: theory, algorithm, application and software.
Shi Yu, Léon-Charles Tranchevent, Bart De Moor, Yves Moreau
Backmatter
Metadaten
Titel
Kernel-based Data Fusion for Machine Learning
verfasst von
Shi Yu
Léon-Charles Tranchevent
Bart De Moor
Yves Moreau
Copyright-Jahr
2011
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-19406-1
Print ISBN
978-3-642-19405-4
DOI
https://doi.org/10.1007/978-3-642-19406-1