Skip to main content

2008 | Buch

Machine Learning and Knowledge Discovery in Databases

European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008, Proceedings, Part I

herausgegeben von: Walter Daelemans, Bart Goethals, Katharina Morik

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the joint conference on Machine Learning and Knowledge Discovery in Databases: ECML PKDD 2008, held in Antwerp, Belgium, in September 2008. The 100 papers presented in two volumes, together with 5 invited talks, were carefully reviewed and selected from 521 submissions. In addition to the regular papers the volume contains 14 abstracts of papers appearing in full version in the Machine Learning Journal and the Knowledge Discovery and Databases Journal of Springer. The conference intends to provide an international forum for the discussion of the latest high quality research results in all areas related to machine learning and knowledge discovery in databases. The topics addressed are application of machine learning and data mining methods to real-world problems, particularly exploratory research that describes novel learning and mining tasks and applications requiring non-standard techniques.

Inhaltsverzeichnis

Frontmatter

Invited Talks (Abstracts)

Industrializing Data Mining, Challenges and Perspectives

Business Intelligence is a very active sector in all industrial domains. Classical techniques (reporting and Olap), mainly concerned with presenting data, are already widely deployed. Meanwhile, Data Mining has long been used in companies as a nichetechnique, reserved for experts only and for very specific problems (credit scoring, fraud detection for example). But with the increasing availability of large data volumes (in particular, but not only, from theWeb), companies are more and more turning to data mining to provide them with high added-value predictive analytics. However producing models in large numbers, making use of large data volumes in an industrial context can only happen if solutions to challenges, both theoretic and operational, are found: we need algorithms which can be used to produce models when datasets have thousands of variables and millions of observations; we need to learn how to run and control the correct execution of hundreds of models; we need ways to automate the data mining process.

I will present these constraints in industrial contexts and show how KXEN has exploited theoretical results (coming from Vladimir Vapnik’s work) to provide answers to the above-mentioned challenges. I will give a few examples of real-life applications and will conclude with some remarks on the future of data mining in the industrial domain.

Françoise Fogelman-Soulié
From Microscopy Images to Models of Cellular Processes

The advance of fluorescent tagging and of confocal microscopy is allowing biologists to image biochemical processes at a level of detail that was unimaginable just a few years ago. However, as the analysis of these images is done mostly by hand, there is a severe bottleneck in transforming these images into useful quantitative data that can be used to evaluate mathematical models.

One of the inherent challenges involved in automating this transformation is that image data is highly variable. This requires a recalibration of the image processing algorithms for each experiment. We use machine learning methods to enable the experimentalist to calibrate the image processing methods without having any knowledge of how these methods work. This, we believe, will allow the rapid integration of computer vision methods with confocal microscopy and open the way to the development of quantitative spatial models of cellular processes.

For more information, see http://seed.ucsd.edu/ yfreund/NewHomePage/ Applications/Biomedical Imaging.html

Yoav Freund
Data Clustering: 50 Years Beyond K-means

The practice of classifying objects according to perceived similarities is the basis for much of science. Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms in to taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping objects according to measured or perceived intrinsic characteristics. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes cluster analysis (unsupervised learning) from discriminant analysis (supervised learning). The objective of cluster analysis is to simply find a convenient and valid organization of the data, not to establish rules for separating future data into categories.

Anil K. Jain
Learning Language from Its Perceptual Context

Current systems that learn to process natural language require laboriously constructed human-annotated training data. Ideally, a computer would be able to acquire language like a child by being exposed to linguistic input in the context of a relevant but ambiguous perceptual environment. As a step in this direction, we present a system that learns to sportscast simulated robot soccer games by example. The training data consists of textual human commentaries on Robocup simulation games. A set of possible alternative meanings for each comment is automatically constructed from game event traces. Our previously developed systems for learning to parse and generate natural language (KRISP and WASP) were augmented to learn from this data and then commentate novel games. The system is evaluated based on its ability to parse sentences into correct meanings and generate accurate descriptions of game events. Human evaluation was also conducted on the overall quality of the generated sportscasts and compared to human-generated commentaries.

Raymond J. Mooney
The Role of Hierarchies in Exploratory Data Mining

In a broad range of data mining tasks, the fundamental challenge is to efficiently explore a very large space of alternatives. The difficulty is two-fold: first, the size of the space raises computational challenges, and second, it can introduce data sparsity issues even in the presence of very large datasets. In this talk, well consider how the use of hierarchies (e.g., taxonomies, or the OLAP multidimensional model) can help mitigate the problem.

Raghu Ramakrishnan

Machine Learning Journal Abstracts

Rollout Sampling Approximate Policy Iteration

Several researchers [2,3] have recently investigated the connection between reinforcement learning and classification. Our work builds on [2], which suggests an approximate policy iteration algorithm for learning a good policy represented as a classifier, without explicit value function representation. At each iteration, a new policy is produced using training data obtained through rollouts of the previous policy on a simulator. These rollouts aim at identifying better action choices over a subset of states in order to form a set of data for training the classifier representing the improved policy. Even though [2,3] examine how to distribute training states over the state space, their major limitation remains the large amount of sampling employed at each training state.

We suggest methods to reduce the number of samples needed to obtain a high-quality training set. This is done by viewing the setting as akin to a bandit problem over the states from which rollouts are performed. Our contribution is two-fold: (a) we suitably adapt existing bandit techniques for rollout management, and (b) we suggest a more appropriate statistical test for identifying states with dominating actions early and with high confidence. Experiments on two classical domains (inverted pendulum, mountain car) demonstrate an improvement in sample complexity that substantially increases the applicability of rollout-based algorithms. In future work, we aim to obtain algorithms specifically tuned to this task with even lower sample complexity and to address the question of the choice of sampling distribution.

Christos Dimitrakakis, Michail G. Lagoudakis
New Closed-Form Bounds on the Partition Function

Estimating the partition function is a key computation in graphical models that is required for important tasks like parameter estimation, model selection and structure learning. However, this computation is intractable in general. Thus, developing efficient and accurate approximations is of considerable interest. Variational methods express the computation of the partition function as an optimization problem (solving which intractable in general) and then relax the optimization problem in various ways to obtain approximations to the partition function. Two popular algorithms belonging to this framework are loopy belief propagation (LBP) and tree-reweighted belief propagation (TRW-BP). Both these algorithms are so-called

message passing

algorithms that work by updating local beliefs about probabilities at graph nodes by passing messages between the nodes in the graph until convergence is achieved. TRW-BP is guaranteed to give an upper bound on the partition function. However, neither algorithm is guaranteed to converge in general. Although convergent alternatives to TRW-BP have been proposed, they have no guarantees on the number of iterations required for convergence. Thus, these algorithms could be prohibitively expensive for applications requiring repeated inference on large graphs (like training large CRFs). In order to overcome this problem, Sutton et al. propose the

piecewise approximation

(PW) to the partition function. PW tends to be loose, but can be computed very fast, because it has a closed-form expression. Sutton et al. apply it successfully to common NLP tasks. In this paper, for a special class of potentials (that contain associative potentials), we prove that both LBP and TRW-BP converge in a single iteration. Using this fact, we obtain closed-form expressions for the TRW and LBP approximations to the partition function. In recent work, Wainwright et al. prove that LBP gives a lower bound on the partition function for binary attractive MRFs. We thus also get closed-form lower bounds for attractive associative MRFs. This enables us to obtain bounds on the error between the true partition function and these approximations for attractive associative MRFs. We also construct examples which show that TRW and LBP can give arbitrarily bad approximations to the marginal probabilities. Using the closed-form bounds for these special cases, we also develop novel closed-form upper bounds for arbitrary MRFs and closed-form lower bounds for associative binary MRFs. We also present experimental results showing that the new upper bounds are almost always tighter than the piecewise bound. The experiments also show that the novel lover bounds beat popular existing bounds like the mean-field bound on densely connected graphs.

Krishnamurthy Dvijotham, Soumen Chakrabarti, Subhasis Chaudhuri
Large Margin vs. Large Volume in Transductive Learning

We focus on distribution-free

transductive learning

. In this setting the learning algorithm is given a ‘full sample’ of unlabeled points. Then, a training sample is selected uniformly at random from the full sample and the labels of the training points are revealed. The goal is to predict the labels of the remaining unlabeled points as accurately as possible. The full sample partitions the transductive hypothesis space into a finite number of

equivalence classes

. All hypotheses in the same equivalence class, generate the same dichotomy of the full sample. We consider a

large volume

principle, whereby the priority of each equivalence class is proportional to its “volume” in the hypothesis space.

Ran El-Yaniv, Dmitry Pechyony, Vladimir Vapnik
Incremental Exemplar Learning Schemes for Classification on Embedded Devices

In this paper, we focus on the data classification problem when the classifier operates on an embedded device (e.g., fault detection in device condition-monitoring data streams). Memory-based classifiers are an excellent choice in such cases, however, an embedded device is unlikely to be able to hold a large training dataset in memory (which could potentially keep increasing in size as new training data with new concepts arrive). A viable option then is to employ exemplar learning (EL) techniques to find a training subset comprising a few carefully selected

exemplars

of high functional value that fit in memory and effectively delineate the class boundaries. We propose two novel incremental EL schemes that unlike traditional EL approaches [3] are, (1) incremental (they naturally incorporate new training data streams), (2) offer ordered removal of instances (they can be customized to obtain exemplar sets of any user-defined size) and (3) robust (such that the exemplar sets generalize for other classifiers as well). Our proposed methods are as follows:

EBEL

(Entropy Based EL) – This method removes instances from the training set based on their

information content

. Instead of using an adhoc ranking scheme, it removes a training instance whose removal causes the least amount of drop in the conditional entropy of the class indicator variable insuring minimum loss of information.

ABEL

(AUC Based EL) – This method prunes data based on AUC (Area under ROC curve) performance.

ABEL

uses a

validation set

and prunes an instance if its removal offers the least drop in the AUC computed for this validation set.

We show that our schemes efficiently incorporate new training datasets while maintaining high-quality exemplar sets of any user-defined size. We present a comprehensive experimental analysis showing excellent classification-accuracy versus memory-usage tradeoffs of our proposed methods.

Ankur Jain, Daniel Nikovski
A Collaborative Filtering Framework Based on Both Local User Similarity and Global User Similarity

Collaborative filtering as a classical method of information retrieval is widely used in helping people to deal with information overload. In this paper, we introduce the concept of local user similarity and global user similarity, based on surprisal-based vector similarity and the application of the concept of maximin distance in graph theory. Surprisal-based vector similarity expresses the relationship between any two users based on the quantities of information (called surprisal) contained in their ratings, which is based on the intuition that less common ratings for a specific item tend to provide more discriminative information than the most common ratings. Furthermore, traditional methods of computing user similarity can not work if two users have not rated any identical item. To solve this problem, the global user similarity is introduced to define two users being similar if they can be connected through their locally similar neighbors. A weighted user graph is first constructed by using local similarity of any two users as the weight of the edge connecting them. Then the global similarity can be calculated as the maximin distance of any two nodes in the graph. Based on both of Local User Similarity and Global User Similarity, we develop a collaborative filtering framework called LS&GS. An empirical study using the MovieLens dataset shows that the proposed framework outperforms other state-of-the-art collaborative filtering algorithms.

Heng Luo, Changyong Niu, Ruimin Shen, Carsten Ullrich
A Critical Analysis of Variants of the AUC

The area under the ROC curve, or AUC, has been widely used to assess the ranking performance of binary scoring classifiers. Given a sample of labeled instances, the metric considers the number of correctly ordered pairs of instances with different class label. Thus, its value only depends on the ordering of the scores but not on the “margin” between them. Consequently, it can happen that a small change in scores leads to a considerable change in AUC value. Such an effect is especially apparent when the number of instances used to calculate the AUC is small. On the other hand, two classifiers can have the same AUC value, even though one of them is a “better separator” in the sense that it increases the difference between scores of positive and negative instances, respectively. It has been argued that this insensitivity toward score differences is disadvantageous for model evaluation and selection. For this reason, three variants of the AUC metric that take the score differences into account have recently been proposed, along with first experimental results.

We present a unifying framework in which the conventional AUC and its variants can be modeled as special cases of a generalized AUC metric. Within this framework, we provide a formal analysis showing that the AUC variants produce estimates of the true AUC with a non-constant, model-specific bias, while the variance can decrease as well as increase. All things considered, the net effect on the quality of the estimations is thus not clear and, hereby, there is no solid theoretical foundation for the variants. Our analysis leads us to conjecture that actually none of the variants should be able to perform better in model selection than conventional AUC. This conjecture is corroborated by extensive experiments with synthetic data as well as real benchmark data, showing that the conventional AUC cannot be outperformed systematically by any variant, not in an ideal setting according to the theoretical analysis, and not in real model selection scenarios. Finally, our contribution also sheds light on recent research dealing with the construction of classifiers that (approximately) optimize the AUC directly, rather than accuracy or another performance metric.

Stijn Vanderlooy, Eyke Hüllermeier
Improving Maximum Margin Matrix Factorization

Maximum Margin Matrix Factorization (MMMF) has been proposed as a learning approach to the task of collaborative filtering with promising results. In our recent paper [2], we proposed to extend the general MMMF framework to allow for structured (ranking) losses in addition to the squared error loss.

In this paper, we introduce a novel algorithm to compute the ordinal regression ranking loss which is significantly faster than the state of the art. In addition, we propose severals extensions to the MMMF model: We introduce offset terms to cater for user and item biases. Users exhibit vastly different rating frequencies ranging from only one rating per user to thousands of them. Similarly, some items get thousands of ratings while others get rated only once. We introduce an adaptive regularizer to allow for more complex models for those items and users with many ratings. Finally, we show equivalence between a recent extension introduced in and a graph kernel approach described in [3]. Both aim at providing meaningful predictions for users with very little training data by virtue of the recommender graph.

We performed an evaluation of these extensions on two standard data sets: Eachmovie and Movielens. These experiments show that the introduced extensions do improve the predictive performance over the original MMMF formulation, even though we did not formally optimize the parameters.

Markus Weimer, Alexandros Karatzoglou, Alex Smola

Data Mining and Knowledge Discovery Journal Abstracts

Finding Reliable Subgraphs from Large Probabilistic Graphs

Consider information search or discovery in a large graph of concepts and their weighted relationships. The user initiates a query by specifying some concepts (“search terms”), and wishes to obtain other concepts and relationships that connect the search concepts. An application example is in analysis of biological information, conveniently represented as a graph of biological concepts and their relations. A search engine we envision would allow a life scientist to query for connections between a gene and a phenotype, for instance, to find information supporting a hypothesis, or to help discover new hypothesis.

Petteri Hintsanen, Hannu Toivonen
A Space Efficient Solution to the Frequent String Mining Problem for Many Databases

In the frequent string mining problem, one is given

m

databases

${\cal D}_1,...,{\cal D}_m$

of strings and searches for strings that fulfill certain frequency constraints. The constraints consist of

m

pairs of thresholds

$(\mathit{minf}_1,\mathit{maxf}_1),$

$...,(\mathit{minf}_m,\mathit{maxf}_m)$

and one wants to find all strings

φ

that satisfy

$\mathit{minf}_i \le \mathit{freq}(\phi, {\cal D}_i) \le \mathit{maxf}_i$

for all

i

with 1 ≤ 

i

 ≤ 

m

, where

$\mathit{freq}(\phi,\mathcal{D}_i) = |\{ \psi \in \mathcal{D}_i : \phi \mbox{ is a substring of } \psi \}|$

.

Fischer et al. [2] presented an algorithm that solves the frequent string mining problem in linear time under the assumption that the number of databases is treated as a constant. The space consumption of this algorithm, however, is proportional to the total size of all databases. We improve this algorithm in such a way that its space consumption is proportional to the size of the largest database, and it takes linear time regardless of the number of databases. Also, our algorithm is more flexible in the sense that one of several databases can be replaced without having to recalculate everything, that is, intermediate data can be stored on file and be reused.

Adrian Kügel, Enno Ohlebusch
The Boolean Column and Column-Row Matrix Decompositions

Matrix decompositions are used for many data mining purposes. One of these purposes is to find a concise but interpretable representation of a given data matrix. Different decomposition formulations have been proposed for this task, many of which assume a certain property of the input data (e.g., nonnegativity) and aim at preserving that property in the decomposition.

Pauli Miettinen
SkyGraph: An Algorithm for Important Subgraph Discovery in Relational Graphs

Graph mining is gaining importance due to the numerous applications that rely on graph-based data. Some example applications are: (i) analysis of microarray data in bioinformatics, (ii) pattern discovery in social networks, (iii) analysis of transportation networks, (iv) community discovery in Web data. Existing pattern discovery approaches operate by using simple constraints on the mined patterns. For example, given a database of graphs, a typical graph mining task is to report all subgraphs that appear in at least s graphs, where s is the frequency support threshold. In other cases, we are interested in the discovery of dense or highly-connected subgraphs. In such a case, a threshold is defined for the density or the connectivity of the returned patterns. Other constraints may be defined as well, towards restricting the number of mined patterns. There are three important limitations with this approach: (i) there is an on-off decision regarding the eligibility of patterns, i.e., a pattern either satisfies the constraints or not, (ii) in the case where the constraints are very strict we risk an empty answer or an answer with only a few patterns, and (iii) in the case where the constraints are too weak the number of patterns may be huge.

Apostolos N. Papadopoulos, Apostolos Lyritsis, Yannis Manolopoulos
Mining Conjunctive Sequential Patterns

In this paper we study the discovery of frequent sequences and we aim at extending the non-derivable condensed representation in frequent itemset mining to sequential pattern mining. We start by showing a negative example: in the context of frequent sequences, the notion of non-derivability is

meaningless

.

This negative result motivated us to look at a slightly different problem: the mining of

conjunctions

of sequential patterns. This extended class of patterns turns out to have much nicer mathematical properties. For example, for this class of patterns we are able to extend the notion of non-derivable itemsets in a non-trivial way, based on a new unexploited theoretical definition of equivalence classes for sequential patterns. As a side-effect of considering conjunctions of sequences as the pattern type, we can easily form association rules between sequences. We believe that building a theoretical framework and an efficient approach for sequence association rules extraction problem is the first step toward the generalization of association rules to all complex and ordered patterns.

Chedy Raïssi, Toon Calders, Pascal Poncelet
Adequate Condensed Representations of Patterns

Patterns are at the core of the discovery of a lot of knowledge from data but their uses are limited due to their huge number and their mining cost. During the last decade, many works addressed the concept of condensed representation w.r.t. frequency queries. Such representations are several orders of magnitude smaller than the size of the whole collections of patterns, and also enable us to regenerate the frequency information of any pattern. Equivalence classes, based on the Galois closure, are at the core of the pattern condensed representations. However, in real-world applications, interestingness of patterns is evaluated by various many other user-defined measures (e.g., confidence, lift, minimum). To the best of our knowledge, these measures have received very little attention. The Galois closure is appropriate to frequency based measures but unfortunately not to other measures.

Arnaud Soulet, Bruno Crémilleux
Two Heads Better Than One: Pattern Discovery in Time-Evolving Multi-aspect Data

Data stream values are often associated with multiple

aspects

. For example, each value observed at a given time-stamp from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Time-stamp, type and location are the three aspects, which can be modeled using a tensor (high-order array). However, the time aspect is special, with a natural ordering, and with successive time-ticks having usually correlated values. Standard multiway analysis ignores this structure. To capture it, we propose

2 Heads Tensor Analysis

(2-heads), which provides a qualitatively different treatment on time. Unlike most existing approaches that use a PCA-like summarization scheme for all aspects, 2-heads treats the time aspect carefully. 2-heads combines the power of classic multilinear analysis (PARAFAC [1], Tucker [5], DTA/STA [3], WTA [2]) with wavelets, leading to a powerful mining tool. Furthermore, 2-heads has several other advantages as well: (a) it can be computed incrementally in a streaming fashion, (b) it has a provable error guarantee and, (c) it achieves significant compression ratio against competitors. Finally, we show experiments on real datasets, and we illustrate how 2-heads reveals interesting trends in the data.

This is an extended abstract of an article published in the Data Mining and Knowledge Discovery journal [4].

Jimeng Sun, Charalampos E. Tsourakakis, Evan Hoke, Christos Faloutsos, Tina Eliassi-Rad

Regular Papers

TOPTMH: Topology Predictor for Transmembrane α-Helices

Alpha-helical transmembrane proteins mediate many key biological processes and represent 20%–30% of all genes in many organisms. Due to the difficulties in experimentally determining their high-resolution 3D structure, computational methods to predict the location and orientation of transmembrane helix segments using sequence information are essential. We present,

TOPTMH

a new transmembrane helix topology prediction method that combines support vector machines, hidden Markov models, and a widely-used rule-based scheme. The contribution of this work is the development of a prediction approach that first uses a binary SVM classifier to predict the helix residues and then it employs a pair of HMM models that incorporate the SVM predictions and hydropathy-based features to identify the entire transmembrane helix segments by capturing the structural characteristics of these proteins.

TOPTMH

outperforms state-of-the-art prediction methods and achieves the best performance on an independent static benchmark.

Rezwan Ahmed, Huzefa Rangwala, George Karypis
Learning to Predict One or More Ranks in Ordinal Regression Tasks

We present nondeterministic hypotheses learned from an ordinal regression task. They try to predict the true rank for an entry, but when the classification is uncertain the hypotheses predict a set of consecutive ranks (an interval). The aim is to keep the set of ranks as small as possible, while still containing the true rank. The justification for learning such a hypothesis is based on a real world problem arisen in breeding beef cattle. After defining a family of loss functions inspired in Information Retrieval, we derive an algorithm for minimizing them. The algorithm is based on posterior probabilities of ranks given an entry. A couple of implementations are compared: one based on a multiclass

SVM

and other based on Gaussian processes designed to minimize the linear loss in ordinal regression tasks.

Jaime Alonso, Juan José del Coz, Jorge Díez, Oscar Luaces, Antonio Bahamonde
Cascade RSVM in Peer-to-Peer Networks

The goal of distributed learning in P2P networks is to achieve results as close as possible to those from centralized approaches. Learning models of classification in a P2P network faces several challenges like scalability, peer dynamism, asynchronism and data privacy preservation. In this paper, we study the feasibility of building SVM classifiers in a P2P network. We show how cascading SVM can be mapped to a P2P network of data propagation. Our proposed P2P SVM provides a method for constructing classifiers in P2P networks with classification accuracy comparable to centralized classifiers and better than other distributed classifiers. The proposed algorithm also satisfies the characteristics of P2P computing and has an upper bound on the communication overhead. Extensive experimental results confirm the feasibility and attractiveness of this approach.

Hock Hee Ang, Vivekanand Gopalkrishnan, Steven C. H. Hoi, Wee Keong Ng
An Algorithm for Transfer Learning in a Heterogeneous Environment

We consider the problem of learning in an environment of classification tasks. Tasks sampled from the environment are used to improve classification performance on future tasks. We consider situations in which the tasks can be divided into groups. Tasks within each group are related by sharing a low dimensional representation, which differs across the groups. We present an algorithm which divides the sampled tasks into groups and computes a common representation for each group. We report experiments on a synthetic and two image data sets, which show the advantage of the approach over single-task learning and a previous transfer learning method.

Andreas Argyriou, Andreas Maurer, Massimiliano Pontil
Minimum-Size Bases of Association Rules

We focus on confidence-bounded association rules; we model a rather practical situation in which the confidence threshold is fixed by the user, as usually happens in applications. Within this model, we study notions of redundancy among association rules from a fundamental perspective: we discuss several existing alternative definitions and provide new characterizations and relationships between them. We show that these alternatives correspond actually to just two variants, which differ in the special treatment of full-confidence implications. For each of these two notions of redundancy, we show how to construct complete bases of absolutely minimum size.

José L. Balcázar
Combining Classifiers through Triplet-Based Belief Functions

Classifier outputs in the form of continuous values have often been combined using linear sum or stacking, but little is generally known about evidential reasoning methods for combining truncated lists of ordered decisions. In this paper we introduce a novel class-indifferent method for combining such a kind of classifier decisions. Specifically we model each output given by classifiers on new instances as a list of ranked decisions that is divided into 2 subsets of decisions, which are represented by

triplet-based belief functions

and then are combined using Dempster’s rule of combination. We present a formalism for triplet-based belief functions and establish a range of general formulae for combining these beliefs in order to arrive at a consensus decision. In addition we carry out a comparative analysis with an alternative representation

dichotomous belief functions

on the UCI benchmark data. We also compare our combination method with the popular methods of stacking, boosting, linear sum and majority voting over the same benchmark data to demonstrate the advantage of our approach.

Yaxin Bi, Shengli Wu, Xuhui Shen, Pan Xiong
An Improved Multi-task Learning Approach with Applications in Medical Diagnosis

We propose a family of multi-task learning algorithms for collaborative computer aided diagnosis which aims to diagnose multiple clinically-related abnormal structures from medical images. Our formulations eliminate features irrelevant to all tasks, and identify discriminative features for each of the tasks. A probabilistic model is derived to justify the proposed learning formulations. By equivalence proof, some existing regularization-based methods can also be interpreted by our probabilistic model as imposing a Wishart hyperprior. Convergence analysis highlights the conditions under which the formulations achieve convexity and global convergence. Two real-world medical problems: lung cancer prognosis and heart wall motion analysis, are used to validate the proposed algorithms.

Jinbo Bi, Tao Xiong, Shipeng Yu, Murat Dundar, R. Bharat Rao
Semi-supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

Kernel canonical correlation analysis (KCCA) is a fundamental technique for dimensionality reduction for paired data. By finding directions that maximize correlation in the space implied by the kernel, KCCA is able to learn representations that are more closely tied to the underlying semantics of the data rather than high variance directions, which are found by PCA but may be the result of noise. However, meaningful directions are not only those that have high correlation to another modality, but also those that capture the manifold structure of the data. We propose a method that is able to simultaneously find highly correlated directions that are also located on high variance directions along the data manifold. This is achieved by the use of semi-supervised Laplacian regularization in the formulation of KCCA, which has the additional benefit of being able to use additional data for which correspondence between the modalities is not known to more robustly estimate the structure of the data manifold. We show experimentally on datasets of images and text that Laplacian regularized training improves the class separation over KCCA with only Tikhonov regularization, while causing no degradation in the correlation between modalities. We propose a model selection criterion based on the Hilbert-Schmidt norm of the semi-supervised Laplacian regularized cross-covariance operator, which can be computed in closed form. Kernel canonical correlation analysis (KCCA) is a dimensionality reduction technique for paired data. By finding directions that maximize correlation, KCCA learns representations that are more closely tied to the underlying semantics of the data rather than noise. However, meaningful directions are not only those that have high correlation to another modality, but also those that capture the manifold structure of the data. We propose a method that is simultaneously able to find highly correlated directions that are also located on high variance directions along the data manifold. This is achieved by the use of semi-supervised Laplacian regularization of KCCA. We show experimentally that Laplacian regularized training improves class separation over KCCA with only Tikhonov regularization, while causing no degradation in the correlation between modalities. We propose a model selection criterion based on the Hilbert-Schmidt norm of the semi-supervised Laplacian regularized cross-covariance operator, which we compute in closed form.

Matthew B. Blaschko, Christoph H. Lampert, Arthur Gretton
Sequence Labelling SVMs Trained in One Pass

This paper proposes an online solver of the dual formulation of support vector machines for structured output spaces. We apply it to sequence labelling using the exact and greedy inference schemes. In both cases, the per-sequence training time is the same as a perceptron based on the same inference procedure, up to a small multiplicative constant. Comparing the two inference schemes, the greedy version is much faster. It is also amenable to higher order Markov assumptions and performs similarly on test. In comparison to existing algorithms, both versions match the accuracies of batch solvers that use exact inference after a single pass over the training examples.

Antoine Bordes, Nicolas Usunier, Léon Bottou
Semi-supervised Classification from Discriminative Random Walks

This paper describes a novel technique, called

$\mathcal{D}$

-walks, to tackle semi-supervised classification problems in large graphs. We introduce here a betweenness measure based on passage times during random walks of bounded lengths. Such walks are further constrained to start and end in nodes within the same class, defining a distinct betweenness for each class. Unlabeled nodes are classified according to the class showing the highest betweenness. Forward and backward recurrences are derived to efficiently compute the passage times.

$\mathcal{D}$

-walks can deal with directed or undirected graphs with a linear time complexity with respect to the number of edges, the maximum walk length considered and the number of classes. Experiments on various real-life databases show that

$\mathcal{D}$

-walks outperforms NetKit [5], the approach of Zhou and Schölkopf [15] and the regularized laplacian kernel [2]. The benefit of

$\mathcal{D}$

-walks is particularly noticeable when few labeled nodes are available. The computation time of

$\mathcal{D}$

-walks is also substantially lower in all cases.

Jérôme Callut, Kevin Françoisse, Marco Saerens, Pierre Dupont
Learning Bidirectional Similarity for Collaborative Filtering

Memory-based collaborative filtering aims at predicting the utility of a certain item for a particular user based on the previous ratings from similar users and similar items. Previous studies in finding similar users and items are based on user-defined similarity metrics such as Pearson Correlation Coefficient or Vector Space Similarity which are not adaptive and optimized for different applications and datasets. Moreover, previous studies have treated the similarity function calculation between users and items separately. In this paper, we propose a novel

adaptive bidirectional similarity metric

for collaborative filtering. We automatically learn similarities between users and items simultaneously through matrix factorization. We show that our model naturally extends the memory based approaches. Theoretical analysis shows our model to be a novel generalization of the SVD model. We evaluate our method using three benchmark datasets, including MovieLens, EachMovie and Netflix, through which we show that our methods outperform many previous baselines.

Bin Cao, Jian-Tao Sun, Jianmin Wu, Qiang Yang, Zheng Chen
Bootstrapping Information Extraction from Semi-structured Web Pages

We consider the problem of extracting structured records from semi-structured web pages with no human supervision required for each target web site. Previous work on this problem has either required significant human effort for each target site or used brittle heuristics to identify semantic data types. Our method only requires annotation for a few pages from a few sites in the target domain. Thus, after a tiny investment of human effort, our method allows automatic extraction from potentially thousands of other sites within the same domain. Our approach extends previous methods for detecting data fields in semi-structured web pages by matching those fields to domain schema columns using robust models of data values and contexts. Annotating 2–5 pages for 4–6 web sites yields an extraction accuracy of 83.8% on job offer sites and 91.1% on vacation rental sites. These results significantly outperform a baseline approach.

Andrew Carlson, Charles Schafer
Online Multiagent Learning against Memory Bounded Adversaries

The traditional agenda in Multiagent Learning (MAL) has been to develop learners that guarantee convergence to an equilibrium in self-play or that converge to playing the best response against an opponent using one of a

fixed set

of known targeted strategies. This paper introduces an algorithm called

L

earn

o

r

E

xploit for

A

dversary

I

nduced

M

arkov

D

ecision

P

rocess (

LoE-AIM

) that targets optimality against any learning opponent that can be treated as a memory bounded adversary.

LoE-AIM

makes no prior assumptions about the opponent and is tailored to optimally exploit any adversary which induces a Markov decision process in the state space of joint histories.

LoE-AIM

either explores and gathers new information about the opponent or converges to the best response to the partially learned opponent strategy in repeated play. We further extend

LoE-AIM

to account for online repeated interactions against the same adversary with plays against other adversaries interleaved in between.

LoE-AIM-repeated

stores learned knowledge about an adversary, identifies the adversary in case of repeated interaction, and reuses the stored knowledge about the behavior of the adversary to enhance learning in the current epoch of play. LoE-AIM and LoE-AIM-repeated are fully implemented, with results demonstrating their superiority over other existing MAL algorithms.

Doran Chakraborty, Peter Stone
Scalable Feature Selection for Multi-class Problems

Scalable feature selection algorithms should remove irrelevant and redundant features and scale well on very large datasets. We identify that the currently best state-of-art methods perform well on binary classification tasks but often underperform on multi-class tasks. We suggest that they suffer from the so-called accumulative effect which becomes more visible with the growing number of classes and results in removing relevant and unredundant features. To remedy the problem, we propose two new feature filtering methods which are both scalable and well adapted for the multi-class cases. We report the evaluation results on 17 different datasets which include both binary and multi-class cases.

Boris Chidlovskii, Loïc Lecerf
Learning Decision Trees for Unbalanced Data

Learning from unbalanced datasets presents a convoluted problem in which traditional learning algorithms may perform poorly. The objective functions used for learning the classifiers typically tend to favor the larger, less important classes in such problems. This paper compares the performance of several popular decision tree splitting criteria – information gain, Gini measure, and DKM – and identifies a new skew insensitive measure in Hellinger distance. We outline the strengths of Hellinger distance in class imbalance, proposes its application in forming decision trees, and performs a comprehensive comparative analysis between each decision tree construction method. In addition, we consider the performance of each tree within a powerful sampling wrapper framework to capture the interaction of the splitting metric and sampling. We evaluate over this wide range of datasets and determine which operate best under class imbalance.

David A. Cieslak, Nitesh V. Chawla
Credal Model Averaging: An Extension of Bayesian Model Averaging to Imprecise Probabilities

We deal with the arbitrariness in the choice of the prior over the models in

Bayesian model averaging

(BMA), by modelling prior knowledge by a set of priors (i.e., a prior

credal set

). We consider Dash and Cooper’s BMA applied to

naive Bayesian networks

, replacing the single prior over the naive models by a credal set; this models a condition close to prior ignorance about the models, which leads to

credal model averaging

(CMA). CMA returns an

indeterminate

classification, i.e., multiple classes, on the instances for which the learning set is not informative enough to smooth the effect of the choice of the prior. We give an algorithm to compute exact credal model averaging for naive networks. Extensive experiments show that indeterminate classifications preserve the reliability of CMA on the instances which are classified in a prior-dependent way by BMA.

Giorgio Corani, Marco Zaffalon
A Fast Method for Training Linear SVM in the Primal

We propose a new algorithm for training a linear Support Vector Machine in the primal. The algorithm mixes ideas from non smooth optimization, subgradient methods, and cutting planes methods. This yields a fast algorithm that compares well to state of the art algorithms. It is proved to require

O

(1/

λε

) iterations to converge to a solution with accuracy

ε

. Additionally we provide an exact shrinking method in the primal that allows reducing the complexity of an iteration to much less than

O

(

N

) where

N

is the number of training samples.

Trinh-Minh-Tri Do, Thierry Artières
On the Equivalence of the SMO and MDM Algorithms for SVM Training

SVM training is usually discussed under two different algorithmic points of view. The first one is provided by decomposition methods such as SMO and SVMLight while the second one encompasses geometric methods that try to solve a Nearest Point Problem (NPP), the Gilbert–Schlesinger–Kozinec (GSK) and Mitchell–Demyanov–Malozemov (MDM) algorithms being the most representative ones. In this work we will show that, indeed, both approaches are essentially coincident. More precisely, we will show that a slight modification of SMO in which at each iteration both updating multipliers correspond to patterns in the same class solves NPP and, moreover, that this modification coincides with an extended MDM algorithm. Besides this, we also propose a new way to apply the MDM algorithm for NPP problems over reduced convex hulls.

Jorge López, Álvaro Barbero, José R. Dorronsoro
Nearest Neighbour Classification with Monotonicity Constraints

In many application areas of machine learning, prior knowledge concerning the monotonicity of relations between the response variable and predictor variables is readily available. Monotonicity may also be an important model requirement with a view toward explaining and justifying decisions, such as acceptance/rejection decisions. We propose a modified nearest neighbour algorithm for the construction of monotone classifiers from data. We start by making the training data monotone with as few label changes as possible. The relabeled data set can be viewed as a monotone classifier that has the lowest possible error-rate on the training data. The relabeled data is subsequently used as the training sample by a modified nearest neighbour algorithm. This modified nearest neighbour rule produces predictions that are guaranteed to satisfy the monotonicity constraints. Hence, it is much more likely to be accepted by the intended users. Our experiments show that monotone kNN often outperforms standard kNN in problems where the monotonicity constraints are applicable.

Wouter Duivesteijn, Ad Feelders
Modeling Transfer Relationships Between Learning Tasks for Improved Inductive Transfer

In this paper, we propose a novel graph-based method for knowledge transfer. We model the transfer relationships between source tasks by embedding the set of learned source models in a graph using transferability as the metric. Transfer to a new problem proceeds by mapping the problem into the graph, then learning a function on this graph that automatically determines the parameters to transfer to the new learning task. This method is analogous to inductive transfer along a manifold that captures the transfer relationships between the tasks. We demonstrate improved transfer performance using this method against existing approaches in several real-world domains.

Eric Eaton, Marie desJardins, Terran Lane
Mining Edge-Weighted Call Graphs to Localise Software Bugs

An important problem in software engineering is the automated discovery of noncrashing occasional bugs. In this work we address this problem and show that mining of weighted call graphs of program executions is a promising technique. We mine weighted graphs with a combination of structural and numerical techniques. More specifically, we propose a novel reduction technique for call graphs which introduces edge weights. Then we present an analysis technique for such weighted call graphs based on graph mining and on traditional feature selection schemes. The technique generalises previous graph mining approaches as it allows for an analysis of weights. Our evaluation shows that our approach finds bugs which previous approaches cannot detect so far. Our technique also doubles the precision of finding bugs which existing techniques can already localise in principle.

Frank Eichinger, Klemens Böhm, Matthias Huber
Hierarchical Distance-Based Conceptual Clustering

In this work we analyse the relation between hierarchical distance-based clustering and the concepts that can be obtained from the hierarchy by generalisation. Many inconsistencies may arise, because the distance and the conceptual generalisation operator are usually incompatible. To overcome this, we propose an algorithm which integrates distance-based and conceptual clustering. The new dendrograms can show when an element has been integrated to the cluster because it is near in the metric space or because it is covered by the concept. In this way, the new clustering can differ from the original one but the metric traceability is clear. We introduce three different levels of agreement between the clustering hierarchy obtained from the linkage distance and the new hierarchy, and we define properties these generalisation operators should satisfy in order to produce distance-consistent dendrograms.

Ana Maria Funes, Cesar Ferri, Jose Hernández-Orallo, Maria Jose Ramírez-Quintana
Mining Frequent Connected Subgraphs Reducing the Number of Candidates

In this paper, a new algorithm for mining frequent connected subgraphs called gRed (

g

raph Candidate

Red

uction Miner) is presented. This algorithm is based on the gSpan algorithm proposed by Yan and Jan. In this method, the mining process is optimized introducing new heuristics to reduce the number of candidates. The performance of gRed is compared against two of the most popular and efficient algorithms available in the literature (gSpan and Gaston). The experimentation on real world databases shows the performance of our proposal overcoming gSpan, and achieving better performance than Gaston for low minimal support when databases are large.

Andrés Gago Alonso, José Eladio Medina Pagola, Jesús Ariel Carrasco-Ochoa, José Fco. Martínez-Trinidad
Unsupervised Riemannian Clustering of Probability Density Functions

We present an algorithm for grouping families of probability density functions (pdfs). We exploit the fact that under the square-root re-parametrization, the space of pdfs forms a Riemannian manifold, namely the unit Hilbert sphere. An immediate consequence of this re-parametrization is that different families of pdfs form different submanifolds of the unit Hilbert sphere. Therefore, the problem of clustering pdfs reduces to the problem of clustering multiple submanifolds on the unit Hilbert sphere. We solve this problem by first learning a low-dimensional representation of the pdfs using generalizations of local nonlinear dimensionality reduction algorithms from Euclidean to Riemannian spaces. Then, by assuming that the pdfs from different groups are separated, we show that the null space of a matrix built from the local representation gives the segmentation of the pdfs. We also apply of our approach to the texture segmentation problem in computer vision.

Alvina Goh, René Vidal
Online Manifold Regularization: A New Learning Setting and Empirical Study

We consider a novel “online semi-supervised learning” setting where (mostly unlabeled) data arrives sequentially in large volume, and it is impractical to store it all before learning. We propose an online manifold regularization algorithm. It differs from standard online learning in that it learns even when the input point is unlabeled. Our algorithm is based on convex programming in kernel space with stochastic gradient descent, and inherits the theoretical guarantees of standard online algorithms. However, naïve implementation of our algorithm does not scale well. This paper focuses on efficient, practical approximations; we discuss two sparse approximations using buffering and online random projection trees. Experiments show our algorithm achieves risk and generalization accuracy comparable to standard batch manifold regularization, while each step runs quickly. Our online semi-supervised learning setting is an interesting direction for further theoretical development, paving the way for semi-supervised learning to work on real-world life-long learning tasks.

Andrew B. Goldberg, Ming Li, Xiaojin Zhu
A Fast Algorithm to Find Overlapping Communities in Networks

Many networks possess a community structure, such that vertices form densely connected groups which are more sparsely linked to other groups. In some cases these groups overlap, with some vertices shared between two or more communities. Discovering communities in networks is a computationally challenging task, especially if they overlap. In previous work we proposed an algorithm, CONGA, that could detect overlapping communities using the new concept of

split betweenness

. Here we present an improved algorithm based on a

local

form of betweenness, which yields good results but is much faster. It is especially effective in discovering small-diameter communities in large networks, and has a time complexity of only O(

n

log

n

) for sparse networks.

Steve Gregory
A Case Study in Sequential Pattern Mining for IT-Operational Risk

IT-operational risk management consists of identifying, assessing, monitoring and mitigating the adverse risks of loss resulting from hardware and software system failures. We present a case study in IT-operational risk measurement in the context of a network of Private Branch eXchanges (PBXs). The approach relies on preprocessing and data mining tasks for the extraction of sequential patterns and their exploitation in the definition of a measure called

expected risk

.

Valerio Grossi, Andrea Romei, Salvatore Ruggieri
Tight Optimistic Estimates for Fast Subgroup Discovery

Subgroup discovery is the task of finding subgroups of a population which exhibit both distributional unusualness and high generality. Due to the non monotonicity of the corresponding evaluation functions, standard pruning techniques cannot be used for subgroup discovery, requiring the use of optimistic estimate techniques instead. So far, however, optimistic estimate pruning has only been considered for the extremely simple case of a binary target attribute and up to now no attempt was made to move beyond suboptimal heuristic optimistic estimates. In this paper, we show that optimistic estimate pruning can be developed into a sound and highly effective pruning approach for subgroup discovery. Based on a precise definition of optimality we show that previous estimates have been tight only in special cases. Thereafter, we present tight optimistic estimates for the most popular binary and multi-class quality functions, and present a family of increasingly efficient approximations to these optimal functions. As we show in empirical experiments, the use of our newly proposed optimistic estimates can lead to a speed up of an order of magnitude compared to previous approaches.

Henrik Grosskreutz, Stefan Rüping, Stefan Wrobel
Watch, Listen & Learn: Co-training on Captioned Images and Videos

Recognizing visual scenes and activities is challenging: often visual cues alone are ambiguous, and it is expensive to obtain manually labeled examples from which to learn. To cope with these constraints, we propose to leverage the text that often accompanies visual data to learn robust models of scenes and actions from partially labeled collections. Our approach uses co-training, a semi-supervised learning method that accommodates multi-modal views of data. To classify images, our method learns from captioned images of natural scenes; and to recognize human actions, it learns from videos of athletic events with commentary. We show that by exploiting both multi-modal representations and unlabeled data our approach learns more accurate image and video classifiers than standard baseline algorithms.

Sonal Gupta, Joohyun Kim, Kristen Grauman, Raymond Mooney
Parameter Learning in Probabilistic Databases: A Least Squares Approach

We introduce the problem of learning the parameters of the probabilistic database ProbLog. Given the observed success probabilities of a set of queries, we compute the probabilities attached to facts that have a low approximation error on the training examples as well as on unseen examples. Assuming Gaussian error terms on the observed success probabilities, this naturally leads to a least squares optimization problem. Our approach, called LeProbLog, is able to learn both from queries and from proofs and even from both simultaneously. This makes it flexible and allows faster training in domains where the proofs are available. Experiments on real world data show the usefulness and effectiveness of this least squares calibration of probabilistic databases.

Bernd Gutmann, Angelika Kimmig, Kristian Kersting, Luc De Raedt
Improving k-Nearest Neighbour Classification with Distance Functions Based on Receiver Operating Characteristics

The

k

-nearest neighbour (

k

-NN) technique, due to its interpretable nature, is a simple and very intuitively appealing method to address classification problems. However, choosing an appropriate distance function for

k

-NN can be challenging and an inferior choice can make the classifier highly vulnerable to noise in the data. In this paper, we propose a new method for determining a good distance function for

k

-NN. Our method is based on consideration of the area under the Receiver Operating Characteristics (ROC) curve, which is a well known method to measure the quality of binary classifiers. It computes weights for the distance function, based on ROC properties within an appropriate neighbourhood for the instances whose distance is being computed. We experimentally compare the effect of our scheme with a number of other well-known

k

-NN distance metrics, as well as with a range of different classifiers. Experiments show that our method can substantially boost the classification performance of the

k

-NN algorithm. Furthermore, in a number of cases our technique is even able to deliver better accuracy than state-of-the-art non

k

-NN classifiers, such as support vector machines.

Md. Rafiul Hassan, M. Maruf Hossain, James Bailey, Kotagiri Ramamohanarao
One-Class Classification by Combining Density and Class Probability Estimation

One-class classification has important applications such as outlier and novelty detection. It is commonly tackled using density estimation techniques or by adapting a standard classification algorithm to the problem of carving out a decision boundary that describes the location of the target data. In this paper we investigate a simple method for one-class classification that combines the application of a density estimator, used to form a reference distribution, with the induction of a standard model for class probability estimation. In this method, the reference distribution is used to generate artificial data that is employed to form a second, artificial class. In conjunction with the target class, this artificial class is the basis for a standard two-class learning problem. We explain how the density function of the reference distribution can be combined with the class probability estimates obtained in this way to form an adjusted estimate of the density function of the target class. Using UCI datasets, and data from a typist recognition problem, we show that the combined model, consisting of both a density estimator and a class probability estimator, can improve on using either component technique alone when used for one-class classification. We also compare the method to one-class classification using support vector machines.

Kathryn Hempstalk, Eibe Frank, Ian H. Witten
Efficient Frequent Connected Subgraph Mining in Graphs of Bounded Treewidth

The frequent connected subgraph mining problem, i.e., the problem of listing all connected graphs that are subgraph isomorphic to at least a certain number of transaction graphs of a database, cannot be solved in output polynomial time in the general case. If, however, the transaction graphs are restricted to forests then the problem becomes tractable. In this paper we generalize the positive result on forests to graphs of

bounded treewidth

. In particular, we show that for this class of transaction graphs, frequent connected subgraphs can be listed in

incremental polynomial time

. Since subgraph isomorphism remains NP-complete for bounded treewidth graphs, the positive complexity result of this paper shows that efficient frequent pattern mining is possible even for computationally hard pattern matching operators.

Tamás Horváth, Jan Ramon
Proper Model Selection with Significance Test

Model selection is an important and ubiquitous task in machine learning. To select models with the best future classification performance measured by a

goal metric

, an

evaluation metric

is often used to select the best classification model among the competing ones. A common practice is to use the same goal and evaluation metric. However, in several recent studies, it is claimed that using an evaluation metric (such as AUC) other than the goal metric (such as accuracy) results in better selection of the correct models. In this paper, we point out a flaw in the experimental design of those studies, and propose an improved method to test the claim. Our extensive experiments show convincingly that only the goal metric itself can most reliably select the correct classification models.

Jin Huang, Charles X. Ling, Harry Zhang, Stan Matwin
A Projection-Based Framework for Classifier Performance Evaluation

In this paper, we propose approaching the problem of classifier evaluation in terms of a projection from a high-dimensional space to a visualizable two-dimensional one. Rather than collapsing confusion matrices into a single measure the way traditional evaluation methods do, we consider the vector composed of the entries of the confusion matrix (or the confusion matrices in case several domains are considered simultaneously) as the performance evaluation vector, and project it into a two dimensional space using a recently proposed distance-preserving projection method. This approach is shown to be particularly useful in the case of comparison of several classifiers on many domains as well as in the case of multiclass classification. Furthermore, by providing simultaneous multiple views of the same evaluation data, it allows for a quick and accurate assessment of classifier performance.

Nathalie Japkowicz, Pritika Sanghi, Peter Tischer
Distortion-Free Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction is an important issue in many machine learning areas where essentially low-dimensional data is nonlinearly embedded in some high-dimensional space. In this paper, we show that the existing Laplacian Eigenmaps method suffers from the distortion problem, and propose a new distortion-free dimensionality reduction method by adopting a local linear model to encode the local information. We introduce a new loss function that can be seen as a different way to construct the Laplacian matrix, and a new way to impose scaling constraints under the local linear model. Better low-dimensional embeddings are obtained via constrained concave convex procedure. Empirical studies and real-world applications have shown the effectiveness of our method.

Yangqing Jia, Zheng Wang, Changshui Zhang
Learning with L q < 1 vs L 1-Norm Regularisation with Exponentially Many Irrelevant Features

We study the use of fractional norms for regularisation in supervised learning from high dimensional data, in conditions of a large number of irrelevant features, focusing on logistic regression. We develop a variational method for parameter estimation, and show an equivalence between two approximations recently proposed in the statistics literature. Building on previous work by A.Ng, we show the fractional norm regularised logistic regression enjoys a sample complexity that grows logarithmically with the data dimensions and polynomially with the number of relevant dimensions. In addition, extensive empirical testing indicates that fractional-norm regularisation is more suitable than L1 in cases when the number of relevant features is very small, and works very well despite a large number of irrelevant features.

Ata Kabán, Robert J. Durrant
Catenary Support Vector Machines

Many problems require making sequential decisions. For these problems, the benefit of acquiring further information must be weighed against the costs. In this paper, we describe the

catenary support vector machine

(catSVM), a margin-based method to solve sequential stopping problems. We provide theoretical guarantees for catSVM on future testing examples. We evaluated the performance of catSVM on UCI benchmark data and also applied it to the task of face detection. The experimental results show that catSVM can achieve a better cost tradeoff than single-stage SVM and chained boosting.

Kin Fai Kan, Christian R. Shelton
Exact and Approximate Inference for Annotating Graphs with Structural SVMs

Training processes of structured prediction models such as structural SVMs involve frequent computations of the

maximum-a-posteriori

(MAP) prediction given a parameterized model. For specific output structures such as sequences or trees, MAP estimates can be computed efficiently by dynamic programming algorithms such as the Viterbi algorithm and the CKY parser. However, when the output structures can be arbitrary graphs, exact calculation of the MAP estimate is an NP-complete problem. In this paper, we compare exact inference and approximate inference for labeling graphs. We study the exact junction tree and the approximate loopy belief propagation and sampling algorithms in terms of performance and ressource requirements.

Thoralf Klein, Ulf Brefeld, Tobias Scheffer
Extracting Semantic Networks from Text Via Relational Clustering

Extracting knowledge from text has long been a goal of AI. Initial approaches were purely logical and brittle. More recently, the availability of large quantities of text on the Web has led to the development of machine learning approaches. However, to date these have mainly extracted ground facts, as opposed to general knowledge. Other learning approaches can extract logical forms, but require supervision and do not scale. In this paper we present an unsupervised approach to extracting semantic networks from large volumes of text. We use the TextRunner system [1] to extract tuples from text, and then induce general concepts and relations from them by jointly clustering the objects and relational strings in the tuples. Our approach is defined in Markov logic using four simple rules. Experiments on a dataset of two million tuples show that it outperforms three other relational clustering approaches, and extracts meaningful semantic networks.

Stanley Kok, Pedro Domingos
Ranking the Uniformity of Interval Pairs

We study the problem of finding the most uniform partition of the class label distribution on an interval. This problem occurs, e.g., in supervised discretization of continuous features, where evaluation heuristics need to find the location of the best place to split the current feature. The weighted average of empirical entropies of the interval label distributions is often used in this task. We observe that this rule is suboptimal, because it prefers short intervals too much. Therefore, we proceed to study alternative approaches. A solution that is based on compression turns out to be the best in our empirical experiments. We also study how these alternative methods affect the performance of classification algorithms.

Jussi Kujala, Tapio Elomaa
Multiagent Reinforcement Learning for Urban Traffic Control Using Coordination Graphs

Since traffic jams are ubiquitous in the modern world, optimizing the behavior of traffic lights for efficient traffic flow is a critically important goal. Though most current traffic lights use simple heuristic protocols, more efficient controllers can be discovered automatically via multiagent reinforcement learning, where each agent controls a single traffic light. However, in previous work on this approach, agents select only locally optimal actions without coordinating their behavior. This paper extends this approach to include explicit coordination between neighboring traffic lights. Coordination is achieved using the max-plus algorithm, which estimates the optimal joint action by sending locally optimized messages among connected agents. This paper presents the first application of max-plus to a large-scale problem and thus verifies its efficacy in realistic settings. It also provides empirical evidence that max-plus performs well on cyclic graphs, though it has been proven to converge only for tree-structured graphs. Furthermore, it provides a new understanding of the properties a traffic network must have for such coordination to be beneficial and shows that max-plus outperforms previous methods on networks that possess those properties.

Lior Kuyer, Shimon Whiteson, Bram Bakker, Nikos Vlassis
StreamKrimp: Detecting Change in Data Streams

Data streams are ubiquitous. Examples range from sensor networks to financial transactions and website logs. In fact, even market basket data can be seen as a stream of sales. Detecting changes in the distribution a stream is sampled from is one of the most challenging problems in stream mining, as only limited storage can be used. In this paper we analyse this problem for streams of transaction data from an MDL perspective. Based on this analysis we introduce the

StreamKrimp

algorithm, whichuses the

Krimp

algorithm to characterise probability distributions with code tables. With these code tables,

StreamKrimp

partitions the stream into a sequence of substreams. Each switch of code table indicates a change in the underlying distribution. Experiments on both real and artificial streams show that

StreamKrimp

detects the changes while using only a very limited amount of data storage.

Matthijs van Leeuwen, Arno Siebes
Backmatter
Metadaten
Titel
Machine Learning and Knowledge Discovery in Databases
herausgegeben von
Walter Daelemans
Bart Goethals
Katharina Morik
Copyright-Jahr
2008
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-87479-9
Print ISBN
978-3-540-87478-2
DOI
https://doi.org/10.1007/978-3-540-87479-9