main-content

## Über dieses Buch

The European Conference on Machine Learning (ECML) and the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) were jointly organized this year for the ?fth time in a row, after some years of mutual independence before. After Freiburg (2001), Helsinki (2002), Cavtat (2003) and Pisa (2004), Porto received the 16th edition of ECML and the 9th PKDD in October 3–7. Having the two conferences together seems to be working well: 585 di?erent paper submissions were received for both events, which maintains the high s- mission standard of last year. Of these, 335 were submitted to ECML only, 220 to PKDD only and 30 to both. Such a high volume of scienti?c work required a tremendous e?ort from Area Chairs, Program Committee members and some additional reviewers. On average, PC members had 10 papers to evaluate, and Area Chairs had 25 papers to decide upon. We managed to have 3 highly qua- ?edindependentreviewsperpaper(withveryfewexceptions)andoneadditional overall input from one of the Area Chairs. After the authors’ responses and the online discussions for many of the papers, we arrived at the ?nal selection of 40 regular papers for ECML and 35 for PKDD. Besides these, 32 others were accepted as short papers for ECML and 35 for PKDD. This represents a joint acceptance rate of around 13% for regular papers and 25% overall. We thank all involved for all the e?ort with reviewing and selection of papers. Besidesthecoretechnicalprogram,ECMLandPKDDhad6invitedspeakers, 10 workshops, 8 tutorials and a Knowledge Discovery Challenge.

## Inhaltsverzeichnis

### Data Analysis in the Life Sciences — Sparking Ideas —

Data from various areas of Life Sciences have increasingly caught the attention of data mining and machine learning researchers. Not only is the amount of data available mind-boggling but the diverse and heterogenous nature of the information is far beyond any other data analysis problem so far. In sharp contrast to classical data analysis scenarios, the life science area poses challenges of a rather different nature for mainly two reasons. Firstly, the available data stems from heterogenous information sources of varying degrees of reliability and quality and is, without the interactive, constant interpretation of a domain expert, not useful. Furthermore, predictive models are of only marginal interest to those users – instead they hope for new insights into a complex, biological system that is only partially represented within that data anyway. In this scenario, the data serves mainly to create new insights and generate new ideas that can be tested. Secondly, the notion of feature space and the accompanying measures of similarity cannot be taken for granted. Similarity measures become context dependent and it is often the case that within one analysis task several different ways of describing the objects of interest or measuring similarity between them matter.

Some more recently published work in the data analysis area has started to address some of these issues. For example, data analysis in parallel universes [1], that is, the detection of patterns of interest in various di.erent descriptor spaces at the same time, and mining of frequent, discriminative fragments in large, molecular data bases [2]. In both cases, sheer numerical performance is not the focus; it is rather the discovery of interpretable pieces of evidence that lights up new ideas in the users mind. Future work in data analysis in the life sciences needs to keep this in mind: the goal is to trigger new ideas and stimulate interesting associations.

Michael R. Berthold

### Machine Learning for Natural Language Processing (and Vice Versa?)

Over the past 10-15 years, the influence of methods from machine learning has transformed the way that research is done in the field of natural language processing. This talk will begin by covering the history of this transformation. In particular, learning methods have proved successful in producing stand-alone text-processing components to handle a number of linguistic tasks. Moreover, these components can be combined to produce systems that exhibit shallow text-understanding capabilities: they can, for example, extract key facts from unrestricted documents in limited domains or find answers to general-purpose questions from open-domain document collections. I will briefly describe the state of the art for these practical text-processing applications, focusing on the important role that machine learning methods have played in their development.

The second part of the talk will explore the role that natural language processing might play in machine learning research. Here, I will explain the kinds of text-based features that are relatively easy to incorporate into machine learning data sets. In addition, I’ll outline some problems from natural language processing that require, or could at least benefit from, new machine learning algorithms.

Claire Cardie

### Statistical Relational Learning: An Inductive Logic Programming Perspective

In the past few years there has been a lot of work lying at the intersection of probability theory, logic programming and machine learning [14,18,13,9,6,1,11]. This work is known under the names of statistical relational learning [7,5], probabilistic logic learning [4], or probabilistic inductive logic programming. Whereas most of the existing works have started from a probabilistic learning perspective and extended probabilistic formalisms with relational aspects, I shall take a di.erent perspective, in which I shall start from inductive logic programming and study how inductive logic programming formalisms, settings and techniques can be extended to deal with probabilistic issues. This tradition has already contributed a rich variety of valuable formalisms and techniques, including probabilistic Horn abduction by David Poole, PRISMs by Sato, stochastic logic programs by Muggleton [13] and Cussens [2], Bayesian logic programs [10,8] by Kersting and De Raedt, and Logical Hidden Markov Models [11].

The main contribution of this talk is the introduction of three probabilistic inductive logic programming settings which are derived from the learning from entailment, from interpretations and from proofs settings of the field of inductive logic programming [3]. Each of these settings contributes di.erent notions of probabilistic logic representations, examples and probability distributions. The first setting, probabilistic learning from entailment, is incorporated in the wellknown PRISM system [19] and Cussens’s Failure Adjusted Maximisation approach to parameter estimation in stochastic logic programs [2]. A novel system that was recently developed and that fits this paradigm is the nFOIL system [12]. It combines key principles of the well-known inductive logic programming system FOIL [15] with the naïve Bayes’ appraoch. In probabilistic learning from entailment, examples are ground facts that should be probabilistically entailed by the target logic program. The second setting, probabilistic learning from interpretations, is incorporated in Bayesian logic programs [10,8], which integrate Bayesian networks with logic programs. This setting is also adopted by [6]. Examples in this setting are Herbrand interpretations that should be a probabilistic model for the target theory. The third setting, learning from proofs [17], is novel. It is motivated by the learning of stochastic context free grammars from tree banks. In this setting, examples are proof trees that should be probabilistically provable from the unknown stochastic logic programs. The sketched settings (and their instances presented) are by no means the only possible settings for probabilistic inductive logic programming, but still – I hope – provide useful insights into the state-of-the-art of this exciting field.

For a full survey of statistical relational learning or probabilistic inductive logic programming, the author would like to refer to [4], and for more details on the probabilistic inductive logic programming settings to [16], where a longer and earlier version of this contribution can be found.

Luc De Raedt

### Recent Advances in Mining Time Series Data

Much of the world’s supply of data is in the form of time series. Furthermore, as we shall see, many types of data can be meaningfully converted into ”time series”, including text, DNA, video, images etc. The last decade has seen an explosion of interest in mining time series data from the academic community. There has been significant work on algorithms to classify, cluster, segment, index, discover rules, visualize, and detect anomalies/novelties in time series.

In this talk I will summarize the latest advances in mining time series data, including:

– New representations of time series data.

– New algorithms/definitions.

– The migration from static problems to online problems.

– New areas and applications of time series data mining.

I will end the talk with a discussion of “what’s left to do” in time series data mining.

Eamonn Keogh

### Focus the Mining Beacon: Lessons and Challenges from the World of E-Commerce

Electronic Commerce is now entering its second decade, with Amazon.com and eBay now in existence for ten years. With massive amounts of data, an actionable domain, and measurable ROI, multiple companies use data mining and knowledge discovery to understand their customers and improve interactions.We present important lessons and challenges using e-commerce examples across two dimensions: (i) business-level to technical, and (ii) the mining lifecycle from data collection, data warehouse construction, to discovery and deployment. Many of the lessons and challenges are applicable to domains outside e-commerce.

Ron Kohavi

### Data Streams and Data Synopses for Massive Data Sets (Invited Talk)

With the proliferation of data intensive applications, it has become necessary to develop new techniques to handle massive data sets. Traditional algorithmic techniques and data structures are not always suitable to handle the amount of data that is required and the fact that the data often streams by and cannot be accessed again. A field of research established over the past decade is that of handling massive data sets using data synopses, and developing algorithmic techniques for data stream models. We will discuss some of the research work that has been done in the field, and provide a decades’ perspective to data synopses and data streams.

Yossi Matias

### Clustering and Metaclustering with Nonnegative Matrix Decompositions

Although very widely used in unsupervised data mining, most clustering methods are affected by the instability of the resulting clusters w.r.t. the initialization of the algorithm (as e.g. in k-means). Here we show that this problem can be elegantly and efficiently tackled by

meta-clustering

the clusters produced in several different runs of the algorithm, especially if “

soft

” clustering algorithms (such as Nonnegative Matrix Factorization) are used both at the object- and the meta-level. The essential difference w.r.t. other meta-clustering approaches consists in the fact that our algorithm detects frequently occurring

sub

-clusters (rather than

complete

clusters) in the various runs, which allows it to outperform existing algorithms. Additionally, we show how to perform

two-way meta

-clustering, i.e. take both object and sample dimensions of clusters simultaneously into account, a feature which is essential e.g. for

bi

clustering gene expression data, but has not been considered before.

### A SAT-Based Version Space Algorithm for Acquiring Constraint Satisfaction Problems

Constraint programming is rapidly becoming the technology of choice for modelling and solving complex combinatorial problems. However, users of this technology need significant expertise in order to model their problems appropriately. The lack of availability of such expertise is a significant bottleneck to the broader uptake of constraint technology in the real world. We present a new SAT-based version space algorithm for acquiring constraint satisfaction problems from examples of solutions and non-solutions of a target problem. An important advantage is the ease with which domain-specific knowledge can be exploited using the new algorithm. Finally, we empirically demonstrate the algorithm and the effect of exploiting domain-specific knowledge on improving the quality of the acquired constraint network.

Christian Bessiere, Remi Coletta, Frédéric Koriche, Barry O’Sullivan

### Estimation of Mixture Models Using Co-EM

We study estimation of mixture models for problems in which multiple views of the instances are available. Examples of this setting include clustering web pages or research papers that have intrinsic (text) and extrinsic (references) attributes. Our optimization criterion quantifies the likelihood and the consensus among models in the individual views; maximizing this consensus minimizes a bound on the risk of assigning an instance to an incorrect mixture component. We derive an algorithm that maximizes this criterion. Empirically, we observe that the resulting clustering method incurs a lower cluster entropy than regular EM for web pages, research papers, and many text collections.

Steffen Bickel, Tobias Scheffer

### Nonrigid Embeddings for Dimensionality Reduction

Spectral methods for embedding graphs and immersing data manifolds in low-dimensional spaces are notoriously unstable due to insufficient and/or numerically ill-conditioned constraint sets. Why show why this is endemic to spectral methods, and develop low-complexity solutions for stiffening ill-conditioned problems and regularizing ill-posed problems, with proofs of correctness. The regularization exploits sparse but complementary constraints on affine rigidity and edge lengths to obtain isometric embeddings. An implemented algorithm is fast, accurate, and industrial-strength: Experiments with problem sizes spanning four orders of magnitude show

O

(

N

) scaling. We demonstrate with speech data.

Matthew Brand

### Multi-view Discriminative Sequential Learning

Discriminative learning techniques for sequential data have proven to be more effective than generative models for named entity recognition, information extraction, and other tasks of discrimination. However, semi-supervised learning mechanisms that utilize inexpensive unlabeled sequences in addition to few labeled sequences – such as the Baum-Welch algorithm – are available only for generative models. The multi-view approach is based on the principle of maximizing the consensus among multiple independent hypotheses; we develop this principle into a semi-supervised hidden Markov perceptron, and a semi-supervised hidden Markov support vector learning algorithm. Experiments reveal that the resulting procedures utilize unlabeled data effectively and discriminate more accurately than their purely supervised counterparts.

Ulf Brefeld, Christoph Büscher, Tobias Scheffer

### Robust Bayesian Linear Classifier Ensembles

Ensemble classifiers combine the classification results of several classifiers. Simple ensemble methods such as uniform averaging over a set of models usually provide an improvement over selecting the single best model. Usually probabilistic classifiers restrict the set of possible models that can be learnt in order to lower computational complexity costs. In these restricted spaces, where incorrect modeling assumptions are possibly made, uniform averaging sometimes performs even better than bayesian model averaging. Linear mixtures over sets of models provide an space that includes uniform averaging as a particular case. We develop two algorithms for learning maximum a posteriori weights for linear mixtures, based on expectation maximization and on constrained optimizition. We provide a nontrivial example of the utility of these two algorithms by applying them for one dependence estimators. We develop the conjugate distribution for one dependence estimators and empirically show that uniform averaging is clearly superior to Bayesian model averaging for this family of models. After that we empirically show that the maximum a posteriori linear mixture weights improve accuracy significantly over uniform aggregation.

Jesús Cerquides, Ramon López de Mántaras

### An Integrated Approach to Learning Bayesian Networks of Rules

Inductive Logic Programming (ILP) is a popular approach for learning rules for classification tasks. An important question is how to combine the individual rules to obtain a useful classifier. In some instances, converting each learned rule into a binary feature for a Bayes net learner improves the accuracy compared to the standard decision list approach [3,4,14]. This results in a two-step process, where rules are generated in the first phase, and the classifier is learned in the second phase. We propose an algorithm that interleaves the two steps, by incrementally building a Bayes net during rule learning. Each candidate rule is introduced into the network, and scored by whether it improves the performance of the classifier. We call the algorithm SAYU for Score As You Use. We evaluate two structure learning algorithms Naïve Bayes and Tree Augmented Naïve Bayes. We test SAYU on four different datasets and see a significant improvement in two out of the four applications. Furthermore, the theories that SAYU learns tend to consist of far fewer rules than the theories in the two-step approach.

Jesse Davis, Elizabeth Burnside, Inês de Castro Dutra, David Page, Vítor Santos Costa

### Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result.

– inflating the page rank of a target page by artificially creating many referring pages – has therefore become a common practice. In order to maintain the quality of their search results, search engine providers try to oppose efforts that decorrelate page rank and relevance and maintain blacklists of spamming pages while spammers, at the same time, try to camouflage their spam pages. We formulate the problem of identifying link spam and discuss a methodology for generating training data. Experiments reveal the effectiveness of classes of intrinsic and relational attributes and shed light on the robustness of classifiers against obfuscation of attributes by an adversarial spammer. We identify open research problems related to web spam.

Isabel Drost, Tobias Scheffer

### Rotational Prior Knowledge for SVMs

Incorporation of prior knowledge into the learning process can significantly improve low-sample classification accuracy. We show how to introduce prior knowledge into linear support vector machines in form of constraints on the rotation of the normal to the separating hyperplane. Such knowledge frequently arises naturally, e.g., as inhibitory and excitatory influences of input variables. We demonstrate that the generalization ability of rotationally-constrained classifiers is improved by analyzing their VC and fat-shattering dimensions. Interestingly, the analysis shows that large-margin classification framework justifies the use of stronger prior knowledge than the traditional VC framework. Empirical experiments with text categorization and political party affiliation prediction confirm the usefulness of rotational prior knowledge.

### On the LearnAbility of Abstraction Theories from Observations for Relational Learning

The most common methodology in symbolic learning consists in inducing, given a set of observations, a general concept definition. It is widely known that the choice of the proper description language for a learning problem can affect the efficacy and effectiveness of the learning task. Furthermore, most real-world domains are affected by various kinds of imperfections in data, such as inappropriateness of the description language which does not contain/facilitate an exact representation of the target concept. To deal with such kind of situations, Machine Learning approaches moved from a framework exploiting a single inference mechanism, such as induction, towards one integrating multiple inference strategies such as abstraction. The literature so far assumed that the information needed to the learning systems to apply additional inference strategies is provided by a domain expert. The goal of this work is the automatic inference of such information.

The effectiveness of the proposed method was tested by providing the generated abstraction theories to the learning system INTHELEX as a background knowledge to exploit its abstraction capabilities. Various experiments were carried out on the real-world application domain of scientific paper documents, showing the validity of the approach.

Stefano Ferilli, Teresa M. A. Basile, Nicola Di Mauro, Floriana Esposito

### Beware the Null Hypothesis: Critical Value Tables for Evaluating Classifiers

Scientists regularly decide the statistical significance of their findings by determining whether they can, with sufficient confidence, rule out the possibility that their findings could be attributed to random variation—the ‘null hypothesis.’ For this, they rely on tables with

critical values

pre-computed for the normal distribution, the t-distribution, etc. This paper provides such tables (and methods for generating them) for the performance metrics of binary classification: accuracy, F-measure, area under the ROC curve (AUC), and true positives in the top ten. Given a test set of a certain size, the tables provide the critical value for accepting or rejecting the null hypothesis that the score of the best classifier would be consistent with taking the best of a set of random classifiers. The tables are appropriate to consult when a researcher, practitioner or contest manager selects the best of many classifiers measured against a common test set. The risk of the null hypothesis is especially high when there is a shortage of positives or negatives in the testing set (irrespective of the training set size), as is the case for many medical and industrial classification tasks with highly skewed class distributions.

George Forman, Ira Cohen

### Kernel Basis Pursuit

Estimating a non-uniformly sampled function from a set of learning points is a classical regression problem. Kernel methods have been widely used in this context, but every problem leads to two major tasks: optimizing the kernel and setting the fitness-regularization compromise.

This article presents a new method to estimate a function from noisy learning points in the context of RKHS (Reproducing Kernel Hilbert Space). We introduce the Kernel Basis Pursuit algorithm, which enables us to build a ℓ

1

-regularized-multiple-kernel estimator. The general idea is to decompose the function to learn on a sparse-optimal set of spanning functions. Our implementation relies on the Least Absolute Shrinkage and Selection Operator (LASSO) formulation and on the Least Angle Regression (LARS) solver. The computation of the full regularization path, through the LARS, will enable us to propose new adaptive criteria to find an optimal fitness-regularization compromise. Finally, we aim at proposing a fast parameter-free method to estimate non-uniform-sampled functions.

Vincent Guigue, Alain Rakotomamonjy, Stéphane Canu

### Hybrid Algorithms with Instance-Based Classification

In this paper we aim to show that instance-based classification can replace the classifier component of a rule learner and of maximum-entropy modeling, thereby improving the generalization accuracy of both algorithms. We describe hybrid algorithms that combine rule learning models and maximum-entropy modeling with instance-based classification. Experimental results show that both hybrids are able to outperform the parent algorithm. We analyze and compare the overlap in errors and the statistical bias and variance of the hybrids, their parent algorithms, and a plain instance-based learner. We observe that the successful hybrid algorithms have a lower statistical bias component in the error than their parent algorithms; the fewer errors they make are also less systematic.

Iris Hendrickx, Antal van den Bosch

### Learning and Classifying Under Hard Budgets

Since resources for data acquisition are seldom infinite, both learners and classifiers must act intelligently under hard budgets. In this paper, we consider problems in which feature values are unknown to both the learner and classifier, but can be acquired at a cost. Our goal is a learner that spends its fixed learning budget

b

L

acquiring training data, to produce the most accurate “active classifier” that spends at most

b

C

per instance. To produce this fixed-budget classifier, the fixed-budget learner must sequentially decide which feature values to collect to learn the relevant information about the distribution. We explore several approaches the learner can take, including the standard “round robin” policy (purchasing every feature of every instance until the

b

L

budget is exhausted). We demonstrate empirically that round robin is problematic (especially for small

b

L

), and provide alternate learning strategies that achieve superior performance on a variety of datasets.

Aloak Kapoor, Russell Greiner

### Training Support Vector Machines with Multiple Equality Constraints

In this paper we present a primal-dual decomposition algorithm for support vector machine training. As with existing methods that use very small working sets (such as

Sequential Minimal Optimization

(SMO),

Successive Over-Relaxation

(SOR) or the

(KA)), our method scales well, is straightforward to implement, and does not require an external QP solver. Unlike SMO, SOR and KA, the method is applicable to a large number of SVM formulations regardless of the number of equality constraints involved. The effectiveness of our algorithm is demonstrated on a more difficult SVM variant in this respect, namely semi-parametric support vector regression.

Wolf Kienzle, Bernhard Schölkopf

### A Model Based Method for Automatic Facial Expression Recognition

Automatic facial expression recognition is a research topic with interesting applications in the field of human-computer interaction, psychology and product marketing. The classification accuracy for an automatic system which uses static images as input is however largely limited by the image quality, lighting conditions and the orientation of the depicted face. These problems can be partially overcome by using a holistic model based approach called the Active Appearance Model. A system will be described that can classify expressions from one of the emotional categories joy, anger, sadness, surprise, fear and disgust with remarkable accuracy. It is also able to detect smaller, local facial features based on minimal muscular movements described by the Facial Action Coding System (FACS). Finally, we show how the system can be used for expression analysis and synthesis.

Hans van Kuilenburg, Marco Wiering, Marten den Uyl

### Margin-Sparsity Trade-Off for the Set Covering Machine

We propose a new learning algorithm for the set covering machine and a tight data-compression risk bound that the learner can use for choosing the appropriate tradeoff between the sparsity of a classifier and the magnitude of its separating margin.

François Laviolette, Mario Marchand, Mohak Shah

### Learning from Positive and Unlabeled Examples with Different Data Distributions

We study the problem of learning from positive and unlabeled examples. Although several techniques exist for dealing with this problem, they all assume that positive examples in the positive set

P

and the positive examples in the unlabeled set

U

are generated from the same distribution. This assumption may be violated in practice. For example, one wants to collect all printer pages from the Web. One can use the printer pages from one site as the set

P

of positive pages and use product pages from another site as

U

. One wants to classify the pages in

U

into printer pages and non-printer pages. Although printer pages from the two sites have many similarities, they can also be quite different because different sites often present similar products in different styles and have different focuses. In such cases, existing methods perform poorly. This paper proposes a novel technique A-EM to deal with the problem. Experiment results with product page classification demonstrate the effectiveness of the proposed technique.

Xiao-Li Li, Bing Liu

### Towards Finite-Sample Convergence of Direct Reinforcement Learning

While direct, model-free reinforcement learning often performs better than model-based approaches in practice, only the latter have yet supported theoretical guarantees for finite-sample convergence. A major difficulty in analyzing the direct approach in an online setting is the absence of a definitive exploration strategy. We extend the notion of admissibility to direct reinforcement learning and show that standard Q-learning with optimistic initial values and constant learning rate is admissible. The notion justifies the use of a greedy strategy that we believe performs very well in practice and holds theoretical significance in deriving finite-sample convergence for direct reinforcement learning. We present empirical evidence that supports our idea.

Shiau Hong Lim, Gerald DeJong

### Infinite Ensemble Learning with Support Vector Machines

Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of base hypotheses. However, existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. It is not clear whether we should construct an ensemble classifier with a larger or even infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on SVM. The framework can output an infinite and nonsparse ensemble, and can be used to construct new kernels for SVM as well as to interpret some existing ones. We demonstrate the framework with a concrete application, the stump kernel, which embodies infinitely many decision stumps. The stump kernel is simple, yet powerful. Experimental results show that SVM with the stump kernel is usually superior than boosting, even with noisy data.

Hsuan-Tien Lin, Ling Li

### A Kernel Between Unordered Sets of Data: The Gaussian Mixture Approach

In this paper, we present a new kernel for unordered sets of data of the same type. It works by first fitting a set with a Gaussian mixture, then evaluate an efficient kernel on the two fitted Gaussian mixtures. Furthermore, we show that this kernel can be extended to sets embedded in a feature space implicitly defined by another kernel, where Gaussian mixtures are fitted with the kernelized EM algorithm [6], and the kernel for Gaussian mixtures are modified to use the outputs from the kernelized EM. All computation depends on data only through their inner products as evaluations of the base kernel. The kernel is computable in closed form, and being able to work in a feature space improves its flexibility and applicability. Its performance is evaluated in experiments on both synthesized and real data.

Siwei Lyu

### Active Learning for Probability Estimation Using Jensen-Shannon Divergence

Active selection of good training examples is an important approach to reducing data-collection costs in machine learning; however, most existing methods focus on maximizing classification accuracy. In many applications, such as those with unequal misclassification costs, producing good class probability estimates (CPEs) is more important than optimizing classification accuracy. We introduce novel approaches to active learning based on the algorithms Bootstrap-LV and

ActiveDecorate

, by using Jensen-Shannon divergence (a similarity measure for probability distributions) to improve sample selection for optimizing CPEs. Comprehensive experimental results demonstrate the benefits of our approaches.

Prem Melville, Stewart M. Yang, Maytal Saar-Tsechansky, Raymond Mooney

### Natural Actor-Critic

This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic. The actor updates are based on stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by the policy-gradient compatible function approximation. We show that several well-known reinforcement learning methods such as the original Actor-Critic and Bradtke’s Linear Quadratic Q-Learning are in fact Natural Actor-Critic algorithms. Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm.

Jan Peters, Sethu Vijayakumar, Stefan Schaal

### Inducing Head-Driven PCFGs with Latent Heads: Refining a Tree-Bank Grammar for Parsing

Although

state-of-the-art

parsers for natural language are lexicalized, it was recently shown that an

accurate

unlexicalized parser for the Penn tree-bank can be simply read off a manually refined tree-bank. While

lexicalized

parsers often suffer from sparse data,

manual mark-up

is costly and largely based on individual linguistic intuition. Thus, across domains, languages, and tree-bank annotations, a fundamental question arises: Is it possible to

automatically

induce an

accurate

parser from a tree-bank without resorting to full lexicalization? In this paper, we show how to induce a probabilistic parser with latent head information from simple linguistic principles. Our parser has a performance of 85.1% (LP/LR F

1

), which is as good as that of early

lexicalized

ones. This is remarkable since the induction of probabilistic grammars is in general a hard task.

Detlef Prescher

### Learning (k,l)-Contextual Tree Languages for Information Extraction

This paper introduces a novel method for learning a wrapper for extraction of text nodes from web pages based upon (

k

,

l

)-contextual tree languages. It also introduces a method to learn good values of

k

and

l

based on a few positive and negative examples. Finally, it describes how the algorithm can be integrated in a tool for information extraction.

Stefan Raeymaekers, Maurice Bruynooghe, Jan Van den Bussche

### Neural Fitted Q Iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method

This paper introduces NFQ, an algorithm for efficient and effective training of a Q-value function represented by a multi-layer perceptron. Based on the principle of storing and reusing transition experiences, a model-free, neural network based Reinforcement Learning algorithm is proposed. The method is evaluated on three benchmark problems. It is shown empirically, that reasonably few interactions with the plant are needed to generate control policies of high quality.

Martin Riedmiller

### MCMC Learning of Bayesian Network Models by Markov Blanket Decomposition

We propose a Bayesian method for learning Bayesian network models using Markov chain Monte Carlo (MCMC). In contrast to most existing MCMC approaches that define components in term of single edges, our approach is to decompose a Bayesian network model in larger

dependence

components defined by Markov blankets. The idea is based on the fact that MCMC performs significantly better when choosing the right decomposition, and that edges in the Markov blanket of the vertices form a natural dependence relationship. Using the ALARM and Insurance networks, we show that this decomposition allows MCMC to mix more rapidly, and is less prone to getting stuck in local maxima compared to the single edge approach.

Carsten Riggelsen

### On Discriminative Joint Density Modeling

We study discriminative joint density models, that is, generative models for the joint density

p

(

c

,

x

) learned by maximizing a discriminative cost function, the conditional likelihood. We use the framework to derive generative models for generalized linear models, including logistic regression, linear discriminant analysis, and discriminative mixture of unigrams. The benefits of deriving the discriminative models from joint density models are that it is easy to extend the models and interpret the results, and missing data can be treated using justified standard methods.

Jarkko Salojärvi, Kai Puolamäki, Samuel Kaski

### Model-Based Online Learning of POMDPs

Learning to act in an unknown partially observable domain is a difficult variant of the reinforcement learning paradigm. Research in the area has focused on model-free methods — methods that learn a policy without learning a model of the world. When sensor noise increases, model-free methods provide less accurate policies. The model-based approach — learning a POMDP model of the world, and computing an optimal policy for the learned model — may generate superior results in the presence of sensor noise, but learning and solving a model of the environment is a difficult problem. We have previously shown how such a model can be obtained from the learned policy of model-free methods, but this approach implies a distinction between a learning phase and an acting phase that is undesirable. In this paper we present a novel method for learning a POMDP model online, based on McCallums’ Utile Suffix Memory (USM), in conjunction with an approximate policy obtained using an incremental POMDP solver. We show that the incrementally improving policy provides superior results to the original USM algorithm, especially in the presence of increasing sensor and action noise.

Guy Shani, Ronen I. Brafman, Solomon E. Shimony

### Simple Test Strategies for Cost-Sensitive Decision Trees

We study cost-sensitive learning of decision trees that incorporate both test costs and misclassification costs. In particular, we first propose a lazy decision tree learning that minimizes the total cost of tests and misclassifications. Then assuming test examples may contain unknown attributes whose values can be obtained at a cost (the test cost), we design several novel test strategies which attempt to minimize the total cost of tests and misclassifications for each test example. We empirically evaluate our tree-building and various test strategies, and show that they are very effective. Our results can be readily applied to real-world diagnosis tasks, such as medical diagnosis where doctors must try to determine what tests (e.g., blood tests) should be ordered for a patient to minimize the total cost of tests and misclassifications (misdiagnosis). A case study on heart disease is given throughout the paper.

Shengli Sheng, Charles X. Ling, Qiang Yang

### $\mathcal{U}$ -Likelihood and $\mathcal{U}$ -Updating Algorithms: Statistical Inference in Latent Variable Models

In this paper we consider latent variable models and introduce a new

$\mathcal{U}$

-likelihood

concept for estimating the distribution over hidden variables. One can derive an estimate of parameters from this distribution. Our approach differs from the Bayesian and Maximum Likelihood (ML) approaches. It gives an alternative to Bayesian inference when we don’t want to define a prior over parameters and gives an alternative to the ML method when we want a better estimate of the distribution over hidden variables. As a practical implementation, we present a

$\mathcal{U}$

-updating algorithm

based on the mean field theory to approximate the distribution over hidden variables from the

$\mathcal{U}$

-likelihood. This algorithm captures some of the correlations among hidden variables by estimating reaction terms. Those reaction terms are found to penalize the likelihood. We show that the

$\mathcal{U}$

-updating algorithm becomes the EM algorithm as a special case in the large sample limit. The useful behavior of our method is confirmed for the case of mixture of Gaussians by comparing to the EM algorithm.

Jaemo Sung, Sung-Yang Bang, Seungjin Choi, Zoubin Ghahramani

### An Optimal Best-First Search Algorithm for Solving Infinite Horizon DEC-POMDPs

In the domain of decentralized Markov decision processes, we develop the first complete and optimal algorithm that is able to extract deterministic policy vectors based on finite state controllers for a cooperative team of agents. Our algorithm applies to the discounted infinite horizon case and extends best-first search methods to the domain of decentralized control theory. We prove the optimality of our approach and give some first experimental results for two small test problems. We believe this to be an important step forward in learning and planning in stochastic multi-agent systems.

Daniel Szer, François Charpillet

### Ensemble Learning with Supervised Kernels

Kernel-based methods have outstanding performance on many machine learning and pattern recognition tasks. However, they are sensitive to kernel selection, they may have low tolerance to noise, and they can not deal with mixed-type or missing data. We propose to derive a novel kernel from an ensemble of decision trees. This leads to kernel methods that naturally handle noisy and heterogeneous data with potentially non-randomly missing values. We demonstrate excellent performance of regularized least square learners based on such kernels.

Kari Torkkola, Eugene Tuv

### Using Advice to Transfer Knowledge Acquired in One Reinforcement Learning Task to Another

Lisa Torrey, Trevor Walker, Jude Shavlik, Richard Maclin

### A Distance-Based Approach for Action Recommendation

Rule induction has attracted a great deal of attention in Machine Learning and Data Mining. However, generating rules is not an end in itself because their applicability is not so straightforward. Indeed, the user is often overwhelmed when faced with a large number of rules.

In this paper, we propose an approach to lighten this burden when the user wishes to exploit such rules to decide which actions to do given an unsatisfactory situation. The method consists in comparing a situation to a set of classification rules. This is achieved using a suitable distance thus allowing to suggest action recommendations with minimal changes to improve that situation. We propose the algorithm

Dakar

for learning action recommendations and we present an application to an environmental protection issue. Our experiment shows the usefulness of our contribution in decision-making but also raises concerns about the impact of the redundancy of a set of rules in learning action recommendations of quality.

Ronan Trepos, Ansaf Salleb, Marie-Odile Cordier, Véronique Masson, Chantal Gascuel

### Multi-armed Bandit Algorithms and Empirical Evaluation

The multi-armed bandit problem for a gambler is to decide which arm of a

K

-slot machine to pull to maximize his total reward in a series of trials. Many real-world learning and optimization problems can be modeled in this way. Several strategies or algorithms have been proposed as a solution to this problem in the last two decades, but, to our knowledge, there has been no common evaluation of these algorithms.

This paper provides a preliminary empirical evaluation of several multi-armed bandit algorithms. It also describes and analyzes a new algorithm,

Poker

(Price Of Knowledge and Estimated Reward) whose performance compares favorably to that of other existing algorithms in several experiments. One remarkable outcome of our experiments is that the most naive approach, the

ε

-greedy strategy, proves to be often hard to beat.

Joannès Vermorel, Mehryar Mohri

### Annealed Discriminant Analysis

Motivated by the analogies to statistical physics, the deterministic annealing (DA) method has successfully been demonstrated in a variety of applications. In this paper, we explore a new methodology to devise the classifier under the DA method. The differential cost function is derived subject to a constraint on the randomness of the solution, which is governed by the temperature

T

. While gradually lowering the temperature, we can always find a good solution which can both solve the overfitting problem and avoid poor local optima. Our approach is called

annealed discriminant analysis

(ADA). It is a general approach, where we elaborate two classifiers, i.e., distance-based and inner product-based, in this paper. The distance-based classifier is an annealed version of linear discriminant analysis (LDA) while the inner product-based classifier is a generalization of penalized logistic regression (PLR). As such, ADA provides new insights into the workings of these two classification algorithms. The experimental results show substantial performance gains over standard learning methods.

Gang Wang, Zhihua Zhang, Frederick H. Lochovsky

### Network Game and Boosting

We propose an ensemble learning method called Network Boosting which combines weak learners together based on a random graph (network). A theoretic analysis based on the game theory shows that the algorithm can learn the target hypothesis asymptotically. The comparison results using several datasets of the UCI machine learning repository and synthetic data are promising and show that Network Boosting has much resistance to the noisy data than AdaBoost through the cooperation of classifiers in the classifier network.

Shijun Wang, Changshui Zhang

### Model Selection in Omnivariate Decision Trees

We propose an omnivariate decision tree architecture which contains univariate, multivariate linear or nonlinear nodes, matching the complexity of the node to the complexity of the data reaching that node. We compare the use of different model selection techniques including AIC, BIC, and CV to choose between the three types of nodes on standard datasets from the UCI repository and see that such omnivariate trees with a small percentage of multivariate nodes close to the root generalize better than pure trees with the same type of node everywhere. CV produces simpler trees than AIC and BIC without sacrificing from expected error. The only disadvantage of CV is its longer training time.

Olcay Taner Yıldız, Ethem Alpaydın

### Bayesian Network Learning with Abstraction Hierarchies and Context-Specific Independence

Context-specific independence representations, such as tree-structured conditional probability tables (TCPTs), reduce the number of parameters in Bayesian networks by capturing local independence relationships and improve the quality of learned Bayesian networks. We previously presented Abstraction-Based Search (ABS), a technique for using attribute value hierarchies during Bayesian network learning to remove unimportant distinctions within the CPTs. In this paper, we introduce TCPT ABS (TABS), which integrates ABS with TCPT learning. Since expert-provided hierarchies may not be available, we provide a clustering technique for deriving hierarchies from data. We present empirical results for three real-world domains, finding that (1) combining TCPTs and ABS provides a significant increase in the quality of learned Bayesian networks (2) combining TCPTs and ABS provides a dramatic reduction in the number of parameters in the learned networks, and (3) data-derived hierarchies perform as well or better than expert-provided hierarchies.

Marie desJardins, Priyang Rathod, Lise Getoor

### Learning to Complete Sentences

We consider the problem of predicting how a user will continue a given initial text fragment. Intuitively, our goal is to develop a “tab-complete” function for natural language, based on a model that is learned from text data. We consider two learning mechanisms that generate predictive models from collections of application-specific document collections: we develop an

N

-gram based completion method and discuss the application of instance-based learning. After developing evaluation metrics for this task, we empirically compare the model-based to the instance-based method and assess the predictability of call-center emails, personal emails, and weather reports.

Steffen Bickel, Peter Haider, Tobias Scheffer

### The Huller: A Simple and Efficient Online SVM

We propose a novel online kernel classifier algorithm that converges to the Hard Margin SVM solution. The same update rule is used to both add and remove support vectors from the current classifier. Experiments suggest that this algorithm matches the SVM accuracies after a single pass over the training examples. This algorithm is attractive when one seeks a competitive classifier with large datasets and limited computing resources.

Antoine Bordes, Léon Bottou

### Inducing Hidden Markov Models to Model Long-Term Dependencies

We propose in this paper a novel approach to the induction of the structure of Hidden Markov Models. The induced model is seen as a lumped process of a Markov chain. It is constructed to fit the dynamics of the target machine, that is to best approximate the stationary distribution and the mean first passage times observed in the sample. The induction relies on non-linear optimization and iterative state splitting from an initial order one Markov chain.

Jérôme Callut, Pierre Dupont

### A Similar Fragments Merging Approach to Learn Automata on Proteins

We propose here to learn automata for the characterization of proteins families to overcome the limitations of the position-specific characterizations classically used in Pattern Discovery. We introduce a new heuristic approach learning non-deterministic automata based on selection and ordering of significantly similar fragments to be merged and on physico-chemical properties identification. Quality of the characterization of the major intrinsic protein (MIP) family is assessed by leave-one-out cross-validation for a large range of models specificity.

François Coste, Goulven Kerbellec

### Nonnegative Lagrangian Relaxation of K-Means and Spectral Clustering

We show that K-means and spectral clustering objective functions can be written as a trace of quadratic forms. Instead of relaxation by eigenvectors, we propose a novel relaxation maintaining the nonnegativity of the cluster indicators and thus give the cluster posterior probabilities, therefore resolving cluster assignment difficulty in spectral relaxation. We derive a multiplicative updating algorithm to solve the nonnegative relaxation problem. The method is briefly extended to semi-supervised classification and semi-supervised clustering.

Chris Ding, Xiaofeng He, Horst D. Simon

### Severe Class Imbalance: Why Better Algorithms Aren’t the Answer

This paper argues that severe class imbalance is not just an interesting technical challenge that improved learning algorithms will address, it is much more serious. To be useful, a classifier must appreciably outperform a trivial solution, such as choosing the majority class. Any application that is inherently noisy limits the error rate, and cost, that is achievable. When data are normally distributed, even a Bayes optimal classifier has a vanishingly small reduction in the majority classifier’s error rate, and cost, as imbalance increases. For fat tailed distributions, and when practical classifiers are used, often no reduction is achieved.

Chris Drummond, Robert C. Holte

### Approximation Algorithms for Minimizing Empirical Error by Axis-Parallel Hyperplanes

Many learning situations involve separation of labeled training instances by hyperplanes. Consistent separation is of theoretical interest, but the real goal is rather to minimize the number of errors using a bounded number of hyperplanes. Exact minimization of empirical error in a high-dimensional grid induced into the feature space by axis-parallel hyperplanes is NP-hard. We develop two approximation schemes with performance guarantees, a greedy set covering scheme for producing a consistently labeled grid, and integer programming rounding scheme for finding the minimum error grid with bounded number of hyperplanes.

Tapio Elomaa, Jussi Kujala, Juho Rousu

### A Comparison of Approaches for Learning Probability Trees

Probability trees (or Probability Estimation Trees, PET’s) are decision trees with probability distributions in the leaves. Several alternative approaches for learning probability trees have been proposed but no thorough comparison of these approaches exists.

In this paper we experimentally compare the main approaches using the relational decision tree learner Tilde (both on non-relational and on relational datasets). Next to the main existing approaches, we also consider a novel variant of an existing approach based on the Bayesian Information Criterion (BIC). Our main conclusion is that overall trees built using the C4.5-approach or the C4.4-approach (C4.5 without post-pruning) have the best predictive performance. If the number of classes is low, however, BIC performs equally well. An additional advantage of BIC is that its trees are considerably smaller than trees for the C4.5- or C4.4-approach.

Daan Fierens, Jan Ramon, Hendrik Blockeel, Maurice Bruynooghe

### Counting Positives Accurately Despite Inaccurate Classification

Most supervised machine learning research assumes the training set is a random sample from the target population, thus the class distribution is invariant. In real world situations, however, the class distribution changes, and is known to erode the effectiveness of classifiers and calibrated probability estimators. This paper focuses on the problem of accurately estimating the number of positives in the test set—

quantification

—as opposed to classifying individual cases accuratel y. It compares three methods: classify & count, an adjusted variant, and a mixture model. An empirical evaluation on a text classification benchmark reveals that the simple method is consistently biased, and that the mixture model is surprisingly effective even when positives are very scarce in the training set—a common case in information retrieval.

George Forman

### Optimal Stopping and Constraints for Diffusion Models of Signals with Discontinuities

Gaussian process regression models can be utilized in recovery of discontinuous signals. Their computational complexity is linear in the number of observations if applied with the covariance functions of nonlinear diffusion. However, such processes often result in hard-to-control jumps of the signal value. Synthetic examples presented in this work indicate that Bayesian evidence-maximizing stopping and knowledge whether signal values are discrete help to outperform the steady state solutions of nonlinear diffusion filtering.

Ramūnas Girdziušas, Jorma Laaksonen

### An Evolutionary Function Approximation Approach to Compute Prediction in XCSF

XCSF is a new extension to XCS that is developed to extend XCS’s reward calculation capability via computing. This new feature is called

computable prediction

. The first version of XCSF tries to find the most appropriate equation to compute each

classifier’s

reward using a weight update mechanism. In this paper, we try to propose a new evolutionary mechanism to compute these equations using genetic algorithms.

### Using Rewards for Belief State Updates in Partially Observable Markov Decision Processes

Partially Observable Markov Decision Processes (POMDP) provide a standard framework for sequential decision making in stochastic environments. In this setting, an agent takes actions and receives observations and rewards from the environment. Many POMDP solution methods are based on computing a

belief state

, which is a probability distribution over possible states in which the agent could be. The action choice of the agent is then based on the belief state. The belief state is computed based on a model of the environment, and the history of actions and observations seen by the agent. However, reward information is not taken into account in updating the belief state. In this paper, we argue that rewards can carry useful information that can help disambiguate the hidden state. We present a method for updating the belief state which takes rewards into account. We present experiments with exact and approximate planning methods on several standard POMDP domains, using this belief update method, and show that it can provide advantages, both in terms of speed and in terms of the quality of the solution obtained.

### Active Learning in Partially Observable Markov Decision Processes

This paper examines the problem of finding an optimal policy for a Partially Observable Markov Decision Process (POMDP) when the model is not known or is only poorly specified. We propose two approaches to this problem. The first relies on a model of the uncertainty that is added directly into the POMDP planning problem. This has theoretical guarantees, but is impractical when many of the parameters are uncertain. The second, called MEDUSA, incrementally improves the POMDP model using selected queries, while still optimizing reward. Results show good performance of the algorithm even in large problems: the most useful parameters of the model are learned quickly and the agent still accumulates high reward throughout the process.

Robin Jaulmes, Joelle Pineau, Doina Precup

### Machine Learning of Plan Robustness Knowledge About Instances

Classical planning domain representations assume all the objects from one type are exactly the same. But when solving problems in the real world systems, the execution of a plan that theoretically solves a problem, can fail because of not properly capturing the special features of an object in the initial representation. We propose to capture this uncertainty about the world with an architecture that integrates planning, execution and learning. In this paper, we describe the

PELA

system (Planning-Execution-Learning Architecture). This system generates plans, executes those plans in the real world, and automatically acquires knowledge about the behaviour of the objects to strengthen the execution processes in the future.

Sergio Jiménez, Fernando Fernández, Daniel Borrajo

### Two Contributions of Constraint Programming to Machine Learning

A constraint is a relation with an active behavior. For a given relation, we propose to learn a representation adapted to this active behavior. It yields two contributions. The first is a generic meta-technique for classifier improvement showing performances comparable to boosting. The second lies in the ability of using the learned concept in constraint-based decision or optimization problems. It opens a new way of integrating Machine Learning in Decision Support Systems.

Arnaud Lallouet, Andreï Legtchenko

### A Clustering Model Based on Matrix Approximation with Applications to Cluster System Log Files

In system management applications, to perform automated analysis of the historical data across multiple components when problems occur, we need to cluster the log messages with disparate formats to automatically infer the common set of semantic situations and obtain a brief description for each situation. In this paper, we propose a clustering model where the problem of clustering is formulated as matrix approximations and the clustering objective is minimizing the approximation error between the original data matrix and the reconstructed matrix based on the cluster structures. The model explicitly characterizes the data and feature memberships and thus enables the descriptions of each cluster. We present a two-side spectral relaxation optimization procedure for the clustering model. We also establish the connections between our clustering model with existing approaches. Experimental results show the effectiveness of the proposed approach.

Tao Li, Wei Peng

### Detecting Fraud in Health Insurance Data: Learning to Model Incomplete Benford’s Law Distributions

Benford’s Law [1] specifies the probabilistic distribution of digits for many commonly occurring phenomena, ideally when we have complete data of the phenomena. We enhance this digital analysis technique with an unsupervised learning method to handle situations where data is incomplete. We apply this method to the detection of fraud and abuse in health insurance claims using real health insurance data. We demonstrate improved precision over the traditional Benford approach in detecting anomalous data indicative of fraud and illustrate some of the challenges to the analysis of healthcare claims fraud.

Fletcher Lu, J. Efrim Boritz

### Efficient Case Based Feature Construction

Feature construction is essential for solving many complex learning problems. Unfortunately, the construction of features usually implies searching a very large space of possibilities and is often computationally demanding. In this work, we propose a case based approach to feature construction. Learning tasks are stored together with a corresponding set of constructed features in a case base and can be retrieved to speed up feature construction for new tasks. The essential part of our method is a new representation model for learning tasks and a corresponding distance measure. Learning tasks are compared using relevance weights on a common set of base features only. Therefore, the case base can be built and queried very efficiently. In this respect, our approach is unique and enables us to apply case based feature construction not only on a large scale, but also in distributed learning scenarios in which communication costs play an important role. We derive a distance measure for heterogeneous learning tasks by stating a set of necessary conditions. Although the conditions are quite basic, they constraint the set of applicable methods to a surprisingly small number.

Ingo Mierswa, Michael Wurst

### Fitting the Smallest Enclosing Bregman Ball

Finding a point which minimizes the maximal distortion with respect to a dataset is an important estimation problem that has recently received growing attentions in machine learning, with the advent of one class classification. We propose two theoretically founded generalizations to arbitrary Bregman divergences, of a recent popular smallest enclosing ball approximation algorithm for Euclidean spaces coined by Bădoiu and Clarkson in 2002.

Richard Nock, Frank Nielsen

### Similarity-Based Alignment and Generalization

We present a novel approach to learning predictive sequential models, called

similarity-based alignment and generalization

, which incorporates in the induction process a specific form of domain knowledge derived from a similarity function between the points in the input space. When applied to Hidden Markov Models, our framework yields a new class of learning algorithms called

SimAlignGen

. We discuss the application of our approach to the problem of programming by demonstration–the problem of learning a procedural model of user behavior by observing the interaction an application Graphical User Interface (GUI). We describe in detail the SimIOHMM, a specific instance of

SimAlignGen

that extends the known Input-Output Hidden Markov Model (IOHMM). Empirical evaluations of the SimIOHMM show the dependence of the prediction accuracy on the introduced similarity bias, and the computational gains over the IOHMM.

Daniel Oblinger, Vittorio Castelli, Tessa Lau, Lawrence D. Bergman

### Fast Non-negative Dimensionality Reduction for Protein Fold Recognition

In this paper, dimensionality reduction via matrix factorization with nonnegativity constraints is studied. Because of these constraints, it stands apart from other linear dimensionality reduction methods. Here we explore nonnegative matrix factorization in combination with a classifier for protein fold recognition. Since typically matrix factorization is iteratively done, convergence can be slow. To alleviate this problem, a significantly faster (more than 11 times) algorithm is proposed.

Oleg Okun, Helen Priisalu, Alexessander Alves

### Mode Directed Path Finding

Learning from multi-relational domains has gained increasing attention over the past few years. Inductive logic programming (ILP) systems, which often rely on hill-climbing heuristics in learning first-order concepts, have been a dominating force in the area of multi-relational concept learning. However, hill-climbing heuristics are susceptible to local maxima and plateaus. In this paper, we show how we can exploit the links between objects in multi-relational data to help a first-order rule learning system direct the search by explicitly traversing these links to find paths between variables of interest. Our contributions are twofold: (i) we extend the

pathfinding

algorithm by Richards and Mooney [12] to make use of

mode declarations

, which specify the mode of call (input or output) for predicate variables, and (ii) we apply our extended path finding algorithm to

saturated bottom clauses

, which anchor one end of the search space, allowing us to make use of background knowledge used to build the saturated clause to further direct search. Experimental results on a medium-sized dataset show that path finding allows one to consider interesting clauses that would not easily be found by Aleph.

Irene M. Ong, Inês de Castro Dutra, David Page, Vítor Santos Costa

### Classification with Maximum Entropy Modeling of Predictive Association Rules

This paper presents a new classification model in which a classifier is built upon predictive association rules (PARs) and the maximum entropy principle (maxent). In this model, PARs can be seen as confident statistical patterns discovered from training data with strong dependencies and correlations among data items. Maxent, on the other hand, is an approach to build an estimated distribution having maximum entropy while obeying a potentially large number of useful features observed in empirical data. The underlying idea of our model is that PARs have suitable characteristics to serve as features for maxent. As a result, our classifier can take advantage of both the useful correlation and confidence of PARs as well as the strong statistical modeling capability of maxent. The experimental results show that our model can achieve significantly higher accuracy in comparison with the previous methods.

Hieu X. Phan, Minh L. Nguyen, S. Horiguchi, Bao T. Ho, Y. Inoguchi

### Classification of Ordinal Data Using Neural Networks

Many real life problems require the classification of items in naturally ordered classes. These problems are traditionally handled by conventional methods for nominal classes, ignoring the order. This paper introduces a new training model for feedforward neural networks, for multiclass classification problems, where the classes are ordered. The proposed model has just one output unit which takes values in the interval [0,1]; this interval is then subdivided into K subintervals (one for each class), according to a specific probabilistic model. A comparison is made with conventional approaches, as well as with other architectures specific for ordinal data proposed in the literature. The new model compares favourably with the other methods under study, in the synthetic dataset used for evaluation.

Joaquim Pinto da Costa, Jaime S. Cardoso

### Independent Subspace Analysis on Innovations

Independent subspace analysis (ISA) that deals with multi-dimensional independent sources, is a generalization of independent component analysis (ICA). However, all known ISA algorithms may become ineffective when the sources possess temporal structure. The innovation process instead of the original mixtures has been proposed to solve ICA problems with temporal dependencies. Here we show that this strategy can be applied to ISA as well. We demonstrate the idea on a mixture of 3D processes and also on a mixture of facial pictures used as two-dimensional deterministic sources. ISA on innovations was able to find the original subspaces, while plain ISA was not.

Barnabás Póczos, Bálint Takács, András Lőrincz

### On Applying Tabling to Inductive Logic Programming

Inductive Logic Programming (ILP) is an established sub-field of Machine Learning. Nevertheless, it is recognized that efficiency and scalability is a major obstacle to an increased usage of ILP systems in complex applications with large hypotheses spaces. In this work, we focus on improving the efficiency and scalability of ILP systems by exploring tabling mechanisms available in the underlying Logic Programming systems. Tabling is an implementation technique that improves the declarativeness and performance of Prolog systems by reusing answers to subgoals. To validate our approach, we ran the April ILP system in the YapTab Prolog tabling system using two well-known datasets. The results obtained show quite impressive gains without changing the accuracy and quality of the theories generated.

Ricardo Rocha, Nuno Fonseca, Vítor Santos Costa

### Learning Models of Relational Stochastic Processes

Processes involving change over time, uncertainty, and rich relational structure are common in the real world, but no general algorithms exist for learning models of them. In this paper we show how Markov logic networks (MLNs), a recently developed approach to combining logic and probability, can be applied to time-changing domains. We then show how existing algorithms for parameter and structure learning in MLNs can be extended to this setting. We apply this approach in two domains: modeling the spread of research topics in scientific communities, and modeling faults in factory assembly processes. Our experiments show that it greatly outperforms purely logical (ILP) and purely probabilistic (DBN) learners.

Sumit Sanghai, Pedro Domingos, Daniel Weld

### Error-Sensitive Grading for Model Combination

Ensemble learning is a powerful learning approach that combines multiple classifiers to improve prediction accuracy. An important decision while using an ensemble of classifiers is to decide upon a way of combining the prediction of its base classifiers. In this paper, we introduce a novel grading-based algorithm for model combination, which uses cost-sensitive learning in building a meta-learner. This method distinguishes between the grading error of classifying an incorrect prediction as correct, and the other-way-round, and tries to assign appropriate costs to the two types of error in order to improve performance. We study issues in error-sensitive grading, and then with extensive experiments show the empirically effectiveness of this new method in comparison with representative meta-classification techniques.

Surendra K. Singhi, Huan Liu

### Strategy Learning for Reasoning Agents

We present a method for knowledge-based agents to learn strategies. Using techniques of inductive logic programming, strategies are learned in two steps: A given example set is first generalized into an overly general theory, which then gets refined. We show how a learning agent can exploit background knowledge of its actions and environment in order to restrict the hypothesis space, which enables the learning of complex logic program clauses. This is a first step toward the long term goal of adaptive, reasoning agents capable of changing their behavior when appropriate.

Hendrik Skubch, Michael Thielscher

### Combining Bias and Variance Reduction Techniques for Regression Trees

Gradient Boosting and bagging applied to regressors can reduce the error due to bias and variance respectively. Alternatively, Stochastic Gradient Boosting (SGB) and Iterated Bagging (IB) attempt to simultaneously reduce the contribution of both bias and variance to error. We provide an extensive empirical analysis of these methods, along with two alternate bias-variance reduction approaches — bagging Gradient Boosting (BagGB) and bagging Stochastic Gradient Boosting (BagSGB). Experimental results demonstrate that SGB does not perform as well as IB or the alternate approaches. Furthermore, results show that, while BagGB and BagSGB perform competitively for low-bias learners, in general, Iterated Bagging is the most effective of these methods.

Yuk Lai Suen, Prem Melville, Raymond J. Mooney

### Analysis of Generic Perceptron-Like Large Margin Classifiers

We analyse perceptron-like algorithms with margin considering both the standard classification condition and a modified one which demands a specific value of the margin in the augmented space. The new algorithms are shown to converge in a finite number of steps and used to approximately locate the optimal weight vector in the augmented space. As the data are embedded in the augmented space at a larger distance from the origin the maximum margin in that space approaches the maximum geometric one in the original space. Thus, our procedures exploiting the new algorithms can be regarded as approximate maximal margin classifiers.

Petroula Tsampouka, John Shawe-Taylor

### Multimodal Function Optimizing by a New Hybrid Nonlinear Simplex Search and Particle Swarm Algorithm

A new hybrid Particle Swarm Optimization (PSO) algorithm is proposed in this paper based on the Nonlinear Simplex Search (NSS) method for multimodal function optimizing tasks. At late stage of PSO process, when the most promising regions of solutions are fixed, the algorithm isolates particles that fly very close to the extrema and applies the NSS method to them to enhance local exploitation searching. Explicit experimental results on famous benchmark functions indicate that this approach is reliable and efficient, especially on multimodal function optimizations. It yields better solution qualities and success rates compared to other three published methods.

Fang Wang, Yuhui Qiu

### Backmatter

Weitere Informationen