Top

2010 | Book

Read chapter Read first chapter

Machine Learning and Knowledge Discovery in Databases

European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part III

Editors: José Luis Balcázar, Francesco Bonchi, Aristides Gionis, Michèle Sebag

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Frontmatter

Regular Papers

Efficient Planning in Large POMDPs through Policy Graph Based Factorized Approximations

Partially observable Markov decision processes (POMDPs) are widely used for planning under uncertainty. In many applications, the huge size of the POMDP state space makes straightforward optimization of plans (policies) computationally intractable. To solve this, we introduce an efficient POMDP planning algorithm. Many current methods store the policy partly through a set of “value vectors” which is updated at each iteration by planning one step further; the size of such vectors follows the size of the state space, making computation intractable for large POMDPs. We store the policy as a graph only, which allows tractable approximations in each policy update step: for a state space described by several variables, we approximate beliefs over future states with factorized forms, minimizing Kullback-Leibler divergence to the non-factorized distributions. Our other speedup approximations include bounding potential rewards. We demonstrate the advantage of our method in several reinforcement learning problems, compared to four previous methods.

Joni Pajarinen, Jaakko Peltonen, Ari Hottinen, Mikko A. Uusitalo

Unsupervised Trajectory Sampling

A novel methodology for efficiently sampling Trajectory Databases (TD) for mobility data mining purposes is presented. In particular, a three-step unsupervised trajectory sampling methodology is proposed, that initially adopts a symbolic vector representation of a trajectory which, using a similarity-based voting technique, is transformed to a continuous function that describes the representativeness of the trajectory in the TD. This vector representation is then relaxed by a merging algorithm, which identifies the maximal representative portions of each trajectory, at the same time preserving the space-time mobility pattern of the trajectory. Finally, a novel sampling algorithm operating on the previous representation is proposed, allowing us to select a subset of a TD in an unsupervised way encapsulating the behavior (in terms of mobility patterns) of the original TD. An experimental evaluation over synthetic and real TD demonstrates the efficiency and effectiveness of our approach.

Nikos Pelekis, Ioannis Kopanakis, Costas Panagiotakis, Yannis Theodoridis

Fast Extraction of Locally Optimal Patterns Based on Consistent Pattern Function Variations

This article introduces the problem of searching locally optimal patterns within a set of patterns constrained by some anti-monotonic predicate: given some pattern scoring function, a locally optimal pattern has a maximal (or minimal) score locally among neighboring patterns. Some instances of this problem have produced patterns of interest in the framework of knowledge discovery since locally optimal patterns extracted from datasets are very few, informative and non-redundant compared to other pattern families derived from frequent patterns. This article then introduces the concept of variation consistency to characterize pattern functions and uses this notion to propose GALLOP, an algorithm that outperforms existing algorithms to extract locally optimal itemsets. Finally it shows how GALLOP can generically be applied to two classes of scoring functions useful in binary classification or clustering pattern mining problems.

Frédéric Pennerath

Large Margin Learning of Bayesian Classifiers Based on Gaussian Mixture Models

We present a discriminative learning framework for Gaussian mixture models (GMMs) used for classification based on the extended Baum-Welch (EBW) algorithm [1]. We suggest two criteria for discriminative optimization, namely the class conditional likelihood (CL) and the maximization of the margin (MM). In the experiments, we present results for synthetic data, broad phonetic classification, and a remote sensing application. The experiments show that CL-optimized GMMs (CL-GMMs) achieve a lower performance compared to MM-optimized GMMs (MM-GMMs), whereas both discriminative GMMs (DGMMs) perform significantly better than generatively learned GMMs. We also show that the generative discriminatively parameterized GMM classifiers still allow to marginalize over missing features, a case where generative classifiers have an advantage over purely discriminative classifiers such as support vector machines or neural networks.

Franz Pernkopf, Michael Wohlmayr

Learning with Ensembles of Randomized Trees : New Insights

Ensembles of randomized trees such as Random Forests are among the most popular tools used in machine learning and data mining. Such algorithms work by introducing randomness in the induction of several decision trees before employing a voting scheme to give a prediction for unseen instances. In this paper, randomized trees ensembles are studied in the point of view of the basis functions they induce. We point out a connection with kernel target alignment, a measure of kernel quality, which suggests that randomization is a way to obtain a high alignment, leading to possibly low generalization error. The connection also suggests to post-process ensembles with sophisticated linear separators such as Support Vector Machines (SVM). Interestingly, post-processing gives experimentally better performances than a classical majority voting. We finish by comparing those results to an approximate infinite ensemble classifier very similar to the one introduced by Lin and Li. This methodology also shows strong learning abilities, comparable to ensemble post-processing.

Vincent Pisetta, Pierre-Emmanuel Jouve, Djamel A. Zighed

Entropy and Margin Maximization for Structured Output Learning

We consider the problem of training discriminative structured output predictors, such as conditional random fields (CRFs) and structured support vector machines (SSVMs). A generalized loss function is introduced, which jointly maximizes the entropy and the margin of the solution. The CRF and SSVM emerge as special cases of our framework. The probabilistic interpretation of large margin methods reveals insights about margin and slack rescaling. Furthermore, we derive the corresponding extensions for latent variable models, in which training operates on partially observed outputs. Experimental results for multiclass, linear-chain models and multiple instance learning demonstrate that the generalized loss can improve accuracy of the resulting classifiers.

Patrick Pletscher, Cheng Soon Ong, Joachim M. Buhmann

Virus Propagation on Time-Varying Networks: Theory and Immunization Algorithms

Given a contact network that changes over time (say, day vs night connectivity), and the SIS (susceptible/infected/susceptible, flu like) virus propagation model, what can we say about its epidemic threshold? That is, can we determine when a small infection will “take-off” and create an epidemic? Consequently then, which nodes should we immunize to prevent an epidemic? This is a very real problem, since, e.g. people have different connections during the day at work, and during the night at home. Static graphs have been studied for a long time, with numerous analytical results. Time-evolving networks are so hard to analyze, that most existing works are simulation studies [5].

Specifically, our contributions in this paper are: (a) we formulate the problem by approximating it by a Non-linear Dynamical system (NLDS), (b) we derive the

first

closed formula for the epidemic threshold of time-varying graphs under the SIS model, and finally (c) we show the usefulness of our threshold by presenting efficient heuristics and evaluate the effectiveness of our methods on synthetic and real data like the MIT reality mining graphs.

B. Aditya Prakash, Hanghang Tong, Nicholas Valler, Michalis Faloutsos, Christos Faloutsos

Adapting Decision DAGs for Multipartite Ranking

Multipartite ranking is a special kind of ranking for problems in which classes exhibit an order. Many applications require its use, for instance, granting loans in a bank, reviewing papers in a conference or just grading exercises in an education environment. Several methods have been proposed for this purpose. The simplest ones resort to regression schemes with a pre- and post-process of the classes, what makes them barely useful. Other alternatives make use of class order information or they perform a pairwise classification together with an aggregation function. In this paper we present and discuss two methods based on building a Decision Directed Acyclic Graph (DDAG). Their performance is evaluated over a set of ordinal benchmark data sets according to the C-Index measure. Both yield competitive results with regard to state-of-the-art methods, specially the one based on a probabilistic approach, called PR-DDAG.

José Ramón Quevedo, Elena Montañés, Oscar Luaces, Juan José del Coz

Fast and Scalable Algorithms for Semi-supervised Link Prediction on Static and Dynamic Graphs

Recent years have witnessed a widespread interest on methods using both link structure and node information for link prediction on graphs. One of the state-of-the-art methods is Link Propagation which is a new semi-supervised learning algorithm for link prediction on graphs based on the popularly-studied label propagation by exploiting information on similarities of links and nodes. Despite its efficiency and effectiveness compared to other methods, its applications were still limited due to the computational time and space constraints. In this paper, we propose fast and scalable algorithms for the Link Propagation by introducing efficient procedures to solve large linear equations that appear in the method. In particular, we show how to obtain a compact representation of the solution to the linear equations by using a non-trivial combination of techniques in linear algebra to construct algorithms that are also effective for link prediction on dynamic graphs. These enable us to apply the Link Propagation to large networks with more than 400,000 nodes. Experiments demonstrate that our approximation methods are scalable, fast, and their prediction qualities are comparably competitive.

Rudy Raymond, Hisashi Kashima

Modeling Relations and Their Mentions without Labeled Text

Several recent works on relation extraction have been applying the distant supervision paradigm: instead of relying on annotated text to learn how to predict relations, they employ existing knowledge bases (KBs) as source of supervision. Crucially, these approaches are trained based on the assumption that each sentence which mentions the two related entities is an expression of the given relation. Here we argue that this leads to noisy patterns that hurt precision, in particular if the knowledge base is not directly related to the text we are working with. We present a novel approach to distant supervision that can alleviate this problem based on the following two ideas: First, we use a factor graph to explicitly model the decision whether two entities are related, and the decision whether this relation is mentioned in a given sentence; second, we apply constraint-driven semi-supervision to train this model without any knowledge about which sentences express the relations in our training KB. We apply our approach to extract relations from the New York Times corpus and use Freebase as knowledge base. When compared to a state-of-the-art approach for relation extraction under distant supervision, we achieve 31% error reduction.

Sebastian Riedel, Limin Yao, Andrew McCallum

An Efficient and Scalable Algorithm for Local Bayesian Network Structure Discovery

We present an efficient and scalable constraint-based algorithm, called Hybrid Parents and Children (HPC), to learn the parents and children of a target variable in a Bayesian network. Finding those variables is an important first step in many applications including Bayesian network structure learning, dimensionality reduction and feature selection. The algorithm combines ideas from incremental and divide-and-conquer methods in a principled and effective way, while still being sound in the sample limit. Extensive empirical experiments are provided on public synthetic and real-world data sets of various sample sizes. The most noteworthy feature of HPC is its ability to handle large neighborhoods contrary to current CB algorithm proposals. The number of calls to the statistical test, en hence the run-time, is empirically on the order

(

1.09

), where

is the number of variables, on the five benchmarks that we considered, and

(

1.21

) on a real drug design characterized by 138,351 features.

Sérgio Rodrigues de Morais, Alex Aussem

Selecting Information Diffusion Models over Social Networks for Behavioral Analysis

We investigate how well different information diffusion models can explain observation data by learning their parameters and discuss which model is better suited to which topic. We use two models (AsIC, AsLT), each of which is an extension of the well known Independent Cascade (IC) and Linear Threshold (LT) models and incorporates asynchronous time delay. The model parameters are learned by maximizing the likelihood of observation, and the model selection is performed by choosing the one with better predictive accuracy. We first show by using four real networks that the proposed learning algorithm correctly learns the model parameters both accurately and stably, and the proposed selection method identifies the correct diffusion model from which the data are generated. We next apply these methods to behavioral analysis of topic propagation using the real blog propagation data, and show that although the relative propagation speed of topics that are derived from the learned parameter values is rather insensitive to the model selected, there is a clear indication as to which topic better follows which model. The correspondence between the topic and the model selected is well interpretable.

Kazumi Saito, Masahiro Kimura, Kouzou Ohara, Hiroshi Motoda

Learning Sparse Gaussian Markov Networks Using a Greedy Coordinate Ascent Approach

In this paper, we introduce a simple but efficient greedy algorithm, called

SINCO

, for the Sparse INverse COvariance selection problem, which is equivalent to learning a sparse Gaussian Markov Network, and empirically investigate the structure-recovery properties of the algorithm. Our approach is based on a coordinate ascent method which naturally preserves the sparsity of the network structure. We show that SINCO is often comparable to, and, in various cases, outperforms commonly used approaches such as

glasso

[7] and COVSEL [1], in terms of both structure-reconstruction error (particularly, false positive error) and computational time. Moreover, our method has the advantage of being easily parallelizable. Finally, we show that SINCO’s greedy nature allows reproduction of the regularization path behavior by applying the method to one (sufficiently small) instance of the regularization parameter

only; thus, SINCO can obtain a desired number of network links directly, without having to tune the

parameter. We evaluate our method empirically on various simulated networks and real-life data from biological and neuroimaging applications.

Katya Scheinberg, Irina Rish

Online Structural Graph Clustering Using Frequent Subgraph Mining

The goal of graph clustering is to partition objects in a graph database into different clusters based on various criteria such as vertex connectivity, neighborhood similarity or the size of the maximum common subgraph. This can serve to structure the graph space and to improve the understanding of the data. In this paper, we present a novel method for structural graph clustering, i.e. graph clustering without generating features or decomposing graphs into parts. In contrast to many related approaches, the method does not rely on computationally expensive maximum common subgraph (MCS) operations or variants thereof, but on frequent subgraph mining. More specifically, our problem formulation takes advantage of the frequent subgraph miner gSpan (that performs well on many practical problems) without effectively generating thousands of subgraphs in the process. In the proposed clustering approach, clusters encompass all graphs that share a sufficiently large common subgraph. The size of the common subgraph of a graph in a cluster has to take at least a user-specified fraction of its overall size. The new algorithm works in an online mode (processing one structure after the other) and produces overlapping (non-disjoint) and non-exhaustive clusters. In a series of experiments, we evaluated the effectiveness and efficiency of the structural clustering algorithm on various real world data sets of molecular graphs.

Madeleine Seeland, Tobias Girschick, Fabian Buchwald, Stefan Kramer

Large-Scale Support Vector Learning with Structural Kernels

In this paper, we present an extensive study of the cutting-plane algorithm (CPA) applied to structural kernels for advanced text classification on large datasets. In particular, we carry out a comprehensive experimentation on two interesting natural language tasks, e.g. predicate argument extraction and question answering. Our results show that (i) CPA applied to train a non-linear model with different tree kernels fully matches the accuracy of the conventional SVM algorithm while being ten times faster; (ii) by using smaller sampling sizes to approximate subgradients in CPA we can trade off accuracy for speed, yet the optimal parameters and kernels found remain optimal for the exact SVM. These results open numerous research perspectives, e.g. in natural language processing, as they show that complex structural kernels can be efficiently used in real-world applications. For example, for the first time, we could carry out extensive tests of several tree kernels on millions of training instances. As a direct benefit, we could experiment with a variant of the partial tree kernel, which we also propose in this paper.

Aliaksei Severyn, Alessandro Moschitti

Synchronization Based Outlier Detection

The study of extraordinary observations is of great interest in a large variety of applications, such as criminal activities detection, athlete performance analysis, and rare events or exceptions identification. The question is: how can we naturally flag these outliers in a real complex data set? In this paper, we study outlier detection based on a novel powerful concept: synchronization. The basic idea is to regard each data object as a phase oscillator and simulate its dynamical behavior over time according to an extensive Kuramoto model. During the process towards synchronization, regular objects and outliers exhibit different interaction patterns. Outlier objects are naturally detected by local synchronization factor (LSF). An extensive experimental evaluation on synthetic and real world data demonstrates the benefits of our method.

Junming Shao, Christian Böhm, Qinli Yang, Claudia Plant

Laplacian Spectrum Learning

The eigenspectrum of a graph Laplacian encodes smoothness information over the graph. A natural approach to learning involves transforming the spectrum of a graph Laplacian to obtain a kernel. While manual exploration of the spectrum is conceivable, non-parametric learning methods that adjust the Laplacian’s spectrum promise better performance. For instance, adjusting the graph Laplacian using kernel target alignment (KTA) yields better performance when an SVM is trained on the resulting kernel. KTA relies on a simple surrogate criterion to choose the kernel; the obtained kernel is then fed to a large margin classification algorithm. In this paper, we propose novel formulations that jointly optimize relative margin and the spectrum of a kernel defined via Laplacian eigenmaps. The large relative margin case is in fact a strict generalization of the large margin case. The proposed methods show significant empirical advantage over numerous other competing methods.

Pannagadatta K. Shivaswamy, Tony Jebara

k-Version-Space Multi-class Classification Based on k-Consistency Tests

-Version spaces were introduced in [6] to handle noisy data. They were defined as sets of

-consistent hypotheses; i.e., hypotheses consistent with all but

instances. Although

-version spaces were applied, their implementation was intractable due to the boundary-set representation.

This paper argues that to classify with

-version spaces we do not need an explicit representation. Instead we need to solve a general

-consistency problem and a general

0-consistency problem. The general

-consistency problem is to test the hypothesis space for classifier that is

-consistent with the data. The general

0-consistency problem is to test the hypothesis space for classifier that is

-consistent with the data and 0-consistent with a labeled test instance. Hence, our main result is that the

-version-space classification can be (tractably) implemented if we have (tractable)

-consistency-test algorithms and (tractable)

0-consistency-test algorithms. We show how to design these algorithms for any learning algorithm in multi-class classification setting.

Evgueni Smirnov, Georgi Nalbantov, Nikolay Nikolaev

Complexity Bounds for Batch Active Learning in Classification

Active learning [1] is a branch of Machine Learning in which the learning algorithm, instead of being directly provided with pairs of problem instances and their solutions (their labels), is allowed to choose, from a set of unlabeled data, which instances to query. It is suited to settings where labeling instances is costly. This paper analyzes the speed-up of batch (parallel) active learning compared to sequential active learning (where instances are chosen 1 by 1): how faster can an algorithm become if it can query

instances at once?

There are two main contributions: proving lower and upper bounds on the possible gain, and illustrating them by experimenting on usual active learning algorithms. Roughly speaking, the speed-up is asymptotically logarithmic in the batch size ł (i.e. when ł→ ∞). However, for some classes of functions with finite VC-dimension

, a linear speed-up can be achieved until a batch size of

. Practically speaking, this means that parallelizing computations on an expensive-to-label problem which is suited to active learning is very beneficial until

simultaneous queries, and less interesting (yet still bringing improvement) afterwards.

Philippe Rolet, Olivier Teytaud

Semi-supervised Projection Clustering with Transferred Centroid Regularization

We propose a novel method, called Semi-supervised Projection Clustering in Transfer Learning (SPCTL), where multiple source domains and one target domain are assumed. Traditional semi-supervised projection clustering methods hold the assumption that the data and pairwise constraints are all drawn from the same domain. However, many related data sets with different distributions are available in real applications. The traditional methods thus can not be directly extended to such a scenario. One major challenging issue is how to exploit constraint knowledge from multiple source domains and transfer it to the target domain where all the data are unlabeled. To handle this difficulty, we are motivated to construct a common subspace where the difference in distributions among domains can be reduced. We also invent a

transferred centroid regularization

, which acts as a bridge to transfer the constraint knowledge to the target domain, to formulate this geometric structure formed by the centroids from different domains. Extensive experiments on both synthetic and practical data sets show the effectiveness of our method.

Bin Tong, Hao Shao, Bin-Hui Chou, Einoshin Suzuki

Permutation Testing Improves Bayesian Network Learning

We are taking a peek “under the hood” of constraint-based learning of graphical models such as Bayesian Networks. This mainstream approach to learning is founded on performing statistical tests of conditional independence. In all prior work however, the tests employed for categorical data are only asymptotically-correct, i.e., they converge to the exact

-value in the sample limit. In this paper we present, evaluate, and compare exact tests, based on standard, adjustable, and semi-parametric Monte-Carlo permutation testing procedures appropriate for small sample sizes. It is demonstrated that (a) permutation testing is calibrated, i.e, the actual Type I error matches the significance level

set by the user; this is not the case with asymptotic tests, (b) permutation testing leads to more robust structural learning, and (c) permutation testing allows learning networks from multiple datasets sharing a common underlying structure but different distribution functions (e.g. continuous vs. discrete); we name this problem the

Bayesian Network Meta-Analysis

problem. In contrast, asymptotic tests may lead to erratic learning behavior in this task (error increasing with total sample-size). The semi-parametric permutation procedure we propose is a reasonable approximation of the basic procedure using 5000 permutations, while being only 10-20 times slower than the asymptotic tests for small sample sizes. Thus, this test should be practical in most graphical learning problems and could substitute asymptotic tests. The conclusions of our studies have ramifications for learning not only Bayesian Networks but other graphical models too and for related causal-based variable selection algorithms, such as HITON. The code is available at mensxmachina.org.

Ioannis Tsamardinos, Giorgos Borboudakis

Example-dependent Basis Vector Selection for Kernel-Based Classifiers

We study methods for speeding up classification time of kernel-based classifiers. Existing solutions are based on explicitly seeking sparse classifiers during training, or by using budgeted versions of the classifier where one directly limits the number of basis vectors allowed. Here, we propose a more flexible alternative: instead of using the same basis vectors over the whole feature space, our solution uses different basis vectors in different parts of the feature space. At the core of our solution lies an optimization procedure that, given a set of basis vectors, finds a good partition of the feature space and good subsets of the existing basis vectors. Using this procedure repeatedly, we build trees whose internal nodes specify feature space partitions and whose leaves implement simple kernel classifiers. Experiments suggest that our method reduces classification time significantly while maintaining performance. In addition, we propose several heuristics that also perform well.

Antti Ukkonen, Marta Arias

Surprising Patterns for the Call Duration Distribution of Mobile Phone Users

How long are the phone calls of mobile users? What are the chances of a call to end, given its current duration? Here we answer these questions by studying the call duration distributions (CDDs) of individual users in large mobile networks. We analyzed a large, real network of

3.1 million

users and more than one

billion

phone call records from a private mobile phone company of a large city, spanning 0.1

. Our first contribution is the TLAC distribution to fit the CDD of each user; TLAC is the truncated version of so-called

log-logistic

distribution, a skewed, power-law-like distribution. We show that the TLAC is an excellent fit for the overwhelming majority of our users (more than 96% of them), much better than exponential or lognormal. Our second contribution is the

MetaDist

to model the collective behavior of the users given their CDDs. We show that the

MetaDist

distribution accurately and succinctly describes the calls duration behavior of users in large mobile networks. All of our methods are fast, and scale linearly with the number of customers.

Pedro O. S. Vaz de Melo, Leman Akoglu, Christos Faloutsos, Antonio A. F. Loureiro

Variational Bayesian Mixture of Robust CCA Models

We study the problem of extracting statistical dependencies between multivariate signals, to be used for exploratory analysis of complicated natural phenomena. In particular, we develop generative models for extracting the dependencies, made possible by the probabilistic interpretation of canonical correlation analysis (CCA). We introduce a mixture of robust canonical correlation analyzers, using t-distribution to make the model robust to outliers and variational Bayesian inference for learning from noisy data. We demonstrate the improvements of the new model on artificial data, and further apply it for analyzing dependencies between MEG and measurements of autonomic nervous system to illustrate potential use scenarios.

Jaakko Viinikanoja, Arto Klami, Samuel Kaski

Adverse Drug Reaction Mining in Pharmacovigilance Data Using Formal Concept Analysis

In this paper we discuss the problem of extracting and evaluating associations between drugs and adverse effects in pharmacovigilance data. Approaches proposed by the medical informatics community for mining one drug - one effect pairs perform an exhaustive search strategy that precludes from mining high-order associations. Some specificities of pharmacovigilance data prevent from applying pattern mining approaches proposed by the data mining community for similar problems dealing with epidemiological studies. We argue that Formal Concept Analysis (FCA) and concept lattices constitute a suitable framework for both identifying relevant associations, and assisting experts in their evaluation task. Demographic attributes are handled so that the disproportionality of an association is computed w.r.t. the relevant population stratum to prevent confounding. We put the focus on the understandability of the results and provide evaluation facilities for experts. A real case study on a subset of the French spontaneous reporting system shows that the method identifies known adverse drug reactions and some unknown associations that has to be further investigated.

Jean Villerd, Yannick Toussaint, Agnès Lillo-Le Louët

Topic Models Conditioned on Relations

Latent Dirichlet allocation is a fully generative statistical language model that has been proven to be successful in capturing both the content and the topics of a corpus of documents. Recently, it was even shown that relations among documents such as hyper-links or citations allow one to share information between documents and in turn to improve topic generation. Although fully generative, in many situations we are actually not interested in predicting relations among documents. In this paper, we therefore present a Dirichlet-multinomial nonparametric regression topic model that includes a Gaussian process prior on joint document and topic distributions that is a function of document relations. On networks of scientific abstracts and of Wikipedia documents we show that this approach meets or exceeds the performance of several baseline topic models.

Mirwaes Wahabzada, Zhao Xu, Kristian Kersting

Shift-Invariant Grouped Multi-task Learning for Gaussian Processes

Multi-task learning leverages shared information among data sets to improve the learning performance of individual tasks. The paper applies this framework for data where each task is a phase-shifted periodic time series. In particular, we develop a novel Bayesian nonparametric model capturing a mixture of Gaussian processes where each task is a sum of a group-specific function and a component capturing individual variation, in addition to each task being phase shifted. We develop an efficient

algorithm to learn the parameters of the model. As a special case we obtain the Gaussian mixture model and

algorithm for phased-shifted periodic time series. Experiments in regression, classification and class discovery demonstrate the performance of the proposed model using both synthetic data and real-world time series data from astrophysics. Our methods are particularly useful when the time series are sparsely and non-synchronously sampled.

Yuyang Wang, Roni Khardon, Pavlos Protopapas

Nonparametric Bayesian Clustering Ensembles

Forming consensus clusters from multiple input clusterings can improve accuracy and robustness. Current clustering ensemble methods require specifying the number of consensus clusters. A poor choice can lead to under or over fitting. This paper proposes a nonparametric Bayesian clustering ensemble (NBCE) method, which can discover the number of clusters in the consensus clustering. Three inference methods are considered: collapsed Gibbs sampling, variational Bayesian inference, and collapsed variational Bayesian inference. Comparison of NBCE with several other algorithms demonstrates its versatility and superior stability.

Pu Wang, Carlotta Domeniconi, Kathryn Blackmond Laskey

Directed Graph Learning via High-Order Co-linkage Analysis

Many real world applications can be naturally formulated as a directed graph learning problem. How to extract the directed link structures of a graph and use labeled vertices are the key issues to infer labels of the remaining unlabeled vertices. However, directed graph learning is not well studied in data mining and machine learning areas. In this paper, we propose a novel Co-linkage Analysis (CA) method to process directed graphs in an undirected way with the directional information preserved. On the induced undirected graph, we use a Green’s function approach to solve the semi-supervised learning problem. We present a new zero-mode free Laplacian which is invertible. This leads to an Improved Green’s Function (IGF) method to solve the classification problem, which is also extended to deal with multi-label classification problems. Promising results in extensive experimental evaluations on real data sets have demonstrated the effectiveness of our approach.

Hua Wang, Chris Ding, Heng Huang

Incorporating Domain Models into Bayesian Optimization for RL

In many Reinforcement Learning (RL) domains there is a high cost for generating experience in order to evaluate an agent’s performance. An appealing approach to reducing the number of expensive evaluations is Bayesian Optimization (BO), which is a framework for global optimization of noisy and costly to evaluate functions. Prior work in a number of RL domains has demonstrated the effectiveness of BO for optimizing parametric policies. However, those approaches completely ignore the state-transition sequence of policy executions and only consider the total reward achieved. In this paper, we study how to more effectively incorporate all of the information observed during policy executions into the BO framework. In particular, our approach uses the observed data to learn approximate transitions models that allow for Monte-Carlo predictions of policy returns. The models are then incorporated into the BO framework as a type of prior on policy returns, which can better inform the BO process. The resulting algorithm provides a new approach for leveraging learned models in RL even when there is no planner available for exploiting those models. We demonstrate the effectiveness of our algorithm in four benchmark domains, which have dynamics of variable complexity. Results indicate that our algorithm effectively combines model based predictions to improve the data efficiency of model free BO methods, and is robust to modeling errors when parts of the domain cannot be modeled successfully.

Aaron Wilson, Alan Fern, Prasad Tadepalli

Efficient and Numerically Stable Sparse Learning

We consider the problem of numerical stability and model density growth when training a sparse linear model from massive data. We focus on scalable algorithms that optimize certain loss function using gradient descent, with either ℓ

or ℓ

regularization. We observed numerical stability problems in several existing methods, leading to divergence and low accuracy. In addition, these methods typically have weak controls over sparsity, such that model density grows faster than necessary. We propose a framework to address the above problems. First, the update rule is numerically stable with convergence guarantee and results in more reasonable models. Second, besides ℓ

regularization, it exploits the sparsity of data distribution and achieves a higher degree of sparsity with a PAC generalization error bound. Lastly, it is parallelizable and suitable for training large margin classifiers on huge datasets. Experiments show that the proposed method converges consistently and outperforms other baselines using 10% of features by as much as 6% reduction in error rate on average. Datasets and software are available from the authors.

Sihong Xie, Wei Fan, Olivier Verscheure, Jiangtao Ren

Fast Active Exploration for Link-Based Preference Learning Using Gaussian Processes

In preference learning, the algorithm observes pairwise relative judgments (preference) between items as training data for learning an ordering of all items. This is an important learning problem for applications where absolute feedback is difficult to elicit, but pairwise judgments are readily available (e.g., via implicit feedback [13]). While it was already shown that active learning can effectively reduce the number of training pairs needed, the most successful existing algorithms cannot generalize over items or queries. Considering web search as an example, they would need to learn a separate relevance score for each document-query pair from scratch. To overcome this inefficiency, we propose a link-based active preference learning method based on Gaussian Processes (GPs) that incorporates dependency information from both feature-vector representations as well as relations. Specifically, to meet the requirement on computational efficiency of active exploration, we introduce a novel incremental update method that scales as well as the non-generalizing models. The proposed algorithm is evaluated on datasets for information retrieval, showing that it learns substantially faster than algorithms that cannot model dependencies.

Zhao Xu, Kristian Kersting, Thorsten Joachims

Many-to-Many Graph Matching: A Continuous Relaxation Approach

Graphs provide an efficient tool for object representation in various machine learning applications. Once graph-based representations are constructed, an important question is how to compare graphs. This problem is often formulated as a graph matching problem where one seeks a mapping between vertices of two graphs which optimally aligns their structure. In the classical formulation of graph matching, only one-to-one correspondences between vertices are considered. However, in many applications, graphs cannot be matched perfectly and it is more interesting to consider many-to-many correspondences where clusters of vertices in one graph are matched to clusters of vertices in the other graph. In this paper, we formulate the many-to-many graph matching problem as a discrete optimization problem and propose two approximate algorithms based on alternative continuous relaxations of the combinatorial problem. We compare new methods with other existing methods on several benchmark datasets.

Mikhail Zaslavskiy, Francis Bach, Jean-Philippe Vert

Competitive Online Generalized Linear Regression under Square Loss

We apply the Aggregating Algorithm to the problem of online regression under the square loss function. We develop an algorithm competitive with the benchmark class of generalized linear models (our “experts”), which are used in a wide range of practical tasks. This problem does not appear to be analytically tractable. Therefore, we develop a prediction algorithm using the Markov chain Monte Carlo method, which is shown to be fast and reliable in many cases. We prove upper bounds on the cumulative square loss of the algorithm. We also perform experiments with our algorithm on a toy data set and two real world ozone level data sets and give suggestions about choosing its parameters.

Fedor Zhdanov, Vladimir Vovk

Cross Validation Framework to Choose amongst Models and Datasets for Transfer Learning

One solution to the lack of label problem is to exploit transfer learning, whereby one acquires knowledge from source-domains to improve the learning performance in the target-domain. The main challenge is that the source and target domains may have different distributions. An open problem is how to select the available models (including algorithms and parameters) and importantly, abundance of source-domain data, through statistically reliable methods, thus making transfer learning practical and easy-to-use for real-world applications. To address this challenge, one needs to take into account the difference in both marginal and conditional distributions in the same time, but not just one of them. In this paper, we formulate a new criterion to overcome “double” distribution shift and present a practical approach “Transfer Cross Validation” (TrCV) to select both models and data in a cross validation framework, optimized for transfer learning. The idea is to use density ratio weighting to overcome the difference in marginal distributions and propose a “reverse validation” procedure to quantify how well a model approximates the true conditional distribution of target-domain. The usefulness of TrCV is demonstrated on different cross-domain tasks, including wine quality evaluation, web-user ranking and text categorization. The experiment results show that the proposed method outperforms both traditional cross-validation and one state-of-the-art method which only considers marginal distribution shift. The software and datasets are available from the authors.

Erheng Zhong, Wei Fan, Qiang Yang, Olivier Verscheure, Jiangtao Ren

Fast, Effective Molecular Feature Mining by Local Optimization

In structure-activity-relationships (SAR) one aims at finding classifiers that predict the biological or chemical activity of a compound from its molecular graph. Many approaches to SAR use sets of binary substructure features, which test for the occurrence of certain substructures in the molecular graph. As an alternative to enumerating very large sets of frequent patterns, numerous pattern set mining and pattern set selection techniques have been proposed. Existing approaches can be broadly classified into those that focus on minimizing correspondences, that is, the number of pairs of training instances from different classes with identical encodings and those that focus on maximizing the number of equivalence classes, that is, unique encodings in the training data. In this paper we evaluate a number of techniques to investigate which criterion is a better indicator of predictive accuracy. We find that minimizing correspondences is a necessary but not sufficient condition for good predictive accuracy, that equivalence classes are a better indicator of success and that it is important to have a good match between training set and pattern set size. Based on these results we propose a new, improved algorithm which performs local minimization of correspondences, yet evaluates the effect of patterns on equivalence classes globally. Empirical experiments demonstrate its efficacy and its superior run time behavior.

Albrecht Zimmermann, Björn Bringmann, Ulrich Rückert

Demo Papers

AnswerArt - Contextualized Question Answering

The focus of this paper is a question answering system, where the answers are retrieved from a collection of textual documents. The system also includes automatic document summarization and document visualization by means of a semantic graph. The information extracted from the documents is stored as subject-predicate-object triplets, and the indexed terms are expanded using Cyc, a large common sense ontology.

Lorand Dali, Delia Rusu, Blaž Fortuna, Dunja Mladenić, Marko Grobelnik

Real-Time News Recommender System

In this demo we present a robust system for delivering real-time news recommendation to the user based on the user’s history of the past visits to the site, current user’s context and popularity of stories. Our system is running live providing real-time recommendations of news articles. The system handles overspecializing as we recommend categories as opposed to items, it implicitly uses collaboration by taking into account user context and popular items and, it can handle new users by using context information. A unique characteristic of our system is that it prefers freshness over relevance, which is important for recommending news articles in real-world setting as addressed here. We experimentally compare the proposed approach as implemented in our system against several state-of-the-art alternatives and show that it significantly outperforms them.

Blaž Fortuna, Carolina Fortuna, Dunja Mladenić

CET: A Tool for Creative Exploration of Graphs

We present a tool for interactive exploration of graphs that integrates advanced graph mining methods in an interactive visualization framework. The tool enables efficient exploration and analysis of complex graph structures. For flexible integration of state-of-the-art graph mining methods, the viewer makes use of the open source data mining platform KNIME. In contrast to existing graph visualization interfaces, all parts of the interface can be dynamically changed to specific visualization requirements, including the use of node type dependent icons, methods for a marking if nodes or edges and highlighting and a fluent graph that allows for iterative growing, shrinking and abstraction of (sub)graphs.

Stefan Haun, Andreas Nürnberger, Tobias Kötter, Kilian Thiel, Michael R. Berthold

NewsGist: A Multilingual Statistical News Summarizer

In this paper we present NewsGist, a multilingual, multi-document news summarization system underpinned by the Singular Value Decomposition (SVD) paradigm for document summarization and purpose-built for the Europe Media Monitor (EMM). The summarization method employed yielded state-of-the-art performance for English at the Update Summarization task of the last Text Analysis Conference (TAC) 2009 and integrated with EMM represents the first online summarization system able to produce summaries for so many languages. We discuss the context and motivation for developing the system and provide an overview of its architecture. The paper is intended to serve as accompaniment of a live demo of the system, which can be of interest to researchers and engineers working on multilingual open-source news analysis and mining.

Mijail Kabadjov, Martin Atkinson, Josef Steinberger, Ralf Steinberger, Erik van der Goot

QUEST: Query Expansion Using Synonyms over Time

A particular problem of searching news archives with named entities is that they are very dynamic in appearance compared to other vocabulary terms, and synonym relationships between terms change with time. In previous work, we proposed an approach to extracting time-based synonyms of named entities from the whole history of Wikipedia. In this paper, we present QUEST (

ery

xpansion using

ynonyms over

ime), a system that exploits time-based synonyms in searching news archives. The system takes as input a named entity query, and automatically determines time-based synonyms for a given query wrt. time criteria. Query expansion using the determined synonyms can be employed in order to improve the retrieval effectiveness.

Nattiya Kanhabua, Kjetil Nørvåg

Flu Detector - Tracking Epidemics on Twitter

We present an automated tool with a web interface for tracking the prevalence of Influenza-like Illness (ILI) in several regions of the United Kingdom using the contents of Twitter’s microblogging service. Our data is comprised by a daily average of approximately 200,000 geolocated tweets collected by targeting 49 urban centres in the UK for a time period of 40 weeks. Official ILI rates from the Health Protection Agency (HPA) form our ground truth. Bolasso, the bootstrapped version of LASSO, is applied in order to extract a consistent set of features, which are then used for learning a regression model.

Vasileios Lampos, Tijl De Bie, Nello Cristianini

X-SDR: An Extensible Experimentation Suite for Dimensionality Reduction

Due to the vast amount and pace of high-dimensional data production, dimensionality reduction emerges as an important requirement in many application areas. In this paper, we introduce X-SDR, a prototype designed specifically for the deployment and assessment of dimensionality reduction techniques. X-SDR is an integrated environment for dimensionality reduction and knowledge discovery that can be effectively used in the data mining process. In the current version, it supports communication with different database management systems and integrates a wealth of dimensionality reduction algorithms both distributed and centralized. Additionally, it interacts with Weka thus enabling the exploitation of the data mining algorithms therein. Finally, X-SDR provides an API that enables the integration and evaluation of any dimensionality reduction algorithm.

Panagis Magdalinos, Anastasios Kapernekas, Alexandros Mpiratsis, Michalis Vazirgiannis

SOREX: Subspace Outlier Ranking Exploration Toolkit

Outlier mining is an important data analysis task to distinguish exceptional outliers from regular objects. In recent research novel outlier ranking methods propose to focus on outliers hidden in subspace projections of the data. However, focusing only on the detection of outliers these approaches miss to provide reasons why an object should be considered as an outlier.

In this work, we propose a novel toolkit for exploration of subspace outlier rankings. To enable exploration of subspace outliers and to complete knowledge extraction we provide further descriptive information in addition to the pure detection of outliers. As wittinesses for the outlierness of an object, we provide information about the relevant projections describing the reasons for outlier properties. We provided

SOREX

as open source framework on our website it is easily extensible and suitable for research and educational purposes in this emerging research area.

Emmanuel Müller, Matthias Schiffer, Patrick Gerwert, Matthias Hannen, Timm Jansen, Thomas Seidl

KDTA: Automated Knowledge-Driven Text Annotation

In this paper we demonstrate a system that automatically annotates text documents with a given domain ontology’s concepts. The annotation process utilizes lexical and Web resources to analyze the semantic similarity of text components with any of the ontology concepts, and outputs a list with the proposed annotations, accompanied with appropriate confidence values. The demonstrated system is available online and free to use, and it constitutes one of the main components of the

KDTA

(

Knowledge-Driven Text Analysis

) module of the

CASAM

European research project.

Katerina Papantoniou, George Tsatsaronis, Georgios Paliouras

Detecting Events in a Million New York Times Articles

We present a demonstration of a newly developed text stream event detection method on over a million articles from the New York Times corpus. The event detection is designed to operate in a predominantly on-line fashion, reporting new events within a specified timeframe. The event detection is achieved by detecting significant changes in the statistical properties of the text where those properties are efficiently stored and updated in a suffix tree.

This particular demonstration shows how our method is effective at discovering both short- and long-term events (which are often denoted topics), and how it automatically copes with topic drift on a corpus of 1 035 263 articles.

Tristan Snowsill, Ilias Flaounas, Tijl De Bie, Nello Cristianini

Experience STORIES: A Visual News Search and Summarization System

Using data collections available on the Internet has for many people become the main medium for staying informed about the world. Many of these collections are dynamic by nature, evolving as the subjects they describe change. We present the STORIES system for (a) learning an abstracted story representation from a collection of time-indexed documents; (b) visualizing it in a way that encourages users to interact and explore in order to discover temporal “story stages” depending on their interests; and (c) supporting the search for documents and facts that pertain to the user-constructed story stages.

Ilija Subašić, Bettina Berendt

Exploring Real Mobility Data with M-Atlas

Research on moving-object data analysis has been recently fostered by the widespread diffusion of new techniques and systems for monitoring, collecting and storing location aware data, generated by a wealth of technological infrastructures, such as GPS positioning and wireless networks. These have made available massive repositories of spatio-temporal data recording human mobile activities, that call for suitable analytical methods, capable of enabling the development of innovative, location-aware applications [3].

R. Trasarti, S. Rinzivillo, F. Pinelli, M. Nanni, A. Monreale, C. Renso, D. Pedreschi, F. Giannotti

Backmatter

Title: Machine Learning and Knowledge Discovery in Databases
Editors: José Luis Balcázar
Francesco Bonchi
Aristides Gionis
Michèle Sebag
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-15939-8
Print ISBN: 978-3-642-15938-1
DOI: https://doi.org/10.1007/978-3-642-15939-8

Springer Professional

Table of Contents