Ensembles

AR-Boost: Reducing Overfitting by a Robust Data-Driven Regularization Strategy

We introduce a novel, robust data-driven regularization strategy called Adaptive Regularized Boosting (AR-Boost), motivated by a desire to reduce overfitting. We replace AdaBoost’s hard margin with a regularized soft margin that trades-off between a larger margin, at the expense of misclassification errors. Minimizing this regularized exponential loss results in a boosting algorithm that

relaxes the weak learning assumption further

: it can use classifiers with error greater than

$\frac{1}{2}$

. This enables a natural extension to multiclass boosting, and further reduces overfitting in both the binary and multiclass cases. We derive bounds for training and generalization errors, and relate them to AdaBoost. Finally, we show empirical results on benchmark data that establish the robustness of our approach and improved performance overall.

Baidya Nath Saha, Gautam Kunapuli, Nilanjan Ray, Joseph A. Maldjian, Sriraam Natarajan

Parallel Boosting with Momentum

We describe a new, simplified, and general analysis of a fusion of Nesterov’s accelerated gradient with parallel coordinate descent. The resulting algorithm, which we call BOOM, for

boo

sting with

m

omentum, enjoys the merits of both techniques. Namely, BOOM retains the momentum and convergence properties of the accelerated gradient method while taking into account the curvature of the objective function. We describe a

distributed

implementation of BOOM which is suitable for massive high dimensional datasets. We show experimentally that BOOM is especially effective in large scale learning problems with rare yet informative features.

Indraneel Mukherjee, Kevin Canini, Rafael Frongillo, Yoram Singer

Inner Ensembles: Using Ensemble Methods Inside the Learning Algorithm

Ensemble Methods represent an important research area within machine learning. Here, we argue that the use of such methods can be generalized and applied in many more situations than they have been previously. Instead of using them only to combine the output of an algorithm, we can apply them to the decisions made inside the learning algorithm, itself. We call this approach Inner Ensembles. The main contribution of this work is to demonstrate how broadly this idea can applied. Specifically, we show that the idea can be applied to different classes of learner such as Bayesian networks and K-means clustering.

Houman Abbasian, Chris Drummond, Nathalie Japkowicz, Stan Matwin

Statistical Learning

Learning Discriminative Sufficient Statistics Score Space for Classification

Generative score spaces provide a principled method to exploit generative information, e.g., data distribution and hidden variables, in discriminative classifiers. The underlying methodology is to derive measures or score functions from generative models. The derived score functions, spanning the so-called score space, provide features of a fixed dimension for discriminative classification. In this paper, we propose a simple yet effective score space which is essentially the sufficient statistics of the adopted generative models and does not involve the parameters of generative models. We further propose a discriminative learning method for the score space that seeks to utilize label information by constraining the classification margin over the score space. The form of score function allows the formulation of simple learning rules, which are essentially the same learning rules for a generative model with an extra posterior imposed over its hidden variables. Experimental evaluation of this approach over two generative models shows that performance of the score space approach coupled with the proposed discriminative learning method is competitive with state-of-the-art classification methods.

Xiong Li, Bin Wang, Yuncai Liu, Tai Sing Lee

The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited

We reconsider the stochastic (sub)gradient approach to the unconstrained primal L1-SVM optimization. We observe that if the learning rate is inversely proportional to the number of steps, i.e., the number of times any training pattern is presented to the algorithm, the update rule may be transformed into the one of the classical perceptron with margin in which the margin threshold increases linearly with the number of steps. Moreover, if we cycle repeatedly through the possibly randomly permuted training set the dual variables defined naturally via the expansion of the weight vector as a linear combination of the patterns on which margin errors were made are shown to obey at the end of each complete cycle automatically the box constraints arising in dual optimization. This renders the dual Lagrangian a running lower bound on the primal objective tending to it at the optimum and makes available an upper bound on the relative accuracy achieved which provides a meaningful stopping criterion. In addition, we propose a mechanism of presenting the same pattern repeatedly to the algorithm which maintains the above properties. Finally, we give experimental evidence that algorithms constructed along these lines exhibit a considerably improved performance.

Constantinos Panagiotakopoulos, Petroula Tsampouka

Bundle CDN: A Highly Parallelized Approach for Large-Scale ℓ1-Regularized Logistic Regression

Parallel coordinate descent algorithms emerge with the growing demand of large-scale optimization. In general, previous algorithms are usually limited by their divergence under high degree of parallelism (DOP), or need data pre-process to avoid divergence. To better exploit parallelism, we propose a coordinate descent based parallel algorithm without needing of data pre-process, termed as Bundle Coordinate Descent Newton (BCDN), and apply it to large-scale ℓ

1

-regularized logistic regression. BCDN first randomly partitions the feature set into

Q

non-overlapping subsets/bundles in a Gauss-Seidel manner, where each bundle contains

P

features. For each bundle, it finds the descent directions for the

P

features in parallel, and performs

P

-dimensional Armijo line search to obtain the stepsize. By theoretical analysis on global convergence, we show that BCDN is guaranteed to converge with a high DOP. Experimental evaluations over five public datasets show that BCDN can better exploit parallelism and outperforms state-of-the-art algorithms in speed, without losing testing accuracy.

Yatao Bian, Xiong Li, Mingqi Cao, Yuncai Liu

MORD: Multi-class Classifier for Ordinal Regression

We show that classification rules used in ordinal regression are equivalent to a certain class of linear multi-class classifiers. This observation not only allows to design new learning algorithms for ordinal regression using existing methods for multi-class classification but it also allows to derive new models for ordinal regression. For example, one can convert learning of ordinal classifier with (almost) arbitrary loss function to a convex unconstrained risk minimization problem for which many efficient solvers exist. The established equivalence also allows to increase discriminative power of the ordinal classifier without need to use kernels by introducing a piece-wise ordinal classifier. We demonstrate advantages of the proposed models on standard benchmarks as well as in solving a real-life problem. In particular, we show that the proposed piece-wise ordinal classifier applied to visual age estimation outperforms other standard prediction models.

Kostiantyn Antoniuk, Vojtěch Franc, Václav Hlaváč

Identifiability of Model Properties in Over-Parameterized Model Classes

Classical learning theory is based on a tight linkage between hypothesis space (a class of function on a domain

X

), data space (function-value examples (

x

,

f

(

x

))), and the space of queries for the learned model (predicting function values for new examples

x

). However, in many learning scenarios the 3-way association between hypotheses, data, and queries can really be much looser. Model classes can be over-parameterized, i.e., different hypotheses may be equivalent with respect to the data observations. Queries may relate to model properties that do not directly correspond to the observations in the data. In this paper we make some initial steps to extend and adapt basic concepts of computational learnability and statistical identifiability to provide a foundation for investigating learnability in such broader contexts. We exemplify the use of the framework in three different applications: the identification of temporal logic properties of probabilistic automata learned from sequence data, the identification of causal dependencies in probabilistic graphical models, and the transfer of probabilistic relational models to new domains.

Manfred Jaeger

Semi-supervised Learning

Exploratory Learning

In multiclass semi-supervised learning (SSL), it is sometimes the case that the number of classes present in the data is not known, and hence no labeled examples are provided for some classes. In this paper we present variants of well-known semi-supervised multiclass learning methods that are robust when the data contains an unknown number of classes. In particular, we present an “exploratory” extension of expectation-maximization (EM) that explores different numbers of classes while learning. “Exploratory” SSL greatly improves performance on three datasets in terms of F1 on the classes

with

seed examples—i.e., the classes which are expected to be in the data. Our Exploratory EM algorithm also outperforms a SSL method based non-parametric Bayesian clustering.

Bhavana Dalvi, William W. Cohen, Jamie Callan

Semi-supervised Gaussian Process Ordinal Regression

Ordinal regression problem arises in situations where examples are rated in an ordinal scale. In practice, labeled ordinal data are difficult to obtain while unlabeled ordinal data are available in abundance. Designing a probabilistic semi-supervised classifier to perform ordinal regression is challenging. In this work, we propose a novel approach for semi-supervised ordinal regression using Gaussian Processes (GP). It uses the expectation-propagation approximation idea, widely used for GP ordinal regression problem. The proposed approach makes use of unlabeled data in addition to the labeled data to learn a model by matching ordinal label distributions approximately between labeled and unlabeled data. The resulting mixed integer programming problem, involving model parameters (real-valued) and ordinal labels (integers) as variables, is solved efficiently using a sequence of alternating optimization steps. Experimental results on synthetic, bench-mark and real-world data sets demonstrate that the proposed GP based approach makes effective use of the unlabeled data to give better generalization performance (on the absolute error metric, in particular) than the supervised approach. Thus, it is a useful approach for probabilistic semi-supervised ordinal regression problem.

P. K. Srijith, Shirish Shevade, S. Sundararajan

Influence of Graph Construction on Semi-supervised Learning

A variety of graph-based semi-supervised learning (SSL) algorithms and graph construction methods have been proposed in the last few years. Despite their apparent empirical success, the field of SSL lacks a detailed study that empirically evaluates the influence of graph construction on SSL. In this paper we provide such an experimental study. We combine a variety of graph construction methods as well as a variety of graph-based SSL algorithms and empirically compare them on a number of benchmark data sets widely used in the SSL literature. The empirical evaluation proposed in this paper is subdivided into four parts: (1) best case analysis; (2) classifiers’ stability evaluation; (3) influence of graph construction; and (4) influence of regularization parameters. The purpose of our experiments is to evaluate the trade-off between classification performance and stability of the SSL algorithms on a variety of graph construction methods and parameter values. The obtained results show that the mutual

k

-nearest neighbors (mutKNN) graph may be the best choice for adjacency graph construction while the RBF kernel may be the best choice for weighted matrix generation. In addition, mutKNN tends to generate smoother error surfaces than other adjacency graph construction methods. However, mutKNN is unstable for a relatively small value of

k

. Our results indicate that the classification performance of the graph-based SSL algorithms are heavily influenced by the parameters setting and we found no evident explorable pattern to relay to future practitioners. We discuss the consequences of such instability in research and practice.

Celso André R. de Sousa, Solange O. Rezende, Gustavo E. A. P. A. Batista

Tractable Semi-supervised Learning of Complex Structured Prediction Models

Semi-supervised learning has been widely studied in the literature. However, most previous works assume that the output structure is simple enough to allow the direct use of tractable inference/learning algorithms (e.g., binary label or linear chain). Therefore, these methods cannot be applied to problems with complex structure. In this paper, we propose an approximate semi-supervised learning method that uses piecewise training for estimating the model weights and a dual decomposition approach for solving the inference problem of finding the labels of unlabeled data subject to domain specific constraints. This allows us to extend semi-supervised learning to general structured prediction problems. As an example, we apply this approach to the problem of multi-label classification (a fully connected pairwise Markov random field). Experimental results on benchmark data show that, in spite of using approximations, the approach is effective and yields good improvements in generalization performance over the plain supervised method. In addition, we demonstrate that our inference engine can be applied to other semi-supervised learning frameworks, and extends them to solve problems with complex structure.

Kai-Wei Chang, S. Sundararajan, S. Sathiya Keerthi

PSSDL: Probabilistic Semi-supervised Dictionary Learning

While recent supervised dictionary learning methods have attained promising results on the classification tasks, their performance depends on the availability of the large labeled datasets. However, in many real world applications, accessing to sufficient labeled data may be expensive and/or time consuming, but its relatively easy to acquire a large amount of unlabeled data. In this paper, we propose a probabilistic framework for discriminative dictionary learning which uses both the labeled and unlabeled data. Experimental results demonstrate that the performance of the proposed method is significantly better than the state of the art dictionary based classification methods.

Behnam Babagholami-Mohamadabadi, Ali Zarghami, Mohammadreza Zolfaghari, Mahdieh Soleymani Baghshah

Unsupervised Learning

Embedding with Autoencoder Regularization

The problem of embedding arises in many machine learning applications with the assumption that there may exist a small number of variabilities which can guarantee the “semantics” of the original high-dimensional data. Most of the existing embedding algorithms perform to maintain the

locality-preserving

property. In this study, inspired by the remarkable success of representation learning and deep learning, we propose a framework of embedding with autoencoder regularization (EAER for short), which incorporates embedding and autoencoding techniques naturally. In this framework, the original data are embedded into the lower dimension, represented by the output of the hidden layer of the autoencoder, thus the resulting data can not only maintain the locality-preserving property but also easily revert to their original forms. This is guaranteed by the joint minimization of the embedding loss and the autoencoder reconstruction error. It is worth mentioning that instead of operating in a batch mode as most of the previous embedding algorithms conduct, the proposed framework actually generates an

inductive

embedding model and thus supports incremental embedding efficiently. To show the effectiveness of EAER, we adapt this joint learning framework to three canonical embedding algorithms, and apply them to both synthetic and real-world data sets. The experimental results show that the adaption of EAER outperforms its original counterpart. Besides, compared with the existing incremental embedding algorithms, the results demonstrate that EAER performs incremental embedding with more competitive efficiency and effectiveness.

Wenchao Yu, Guangxiang Zeng, Ping Luo, Fuzhen Zhuang, Qing He, Zhongzhi Shi

Reduced-Rank Local Distance Metric Learning

We propose a new method for local metric learning based on a conical combination of Mahalanobis metrics and pair-wise similarities between the data. Its formulation allows for controlling the rank of the metrics’ weight matrices. We also offer a convergent algorithm for training the associated model. Experimental results on a collection of classification problems imply that the new method may offer notable performance advantages over alternative metric learning approaches that have recently appeared in the literature.

Yinjie Huang, Cong Li, Michael Georgiopoulos, Georgios C. Anagnostopoulos

Learning Exemplar-Represented Manifolds in Latent Space for Classification

Intrinsic manifold structure of a data collection is valuable information for classification task. By considering the manifold structure in the data set for classification and with the sparse coding framework, we propose an algorithm to: (1) find exemplars from each class to represent the class-specific manifold structure, in which way the object-space dimensionality is reduced; (2) simultaneously learn a latent feature space to make the mapped data more discriminative according to the class-specific manifold measurement. We call the proposed algorithm Exemplar-represented Manifold in Latent Space for Classification (EMLSC). We also present the nonlinear extension of EMLSC based on kernel tricks to deal with highly nonlinear situations. Experiments on synthetic and real-world datasets demonstrate the merit of the proposed method.

Shu Kong, Donghui Wang

Locally Linear Landmarks for Large-Scale Manifold Learning

Spectral methods for manifold learning and clustering typically construct a graph weighted with affinities from a dataset and compute eigenvectors of a graph Laplacian. With large datasets, the eigendecomposition is too expensive, and is usually approximated by solving for a smaller graph defined on a subset of the points (landmarks) and then applying the Nyström formula to estimate the eigenvectors over all points. This has the problem that the affinities between landmarks do not benefit from the remaining points and may poorly represent the data if using few landmarks. We introduce a modified spectral problem that uses all data points by constraining the latent projection of each point to be a local linear function of the landmarks’ latent projections. This constructs a new affinity matrix between landmarks that preserves manifold structure even with few landmarks, allows one to reduce the eigenproblem size, and defines a fast, nonlinear out-of-sample mapping.

Max Vladymyrov, Miguel Á. Carreira-Perpiñán

Subgroup Discovery, Outlier Detection and Anomaly Detection

Discovering Skylines of Subgroup Sets

Many tasks in exploratory data mining aim to discover the top-

k

results with respect to a certain interestingness measure. Unfortunately, in practice top-

k

solution sets are hardly satisfactory, if only because redundancy in such results is a severe problem. To address this, a recent trend is to find

diverse sets of high-quality patterns

. However, a ‘perfect’ diverse top-

k

cannot possibly exist, since there is an inherent trade-off between quality and diversity.

We argue that the best way to deal with the quality-diversity trade-off is to

explicitly consider the Pareto front, or skyline, of non-dominated solutions

, i.e. those solutions for which neither quality nor diversity can be improved without degrading the other quantity. In particular, we focus on

k

-pattern set mining in the context of Subgroup Discovery [6]. For this setting, we present two algorithms for the discovery of skylines; an exact algorithm and a levelwise heuristic.

We evaluate the performance of the two proposed skyline algorithms, and the accuracy of the levelwise method. Furthermore, we show that the skylines can be used for the objective evaluation of subgroup set heuristics. Finally, we show characteristics of the obtained skylines, which reveal that different quality-diversity trade-offs result in clearly different subgroup sets. Hence, the discovery of skylines is an important step towards a better understanding of ‘diverse top-

k

’s’.

Matthijs van Leeuwen, Antti Ukkonen

Difference-Based Estimates for Generalization-Aware Subgroup Discovery

For the task of subgroup discovery, generalization-aware interesting measures that are based not only on the statistics of the patterns itself, but also on the statistics of their generalizations have recently been shown to be essential. A key technique to increase runtime performance of subgroup discovery algorithms is the application of optimistic estimates to limit the search space size. These are upper bounds for the interestingness that any specialization of the currently evaluated pattern may have. Until now these estimates are based on the anti-monotonicity of instances, which are covered by the current pattern. This neglects important properties of generalizations. Therefore, we present in this paper a new scheme of deriving optimistic estimates for generalization aware subgroup discovery, which is based on the instances by which patterns differ in comparison to their generalizations. We show, how this technique can be applied for the most popular interestingness measures for binary as well as for numeric target concepts. The novel bounds are incorporated in an efficient algorithm, which outperforms previous methods by up to an order of magnitude.

Florian Lemmerich, Martin Becker, Frank Puppe

Local Outlier Detection with Interpretation

Outlier detection aims at searching for a small set of objects that are inconsistent or considerably deviating from other objects in a dataset. Existing research focuses on outlier identification while omitting the equally important problem of outlier interpretation. This paper presents a novel method named LODI to address both problems at the same time. In LODI, we develop an approach that explores the quadratic entropy to adaptively select a set of neighboring instances, and a learning method to seek an optimal subspace in which an outlier is maximally separated from its neighbors. We show that this learning task can be solved via the matrix eigen-decomposition and its solution contains essential information to reveal features that are most important to interpret the exceptional properties of outliers. We demonstrate the appealing performance of LODI via a number of synthetic and real world datasets and compare its outlier detection rates against state-of-the-art algorithms.

Xuan Hong Dang, Barbora Micenková, Ira Assent, Raymond T. Ng

Anomaly Detection in Vertically Partitioned Data by Distributed Core Vector Machines

Observations of physical processes suffer from instrument malfunction and noise and demand data cleansing. However, rare events are not to be excluded from modeling, since they can be the most interesting findings. Often, sensors collect features at different sites, so that only a subset is present (vertically distributed data). Transferring all data or a sample to a single location is impossible in many real-world applications due to restricted bandwidth of communication. Finding interesting abnormalities thus requires efficient methods of distributed anomaly detection.

We propose a new algorithm for anomaly detection on vertically distributed data. It aggregates the data directly at the local storage nodes using RBF kernels. Only a fraction of the data is communicated to a central node. Through extensive empirical evaluation on controlled datasets, we demonstrate that our method is an order of magnitude more communication efficient than state of the art methods, achieving a comparable accuracy.

Marco Stolpe, Kanishka Bhaduri, Kamalika Das, Katharina Morik

Mining Outlier Participants: Insights Using Directional Distributions in Latent Models

In this paper we will propose a new probabilistic topic model to score the expertise of participants on the projects that they contribute to based on their previous experience. Based on each participant’s score, we rank participants and define those who have the lowest scores as

outlier participants

. Since the focus of our study is on outliers, we name the model as

M

ining

O

utlier

P

articipants from

P

rojects (

MOPP

) model.

MOPP

is a topic model that is based on directional distributions which are particularly suitable for outlier detection in high-dimensional spaces. Extensive experiments on both synthetic and real data sets have shown that

MOPP

gives better results on both topic modeling and outlier detection tasks.

Didi Surian, Sanjay Chawla

Privacy and Security

Anonymizing Data with Relational and Transaction Attributes

Publishing datasets about individuals that contain both relational and transaction (i.e., set-valued) attributes is essential to support many applications, ranging from healthcare to marketing. However, preserving the privacy and utility of these datasets is challenging, as it requires

(i)

guarding against attackers, whose knowledge spans both attribute types, and

(ii)

minimizing the overall information loss. Existing anonymization techniques are not applicable to such datasets, and the problem cannot be tackled based on popular, multi-objective optimization strategies. This work proposes the first approach to address this problem. Based on this approach, we develop two frameworks to offer privacy, with bounded information loss in one attribute type and minimal information loss in the other. To realize each framework, we propose privacy algorithms that effectively preserve data utility, as verified by extensive experiments.

Giorgos Poulis, Grigorios Loukides, Aris Gkoulalas-Divanis, Spiros Skiadopoulos

Privacy-Preserving Mobility Monitoring Using Sketches of Stationary Sensor Readings

Two fundamental tasks of mobility modeling are (1) to track the number of distinct persons that are present at a location of interest and (2) to reconstruct flows of persons between two or more different locations. Stationary sensors, such as Bluetooth scanners, have been applied to both tasks with remarkable success. However, this approach has privacy problems. For instance, Bluetooth scanners store the MAC address of a device that can in principle be linked to a single person. Unique hashing of the address only partially solves the problem because such a pseudonym is still vulnerable to various linking attacks. In this paper we propose a solution to both tasks using an extension of linear counting sketches. The idea is to map several individuals to the same position in a sketch, while at the same time the inaccuracies introduced by this overloading are compensated by using several independent sketches. This idea provides, for the first time, a general set of primitives for privacy preserving mobility modeling from Bluetooth and similar address-based devices.

Michael Kamp, Christine Kopp, Michael Mock, Mario Boley, Michael May

Evasion Attacks against Machine Learning at Test Time

In security-sensitive applications, the success of machine learning depends on a thorough vetting of their resistance to adversarial data. In one pertinent, well-motivated attack scenario, an adversary may attempt to evade a deployed system at test time by carefully manipulating attack samples. In this work, we present a simple but effective gradient-based approach that can be exploited to systematically assess the security of several, widely-used classification algorithms against evasion attacks. Following a recently proposed framework for security evaluation, we simulate attack scenarios that exhibit different risk levels for the classifier by increasing the attacker’s knowledge of the system and her ability to manipulate attack samples. This gives the classifier designer a better picture of the classifier performance under evasion attacks, and allows him to perform a more informed model selection (or parameter setting). We evaluate our approach on the relevant security task of malware detection in PDF files, and show that such systems can be easily evaded. We also sketch some countermeasures suggested by our analysis.

Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, Fabio Roli

Data Mining and Constraint Solving

The Top-k Frequent Closed Itemset Mining Using Top-k SAT Problem

In this paper, we introduce a new problem, called Top-

k

SAT, that consists in enumerating the Top-

k

models of a propositional formula. A Top-

k

model is defined as a model with less than

k

models preferred to it with respect to a preference relation. We show that Top-

k

SAT generalizes two well-known problems: the partial Max-SAT problem and the problem of computing minimal models. Moreover, we propose a general algorithm for Top-

k

SAT. Then, we give the first application of our declarative framework in data mining, namely, the problem of enumerating the Top-

k

frequent closed itemsets of length at least

min

(

${\cal FCIM}_{min}^k$

). Finally, to show the nice declarative aspects of our framework, we encode several other variants of

${\cal FCIM}_{min}^k$

into the Top-

k

SAT problem.

Said Jabbour, Lakhdar Sais, Yakoub Salhi

A Declarative Framework for Constrained Clustering

In recent years, clustering has been extended to constrained clustering, so as to integrate knowledge on objects or on clusters, but adding such constraints generally requires to develop new algorithms. We propose a declarative and generic framework, based on Constraint Programming, which enables to design clustering tasks by specifying an optimization criterion and some constraints either on the clusters or on pairs of objects. In our framework, several classical optimization criteria are considered and they can be coupled with different kinds of constraints. Relying on Constraint Programming has two main advantages: the declarativity, which enables to easily add new constraints and the ability to find an optimal solution satisfying all the constraints (when there exists one). On the other hand, computation time depends on the constraints and on their ability to reduce the domain of variables, thus avoiding an exhaustive search.

Thi-Bich-Hanh Dao, Khanh-Chuong Duong, Christel Vrain

SNNAP: Solver-Based Nearest Neighbor for Algorithm Portfolios

The success of portfolio algorithms in competitions in the area of combinatorial problem solving, as well as in practice, has motivated interest in the development of new approaches to determine the best solver for the problem at hand. Yet, although there are a number of ways in which this decision can be made, it always relies on a rich set of features to identify and distinguish the structure of the problem instances. In this paper, we show how one of the more successful portfolio approaches, ISAC, can be augmented by taking into account the past performance of solvers as part of the feature vector. Testing on a variety of SAT datasets, we show how our new formulation continuously outperforms an unmodified/standard version of ISAC.

Marco Collautti, Yuri Malitsky, Deepak Mehta, Barry O’Sullivan

Evaluation

Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals

The area under the precision-recall curve (AUCPR) is a single number summary of the information in the precision-recall (PR) curve. Similar to the receiver operating characteristic curve, the PR curve has its own unique properties that make estimating its enclosed area challenging. Besides a point estimate of the area, an interval estimate is often required to express magnitude and uncertainty. In this paper we perform a computational analysis of common AUCPR estimators and their confidence intervals. We find both satisfactory estimates and invalid procedures and we recommend two simple intervals that are robust to a variety of assumptions.

Kendrick Boyd, Kevin H. Eng, C. David Page

Applications

Incremental Sensor Placement Optimization on Water Network

Sensor placement on water networks is critical for the detection of accidental or intentional contamination event. With the development and expansion of cities, the public water distribution systems in cities are continuously growing. As a result, the current sensor placement will lose its effectiveness in detecting contamination event. Hence, in many real applications, we need to solve the

incremental sensor placement

(ISP) problem. We expect to find a sensor placement solution that reuses existing sensor deployments as much as possible to reduce cost, while ensuring the effectiveness of contamination detection. In this paper, we propose scenario-cover model to formalize ISP and prove that ISP is NP-hard and propose our greedy approaches with provable quality bound. Extensive experiments show the effectiveness, robustness and efficiency of the proposed solutions.

Xiaomin Xu, Yiqi Lu, Sheng Huang, Yanghua Xiao, Wei Wang

Detecting Marionette Microblog Users for Improved Information Credibility

In this paper, we mine a special group of microblog users: the “marionette” users, who are created or employed by backstage “puppeteers”, either through programs or manually. Unlike normal users that access microblogs for information sharing or social communication, the marionette users perform specific tasks to earn financial profits. For example, they follow certain users to increase their “statistical popularity”, or retweet some tweets to amplify their “statistical impact”. The fabricated follower or retweet counts not only mislead normal users to wrong information, but also seriously impair microblog-based applications, such as popular tweets selection and expert finding. In this paper, we study the important problem of detecting marionette users on microblog platforms. This problem is challenging because puppeteers are employing complicated strategies to generate marionette users that present similar behaviors as normal ones. To tackle this challenge, we propose to take into account two types of discriminative information: (1) individual user tweeting behaviors and (2) the social interactions among users. By integrating both information into a semi-supervised probabilistic model, we can effectively distinguish marionette users from normal ones. By applying the proposed model to one of the most popular microblog platform (Sina Weibo) in China, we find that the model can detect marionette users with f-measure close to 0.9. In addition, we propose an application to measure the credibility of retweet counts.

Xian Wu, Ziming Feng, Wei Fan, Jing Gao, Yong Yu

Will My Question Be Answered? Predicting “Question Answerability” in Community Question-Answering Sites

All askers who post questions in Community-based Question Answering (CQA) sites such as Yahoo! Answers, Quora or Baidu’s Zhidao, expect to receive an answer, and are frustrated when their questions remain unanswered. We propose to provide a type of “heads up” to askers by predicting how many answers, if at all, they will get. Giving a preemptive warning to the asker at posting time should reduce the frustration effect and hopefully allow askers to rephrase their questions if needed. To the best of our knowledge, this is the first attempt to predict the actual number of answers, in addition to predicting whether the question will be answered or not. To this effect, we introduce a new prediction model, specifically tailored to hierarchically structured CQA sites.We conducted extensive experiments on a large corpus comprising 1 year of answering activity on Yahoo! Answers, as opposed to a single day in previous studies. These experiments show that the

F

1 we achieved is 24% better than in previous work, mostly due the structure built into the novel model.

Gideon Dror, Yoelle Maarek, Idan Szpektor

Learning to Detect Patterns of Crime

Our goal is to automatically detect patterns of crime. Among a large set of crimes that happen every year in a major city, it is challenging, time-consuming, and labor-intensive for crime analysts to determine which ones may have been committed by the same individual(s). If automated, data-driven tools for crime pattern detection are made available to assist analysts, these tools could help police to better understand patterns of crime, leading to more precise attribution of past crimes, and the apprehension of suspects. To do this, we propose a pattern detection algorithm called

Series Finder

, that grows a pattern of discovered crimes from within a database, starting from a “seed” of a few crimes. Series Finder incorporates both the common characteristics of all patterns and the unique aspects of each specific pattern, and has had promising results on a decade’s worth of crime pattern data collected by the Crime Analysis Unit of the Cambridge Police Department.

Tong Wang, Cynthia Rudin, Daniel Wagner, Rich Sevieri

Space Allocation in the Retail Industry: A Decision Support System Integrating Evolutionary Algorithms and Regression Models

One of the hardest resources to manage in retail is space. Retailers need to assign limited store space to a growing number of product categories such that sales and other performance metrics are maximized. Although this seems to be an ideal task for a data mining approach, there is one important barrier: the representativeness of the available data. In fact, changes to the layout of retail stores are infrequent. This means that very few values of the space variable are represented in the data, which makes it hard to generalize. In this paper, we describe a Decision Support System to assist retailers in this task. The system uses an Evolutionary Algorithm to optimize space allocation based on the estimated impact on sales caused by changes in the space assigned to product categories. We assess the quality of the system on a real case study, using different regression algorithms to generate the estimates. The system obtained very good results when compared with the recommendations made by the business experts. We also investigated the effect of the representativeness of the sample on the accuracy of the regression models. We selected a few product categories based on a heuristic assessment of their representativeness. The results indicate that the best regression models were obtained on products for which the sample was not the best. The reason for this unexpected results remains to be explained.

Fábio Pinto, Carlos Soares

Medical Applications

Forest-Based Point Process for Event Prediction from Electronic Health Records

Accurate prediction of future onset of disease from Electronic Health Records (EHRs) has important clinical and economic implications. In this domain the arrival of data comes at semi-irregular intervals and makes the prediction task challenging. We propose a method called multiplicative-forest point processes (MFPPs) that learns the rate of future events based on an event history. MFPPs join previous theory in multiplicative forest continuous-time Bayesian networks and piecewise-continuous conditional intensity models. We analyze the advantages of using MFPPs over previous methods and show that on synthetic and real EHR forecasting of heart attacks, MFPPs outperform earlier methods and augment off-the-shelf machine learning algorithms.

Jeremy C. Weiss, David Page

On Discovering the Correlated Relationship between Static and Dynamic Data in Clinical Gait Analysis

‘Gait’ is a person’s manner of walking. Patients may have an abnormal gait due to a range of physical impairment or brain damage. Clinical gait analysis (CGA) is a technique for identifying the underlying impairments that affect a patient’s gait pattern. The CGA is critical for treatment planning. Essentially, CGA tries to use patients’ physical examination results, known as

static

data, to interpret the dynamic characteristics in an abnormal gait, known as

dynamic

data. This process is carried out by gait analysis experts, mainly based on their experience which may lead to subjective diagnoses. To facilitate the automation of this process and form a relatively objective diagnosis, this paper proposes a new probabilistic correlated static-dynamic model (CSDM) to discover correlated relationships between the dynamic characteristics of gait and their root cause in the static data space. We propose an EM-based algorithm to learn the parameters of the CSDM. One of the main advantages of the CSDM is its ability to provide intuitive knowledge. For example, the CSDM can describe what kinds of static data will lead to what kinds of hidden gait patterns in the form of a decision tree, which helps us to infer dynamic characteristics based on static data. Our initial experiments indicate that the CSDM is promising for discovering the correlated relationship between physical examination (static) and gait (dynamic) data.

Yin Song, Jian Zhang, Longbing Cao, Morgan Sangeux

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

Drug repositioning helps identify new indications for marketed drugs and clinical candidates. In this study, we proposed an integrative computational framework to predict novel drug indications for both approved drugs and clinical molecules by integrating chemical, biological and phenotypic data sources. We defined different similarity measures for each of these data sources and utilized a weighted

k

-nearest neighbor algorithm to transfer similarities of nearest neighbors to prediction scores for a given compound. A large margin method was used to combine individual metrics from multiple sources into a global metric. A large-scale study was conducted to repurpose 1007 drugs against 719 diseases. Experimental results showed that the proposed algorithm outperformed similar previously developed computational drug repositioning approaches. Moreover, the new algorithm also ranked drug information sources based on their contributions to the prediction, thus paving the way for prioritizing multiple data sources and building more reliable drug repositioning models.

Ping Zhang, Pankaj Agarwal, Zoran Obradovic

Score As You Lift (SAYL): A Statistical Relational Learning Approach to Uplift Modeling

We introduce Score As You Lift (SAYL), a novel Statistical Relational Learning (SRL) algorithm, and apply it to an important task in the diagnosis of breast cancer. SAYL combines SRL with the marketing concept of uplift modeling, uses the area under the uplift curve to direct clause construction and final theory evaluation, integrates rule learning and probability assignment, and conditions the addition of each new theory rule to existing ones.

Breast cancer, the most common type of cancer among women, is categorized into two subtypes: an earlier in situ stage where cancer cells are still confined, and a subsequent invasive stage. Currently older women with in situ cancer are treated to prevent cancer progression, regardless of the fact that treatment may generate undesirable side-effects, and the woman may die of other causes. Younger women tend to have more aggressive cancers, while older women tend to have more indolent tumors. Therefore older women whose in situ tumors show significant dissimilarity with in situ cancer in younger women are less likely to progress, and can thus be considered for watchful waiting.

Motivated by this important problem, this work makes two main contributions. First, we present the first multi-relational uplift modeling system, and introduce, implement and evaluate a novel method to guide search in an SRL framework. Second, we compare our algorithm to previous approaches, and demonstrate that the system can indeed obtain differential rules of interest to an expert on real data, while significantly improving the data uplift.

Houssam Nassif, Finn Kuusisto, Elizabeth S. Burnside, David Page, Jude Shavlik, Vítor Santos Costa

Nectar Track

A Theoretical Framework for Exploratory Data Mining: Recent Insights and Challenges Ahead

Exploratory Data Mining (EDM), the contemporary heir of Exploratory Data Analysis (EDA) pioneered by Tukey in the seventies, is the task of facilitating the extraction of interesting nuggets of information from possibly large and complexly structured data. Major conceptual challenges in EDM research are the understanding of how one can formalise a nugget of information (given the diversity of types of data of interest), and how one can formalise how interesting such a nugget of information is to a particular user (given the diversity of types of users and intended purposes). In this Nectar paper we briefly survey a number of recent contributions made by us and collaborators towards a theoretically motivated and practically usable resolution of these challenges.

Tijl De Bie, Eirini Spyropoulou

Tensor Factorization for Multi-relational Learning

Tensor factorization has emerged as a promising approach for solving relational learning tasks. Here we review recent results on a particular tensor factorization approach, i.e.

Rescal

, which has demonstrated state-of-the-art relational learning results, while scaling to knowledge bases with millions of entities and billions of known facts.

Maximilian Nickel, Volker Tresp

MONIC and Followups on Modeling and Monitoring Cluster Transitions

There is much recent discussion on data streams and big data, which except of their volume and velocity are also characterized by volatility. Next to detecting change, it is also important to interpret it. Consider customer profiling as an example: Is a cluster corresponding to a group of customers simply disappearing or are its members migrating to other clusters? Does a new cluster reflect a new type of customers or does it rather consist of old customers whose preferences shift? To answer such questions, we have proposed the framework MONIC [20] for modeling and tracking cluster transitions. MONIC has been re-discovered some years after publication and is enjoying a large citation record from papers on community evolution, cluster evolution, change prediction and topic evolution.

Myra Spiliopoulou, Eirini Ntoutsi, Yannis Theodoridis, Rene Schult

Towards Robot Skill Learning: From Simple Skills to Table Tennis

Learning robots that can acquire new motor skills and refine existing ones have been a long-standing vision of both robotics, and machine learning. However, off-the-shelf machine learning appears not to be adequate for robot skill learning, as it neither scales to anthropomorphic robotics nor do fulfills the crucial real-time requirements. As an alternative, we propose to divide the generic skill learning problem into parts that can be well-understood from a robotics point of view. In this context, we have developed machine learning methods applicable to robot skill learning. This paper discusses recent progress ranging from simple skill learning problems to a game of robot table tennis.

Jan Peters, Jens Kober, Katharina Mülling, Oliver Krämer, Gerhard Neumann

Functional MRI Analysis with Sparse Models

Sparse models embed variable selection into model learning (e.g., by using

l

1

-norm regularizer). In small-sample high-dimensional problems, this leads to improved generalization accuracy combined with interpretability, which is important in scientific applications such as biology. In this paper, we summarize our recent work on sparse models, including both sparse regression and sparse Gaussian Markov Random Fields (GMRF), in neuroimaging applications, such as functional MRI data analysis, where the central objective is to gain a better insight into brain functioning, besides just learning predictive models of mental states from imaging data.

Irina Rish

Demo Track

Image Hub Explorer: Evaluating Representations and Metrics for Content-Based Image Retrieval and Object Recognition

Large quantities of image data are generated daily and visualizing large image datasets is an important task. We present a novel tool for image data visualization and analysis, Image Hub Explorer. The integrated analytic functionality is centered around dealing with the recently described phenomenon of

hubness

and evaluating its impact on the image retrieval, recognition and recommendation process. Hubness is reflected in that some images (

hubs

) end up being very frequently retrieved in ’top

k

’ result sets, regardless of their labels and target semantics. Image Hub Explorer offers many methods that help in visualizing the influence of major image hubs, as well as state-of-the-art metric learning and hubness-aware classification methods that help in reducing the overall impact of extremely frequent neighbor points. The system also helps in visualizing both beneficial and detrimental visual words in individual images. Search functionality is supported, along with the recently developed hubness-aware result set re-ranking procedure.

Nenad Tomašev, Dunja Mladenić

Ipseity – A Laboratory for Synthesizing and Validating Artificial Cognitive Systems in Multi-agent Systems

This article presents an overview on

Ipseity

[1], an opensource rich-client platform developed in C++ with the

Qt

[2] framework.

Ipseity

facilitates the synthesis of artificial cognitive systems in multi-agent systems. The current version of the platform includes a set of plugins based on the classical reinforcement learning techniques like Q-Learning and Sarsa.

Ipseity

is targeted at a broad range of users interested in artificial intelligence in general, including industrial practitioners, as well as machine learning researchers, students and teachers. It is daily used as a course support in Artificial Intelligence and Reinforcement Learning and it has been used successfully to manage power flows in simulated microgrids using multi-agent reinforcement learning [4].

Fabrice Lauri, Nicolas Gaud, Stéphane Galland, Vincent Hilaire

OpenML: A Collaborative Science Platform

We present OpenML, a novel open science platform that provides easy access to machine learning data, software and results to encourage further study and application. It organizes all submitted results online so they can be easily found and reused, and features a web API which is being integrated in popular machine learning tools such as Weka, KNIME, RapidMiner and R packages, so that experiments can be shared easily.

Jan N. van Rijn, Bernd Bischl, Luis Torgo, Bo Gao, Venkatesh Umaashankar, Simon Fischer, Patrick Winter, Bernd Wiswedel, Michael R. Berthold, Joaquin Vanschoren

ViperCharts: Visual Performance Evaluation Platform

The paper presents the ViperCharts web-based platform for visual performance evaluation of classification, prediction, and information retrieval algorithms. The platform enables to create interactive charts for easy and intuitive evaluation of performance results. It includes standard visualizations and extends them by offering alternative evaluation methods like

F

-isolines, and by establishing relations between corresponding presentations like Precision-Recall and ROC curves. Additionally, the interactive performance charts can be saved, exported to several formats, and shared via unique web addresses. A web API to the service is also available.

Borut Sluban, Nada Lavrač

Entityclassifier.eu: Real-Time Classification of Entities in Text with Wikipedia

Targeted Hypernym Discovery (THD) performs unsupervised classification of entities appearing in text. A hypernym mined from the free-text of the Wikipedia article describing the entity is used as a class. The type as well as the entity are cross-linked with their representation in DBpedia, and enriched with additional types from DBpedia and YAGO knowledge bases providing a semantic web interoperability. The system, available as a web application and web service at

entityclassifier.eu

, currently supports English, German and Dutch.

Milan Dojchinovski, Tomáš Kliegr

Hermoupolis: A Trajectory Generator for Simulating Generalized Mobility Patterns

During the last decade, the domain of mobility data mining has emerged providing many effective methods for the discovery of intuitive patterns representing collective behavior of trajectories of moving objects. Although a few real-world trajectory datasets have been made available recently, these are not sufficient for experimentally evaluating the various proposals, therefore, researchers look to synthetic trajectory generators. This case is problematic because, on the one hand, real datasets are usually small, which compromises scalability experiments, and, on the other hand, synthetic dataset generators have not been designed to produce mobility pattern driven trajectories. Motivated by this observation, we present

Hermoupolis

, an effective generator of synthetic trajectories of moving objects that has the main objective that the resulting datasets support various types of mobility patterns (clusters, flocks, convoys, etc.), as such producing datasets with available ground truth information.

Nikos Pelekis, Christos Ntrigkogias, Panagiotis Tampakis, Stylianos Sideridis, Yannis Theodoridis

AllAboard: A System for Exploring Urban Mobility and Optimizing Public Transport Using Cellphone Data

The deep penetration of mobile phones offers cities the ability to opportunistically monitor citizens’ interactions and use data-driven insights to better plan and manage services. In this context, transit operators can leverage pervasive mobile sensing to better match observed demand for travel with their service offerings. With large scale data on mobility patterns, operators can move away from the costly and resource intensive transportation planning processes prevalent in the West, to a more data-centric view, that places the instrumented user at the center of development. In this framework, using mobile phone data to perform transit analysis and optimization represents a new frontier with significant societal impact, especially in developing countries.

Michele Berlingerio, Francesco Calabrese, Giusy Di Lorenzo, Rahul Nair, Fabio Pinelli, Marco Luca Sbodio

ScienScan – An Efficient Visualization and Browsing Tool for Academic Search

In this paper we present ScienScan – a browsing and visualization tool for academic search. The tool operates in real time by post-processing the query results returned by an academic search engine. ScienScan discovers topics in the search results and summarizes them in the form of a concise hierarchical topic map. The produced topical summary informatively represents the results in a visual way and provides an additional filtering control. We demonstrate the operation of ScienScan deploying it on top of the search API of Microsoft Academic Search.

Daniil Mirylenka, Andrea Passerini

InVis: A Tool for Interactive Visual Data Analysis

We present

InVis

, a tool to visually analyse data by interactively shaping a two dimensional embedding of it. Traditionally, embedding techniques focus on finding one fixed embedding, which emphasizes a single aspects of the data. In contrast, our application enables the user to explore the structures of a dataset by observing and controlling a projection of it. Ultimately it provides a way to search and find an embedding, emphasizing aspects that the user desires to highlight.

Daniel Paurat, Thomas Gärtner

Kanopy: Analysing the Semantic Network around Document Topics

External knowledge bases, both generic and domain specific, available on the Web of Data have the potential of enriching the content of text documents with structured information. We present the Kanopy system that makes explicit use of this potential. Besides the common task of semantic annotation of documents, Kanopy analyses the semantic network that resides in DBpedia around extracted concepts. The system’s main novelty lies in the translation of social network analysis measures to semantic networks in order to find suitable topic labels. Moreover, Kanopy extracts advanced knolwedge in the form of subgraphs that capture the relationships between the concepts.

Ioana Hulpuş, Conor Hayes, Marcel Karnstedt, Derek Greene, Marek Jozwowicz

SCCQL : A Constraint-Based Clustering System

This paper presents the first version of a new inductive data-base system called SCCQL. The system performs constraint-based clustering on a relational database. Clustering problems are formulated with a query language, an extension of SQL for clustering that includes must-link and cannot-link constraints. The functioning of the system is explained. As an example of use of this system, an application in the context of microbiology has been developed that is presented here.

Antoine Adam, Hendrik Blockeel, Sander Govers, Abram Aertsen

Erratum: Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals

The right hand side of the equation for interpolated area under the PR curve at the bottom of page 455 inadvertently transposes

a

and

b

. This typo does not affect the validity of the proof.

The corrected formula should read:

Kendrick Boyd, Kevin H. Eng, C. David Page

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Ensembles

AR-Boost: Reducing Overfitting by a Robust Data-Driven Regularization Strategy

Parallel Boosting with Momentum

Inner Ensembles: Using Ensemble Methods Inside the Learning Algorithm

Statistical Learning

Learning Discriminative Sufficient Statistics Score Space for Classification

The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited

Bundle CDN: A Highly Parallelized Approach for Large-Scale ℓ1-Regularized Logistic Regression

MORD: Multi-class Classifier for Ordinal Regression

Identifiability of Model Properties in Over-Parameterized Model Classes

Semi-supervised Learning

Exploratory Learning

Semi-supervised Gaussian Process Ordinal Regression

Influence of Graph Construction on Semi-supervised Learning

Tractable Semi-supervised Learning of Complex Structured Prediction Models

PSSDL: Probabilistic Semi-supervised Dictionary Learning

Unsupervised Learning

Embedding with Autoencoder Regularization

Reduced-Rank Local Distance Metric Learning

Learning Exemplar-Represented Manifolds in Latent Space for Classification

Locally Linear Landmarks for Large-Scale Manifold Learning

Subgroup Discovery, Outlier Detection and Anomaly Detection

Discovering Skylines of Subgroup Sets

Difference-Based Estimates for Generalization-Aware Subgroup Discovery

Local Outlier Detection with Interpretation

Anomaly Detection in Vertically Partitioned Data by Distributed Core Vector Machines

Mining Outlier Participants: Insights Using Directional Distributions in Latent Models

Privacy and Security

Anonymizing Data with Relational and Transaction Attributes

Privacy-Preserving Mobility Monitoring Using Sketches of Stationary Sensor Readings

Evasion Attacks against Machine Learning at Test Time

Data Mining and Constraint Solving

The Top-k Frequent Closed Itemset Mining Using Top-k SAT Problem

A Declarative Framework for Constrained Clustering

SNNAP: Solver-Based Nearest Neighbor for Algorithm Portfolios

Evaluation

Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals

Applications

Incremental Sensor Placement Optimization on Water Network

Detecting Marionette Microblog Users for Improved Information Credibility

Will My Question Be Answered? Predicting “Question Answerability” in Community Question-Answering Sites

Learning to Detect Patterns of Crime

Space Allocation in the Retail Industry: A Decision Support System Integrating Evolutionary Algorithms and Regression Models

Medical Applications

Forest-Based Point Process for Event Prediction from Electronic Health Records

On Discovering the Correlated Relationship between Static and Dynamic Data in Clinical Gait Analysis

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

Score As You Lift (SAYL): A Statistical Relational Learning Approach to Uplift Modeling

Nectar Track

A Theoretical Framework for Exploratory Data Mining: Recent Insights and Challenges Ahead

Tensor Factorization for Multi-relational Learning

MONIC and Followups on Modeling and Monitoring Cluster Transitions

Towards Robot Skill Learning: From Simple Skills to Table Tennis

Functional MRI Analysis with Sparse Models

Demo Track

Image Hub Explorer: Evaluating Representations and Metrics for Content-Based Image Retrieval and Object Recognition

Ipseity – A Laboratory for Synthesizing and Validating Artificial Cognitive Systems in Multi-agent Systems

OpenML: A Collaborative Science Platform

ViperCharts: Visual Performance Evaluation Platform

Entityclassifier.eu: Real-Time Classification of Entities in Text with Wikipedia

Hermoupolis: A Trajectory Generator for Simulating Generalized Mobility Patterns

AllAboard: A System for Exploring Urban Mobility and Optimizing Public Transport Using Cellphone Data

ScienScan – An Efficient Visualization and Browsing Tool for Academic Search

InVis: A Tool for Interactive Visual Data Analysis

Kanopy: Analysing the Semantic Network around Document Topics

SCCQL : A Constraint-Based Clustering System

Erratum: Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals

Backmatter

Premium Partner