main-content

## Über dieses Buch

The three volume proceedings LNAI 10534 – 10536 constitutes the refereed proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2017, held in Skopje, Macedonia, in September 2017.

The total of 104 papers presented in these books was carefully reviewed and selected from 364 submissions. The papers were organized in topical sections named as follows:
Part I: anomaly detection; computer vision; ensembles and meta learning; feature selection and extraction; kernel methods; learning and optimization, matrix and tensor factorization; networks and graphs; neural networks and deep learning.
Part II: pattern and sequence mining; privacy and security; probabilistic models and methods; recommendation; regression; reinforcement learning; subgroup discovery; time series and streams; transfer and multi-task learning; unsupervised and semisupervised learning.
Part III: applied data science track; nectar track; and demo track.

## Inhaltsverzeichnis

### BeatLex: Summarizing and Forecasting Time Series with Patterns

Given time-series data such as electrocardiogram (ECG) readings, or motion capture data, how can we succintly summarize the data in a way that robustly identifies patterns that appear repeatedly? How can we then use such a summary to identify anomalies such as abnormal heartbeats, and also forecast future values of the time series? Our main idea is a vocabulary-based approach, which automatically learns a set of common patterns, or ‘beat patterns,’ which are used as building blocks to describe the time series in an intuitive and interpretable way. Our summarization algorithm, BeatLex (BeatLexicons for Summarization) is: (1) fast and online, requiring linear time in the data size and bounded memory; (2) effective, outperforming competing algorithms in labelling accuracy by $$5.3$$5.3 times, and forecasting accuracy by $$1.8$$1.8 times; (3) principled and parameter-free, as it is based on the Minimum Description Length principle of summarizing the data by compressing it using as few bits as possible, and automatically tunes all its parameters; (4) general: it applies to any domain of time series data, and can make use of multidimensional (i.e. coevolving) time series.

Bryan Hooi, Shenghua Liu, Asim Smailagic, Christos Faloutsos

### Behavioral Constraint Template-Based Sequence Classification

In this paper we present the interesting Behavioral Constraint Miner (iBCM), a new approach towards classifying sequences. The prevalence of sequential data, i.e., a collection of ordered items such as text, website navigation patterns, traffic management, and so on, has incited a surge in research interest towards sequence classification. Existing approaches mainly focus on retrieving sequences of itemsets and checking their presence in labeled data streams to obtain a classifier. The proposed iBCM approach, rather than focusing on plain sequences, is template-based and draws its inspiration from behavioral patterns used for software verification. These patterns have a broad range of characteristics and go beyond the typical sequence mining representation, allowing for a more precise and concise way of capturing sequential information in a database. Furthermore, it is possible to also mine for negative information, i.e., sequences that do not occur. The technique is benchmarked against other state-of-the-art approaches and exhibits a strong potential towards sequence classification. Code related to this chapter is available at: http://feb.kuleuven.be/public/u0092789/.

Johannes De Smedt, Galina Deeva, Jochen De Weerdt

### Efficient Sequence Regression by Learning Linear Models in All-Subsequence Space

We present a new approach for learning a sequence regression function, i.e., a mapping from sequential observations to a numeric score. Our learning algorithm employs coordinate gradient descent with Gauss-Southwell optimization in the feature space of all subsequences. We give a tight upper bound for the coordinate wise gradients of squared error loss which enables efficient Gauss-Southwell selection. The proposed bound is built by separating the positive and the negative gradients of the loss function and exploits the structure of the feature space. Extensive experiments on simulated as well as real-world sequence regression benchmarks show that the bound is effective and our proposed learning algorithm is efficient and accurate. The resulting linear regression model provides the user with a list of the most predictive features selected during the learning stage, adding to the interpretability of the method. Code and data related to this chapter are available at: https://github.com/svgsponer/SqLoss.

Severin Gsponer, Barry Smyth, Georgiana Ifrim

### Subjectively Interesting Connecting Trees

Consider a large network, and a user-provided set of query nodes between which the user wishes to explore relations. For example, a researcher may want to connect research papers in a citation network, an analyst may wish to connect organized crime suspects in a communication network, or an internet user may want to organize their bookmarks given their location in the world wide web. A natural way to show how query nodes are related is in the form of a tree in the network that connects them. However, in sufficiently dense networks, most such trees will be large or somehow trivial (e.g. involving high degree nodes) and thus not insightful. In this paper, we define and investigate the new problem of mining subjectively interesting trees connecting a set of query nodes in a network, i.e., trees that are highly surprising to the specific user at hand. Using information theoretic principles, we formalize the notion of interestingness of such trees mathematically, taking in account any prior beliefs the user has specified about the network. We then propose heuristic algorithms to find the best trees efficiently, given a specified prior belief model. Modeling the user’s prior belief state is however not necessarily computationally tractable. Yet, we show how a highly generic class of prior beliefs, namely about individual node degrees in combination with the density of particular sub-networks, can be dealt with in a tractable manner. Such types of beliefs can be used to model knowledge of a partial or total order of the network nodes, e.g. where the nodes represent events in time (such as papers in a citation network). An empirical validation of our methods on a large real network evaluates the different heuristics and validates the interestingness of the given trees.

Florian Adriaens, Jefrey Lijffijt, Tijl De Bie

### Malware Detection by Analysing Encrypted Network Traffic with Neural Networks

We study the problem of detecting malware on client computers based on the analysis of HTTPS traffic. Here, malware has to be detected based on the host address, timestamps, and data volume information of the computer’s network traffic. We develop a scalable protocol that allows us to collect network flows of known malicious and benign applications as training data and derive a malware-detection method based on a neural embedding of domain names and a long short-term memory network that processes network flows. We study the method’s ability to detect new malware in a large-scale empirical study.

Paul Prasse, Lukáš Machlica, Tomáš Pevný, Jiří Havelka, Tobias Scheffer

### PEM: A Practical Differentially Private System for Large-Scale Cross-Institutional Data Mining

Privacy has become a serious concern in data mining. Achieving adequate privacy is especially challenging when the scale of the problem is large. Fundamentally, designing a practical privacy-preserving data mining system involves tradeoffs among several factors such as the privacy guarantee, the accuracy or utility of the mining result, the computation efficiency and the generality of the approach. In this paper, we present PEM, a practical system that tries to strike the right balance among these factors. We use a combination of noise-based and noise-free techniques to achieve provable differential privacy at a low computational overhead while obtaining more accurate result than previous approaches. PEM provides an efficient private gradient descent that can be the basis for many practical data mining and machine learning algorithms, like logistic regression, k-means, and Apriori. We evaluate these algorithms on three real-world open datasets in a cloud computing environment. The results show that PEM achieves good accuracy, high scalability, low computation cost while maintaining differential privacy.

Yi Li, Yitao Duan, Wei Xu

### Bayesian Heatmaps: Probabilistic Classification with Multiple Unreliable Information Sources

Unstructured data from diverse sources, such as social media and aerial imagery, can provide valuable up-to-date information for intelligent situation assessment. Mining these different information sources could bring major benefits to applications such as situation awareness in disaster zones and mapping the spread of diseases. Such applications depend on classifying the situation across a region of interest, which can be depicted as a spatial “heatmap”. Annotating unstructured data using crowdsourcing or automated classifiers produces individual classifications at sparse locations that typically contain many errors. We propose a novel Bayesian approach that models the relevance, error rates and bias of each information source, enabling us to learn a spatial Gaussian Process classifier by aggregating data from multiple sources with varying reliability and relevance. Our method does not require gold-labelled data and can make predictions at any location in an area of interest given only sparse observations. We show empirically that our approach can handle noisy and biased data sources, and that simultaneously inferring reliability and transferring information between neighbouring reports leads to more accurate predictions. We demonstrate our method on two real-world problems from disaster response, showing how our approach reduces the amount of crowdsourced data required and can be used to generate valuable heatmap visualisations from SMS messages and satellite images.

Edwin Simpson, Steven Reece, Stephen J. Roberts

### Bayesian Inference for Least Squares Temporal Difference Regularization

This paper proposes a fully Bayesian approach for Least-Squares Temporal Differences (LSTD), resulting in fully probabilistic inference of value functions that avoids the overfitting commonly experienced with classical LSTD when the number of features is larger than the number of samples. Sparse Bayesian learning provides an elegant solution through the introduction of a prior over value function parameters. This gives us the advantages of probabilistic predictions, a sparse model, and good generalisation capabilities, as irrelevant parameters are marginalised out. The algorithm efficiently approximates the posterior distribution through variational inference. We demonstrate the ability of the algorithm in avoiding overfitting experimentally.

Nikolaos Tziortziotis, Christos Dimitrakakis

### Discovery of Causal Models that Contain Latent Variables Through Bayesian Scoring of Independence Constraints

Discovering causal structure from observational data in the presence of latent variables remains an active research area. Constraint-based causal discovery algorithms are relatively efficient at discovering such causal models from data using independence tests. Typically, however, they derive and output only one such model. In contrast, Bayesian methods can generate and probabilistically score multiple models, outputting the most probable one; however, they are often computationally infeasible to apply when modeling latent variables. We introduce a hybrid method that derives a Bayesian probability that the set of independence tests associated with a given causal model are jointly correct. Using this constraint-based scoring method, we are able to score multiple causal models, which possibly contain latent variables, and output the most probable one. The structure-discovery performance of the proposed method is compared to an existing constraint-based method (RFCI) using data generated from several previously published Bayesian networks. The structural Hamming distances of the output models improved when using the proposed method compared to RFCI, especially for small sample sizes.

Fattaneh Jabbari, Joseph Ramsey, Peter Spirtes, Gregory Cooper

### Labeled DBN Learning with Community Structure Knowledge

Learning interactions between dynamical processes is a widespread but difficult problem in ecological or human sciences. Unlike in other domains (bioinformatics, for example), data is often scarce, but expert knowledge is available. We consider the case where knowledge is about a limited number of interactions that drive the processes dynamics, and on a community structure in the interaction network. We propose an original framework, based on Dynamic Bayesian Networks with labeled-edge structure and parsimonious parameterization, and a Stochastic Block Model prior, to integrate this knowledge. Then we propose a restoration-estimation algorithm, based on 0-1 Linear Programing, that improves network learning when these two types of expert knowledge are available. The approach is illustrated on a problem of ecological interaction network learning.

E. Auclair, N. Peyrard, R. Sabbadin

### Multi-view Generative Adversarial Networks

Learning over multi-view data is a challenging problem with strong practical applications. Most related studies focus on the classification point of view and assume that all the views are available at any time. We consider an extension of this framework in two directions. First, based on the BiGAN model, the Multi-view BiGAN (MV-BiGAN) is able to perform density estimation from multi-view inputs. Second, it can deal with missing views and is able to update its prediction when additional views are provided. We illustrate these properties on a set of experiments over different datasets.

Mickaël Chen, Ludovic Denoyer

### Online Sparse Collapsed Hybrid Variational-Gibbs Algorithm for Hierarchical Dirichlet Process Topic Models

Topic models for text analysis are most commonly trained using either Gibbs sampling or variational Bayes. Recently, hybrid variational-Gibbs algorithms have been found to combine the best of both worlds. Variational algorithms are fast to converge and more efficient for inference on new documents. Gibbs sampling enables sparse updates since each token is only associated with one topic instead of a distribution over all topics. Additionally, Gibbs sampling is unbiased. Although Gibbs sampling takes longer to converge, it is guaranteed to arrive at the true posterior after infinitely many iterations. By combining the two methods it is possible to reduce the bias of variational methods while simultaneously speeding up variational updates. This idea has previously been applied to standard latent Dirichlet allocation (LDA). We propose a new sampling method that enables the application of the idea to the nonparametric version of LDA, hierarchical Dirichlet process topic models. Our fast sampling method leads to a significant speedup of variational updates as compared to other sampling methods. Experiments show that training of our topic model converges to a better log-likelihood than previously existing variational methods and converges faster than Gibbs sampling in the batch setting.

Sophie Burkhardt, Stefan Kramer

### PAC-Bayesian Analysis for a Two-Step Hierarchical Multiview Learning Approach

We study a two-level multiview learning with more than two views under the PAC-Bayesian framework. This approach, sometimes referred as late fusion, consists in learning sequentially multiple view-specific classifiers at the first level, and then combining these view-specific classifiers at the second level. Our main theoretical result is a generalization bound on the risk of the majority vote which exhibits a term of diversity in the predictions of the view-specific classifiers. From this result it comes out that controlling the trade-off between diversity and accuracy is a key element for multiview learning, which complements other results in multiview learning. Finally, we experiment our principle on multiview datasets extracted from the Reuters RCV1/RCV2 collection.

Anil Goyal, Emilie Morvant, Pascal Germain, Massih-Reza Amini

### Partial Device Fingerprints

In computing, remote devices may be identified by means of device fingerprinting, which works by collecting a myriad of client-side attributes such as the device’s browser and operating system version, installed plugins, screen resolution, hardware artifacts, Wi-Fi settings, and anything else available to the server, and then merging these attributes into uniquely identifying fingerprints. This technique is used in practice to present personalized content to repeat website visitors, detect fraudulent users, and stop masquerading attacks on local networks. However, device fingerprints are seldom uniquely identifying. They are better viewed as partial device fingerprints, which do have some discriminatory power but not enough to uniquely identify users. How can we infer from partial fingerprints whether different observations belong to the same device? We present a mathematical formulation of this problem that enables probabilistic inference of the correspondence of observations. We set out to estimate a correspondence probability for every pair of observations that reflects the plausibility that they are made by the same user. By extending probabilistic data association techniques previously used in object tracking, traffic surveillance and citation matching, we develop a general-purpose probabilistic method for estimating correspondence probabilities with partial fingerprints. Our approach exploits the natural variation in fingerprints and allows for use of situation-specific knowledge through the specification of a generative probability model. Experiments with a real-world dataset show that our approach gives calibrated correspondence probabilities. Moreover, we demonstrate that improved results can be obtained by combining device fingerprints with behavioral models.

Michael Ciere, Carlos Gañán, Michel van Eeten

### Robust Multi-view Topic Modeling by Incorporating Detecting Anomalies

Multi-view text data consist of texts from different sources. For instance, multilingual Wikipedia corpora contain articles in different languages which are created by different group of users. Because multi-view text data are often created in distributed fashion, information from different sources may not be consistent. Such inconsistency introduce noise to analysis of such kind of data. In this paper, we propose a probabilistic topic model for multi-view data, which is robust against noise. The proposed model can also be used for detecting anomalies. In our experiments on Wikipedia data sets, the proposed model is more robust than existing multi-view topic models in terms of held-out perplexity.

Guoxi Zhang, Tomoharu Iwata, Hisashi Kashima

### A Regularization Method with Inference of Trust and Distrust in Recommender Systems

In this study we investigate the recommendation problem with trust and distrust relationships to overcome the sparsity of users’ preferences, accounting for the fact that users trust the recommendations of their friends, and they do not accept the recommendations of their foes. In addition, not only users’ preferences are sparse, but also users’ social relationships. So, we first propose an inference step with multiple random walks to predict the implicit-missing trust relationships that users might have in recommender systems, while considering users’ explicit trust and distrust relationships during the inference. We introduce a regularization method and design an objective function with a social regularization term to weigh the influence of friends’ trust and foes’ distrust degrees on users’ preferences. We formulate the objective function of our regularization method as a minimization problem with respect to the users’ and items’ latent features and then we solve our recommendation problem via gradient descent. Our experiments confirm that our approach preserves relatively high recommendation accuracy in the presence of sparsity in both the users’ preferences and social relationships, significantly outperforming several state-of-the-art methods.

Dimitrios Rafailidis, Fabio Crestani

### A Unified Contextual Bandit Framework for Long- and Short-Term Recommendations

We present a unified contextual bandit framework for recommendation problems that is able to capture long- and short-term interests of users. The model is devised in dual space and the derivation is consequentially carried out using Fenchel-Legrende conjugates and thus leverages to a wide range of tasks and settings. We detail two instantiations for regression and classification scenarios and obtain well-known algorithms for these special cases. The resulting general and unified framework allows for quickly adapting contextual bandits to different applications at-hand. The empirical study demonstrates that the proposed long- and short-term framework outperforms both, short-term and long-term models on data. Moreover, a tweak of the combined model proves beneficial in cold start problems.

M. Tavakol, U. Brefeld

### Perceiving the Next Choice with Comprehensive Transaction Embeddings for Online Recommendation

To predict customer’s next choice in the context of what he/she has bought in a session is interesting and critical in the transaction domain especially for online shopping. Precise prediction leads to high quality recommendations and thus high benefit. Such kind of recommendation is usually formalized as transaction-based recommender systems (TBRS). Existing TBRS either tend to recommend popular items while ignore infrequent and newly-released ones (e.g., pattern-based RS) or assume a rigid order between items within a transaction (e.g., Markov Chain-based RS) which does not satisfy real-world cases in most time. In this paper, we propose a neural network-based comprehensive transaction embedding model (NTEM) which can effectively perceive the next choice in a transaction context. Specifically, we learn these comprehensive embeddings of both items and their features from relaxed ordered transactions. The relevance between items revealed by the transactions is encoded into such embeddings. With rich information embedded, such embeddings are powerful to predict the next choices given those already bought items. NTEM is a shallow wide-in-wide-out network, which is more efficient than deep networks considering large numbers of items and transactions. Experimental results on real-world datasets show that NTEM outperforms three typical TBRS models FPMC, PRME and GRU4Rec in terms of recommendation accuracy and novelty. Our implementation is available at https://github.com/shoujin88/NTEM-model.

Shoujin Wang, Liang Hu, Longbing Cao

### Adaptive Skip-Train Structured Regression for Temporal Networks

A broad range of high impact applications involve learning a predictive model in a temporal network environment. In weather forecasting, predicting effectiveness of treatments, outcomes in healthcare and in many other domains, networks are often large, while intervals between consecutive time moments are brief. Therefore, models are required to forecast in a more scalable and efficient way, without compromising accuracy. The Gaussian Conditional Random Field (GCRF) is a widely used graphical model for performing structured regression on networks. However, GCRF is not applicable to large networks and it cannot capture different network substructures (communities) since it considers the entire network while learning. In this study, we present a novel model, Adaptive Skip-Train Structured Ensemble (AST-SE), which is a sampling-based structured regression ensemble for prediction on top of temporal networks. AST-SE takes advantage of the scheme of ensemble methods to allow multiple GCRFs to learn from several subnetworks. The proposed model is able to automatically skip the entire training or some phases of the training process. The prediction accuracy and efficiency of AST-SE were assessed and compared against alternatives on synthetic temporal networks and the H3N2 Virus Influenza network. The obtained results provide evidence that (1) AST-SE is $$\sim$$∼140 times faster than GCRF as it skips retraining quite frequently; (2) It still captures the original network structure more accurately than GCRF while operating solely on partial views of the network; (3) It outperforms both unweighted and weighted GCRF ensembles which also operate on subnetworks but require retraining at each timestep. Code and data related to this chapter are available at: https://doi.org/10.6084/m9.figshare.5444500.

Martin Pavlovski, Fang Zhou, Ivan Stojkovic, Ljupco Kocarev, Zoran Obradovic

### ALADIN: A New Approach for Drug–Target Interaction Prediction

Due to its pharmaceutical applications, one of the most prominent machine learning challenges in bioinformatics is the prediction of drug–target interactions. State-of-the-art approaches are based on various techniques, such as matrix factorization, restricted Boltzmann machines, network-based inference and bipartite local models (BLM). In this paper, we extend BLM by the incorporation of a hubness-aware regression technique coupled with an enhanced representation of drugs and targets in a multi-modal similarity space. Additionally, we propose to build a projection-based ensemble. Our technique (ALADIN) is evaluated on publicly available real-world drug–target interaction datasets. The results show that our approach statistically significantly outperforms BLM-NII, a recent version of BLM, as well as NetLapRLS and WNN-GIP.Code related to this chapter is available at: https://github.com/lpeska/ALADINData related to this chapter are available at: https://zenodo.org/record/556337#.WPiAzIVOIdVSupplementary material is available at: http://www.biointelligence.hu/dti/

Krisztian Buza, Ladislav Peska

### Co-Regularised Support Vector Regression

We consider a semi-supervised learning scenario for regression, where only few labelled examples, many unlabelled instances and different data representations (multiple views) are available. For this setting, we extend support vector regression with a co-regularisation term and obtain co-regularised support vector regression (CoSVR). In addition to labelled data, co-regularisation includes information from unlabelled examples by ensuring that models trained on different views make similar predictions. Ligand affinity prediction is an important real-world problem that fits into this scenario. The characterisation of the strength of protein-ligand bonds is a crucial step in the process of drug discovery and design. We introduce variants of the base CoSVR algorithm and discuss their theoretical and computational properties. For the CoSVR function class we provide a theoretical bound on the Rademacher complexity. Finally, we demonstrate the usefulness of CoSVR for the affinity prediction task and evaluate its performance empirically on different protein-ligand datasets. We show that CoSVR outperforms co-regularised least squares regression as well as existing state-of-the-art approaches for affinity prediction. Code and data related to this chapter are available at: https://doi.org/10.6084/m9.figshare.5427241.

Katrin Ullrich, Michael Kamp, Thomas Gärtner, Martin Vogt, Stefan Wrobel

### Online Regression with Controlled Label Noise Rate

Many online regression (and adaptive filtering) algorithms are linear, use additive update and designed for the noise-free setting. We consider the practical setting where the algorithm’s feedback is noisy, rather than a clean label. We propose a new family of algorithms which modifies the learning rate based on the noise-variance of the feedback (labels), by shrinking both inputs and feedbacks, based on the amount of noise per input instance. We consider both settings, where the noise is either given or estimated. Empirical study with both synthetic and real-world speech data shows that our algorithms improve the overall performance of the regressor, even when there is no additional explicit information (i.e. amount of noise). We also consider a more general setting where an algorithm can sample more than single (noisy) label, yet there is a total (or average) budget for the feedback. We propose a few strategies how to effectively spend the given budget, which are based on noise-variance estimation and our shrinkage rule. We show empirically that our approach outperforms other naive approaches.

Edward Moroshko, Koby Crammer

### Generalized Inverse Reinforcement Learning with Linearly Solvable MDP

In this paper, we consider a generalized variant of inverse reinforcement learning (IRL) that estimates both a cost (negative reward) function and a transition probability from observed optimal behavior. In theoretical studies of standard IRL, which estimates only the cost function, it is well known that IRL involves a non-identifiable problem, i.e., the cost function cannot be determined uniquely. This problem has been solved by using a new class of Markov decision process (MDP) called a linearly solvable MDP (LMDP). In this paper, we investigate whether a non-identifiable problem occurs in the generalized variant of IRL (gIRL) using the framework of LMDP and construct a new gIRL method. The contributions of this study are summarized as follows: (i) We point out that gIRL with LMDP suffers from a non-identifiable problem. (ii) We propose a Bayesian method to escape the non-identifiable problem. (iii) We validate the proposed method by performing an experiment on synthetic data and real car probe data.

Masahiro Kohjima, Tatsushi Matsubayashi, Hiroshi Sawada

### Max K-Armed Bandit: On the ExtremeHunter Algorithm and Beyond

This paper is devoted to the study of the max K-armed bandit problem, which consists in sequentially allocating resources in order to detect extreme values. Our contribution is twofold. We first significantly refine the analysis of the ExtremeHunter algorithm carried out in Carpentier and Valko (2014), and next propose an alternative approach, showing that, remarkably, Extreme Bandits can be reduced to a classical version of the bandit problem to a certain extent. Beyond the formal analysis, these two approaches are compared through numerical experiments.

Mastane Achab, Stephan Clémençon, Aurélien Garivier, Anne Sabourin, Claire Vernade

### Variational Thompson Sampling for Relational Recurrent Bandits

In this paper, we introduce a novel non-stationary bandit setting, called relational recurrent bandit, where rewards of arms at successive time steps are interdependent. The aim is to discover temporal and structural dependencies between arms in order to maximize the cumulative collected reward. Two algorithms are proposed: the first one directly models temporal dependencies between arms, as the second one assumes the existence of hidden states of the system behind the observed rewards. For both approaches, we develop a Variational Thompson Sampling method, which approximates distributions via variational inference, and uses the estimated distributions to sample reward expectations at each iteration of the process. Experiments conducted on both synthetic and real data demonstrate the effectiveness of our approaches.

Sylvain Lamprier, Thibault Gisselbrecht, Patrick Gallinari

### Explaining Deviating Subsets Through Explanation Networks

We propose a novel approach to finding explanations of deviating subsets, often called subgroups. Existing approaches for subgroup discovery rely on various quality measures that nonetheless often fail to find subgroup sets that are diverse, of high quality, and most importantly, provide good explanations of the deviations that occur in the data.To tackle this issue we introduce explanation networks, which provide a holistic view on all candidate subgroups and how they relate to each other, offering elegant ways to select high-quality yet diverse subgroup sets. Explanation networks are constructed by representing subgroups by nodes and having weighted edges represent the extent to which one subgroup explains another. Explanatory strength is defined by extending ideas from database causality, in which interventions are used to quantify the effect of one query on another.Given an explanatory network, existing network analysis techniques can be used for subgroup discovery. In particular, we study the use of Page-Rank for pattern ranking and seed selection (from influence maximization) for pattern set selection. Experiments on synthetic and real data show that the proposed approach finds subgroup sets that are more likely to capture the generative processes of the data than other methods.

Antti Ukkonen, Vladimir Dzyuba, Matthijs van Leeuwen

### Flash Points: Discovering Exceptional Pairwise Behaviors in Vote or Rating Data

We address the problem of discovering contexts that lead well-distinguished collections of individuals to change their pairwise agreement w.r.t. their usual one. For instance, in the European parliament, while in overall, a strong disagreement is witnessed between deputies of the far-right French party Front National and deputies of the left party Front de Gauche, a strong agreement is observed between these deputies in votes related to the thematic: External relations with the union. We devise the method DSC (Discovering Similarities Changes) which relies on exceptional model mining to uncover three-set patterns that identify contexts and two collections of individuals where an unexpected strengthening or weakening of pairwise agreement is observed. To efficiently explore the search space, we define some closure operators and pruning techniques using upper bounds on the quality measure. In addition of handling usual attributes (e.g. numerical, nominal), we propose a novel pattern domain which involves hierarchical multi-tag attributes that are present in many datasets. A thorough empirical study on two real-world datasets (i.e., European parliament votes and collaborative movie reviews) demonstrates the efficiency and the effectiveness of our approach as well as the interest and the actionability of the patterns.

Adnene Belfodil, Sylvie Cazalens, Philippe Lamarre, Marc Plantevit

### A Multiscale Bezier-Representation for Time Series that Supports Elastic Matching

Common time series similarity measures that operate on the full series (like Euclidean distance or Dynamic Time Warping DTW) do not correspond well to the visual similarity as perceived by a human. Based on the interval tree of scale, we propose a multiscale Bezier representation of time series, that supports the definition of elastic similarity measures that overcome this problem. With this representation the matching can be performed efficiently as similarity is measured segment-wise rather than element-wise (as with DTW). We effectively restrict the set of warping paths considered by DTW and the results do not only correspond better to the analysts intuition but improve the accuracy in the standard 1NN time series classification.

F. Höppner, T. Sobek

### Arbitrated Ensemble for Time Series Forecasting

This paper proposes an ensemble method for time series forecasting tasks. Combining different forecasting models is a common approach to tackle these problems. State-of-the-art methods track the loss of the available models and adapt their weights accordingly. Metalearning strategies such as stacking are also used in these tasks. We propose a metalearning approach for adaptively combining forecasting models that specializes them across the time series. Our assumption is that different forecasting models have different areas of expertise and a varying relative performance. Moreover, many time series show recurring structures due to factors such as seasonality. Therefore, the ability of a method to deal with changes in relative performance of models as well as recurrent changes in the data distribution can be very useful in dynamic environments. Our approach is based on an ensemble of heterogeneous forecasters, arbitrated by a metalearning model. This strategy is designed to cope with the different dynamics of time series and quickly adapt the ensemble to regime changes. We validate our proposal using time series from several real world domains. Empirical results show the competitiveness of the method in comparison to state-of-the-art approaches for combining forecasters.

Vítor Cerqueira, Luís Torgo, Fábio Pinto, Carlos Soares

### Cost Sensitive Time-Series Classification

This paper investigates the problem of highly imbalanced time-series classification using shapelets, short patterns that best characterize the target time-series, which are highly discriminative. The current state-of-the-art approach learns generalized shapelets along with weights of the classification hyperplane via a classical cost-insensitive loss function. Cost-insensitive loss functions tend to treat different misclassification errors equally and thus, models are usually biased towards examples of majority class. The rare class (which will be referred to as positive class) is usually the important class and a false negative is always costlier than a false positive. Traditional 0–1 loss functions fail to differentiate between these two types of misclassification errors. In this paper, the generalized shapelets learning framework is extended and a cost-sensitive learning model is proposed. Instead of incorporating the misclassification cost as a prior knowledge, as was done by other published methods, we formulate a constrained optimization problem to learn the unknown misclassification costs along with the shapelets and their weights. First, we demonstrated the effectiveness of the proposed method on two case studies, with the objective to detect true alarms from life threatening cardiac arrhythmia dataset from Physionets MIMIC II repository. The results show improved true alarm detection rates over the current state-of-the-art method. Next, we compared to the state-of-the-art learning shapelet method on 16 balanced dataset from UCR time-series repository. The results show evidence that the proposed method outperforms the state-of-the-art method. Finally, we performed extensive experiments across additional 18 imbalanced time-series datasets. The results provide evidence that the proposed method achieves comparable results with the state-of-the-art sampling/non-sampling based approaches for highly imbalanced time-series datasets. However, our method is highly interpretable which is an advantage over many other methods.

Shoumik Roychoudhury, Mohamed Ghalwash, Zoran Obradovic

### Cost-Sensitive Perceptron Decision Trees for Imbalanced Drifting Data Streams

Mining streaming and drifting data is among the most popular contemporary applications of machine learning methods. Due to the potentially unbounded number of instances arriving rapidly, evolving concepts and limitations imposed on utilized computational resources, there is a need to develop efficient and adaptive algorithms that can handle such problems. These learning difficulties can be further augmented by appearance of skewed distributions during the stream progress. Class imbalance in non-stationary scenarios is highly challenging, as not only imbalance ratio may change over time, but also relationships among classes. In this paper we propose an efficient and fast cost-sensitive decision tree learning scheme for handling online class imbalance. In each leaf of the tree we train a perceptron with output adaptation to compensate for skewed class distributions, while McDiarmid’s bound is used for controlling the splitting attribute selection. The cost matrix automatically adapts itself to the current imbalance ratio in the stream, allowing for a smooth compensation of evolving class relationships. Furthermore, we analyze characteristics of minority class instances and incorporate this information during the model update process. It allows our classifier to focus on most difficult instances, while a sliding window keeps track of changes in class structures. Experimental analysis carried out on a number of binary and multi-class imbalanced data streams indicate the usefulness of the proposed approach.

Bartosz Krawczyk, Przemysław Skryjomski

### Efficient Temporal Kernels Between Feature Sets for Time Series Classification

In the time-series classification context, the majority of the most accurate core methods are based on the Bag-of-Words framework, in which sets of local features are first extracted from time series. A dictionary of words is then learned and each time series is finally represented by a histogram of word occurrences. This representation induces a loss of information due to the quantization of features into words as all the time series are represented using the same fixed dictionary. In order to overcome this issue, we introduce in this paper a kernel operating directly on sets of features. Then, we extend it to a time-compliant kernel that allows one to take into account the temporal information. We apply this kernel in the time series classification context. Proposed kernel has a quadratic complexity with the size of input feature sets, which is problematic when dealing with long time series. However, we show that kernel approximation techniques can be used to define a good trade-off between accuracy and complexity. We experimentally demonstrate that the proposed kernel can significantly improve the performance of time series classification algorithms based on Bag-of-Words.Code related to this chapter is available at: https://github.com/rtavenar/SQFD-TimeSeriesData related to this chapter are available at: http://www.timeseriesclassification.com

Romain Tavenard, Simon Malinowski, Laetitia Chapel, Adeline Bailly, Heider Sanchez, Benjamin Bustos

### Forecasting and Granger Modelling with Non-linear Dynamical Dependencies

Traditional linear methods for forecasting multivariate time series are not able to satisfactorily model the non-linear dependencies that may exist in non-Gaussian series. We build on the theory of learning vector-valued functions in the reproducing kernel Hilbert space and develop a method for learning prediction functions that accommodate such non-linearities. The method not only learns the predictive function but also the matrix-valued kernel underlying the function search space directly from the data. Our approach is based on learning multiple matrix-valued kernels, each of those composed of a set of input kernels and a set of output kernels learned in the cone of positive semi-definite matrices. In addition to superior predictive performance in the presence of strong non-linearities, our method also recovers the hidden dynamic relationships between the series and thus is a new alternative to existing graphical Granger techniques.

Magda Gregorová, Alexandros Kalousis, Stéphane Marchand-Maillet

### Learning TSK Fuzzy Rules from Data Streams

Learning from data streams has received increasing attention in recent years, not only in the machine learning community but also in other research fields, such as computational intelligence and fuzzy systems. In particular, several rule-based methods for the incremental induction of regression models have been proposed. In this paper, we develop a method that combines the strengths of two existing approaches rooted in different learning paradigms. Our method induces a set of fuzzy rules, which, compared to conventional rules with Boolean antecedents, has the advantage of producing smooth regression functions. To do so, it makes use of an induction technique inspired by AMRules, a very efficient and effective learning algorithm that can be seen as the state of the art in machine learning. We conduct a comprehensive experimental study showing that a combination of the expressiveness of fuzzy rules with the algorithmic concepts of AMRules yields a learning system with superb performance.

Ammar Shaker, Waleri Heldt, Eyke Hüllermeier

### Non-parametric Online AUC Maximization

We consider the problems of online and one-pass maximization of the area under the ROC curve (AUC). AUC maximization is hard even in the offline setting and thus solutions often make some compromises. Existing results for the online problem typically optimize for some proxy defined via surrogate losses instead of maximizing the real AUC. This approach is confirmed by results showing that the optimum of these proxies, over the set of all (measurable) functions, maximize the AUC. The problem is that—in order to meet the strong requirements for per round run time complexity—online methods typically work with restricted hypothesis classes and this, as we show, corrupts the above compatibility and causes the methods to converge to suboptimal solutions even in some simple stochastic cases. To remedy this, we propose a different approach and show that it leads to asymptotic optimality. Our theoretical claims and considerations are tested by experiments on real datasets, which provide empirical justification to them.

Balázs Szörényi, Snir Cohen, Shie Mannor

### On-Line Dynamic Time Warping for Streaming Time Series

Dynamic Time Warping is a well-known measure of dissimilarity between time series. Due to its flexibility to deal with non-linear distortions along the time axis, this measure has been widely utilized in machine learning models for this particular kind of data. Nowadays, the proliferation of streaming data sources has ignited the interest and attention of the scientific community around on-line learning models. In this work, we naturally adapt Dynamic Time Warping to the on-line learning setting. Specifically, we propose a novel on-line measure of dissimilarity for streaming time series which combines a warp constraint and a weighted memory mechanism to simplify the time series alignment and adapt to non-stationary data intervals along time. Computer simulations are analyzed and discussed so as to shed light on the performance and complexity of the proposed measure.

Izaskun Oregi, Aritz Pérez, Javier Del Ser, José A. Lozano

### PowerCast: Mining and Forecasting Power Grid Sequences

What will be the power consumption of our institution at 8am for the upcoming days? What will happen to the power consumption of a small factory, if it wants to double (or half) its production? Technologies associated with the smart electrical grid are needed. Central to this process are algorithms that accurately model electrical load behavior, and forecast future electric power demand. However, existing power load models fail to accurately represent electrical load behavior in the grid. In this paper, we propose PowerCast, a novel domain-aware approach for forecasting the electrical power demand, by carefully incorporating domain knowledge. Our contributions are as follows: 1. Infusion of domain expert knowledge: We represent the time sequences using an equivalent circuit model, the “BIG” model, which allows for an intuitive interpretation of the power load, as the BIG model is derived from physics-based first principles. 2. Forecasting of the power load: Our PowerCast uses the BIG model, and provides (a) accurate prediction in multi-step-ahead forecasting, and (b) extrapolations, under what-if scenarios, such as variation in the demand (say, due to increase in the count of people on campus, or a decision to half the production in our factory etc.) 3. Anomaly detection: PowerCast can spot and, even explain, anomalies in the given time sequences. The experimental results based on two real world datasets of up to three weeks duration, demonstrate that PowerCast is able to forecast several steps ahead, with 59% error reduction, compared to the competitors. Moreover, it is fast, and scales linearly with the duration of the sequences.

Hyun Ah Song, Bryan Hooi, Marko Jereminov, Amritanshu Pandey, Larry Pileggi, Christos Faloutsos

### UAPD: Predicting Urban Anomalies from Spatial-Temporal Data

Urban city environments face the challenge of disturbances, which can create inconveniences for its citizens. These require timely detection and resolution, and more importantly timely preparedness on the part of city officials. We term these disturbances as anomalies, and pose the problem statement: if it is possible to also predict these anomalous events (proactive), and not just detect (reactive). While significant effort has been made in detecting anomalies in existing urban data, the prediction of future urban anomalies is much less well studied and understood. In this work, we formalize the future anomaly prediction problem in urban environments, such that those can be addressed in a more efficient and effective manner. We develop the Urban Anomaly PreDiction (UAPD) framework, which addresses a number of challenges, including the dynamic, spatial varieties of different categories of anomalies. Given the urban anomaly data to date, UAPD first detects the change point of each type of anomalies in the temporal dimension and then uses a tensor decomposition model to decouple the interrelations between the spatial and categorical dimensions. Finally, UAPD applies an autoregression method to predict which categories of anomalies will happen at each region in the future. We conduct extensive experiments in two urban environments, namely New York City and Pittsburgh. Experimental results demonstrate that UAPD outperforms alternative baselines across various settings, including different region and time-frame scales, as well as diverse categories of anomalies. Code related to this chapter is available at: https://bitbucket.org/xianwu9/uapd.

Xian Wu, Yuxiao Dong, Chao Huang, Jian Xu, Dong Wang, Nitesh V. Chawla

### LKT-FM: A Novel Rating Pattern Transfer Model for Improving Non-overlapping Cross-Domain Collaborative Filtering

Cross-Domain Collaborative Filtering (CDCF) has attracted various research works in recent years. However, an important problem setting, i.e., “users and items in source and target domains are totally different”, has not received much attention yet. We coin this problem as Non-Overlapping Cross-Domain Collaborative Filtering (NOCDCF). In order to solve this challenging CDCF task, we propose a novel 3-step rating pattern transfer model, i.e. low-rank knowledge transfer via factorization machines (LKT-FM). Our solution is able to mine high quality knowledge from large and sparse source matrices, and to integrate the knowledge without losing much information contained in the target matrix via exploiting Factorization Machine (FM). Extensive experiments on real world datasets show that the proposed LKT-FM model outperforms the state-of-the-art CDCF solutions.

Yizhou Zang, Xiaohua Hu

### Distributed Multi-task Learning for Sensor Network

A sensor in a sensor network is expected to be able to make prediction or decision utilizing the models learned from the data observed on this sensor. However, in the early stage of using a sensor, there may be not a lot of data available to train the model for this sensor. A solution is to leverage the observation data from other sensors which have similar conditions and models with the given sensor. We thus propose a novel distributed multi-task learning approach which incorporates neighborhood relations among sensors to learn multiple models simultaneously in which each sensor corresponds to one task. It may be not cheap for each sensor to transfer the observation data from other sensors; broadcasting the observation data of a sensor in the entire network is not satisfied for the reason of privacy protection; each sensor is expected to make real-time prediction independently from neighbor sensors. Therefore, this approach shares the model parameters as regularization terms in the objective function by assuming that neighbor sensors have similar model parameters. We conduct the experiments on two real datasets by predicting the temperature with the regression. They verify that our approach is effective, especially when the bias of an independent model which does not utilize the data from other sensors is high such as when there is not plenty of training data available.

Jiyi Li, Tomohiro Arai, Yukino Baba, Hisashi Kashima, Shotaro Miwa

### Learning Task Clusters via Sparsity Grouped Multitask Learning

Sparse mapping has been a key methodology in many high-dimensional scientific problems. When multiple tasks share the set of relevant features, learning them jointly in a group drastically improves the quality of relevant feature selection. However, in practice this technique is used limitedly since such grouping information is usually hidden. In this paper, our goal is to recover the group structure on the sparsity patterns and leverage that information in the sparse learning. Toward this, we formulate a joint optimization problem in the task parameter and the group membership, by constructing an appropriate regularizer to encourage sparse learning as well as correct recovery of task groups. We further demonstrate that our proposed method recovers groups and the sparsity patterns in the task parameters accurately by extensive experiments.

Meghana Kshirsagar, Eunho Yang, Aurélie C. Lozano

### Lifelong Learning with Gaussian Processes

Recent developments in lifelong machine learning have demonstrated that it is possible to learn multiple tasks consecutively, transferring knowledge between those tasks to accelerate learning and improve performance. However, these methods are limited to using linear parametric base learners, substantially restricting the predictive power of the resulting models. We present a lifelong learning algorithm that can support non-parametric models, focusing on Gaussian processes. To enable efficient online transfer between Gaussian process models, our approach assumes a factorized formulation of the covariance functions, and incrementally learns a shared sparse basis for the models’ parameterizations. We show that this lifelong learning approach is highly computationally efficient, and outperforms existing methods on a variety of data sets.

Christopher Clingerman, Eric Eaton

### Personalized Tag Recommendation for Images Using Deep Transfer Learning

Image tag recommendation in social media systems provides the users with personalized tag suggestions which facilitate the users’ tagging task and enable automatic organization and many image retrieval tasks. Factorization models are a widely used approach for personalized tag recommendation and achieve good results. These methods rely on the user’s tagging preferences only and ignore the contents of the image. However, it is obvious that especially the contents of the image, such as the objects appearing in the image, colors, shapes or other visual aspects, strongly influence the user’s tagging decisions.We present a personalized content-aware image tag recommendation approach that combines both historical tagging information and image-based features in a factorization model. Employing transfer learning, we apply state of the art deep learning image classification and object detection techniques to extract powerful features from the images. Both, image information and tagging history, are fed to an adaptive factorization model to recommend tags. Empirically, we can demonstrate that the visual and object-based features can improve the performance up to 1.5% over the state of the art.

Hanh T. H. Nguyen, Martin Wistuba, Lars Schmidt-Thieme

### Ranking Based Multitask Learning of Scoring Functions

Scoring functions are an important tool for quantifying properties of interest in many domains; for example, in healthcare, a disease severity scores are used to diagnose the patient’s condition and to decide its further treatment. Scoring functions might be obtained based on the domain knowledge or learned from data by using classification, regression or ranking techniques - depending on the type of supervised information. Although learning scoring functions from collected data is beneficial, it can be challenging when limited data are available. Therefore, learning multiple distinct, but related, scoring functions together can increase their quality as shared regularities may be easier to identify. We propose a multitask formulation for ranking-based learning of scoring functions, where the model is trained from pairwise comparisons. The approach uses mixed-norm regularization to impose structural regularities among the tasks. The proposed regularized objective function is convex; therefore, we developed an optimization approach based on alternating minimization and proximal gradient algorithms to solve the problem. The increased predictive accuracy of the presented approach, in comparison to several baselines, is demonstrated on synthetic data and two different real-world applications; predicting exam scores and predicting tolerance to infections score.

Ivan Stojkovic, Mohamed Ghalwash, Zoran Obradovic

### Theoretical Analysis of Domain Adaptation with Optimal Transport

Domain adaptation (DA) is an important and emerging field of machine learning that tackles the problem occurring when the distributions of training (source domain) and test (target domain) data are similar but different. This kind of learning paradigm is of vital importance for future advances as it allows a learner to generalize the knowledge across different tasks. Current theoretical results show that the efficiency of DA algorithms depends on their capacity of minimizing the divergence between source and target probability distributions. In this paper, we provide a theoretical study on the advantages that concepts borrowed from optimal transportation theory [17] can bring to DA. In particular, we show that the Wasserstein metric can be used as a divergence measure between distributions to obtain generalization guarantees for three different learning settings: (i) classic DA with unsupervised target data (ii) DA combining source and target labeled data, (iii) multiple source DA. Based on the obtained results, we motivate the use of the regularized optimal transport and provide some algorithmic insights for multi-source domain adaptation. We also show when this theoretical analysis can lead to tighter inequalities than those of other existing frameworks. We believe that these results open the door to novel ideas and directions for DA.

Ievgen Redko, Amaury Habrard, Marc Sebban

### TSP: Learning Task-Specific Pivots for Unsupervised Domain Adaptation

Unsupervised Domain Adaptation (UDA) considers the problem of adapting a classifier trained using labelled training instances from a source domain to a different target domain, without having access to any labelled training instances from the target domain. Projection-based methods, where the source and target domain instances are first projected onto a common feature space on which a classifier can be trained and applied have produced state-of-the-art results for UDA. However, a critical pre-processing step required by these methods is the selection of a set of common features (aka. pivots), this is typically done using heuristic approaches, applied prior to performing domain adaptation. In contrast to the one of heuristics, we propose a method for learning Task-Specific Pivots (TSPs) in a systematic manner by considering both the labelled and unlabelled data available from both domains. We evaluate TSPs against pivots selected using alternatives in two cross-domain sentiment classification applications. Our experimental results show that the proposed TSPs significantly outperform previously proposed selection strategies in both tasks. Moreover, when applied in a cross-domain sentiment classification task, TSP captures many sentiment-bearing pivots.

Xia Cui, Frans Coenen, Danushka Bollegala

### -means for Fast and Accurate Large Scale Clustering

We propose $$k^2$$k2-means, a new clustering method which efficiently copes with large numbers of clusters and achieves low energy solutions. $$k^2$$k2-means builds upon the standard k-means (Lloyd’s algorithm) and combines a new strategy to accelerate the convergence with a new low time complexity divisive initialization. The accelerated convergence is achieved through only looking at $$k_n$$kn nearest clusters and using triangle inequality bounds in the assignment step while the divisive initialization employs an optimal 2-clustering along a direction. The worst-case time complexity per iteration of our $$k^2$$k2-means is $$O(nk_nd\,+\,k^2d)$$O(nknd+k2d), where d is the dimension of the n data points and k is the number of clusters and usually $$n\gg k \gg k_n$$n≫k≫kn. Compared to k-means’ O(nkd) complexity, our $$k^2$$k2-means complexity is significantly lower, at the expense of slightly increasing the memory complexity by $$O(nk_n+k^2)$$O(nkn+k2). In our extensive experiments $$k^2$$k2-means is order(s) of magnitude faster than standard methods in computing accurate clusterings on several standard datasets and settings with hundreds of clusters and high dimensional data. Moreover, the proposed divisive initialization generally leads to clustering energies comparable to those achieved with the standard k-means++ initialization, while being significantly faster.

Eirikur Agustsson, Radu Timofte, Luc Van Gool

### A Simple Exponential Family Framework for Zero-Shot Learning

We present a simple generative framework for learning to predict previously unseen classes, based on estimating class-attribute-gated class-conditional distributions. We model each class-conditional distribution as an exponential family distribution and the parameters of the distribution of each seen/unseen class are defined as functions of the respective observed class attributes. These functions can be learned using only the seen class data and can be used to predict the parameters of the class-conditional distribution of each unseen class. Unlike most existing methods for zero-shot learning that represent classes as fixed embeddings in some vector space, our generative model naturally represents each class as a probability distribution. It is simple to implement and also allows leveraging additional unlabeled data from unseen classes to improve the estimates of their class-conditional distributions using transductive/semi-supervised learning. Moreover, it extends seamlessly to few-shot learning by easily updating these distributions when provided with a small number of additional labelled examples from unseen classes. Through a comprehensive set of experiments on several benchmark data sets, we demonstrate the efficacy of our framework.

Vinay Kumar Verma, Piyush Rai

### DeepCluster: A General Clustering Framework Based on Deep Learning

In this paper, we propose a general framework DeepCluster to integrate traditional clustering methods into deep learning (DL) models and adopt Alternating Direction of Multiplier Method (ADMM) to optimize it. While most existing DL based clustering techniques have separate feature learning (via DL) and clustering (with traditional clustering methods), DeepCluster simultaneously learns feature representation and does cluster assignment under the same framework. Furthermore, it is a general and flexible framework that can employ different networks and clustering methods. We demonstrate the effectiveness of DeepCluster by integrating two popular clustering methods: K-means and Gaussian Mixture Model (GMM) into deep networks. The experimental results shown that our method can achieve state-of-the-art performance on learning representation for clustering analysis. Code and data related to this chapter are available at: https://github.com/JennyQQL/DeepClusterADMM-Release.

Kai Tian, Shuigeng Zhou, Jihong Guan

### Multi-view Spectral Clustering on Conflicting Views

In a growing number of application domains, multiple feature representations or views are available to describe objects. Multi-view clustering tries to find similar groups of objects across these views. This task is complicated when the corresponding clusterings in each view show poor agreement (conflicting views). In such cases, traditional multi-view clustering methods will not benefit from using multi-view data. Here, we propose to overcome this problem by combining the ideas of multi-view spectral clustering with alternative clustering through kernel-based dimensionality reduction. Our method automatically determines feature transformations in each view that lead to an optimal clustering w.r.t to a new proposed objective function for conflicting views. In our experiments, our approach outperforms state-of-the-art multi-view clustering methods by more accurately detecting the ground truth clustering supported by all views.

Xiao He, Limin Li, Damian Roqueiro, Karsten Borgwardt

### Pivot-Based Distributed K-Nearest Neighbor Mining

k-nearest neighbor (kNN) search is a fundamental data mining task critical to many data analytics methods. Yet no effective techniques to date scale kNN search to large datasets. In this work we present PkNN, an exact distributed method that by leveraging modern distributed architectures for the first time scales kNN search to billion point datasets. The key to the PkNN strategy is a multi-round kNN search that exploits pivot-based data partitioning at each stage. This includes an outlier-driven partition adjustment mechanism that effectively minimizes data duplication and achieves a balanced workload across the compute cluster. Aggressive data-driven bounds along with a tiered support assignment strategy ensure correctness while limiting computation costs. Our experimental study on multi-dimensional real-world data demonstrates that PkNN achieves significant speedup over the state-of-the-art and scales effectively in data cardinality. Code and data related to this chapter are available at: http://solar-10.wpi.edu/cakuhlman/PkNN.

Caitlin Kuhlman, Yizhou Yan, Lei Cao, Elke Rundensteiner

### Backmatter

Weitere Informationen