Skip to main content

2019 | Buch

Machine Learning and Knowledge Discovery in Databases

European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part I

herausgegeben von: Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, Georgiana Ifrim

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The three volume proceedings LNAI 11051 – 11053 constitutes the refereed proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2018, held in Dublin, Ireland, in September 2018.

The total of 131 regular papers presented in part I and part II was carefully reviewed and selected from 535 submissions; there are 52 papers in the applied data science, nectar and demo track.

The contributions were organized in topical sections named as follows:
Part I: adversarial learning; anomaly and outlier detection; applications; classification; clustering and unsupervised learning; deep learningensemble methods; and evaluation.
Part II: graphs; kernel methods; learning paradigms; matrix and tensor analysis; online and active learning; pattern and sequence mining; probabilistic models and statistical methods; recommender systems; and transfer learning.
Part III: ADS data science applications; ADS e-commerce; ADS engineering and design; ADS financial and security; ADS health; ADS sensing and positioning; nectar track; and demo track.

Inhaltsverzeichnis

Frontmatter

Adversarial Learning

Frontmatter
Image Anomaly Detection with Generative Adversarial Networks

Many anomaly detection methods exist that perform well on low-dimensional problems however there is a notable lack of effective methods for high-dimensional spaces, such as images. Inspired by recent successes in deep learning we propose a novel approach to anomaly detection using generative adversarial networks. Given a sample under consideration, our method is based on searching for a good representation of that sample in the latent space of the generator; if such a representation is not found, the sample is deemed anomalous. We achieve state-of-the-art performance on standard image benchmark datasets and visual inspection of the most anomalous samples reveals that our method does indeed return anomalies.

Lucas Deecke, Robert Vandermeulen, Lukas Ruff, Stephan Mandt, Marius Kloft
Image-to-Markup Generation via Paired Adversarial Learning

Motivated by the fact that humans can grasp semantic-invariant features shared by the same category while attention-based models focus mainly on discriminative features of each object, we propose a scalable paired adversarial learning (PAL) method for image-to-markup generation. PAL can incorporate the prior knowledge of standard templates to guide the attention-based model for discovering semantic-invariant features when the model pays attention to regions of interest. Furthermore, we also extend the convolutional attention mechanism to speed up the image-to-markup parsing process while achieving competitive performance compared with recurrent attention models. We evaluate the proposed method in the scenario of handwritten-image-to-LaTeX generation, i.e., converting handwritten mathematical expressions to LaTeX. Experimental results show that our method can significantly improve the generalization performance over standard attention-based encoder-decoder models.

Jin-Wen Wu, Fei Yin, Yan-Ming Zhang, Xu-Yao Zhang, Cheng-Lin Liu
Toward an Understanding of Adversarial Examples in Clinical Trials

Deep learning systems can be fooled by small, worst-case perturbations of their inputs, known as adversarial examples. This has been almost exclusively studied in supervised learning, on vision tasks. However, adversarial examples in counterfactual modelling, which sits outside the traditional supervised scenario, is an overlooked challenge. We introduce the concept of adversarial patients, in the context of counterfactual models for clinical trials—this turns out to introduce several new dimensions to the literature. We describe how there exist multiple types of adversarial example—and demonstrate different consequences, e.g. ethical, when they arise. The study of adversarial examples in this area is rich in challenges for accountability and trustworthiness in ML–we highlight future directions that may be of interest to the community.

Konstantinos Papangelou, Konstantinos Sechidis, James Weatherall, Gavin Brown
ShapeShifter: Robust Physical Adversarial Attack on Faster R-CNN Object Detector

Given the ability to directly manipulate image pixels in the digital input space, an adversary can easily generate imperceptible perturbations to fool a Deep Neural Network (DNN) image classifier, as demonstrated in prior work. In this work, we propose ShapeShifter, an attack that tackles the more challenging problem of crafting physical adversarial perturbations to fool image-based object detectors like Faster R-CNN. Attacking an object detector is more difficult than attacking an image classifier, as it needs to mislead the classification results in multiple bounding boxes with different scales. Extending the digital attack to the physical world adds another layer of difficulty, because it requires the perturbation to be robust enough to survive real-world distortions due to different viewing distances and angles, lighting conditions, and camera limitations. We show that the Expectation over Transformation technique, which was originally proposed to enhance the robustness of adversarial perturbations in image classification, can be successfully adapted to the object detection setting. ShapeShifter can generate adversarially perturbed stop signs that are consistently mis-detected by Faster R-CNN as other objects, posing a potential threat to autonomous vehicles and other safety-critical computer vision systems. Code related to this paper is available at: https://github.com/shangtse/robust-physical-attack .

Shang-Tse Chen, Cory Cornelius, Jason Martin, Duen Horng (Polo) Chau

Anomaly and Outlier Detection

Frontmatter
GridWatch: Sensor Placement and Anomaly Detection in the Electrical Grid

Given sensor readings over time from a power grid consisting of nodes (e.g. generators) and edges (e.g. power lines), how can we most accurately detect when an electrical component has failed? More challengingly, given a limited budget of sensors to place, how can we determine where to place them to have the highest chance of detecting such a failure? Maintaining the reliability of the electrical grid is a major challenge. An important part of achieving this is to place sensors in the grid, and use them to detect anomalies, in order to quickly respond to a problem. Our contributions are: (1) Online anomaly detection: we propose a novel, online anomaly detection algorithm that outperforms existing approaches. (2) Sensor placement: we construct an optimization objective for sensor placement, with the goal of maximizing the probability of detecting an anomaly. We show that this objective has the property of submodularity, which we exploit in our sensor placement algorithm. (3) Effectiveness: Our sensor placement algorithm is provably near-optimal, and both our algorithms outperform existing approaches in accuracy by $$59\%$$ or more (F-measure) in experiments. (4) Scalability: our algorithms scale linearly, and our detection algorithm is online, requiring bounded space and constant time per update. Code related to this paper is available at: https://github.com/bhooi/gridwatch .

Bryan Hooi, Dhivya Eswaran, Hyun Ah Song, Amritanshu Pandey, Marko Jereminov, Larry Pileggi, Christos Faloutsos
Incorporating Privileged Information to Unsupervised Anomaly Detection

We introduce a new unsupervised anomaly detection ensemble called SPI which can harness privileged information—data available only for training examples but not for (future) test examples. Our ideas build on the Learning Using Privileged Information (LUPI) paradigm pioneered by Vapnik et al. [17, 19], which we extend to unsupervised learning and in particular to anomaly detection. SPI (for Spotting anomalies with Privileged Information) constructs a number of frames/fragments of knowledge (i.e., density estimates) in the privileged space and transfers them to the anomaly scoring space through “imitation” functions that use only the partial information available for test examples. Our generalization of the LUPI paradigm to unsupervised anomaly detection shepherds the field in several key directions, including (i) domain-knowledge-augmented detection using expert annotations as PI, (ii) fast detection using computationally-demanding data as PI, and (iii) early detection using “historical future” data as PI. Through extensive experiments on simulated and real datasets, we show that augmenting privileged information to anomaly detection significantly improves detection performance. We also demonstrate the promise of SPI under all three settings (i–iii); with PI capturing expert knowledge, computationally-expensive features, and future data on three real world detection tasks. Code related to this paper is available at: http://www.andrew.cmu.edu/user/shubhras/SPI .

Shubhranshu Shekhar, Leman Akoglu
L1-Depth Revisited: A Robust Angle-Based Outlier Factor in High-Dimensional Space

Angle-based outlier detection (ABOD) has been recently emerged as an effective method to detect outliers in high dimensions. Instead of examining neighborhoods as proximity-based concepts, ABOD assesses the broadness of angle spectrum of a point as an outlier factor. Despite being a parameter-free and robust measure in high-dimensional space, the exact solution of ABOD suffers from the cubic cost $$O(n^3)$$ regarding the data size n, hence cannot be used on large-scale data sets.In this work we present a conceptual relationship between the ABOD intuition and the L1-depth concept in statistics, one of the earliest methods used for detecting outliers. Deriving from this relationship, we propose to use L1-depth as a variant of angle-based outlier factors, since it only requires a quadratic computational time as proximity-based outlier factors. Empirically, L1-depth is competitive (often superior) to proximity-based and other proposed angle-based outlier factors on detecting high-dimensional outliers regarding both efficiency and accuracy.In order to avoid the quadratic computational time, we introduce a simple but efficient sampling method named SamDepth for estimating L1-depth measure. We also present theoretical analysis to guarantee the reliability of SamDepth. The empirical experiments on many real-world high-dimensional data sets demonstrate that SamDepth with $$\sqrt{n}$$ samples often achieves very competitive accuracy and runs several orders of magnitude faster than other proximity-based and ABOD competitors. Data related to this paper are available at: https://www.dropbox.com/s/nk7nqmwmdsatizs/Datasets.zip . Code related to this paper is available at: https://github.com/NinhPham/Outlier .

Ninh Pham
Beyond Outlier Detection: LookOut for Pictorial Explanation

Why is a given point in a dataset marked as an outlier by an off-the-shelf detection algorithm? Which feature(s) explain it the best? What is the best way to convince a human analyst that the point is indeed an outlier? We provide succinct, interpretable, and simple pictorial explanations of outlying behavior in multi-dimensional real-valued datasets while respecting the limited attention of human analysts. Specifically, we propose to output a few focus-plots, i.e., pairwise feature plots, from a few, carefully chosen feature sub-spaces. The proposed LookOut makes four contributions: (a) problem formulation: we introduce an “analyst-centered” problem formulation for explaining outliers via focus-plots, (b) explanation algorithm: we propose a plot-selection objective and the LookOut algorithm to approximate it with optimality guarantees, (c) generality: our explanation algorithm is both domain- and detector-agnostic, and (d) scalability: LookOut scales linearly with the size of input outliers to explain and the explanation budget. Our experiments show that LookOut performs near-ideally in terms of maximizing explanation objective on several real datasets, while producing visually interpretable and intuitive results in explaining groundtruth outliers. Code related to this paper is available at: https://github.com/NikhilGupta1997/Lookout .

Nikhil Gupta, Dhivya Eswaran, Neil Shah, Leman Akoglu, Christos Faloutsos
ConOut: Contextual Outlier Detection with Multiple Contexts: Application to Ad Fraud

Outlier detection has numerous applications in different domains. A family of techniques, called contextual outlier detectors, are based on a single, user-specified demarcation of data attributes into indicators and contexts. In this work, we propose ConOut, a new contextual outlier detection technique that leverages multiple contexts that are automatically identified. Importantly, ConOut is a one-click algorithm—it does not require any user-specified (hyper)parameters. Through experiments on various real-world data sets, we show that ConOut outperforms existing baselines in detection accuracy. Further, we motivate and apply ConOut to the advertisement domain to identify fraudulent publishers, where ConOut not only improves detection but also provides statistically significant revenue gains to advertisers: a minimum of 57% compared to a naïve fraud detector; and $$\sim $$ 20% in revenue gains as well as $$\sim $$ 34% in mean average precision compared to its nearest competitor. Code related to this paper is available at: https://github.com/meghanathmacha/ConOut , https://cmuconout.github.io/ .

M. Y. Meghanath, Deepak Pai, Leman Akoglu
Scalable and Interpretable One-Class SVMs with Deep Learning and Random Fourier Features

One-class support vector machine (OC-SVM) for a long time has been one of the most effective anomaly detection methods and extensively adopted in both research as well as industrial applications. The biggest issue for OC-SVM is yet the capability to operate with large and high-dimensional datasets due to optimization complexity. Those problems might be mitigated via dimensionality reduction techniques such as manifold learning or autoencoder. However, previous work often treats representation learning and anomaly prediction separately. In this paper, we propose autoencoder based one-class support vector machine (AE-1SVM) that brings OC-SVM, with the aid of random Fourier features to approximate the radial basis kernel, into deep learning context by combining it with a representation learning architecture and jointly exploit stochastic gradient descent to obtain end-to-end training. Interestingly, this also opens up the possible use of gradient-based attribution methods to explain the decision making for anomaly detection, which has ever been challenging as a result of the implicit mappings between the input space and the kernel space. To the best of our knowledge, this is the first work to study the interpretability of deep learning in anomaly detection. We evaluate our method on a wide range of unsupervised anomaly detection tasks in which our end-to-end training architecture achieves a performance significantly better than the previous work using separate training. Code related to this paper is available at: https://github.com/minh-nghia/AE-1SVM .

Minh-Nghia Nguyen, Ngo Anh Vien
Group Anomaly Detection Using Deep Generative Models

Unlike conventional anomaly detection research that focuses on point anomalies, our goal is to detect anomalous collections of individual data points. In particular, we perform group anomaly detection (GAD) with an emphasis on irregular group distributions (e.g. irregular mixtures of image pixels). GAD is an important task in detecting unusual and anomalous phenomena in real-world applications such as high energy particle physics, social media and medical imaging. In this paper, we take a generative approach by proposing deep generative models: Adversarial autoencoder (AAE) and variational autoencoder (VAE) for group anomaly detection. Both AAE and VAE detect group anomalies using point-wise input data where group memberships are known a priori. We conduct extensive experiments to evaluate our models on real world datasets. The empirical results demonstrate that our approach is effective and robust in detecting group anomalies. Code related to this paper is available at: https://github.com/raghavchalapathy/gad , https://www.cs.cmu.edu/~lxiong/gad/gad.html , https://github.com/jorjasso/SMDD-group-anomaly-detection , https://github.com/cjlin1/libsvm .

Raghavendra Chalapathy, Edward Toth, Sanjay Chawla

Applications

Frontmatter
Detecting Autism by Analyzing a Simulated Social Interaction

Diagnosing autism spectrum conditions takes several hours by well-trained practitioners; therefore, standardized questionnaires are widely used for first-level screening. Questionnaires as a diagnostic tool, however, rely on self-reflection—which is typically impaired in individuals with autism spectrum condition. We develop an alternative screening mechanism in which subjects engage in a simulated social interaction. During this interaction, the subjects’ voice, eye gaze, and facial expression are tracked, and features are extracted that serve as input to a predictive model. We find that a random-forest classifier on these features can detect autism spectrum condition accurately and functionally independently of diagnostic questionnaires. We also find that a regression model estimates the severity of the condition more accurately than the reference screening method.

Hanna Drimalla, Niels Landwehr, Irina Baskow, Behnoush Behnia, Stefan Roepke, Isabel Dziobek, Tobias Scheffer
A Discriminative Model for Identifying Readers and Assessing Text Comprehension from Eye Movements

We study the problem of inferring readers’ identities and estimating their level of text comprehension from observations of their eye movements during reading. We develop a generative model of individual gaze patterns (scanpaths) that makes use of lexical features of the fixated words. Using this generative model, we derive a Fisher-score representation of eye-movement sequences. We study whether a Fisher-SVM with this Fisher kernel and several reference methods are able to identify readers and estimate their level of text comprehension based on eye-tracking data. While none of the methods are able to estimate text comprehension accurately, we find that the SVM with Fisher kernel excels at identifying readers.

Silvia Makowski, Lena A. Jäger, Ahmed Abdelwahab, Niels Landwehr, Tobias Scheffer
Face-Cap: Image Captioning Using Facial Expression Analysis

Image captioning is the process of generating a natural language description of an image. Most current image captioning models, however, do not take into account the emotional aspect of an image, which is very relevant to activities and interpersonal relationships represented therein. Towards developing a model that can produce human-like captions incorporating these, we use facial expression features extracted from images including human faces, with the aim of improving the descriptive ability of the model. In this work, we present two variants of our Face-Cap model, which embed facial expression features in different ways, to generate image captions. Using all standard evaluation metrics, our Face-Cap models outperform a state-of-the-art baseline model for generating image captions when applied to an image caption dataset extracted from the standard Flickr 30 K dataset, consisting of around 11 K images containing faces. An analysis of the captions finds that, perhaps surprisingly, the improvement in caption quality appears to come not from the addition of adjectives linked to emotional aspects of the images, but from more variety in the actions described in the captions. Code related to this paper is available at: https://github.com/omidmn/Face-Cap .

Omid Mohamad Nezami, Mark Dras, Peter Anderson, Len Hamey
Pedestrian Trajectory Prediction with Structured Memory Hierarchies

This paper presents a novel framework for human trajectory prediction based on multimodal data (video and radar). Motivated by recent neuroscience discoveries, we propose incorporating a structured memory component in the human trajectory prediction pipeline to capture historical information to improve performance. We introduce structured LSTM cells for modelling the memory content hierarchically, preserving the spatiotemporal structure of the information and enabling us to capture both short-term and long-term context. We demonstrate how this architecture can be extended to integrate salient information from multiple modalities to automatically store and retrieve important information for decision making without any supervision. We evaluate the effectiveness of the proposed models on a novel multimodal dataset that we introduce, consisting of 40,000 pedestrian trajectories, acquired jointly from a radar system and a CCTV camera system installed in a public place. The performance is also evaluated on the publicly available New York Grand Central pedestrian database. In both settings, the proposed models demonstrate their capability to better anticipate future pedestrian motion compared to existing state of the art. Data related to this paper are available at: https://github.com/qutsaivt/SAIVTMultiSpectralTrajectoryDataset .

Tharindu Fernando, Simon Denman, Sridha Sridharan, Clinton Fookes

Classification

Frontmatter
Multiple Instance Learning with Bag-Level Randomized Trees

Knowledge discovery in databases with a flexible structure poses a great challenge to machine learning community. Multiple Instance Learning (MIL) aims at learning from samples (called bags) represented by multiple feature vectors (called instances) as opposed to single feature vectors characteristic for the traditional data representation. This relaxation turns out to be useful in formulating many machine learning problems including classification of molecules, cancer detection from tissue images or identification of malicious network communications. However, despite the recent progress in this area, the current set of MIL tools still seems to be very application specific and/or burdened with many tuning parameters or processing steps. In this paper, we propose a simple, yet effective tree-based algorithm for solving MIL classification problems. Empirical evaluation against 28 classifiers on 29 publicly available benchmark datasets shows a high level performance of the proposed solution even with its default parameter settings. Data related to this paper are available at: https://github.com/komartom/MIDatasets.jl . Code related to this paper is available at: https://github.com/komartom/BLRT.jl .

Tomáš Komárek, Petr Somol
One-Class Quantification

This paper proposes one-class quantification, a new Machine Learning task. Quantification estimates the class distribution of an unlabeled sample of instances. Similarly to one-class classification, we assume that only a sample of examples of a single class is available for learning, and we are interested in counting the cases of such class in a test set. We formulate, for the first time, one-class quantification methods and assess them in a comprehensible open-set evaluation. In an open-set problem, several “subclasses” represent the negative class, and we cannot assume to have enough observations for all of them at training time. Therefore, new classes may appear after deployment, making this a challenging setup for existing quantification methods. We show that our proposals are simple and more accurate than the state-of-the-art in quantification. Finally, the approaches are very efficient, fitting batch and stream applications. Code related to this paper is available at: https://github.com/denismr/One-class-Quantification .

Denis dos Reis, André Maletzke, Everton Cherman, Gustavo Batista
Deep F-Measure Maximization in Multi-label Classification: A Comparative Study

In recent years several novel algorithms have been developed for maximizing the instance-wise F $$_\beta $$ -measure in multi-label classification problems. However, so far, such algorithms have only been tested in tandem with shallow base learners. In the deep learning landscape, usually simple thresholding approaches are implemented, even though it is expected that such approaches are suboptimal. In this article we introduce extensions of utility maximization and decision-theoretic methods that can optimize the F $$_\beta $$ -measure with (convolutional) neural networks. We discuss pros and cons of the different methods and we present experimental results on several image classification datasets. The results illustrate that decision-theoretic inference algorithms are worth the investment. While being more difficult to implement compared to thresholding strategies, they lead to a better predictive performance. Overall, a decision-theoretic inference algorithm based on proportional odds models outperforms the other methods. Code related to this paper is available at: https://github.com/sdcubber/f-measure .

Stijn Decubber, Thomas Mortier, Krzysztof Dembczyński, Willem Waegeman
Ordinal Label Proportions

In Machine Learning, it is common to distinguish different degrees of supervision, ranging from fully supervised to completely unsupervised scenarios. However, lying in between those, the Learning from Label Proportions (LLP) setting [19] assumes the training data is provided in the form of bags, and the only supervision comes through the proportion of each class in each bag. In this paper, we present a novel version of the LLP paradigm where the relationship among the classes is ordinal. While this is a highly relevant scenario (e.g. customer surveys where the results can be divided into various degrees of satisfaction), it is as yet unexplored in the literature. We refer to this setting as Ordinal Label Proportions (OLP). We formally define the scenario and introduce an efficient algorithm to tackle it. We test our algorithm on synthetic and benchmark datasets. Additionally, we present a case study examining a dataset gathered from the Research Excellence Framework that assesses the quality of research in the United Kingdom.

Rafael Poyiadzi, Raúl Santos-Rodríguez, Tijl De Bie
AWX: An Integrated Approach to Hierarchical-Multilabel Classification

The recent outbreak of works on artificial neural networks (ANNs) has reshaped the machine learning scenario. Despite the vast literature, there is still a lack of methods able to tackle the hierarchical multilabel classification (HMC) task exploiting entirely ANNs. Here we propose AWX, a novel approach that aims to fill this gap. AWX is a versatile component that can be used as output layer of any ANN, whenever a fixed structured output is required, as in the case of HMC. AWX exploits the prior knowledge on the output domain embedding the hierarchical structure directly in the network topology. The information flows from the leaf terms to the inner ones allowing a jointly optimization of the predictions. Different options to combine the signals received from the leaves are proposed and discussed. Moreover, we propose a generalization of the true path rule to the continuous domain and we demonstrate that AWX’s predictions are guaranteed to be consistent with respect to it. Finally, the proposed method is evaluated on 10 benchmark datasets and shows a significant increase in the performance over plain ANN, HMC-LMLP, and the state-of-the-art method CLUS-HMC. Code related to this paper is available at: https://github.com/lucamasera/AWX .

Luca Masera, Enrico Blanzieri

Clustering and Unsupervised Learning

Frontmatter
Clustering in the Presence of Concept Drift

Clustering naturally addresses many of the challenges of data streams and many data stream clustering algorithms (DSCAs) have been proposed. The literature does not, however, provide quantitative descriptions of how these algorithms behave in different circumstances. In this paper we study how the clusterings produced by different DSCAs change, relative to the ground truth, as quantitatively different types of concept drift are encountered. This paper makes two contributions to the literature. First, we propose a method for generating real-valued data streams with precise quantitative concept drift. Second, we conduct an experimental study to provide quantitative analyses of DSCA performance with synthetic real-valued data streams and show how to apply this knowledge to real world data streams. We find that large magnitude and short duration concept drifts are most challenging and that DSCAs with partitioning-based offline clustering methods are generally more robust than those with density-based offline clustering methods. Our results further indicate that increasing the number of classes present in a stream is a more challenging environment than decreasing the number of classes. Code related to this paper is available at: https://doi.org/10.5281/zenodo.1168699 , https://doi.org/10.5281/zenodo.1216189 , https://doi.org/10.5281/zenodo.1213802 , https://doi.org/10.5281/zenodo.1304380 .

Richard Hugh Moulton, Herna L. Viktor, Nathalie Japkowicz, João Gama
Time Warp Invariant Dictionary Learning for Time Series Clustering: Application to Music Data Stream Analysis

This work proposes a time warp invariant sparse coding and dictionary learning framework for time series clustering, where both input samples and atoms define time series of different lengths that involve variable delays. For that, first an $$l_0$$ sparse coding problem is formalised and a time warp invariant orthogonal matching pursuit based on a new cosine maximisation time warp operator is proposed. A dictionary learning under time warp is then formalised and a gradient descent solution is developed. Lastly, a time series clustering based on the time warp sparse coding and dictionary learning is presented. The proposed approach is evaluated and compared to major alternative methods on several public datasets, with an application to deezer music data stream clustering. Data related to this paper are available at: The link to the data and the evaluating algorithms are provided in the paper. Code related to this paper is available at: The link will be provided at the first author personal website ( http://ama.liglab.fr/~varasteh/ ).

Saeed Varasteh Yazdi, Ahlame Douzal-Chouakria, Patrick Gallinari, Manuel Moussallam
How Your Supporters and Opponents Define Your Interestingness

How can one determine whether a data mining method extracts interesting patterns? The paper deals with this core question in the context of unsupervised problems with binary data. We formalize the quality of a data mining method by identifying patterns – the supporters and opponents – which are related to a pattern extracted by a method. We define a typology offering a global picture of the methods based on two complementary criteria to evaluate and interpret their interests. The quality of a data mining method is quantified via an evaluation complexity analysis based on the number of supporters and opponents of a pattern extracted by the method. We provide an experimental study on the evaluation of the quality of the methods.

Bruno Crémilleux, Arnaud Giacometti, Arnaud Soulet

Deep Learning

Frontmatter
Efficient Decentralized Deep Learning by Dynamic Model Averaging

We propose an efficient protocol for decentralized training of deep neural networks from distributed data sources. The proposed protocol allows to handle different phases of model training equally well and to quickly adapt to concept drifts. This leads to a reduction of communication by an order of magnitude compared to periodically communicating state-of-the-art approaches. Moreover, we derive a communication bound that scales well with the hardness of the serialized learning problem. The reduction in communication comes at almost no cost, as the predictive performance remains virtually unchanged. Indeed, the proposed protocol retains loss bounds of periodically averaging schemes. An extensive empirical evaluation validates major improvement of the trade-off between model performance and communication which could be beneficial for numerous decentralized learning applications, such as autonomous driving, or voice recognition and image classification on mobile phones. Code related to this paper is available at: https://bitbucket.org/Michael_Kamp/decentralized-machine-learning .

Michael Kamp, Linara Adilova, Joachim Sicking, Fabian Hüger, Peter Schlicht, Tim Wirtz, Stefan Wrobel
Using Supervised Pretraining to Improve Generalization of Neural Networks on Binary Classification Problems

Neural networks are known to be very sensitive to the initial weights. There has been a lot of research on initialization that aims to stabilize the training process. However, very little research has studied the relationship between initialization and generalization. We demonstrate that poorly initialized model will lead to lower test accuracy. We propose a supervised pretraining technique that helps improve generalization on binary classification problems. The experimental results on four UCI datasets show that the proposed pretraining leads to higher test accuracy compared to the he_normal initialization when the training set is small. In further experiments on synthetic data, the improvement on test accuracy using the proposed pretraining reaches more than 30% when the data has high dimensionality and noisy features. Code related to this paper is available at: https://github.com/superRookie007/supervised_pretraining .

Alex Yuxuan Peng, Yun Sing Koh, Patricia Riddle, Bernhard Pfahringer
Towards Efficient Forward Propagation on Resource-Constrained Systems

In this work we present key elements of DeepChip, a framework that bridges recent trends in machine learning with applicable forward propagation on resource-constrained devices. Main objective of this work is to reduce compute and memory requirements by removing redundancy from neural networks. DeepChip features a flexible quantizer to reduce the bit width of activations to 8-bit fixed-point and weights to an asymmetric ternary representation. In combination with novel algorithms and data compression we leverage reduced precision and sparsity for efficient forward propagation on a wide range of processor architectures. We validate our approach on a set of different convolutional neural networks and datasets: ConvNet on SVHN, ResNet-44 on CIFAR10 and AlexNet on ImageNet. Compared to single-precision floating point, memory requirements can be compressed by a factor of 43, 22 and 10 and computations accelerated by a factor of 5.2, 2.8 and 2.0 on a mobile processor without a loss in classification accuracy. DeepChip allows trading accuracy for efficiency, and for instance tolerating about 2% loss in classification accuracy further reduces memory requirements by a factor of 88, 29 and 13, and speeds up computations by a factor of 6.0, 4.3 and 5.0. Code related to this paper is available at: https://github.com/UniHD-CEG/ECML2018 .

Günther Schindler, Matthias Zöhrer, Franz Pernkopf, Holger Fröning
Auxiliary Guided Autoregressive Variational Autoencoders

Generative modeling of high-dimensional data is a key problem in machine learning. Successful approaches include latent variable models and autoregressive models. The complementary strengths of these approaches, to model global and local image statistics respectively, suggest hybrid models that encode global image structure into latent variables while autoregressively modeling low level detail. Previous approaches to such hybrid models restrict the capacity of the autoregressive decoder to prevent degenerate models that ignore the latent variables and only rely on autoregressive modeling. Our contribution is a training procedure relying on an auxiliary loss function that controls which information is captured by the latent variables and what is left to the autoregressive decoder. Our approach can leverage arbitrarily powerful autoregressive decoders, achieves state-of-the art quantitative performance among models with latent variables, and generates qualitatively convincing samples.

Thomas Lucas, Jakob Verbeek
Cooperative Multi-agent Policy Gradient

Reinforcement Learning (RL) for decentralized partially observable Markov decision processes (Dec-POMDPs) is lagging behind the spectacular breakthroughs of single-agent RL. That is because assumptions that hold in single-agent settings are often obsolete in decentralized multi-agent systems. To tackle this issue, we investigate the foundations of policy gradient methods within the centralized training for decentralized control (CTDC) paradigm. In this paradigm, learning can be accomplished in a centralized manner while execution can still be independent. Using this insight, we establish policy gradient theorem and compatible function approximations for decentralized multi-agent systems. Resulting actor-critic methods preserve the decentralized control at the execution phase, but can also estimate the policy gradient from collective experiences guided by a centralized critic at the training phase. Experiments demonstrate our policy gradient methods compare favorably against standard RL techniques in benchmarks from the literature. Code related to this paper is available at: https://gitlab.inria.fr/gbono/coop-ma-pg .

Guillaume Bono, Jilles Steeve Dibangoye, Laëtitia Matignon, Florian Pereyron, Olivier Simonin
Parametric t-Distributed Stochastic Exemplar-Centered Embedding

Parametric embedding methods such as parametric t-distributed Stochastic Neighbor Embedding (pt-SNE) enables out-of-sample data visualization without further computationally expensive optimization or approximation. However, pt-SNE favors small mini-batches to train a deep neural network but large mini-batches to approximate its cost function involving all pairwise data point comparisons, and thus has difficulty in finding a balance. To resolve the conflicts, we present parametric t-distributed stochastic exemplar-centered embedding. Our strategy learns embedding parameters by comparing training data only with precomputed exemplars to indirectly preserve local neighborhoods, resulting in a cost function with significantly reduced computational and memory complexity. Moreover, we propose a shallow embedding network with high-order feature interactions for data visualization, which is much easier to tune but produces comparable performance in contrast to a deep feedforward neural network employed by pt-SNE. We empirically demonstrate, using several benchmark datasets, that our proposed method significantly outperforms pt-SNE in terms of robustness, visual effects, and quantitative evaluations.

Martin Renqiang Min, Hongyu Guo, Dinghan Shen
Joint Autoencoders: A Flexible Meta-learning Framework

We develop a framework for learning multiple tasks simultaneously, based on sharing features that are common to all tasks, achieved through the use of a modular deep feedforward neural network consisting of shared branches, dealing with the common features of all tasks, and private branches, learning the specific unique aspects of each task. Once an appropriate weight sharing architecture has been established, learning takes place through standard algorithms for feedforward networks, e.g., stochastic gradient descent and its variations. The method deals with meta-learning (such as domain adaptation, transfer and multi-task learning) in a unified fashion, and can deal with data arising from different modalities. Numerical experiments demonstrate the effectiveness of learning in domain adaptation and transfer learning setups, and provide evidence for the flexible and task-oriented representations arising in the network. In particular, we handle transfer learning between multiple tasks in a straightforward manner, as opposed to many competing state-of-the-art methods, that are unable to handle more than two tasks. We also illustrate the network’s ability to distill task-specific and shared features.

Baruch Epstein, Ron Meir, Tomer Michaeli
Privacy Preserving Synthetic Data Release Using Deep Learning

For many critical applications ranging from health care to social sciences, releasing personal data while protecting individual privacy is paramount. Over the years, data anonymization and synthetic data generation techniques have been proposed to address this challenge. Unfortunately, data anonymization approaches do not provide rigorous privacy guarantees. Although, there are existing synthetic data generation techniques that use rigorous definitions of differential privacy, to our knowledge, these techniques have not been compared extensively using different utility metrics.In this work, we provide two novel contributions. First, we compare existing techniques on different datasets using different utility metrics. Second, we present a novel approach that utilizes deep learning techniques coupled with an efficient analysis of privacy costs to generate differentially private synthetic datasets with higher data utility. We show that we can learn deep learning models that can capture relationship among multiple features, and then use these models to generate differentially private synthetic datasets. Our extensive experimental evaluation conducted on multiple datasets indicates that our proposed approach is more robust (i.e., one of the top performing technique in almost all type of data we have experimented) compared to the state-of-the art methods in terms of various data utility measures. Code related to this paper is available at: https://github.com/ncabay/synthetic_generation .

Nazmiye Ceren Abay, Yan Zhou, Murat Kantarcioglu, Bhavani Thuraisingham, Latanya Sweeney
On Finer Control of Information Flow in LSTMs

Since its inception in 1995, the Long Short-Term Memory (LSTM) architecture for recurrent neural networks has shown promising performance, sometimes state-of-art, for various tasks. Aiming at achieving constant error flow through hidden units, LSTM introduces a complex unit called a memory cell, in which gates are adopted to control the exposure/isolation of information flowing in, out and back to itself. Despite its widely acknowledged success, in this paper, we propose a hypothesis that LSTMs may suffer from an implicit functional binding of information exposure/isolation for the output and candidate computation, i.e., the output gate at time $$t-1$$ is not only in charge of the information flowing out of a cell as the response to the external environment, but also controls the information flowing back to the cell for the candidate computation, which is often the only source of nonlinear combination of input at time t and previous cell state at time $$t-1$$ for cell memory updates. We propose Untied Long Short Term Memory (ULSTM) as a solution to the above problem. We test our model on various tasks, including semantic relatedness prediction, language modeling and sentiment classification. Experimental results indicate that our proposed model is capable to at least partially solve the problem and outperform LSTM for all these tasks. Code related to this paper is available at: https://github.com/HangGao/ULSTM.git .

Hang Gao, Tim Oates
MaxGain: Regularisation of Neural Networks by Constraining Activation Magnitudes

Effective regularisation of neural networks is essential to combat overfitting due to the large number of parameters involved. We present an empirical analogue to the Lipschitz constant of a feed-forward neural network, which we refer to as the maximum gain. We hypothesise that constraining the gain of a network will have a regularising effect, similar to how constraining the Lipschitz constant of a network has been shown to improve generalisation. A simple algorithm is provided that involves rescaling the weight matrix of each layer after each parameter update. We conduct a series of studies on common benchmark datasets, and also a novel dataset that we introduce to enable easier significance testing for experiments using convolutional networks. Performance on these datasets compares favourably with other common regularisation techniques. Data related to this paper is available at: https://www.cs.waikato.ac.nz/~ml/sins10/ .

Henry Gouk, Bernhard Pfahringer, Eibe Frank, Michael J. Cree
Ontology Alignment Based on Word Embedding and Random Forest Classification

Ontology alignment is crucial for integrating heterogeneous data sources and forms an important component of the semantic web. Accordingly, several ontology alignment techniques have been proposed and used for discovering correspondences between the concepts (or entities) of different ontologies. Most alignment techniques depend on string-based similarities which are unable to handle the vocabulary mismatch problem. Also, determining which similarity measures to use and how to effectively combine them in alignment systems are challenges that have persisted in this area. In this work, we introduce a random forest classifier approach for ontology alignment which relies on word embedding for determining a variety of semantic similarity features between concepts. Specifically, we combine string-based and semantic similarity measures to form feature vectors that are used by the classifier model to determine when concepts align. By harnessing background knowledge and relying on minimal information from the ontologies, our approach can handle knowledge-light ontological resources. It also eliminates the need for learning the aggregation weights of a composition of similarity measures. Experiments using Ontology Alignment Evaluation Initiative (OAEI) dataset and real-world ontologies highlight the utility of our approach and show that it can outperform state-of-the-art alignment systems. Code related to this paper is available at: https://bitbucket.org/paravariar/rafcom .

Ikechukwu Nkisi-Orji, Nirmalie Wiratunga, Stewart Massie, Kit-Ying Hui, Rachel Heaven
Domain Adaption in One-Shot Learning

Recent advances in deep learning lead to breakthroughs in many machine learning tasks. Due to the data-dri ven nature of deep learning, the training procedure often requires large amounts of manually annotated data, which is often unavailable. One-shot learning aims to categorize the new classes unseen in the training set, given only one example of each new class. Can we transfer knowledge learned by one-shot learning from one domain to another? In this paper, we formulate the problem of domain adaption in one-shot image classification, where the training data and test data come from similar but different distributions. We propose a domain adaption framework based on adversarial networks. This framework is generalized for situations where the source and target domain have different labels. We use a policy network, inspired by human learning behaviors, to effectively select samples from the source domain in the training process. This sampling strategy can further improve the domain adaption performance. We investigate our approach in one-shot image classification tasks on different settings and achieve better results than previous methods. Code related to this paper is available at: https://github.com/NanqingD/DAOSL .

Nanqing Dong, Eric P. Xing

Ensemble Methods

Frontmatter
Axiomatic Characterization of AdaBoost and the Multiplicative Weight Update Procedure

AdaBoost was introduced for binary classification tasks by Freund and Schapire in 1995. Ever since its publication, numerous results have been produced, which revealed surprising links between AdaBoost and related fields, such as information geometry, game theory, and convex optimization. This remarkably comprehensive set of connections suggests that adaBoost is a unique approach that may, in fact, arise out of axiomatic principles. In this paper, we prove that this is indeed the case. We show that three natural axioms on adaptive re-weighting and combining algorithms, also called arcing, suffice to construct adaBoost and, more generally, the multiplicative weight update procedure as the unique family of algorithms that meet those axioms. Informally speaking, our three axioms only require that the arcing algorithm satisfies some elementary notions of additivity, objectivity, and utility. We prove that any method that satisfies these axioms must be minimizing the composition of an exponential loss with an additive function, and that the weights must be updated according to the multiplicative weight update procedure. This conclusion holds in the general setting of learning, which encompasses regression, classification, ranking, and clustering.

Ibrahim Alabdulmohsin
Modular Dimensionality Reduction

We introduce an approach to modular dimensionality reduction, allowing efficient learning of multiple complementary representations of the same object. Modules are trained by optimising an unsupervised cost function which balances two competing goals: Maintaining the inner product structure within the original space, and encouraging structural diversity between complementary representations. We derive an efficient learning algorithm which outperforms gradient based approaches without the need to choose a learning rate. We also demonstrate an intriguing connection with Dropout. Empirical results demonstrate the efficacy of the method for image retrieval and classification.

Henry W. J. Reeve, Tingting Mu, Gavin Brown
Constructive Aggregation and Its Application to Forecasting with Dynamic Ensembles

While the predictive advantage of ensemble methods is nowadays widely accepted, the most appropriate way of estimating the weights of each individual model remains an open research question. Meanwhile, several studies report that combining different ensemble approaches leads to improvements in performance, due to a better trade-off between the diversity and the error of the individual models in the ensemble. We contribute to this research line by proposing an aggregation framework for a set of independently created forecasting models, i.e. heterogeneous ensembles. The general idea is to, instead of directly aggregating these models, first rearrange them into different subsets, creating a new set of combined models which is then aggregated into a final decision. We present this idea as constructive aggregation, and apply it to time series forecasting problems. Results from empirical experiments show that applying constructive aggregation to state of the art dynamic aggregation methods provides a consistent advantage. Constructive aggregation is publicly available in a software package. Data related to this paper are available at: https://github.com/vcerqueira/timeseriesdata . Code related to this paper is available at: https://github.com/vcerqueira/tsensembler .

Vitor Cerqueira, Fabio Pinto, Luis Torgo, Carlos Soares, Nuno Moniz
MetaBags: Bagged Meta-Decision Trees for Regression

Methods for learning heterogeneous regression ensembles have not yet been proposed on a large scale. Hitherto, in classical ML literature, stacking, cascading and voting are mostly restricted to classification problems. Regression poses distinct learning challenges that may result in poor performance, even when using well established homogeneous ensemble schemas such as bagging or boosting. In this paper, we introduce MetaBags, a novel stacking framework for regression. MetaBags learns a set of meta-decision trees designed to select one base model (i.e. expert) for each query, and focuses on inductive bias reduction. Finally, these predictions are aggregated into a single prediction through a bagging procedure at meta-level. MetaBags is designed to learn a model with a fair bias-variance trade-off, and its improvement over base model performance is correlated with the prediction diversity of different experts on specific input space subregions. An exhaustive empirical testing of the method was performed, evaluating both generalization error and scalability of the approach on open, synthetic and real-world application datasets. The obtained results show that our method outperforms existing state-of-the-art approaches.

Jihed Khiari, Luis Moreira-Matias, Ammar Shaker, Bernard Ženko, Sašo Džeroski

Evaluation

Frontmatter
Visualizing the Feature Importance for Black Box Models

In recent years, a large amount of model-agnostic methods to improve the transparency, trustability, and interpretability of machine learning models have been developed. Based on a recent method for model-agnostic global feature importance, we introduce a local feature importance measure for individual observations and propose two visual tools: partial importance (PI) and individual conditional importance (ICI) plots which visualize how changes in a feature affect the model performance on average, as well as for individual observations. Our proposed methods are related to partial dependence (PD) and individual conditional expectation (ICE) plots, but visualize the expected (conditional) feature importance instead of the expected (conditional) prediction. Furthermore, we show that averaging ICI curves across observations yields a PI curve, and integrating the PI curve with respect to the distribution of the considered feature results in the global feature importance. Another contribution of our paper is the Shapley feature importance, which fairly distributes the overall performance of a model among the features according to the marginal contributions and which can be used to compare the feature importance across different models. Code related to this paper is available at: https://github.com/giuseppec/featureImportance .

Giuseppe Casalicchio, Christoph Molnar, Bernd Bischl
Efficient Estimation of AUC in a Sliding Window

In many applications, monitoring area under the ROC curve (AUC) in a sliding window over a data stream is a natural way of detecting changes in the system. The drawback is that computing AUC in a sliding window is expensive, especially if the window size is large and the data flow is significant.In this paper we propose a scheme for maintaining an approximate AUC in a sliding window of length k. More specifically, we propose an algorithm that, given $$\epsilon $$ , estimates AUC within $$\epsilon / 2$$ , and can maintain this estimate in $$ \mathcal {O} \mathopen {}\left( (\log k) / \epsilon \right) $$ time, per update, as the window slides. This provides a speed-up over the exact computation of AUC, which requires $$ \mathcal {O} \mathopen {}\left( k\right) $$ time, per update. The speed-up becomes more significant as the size of the window increases. Our estimate is based on grouping the data points together, and using these groups to calculate AUC. The grouping is designed carefully such that (i) the groups are small enough, so that the error stays small, (ii) the number of groups is small, so that enumerating them is not expensive, and (iii) the definition is flexible enough so that we can maintain the groups efficiently.Our experimental evaluation demonstrates that the average approximation error in practice is much smaller than the approximation guarantee $$\epsilon / 2$$ , and that we can achieve significant speed-ups with only a modest sacrifice in accuracy. Code related to this paper is available at: https://bitbucket.org/orlyanalytics/streamauc .

Nikolaj Tatti
Controlling and Visualizing the Precision-Recall Tradeoff for External Performance Indices

In many machine learning problems, the performance of the results is measured by indices that often combine precision and recall. In this paper, we study the behavior of such indices in function of the tradeoff precision-recall. We present a new tool of performance visualization and analysis referred to the tradeoff space, which plots the performance index in function of the precision-recall tradeoff. We analyse the properties of this new space and show its advantages over the precision-recall space. Code related to this paper is available at: https://sites.google.com/site/bhanczarhomepage/prerec .

Blaise Hanczar, Mohamed Nadif
Evaluation Procedures for Forecasting with Spatio-Temporal Data

The amount of available spatio-temporal data has been increasing as large-scale data collection (e.g., from geosensor networks) becomes more prevalent. This has led to an increase in spatio-temporal forecasting applications using geo-referenced time series data motivated by important domains such as environmental monitoring (e.g., air pollution index, forest fire risk prediction). Being able to properly assess the performance of new forecasting approaches is fundamental to achieve progress. However, the dependence between observations that the spatio-temporal context implies, besides being challenging in the modelling step, also raises issues for performance estimation as indicated by previous work. In this paper, we empirically compare several variants of cross-validation (CV) and out-of-sample (OOS) performance estimation procedures that respect data ordering, using both artificially generated and real-world spatio-temporal data sets. Our results show both CV and OOS reporting useful estimates. Further, they suggest that blocking may be useful in addressing CV’s bias to underestimate error. OOS can be very sensitive to test size, as expected, but estimates can be improved by careful management of the temporal dimension in training. Code related to this paper is available at: https://github.com/mrfoliveira/Evaluation-procedures-for-forecasting-with-spatio-temporal-data .

Mariana Oliveira, Luís Torgo, Vítor Santos Costa
A Blended Metric for Multi-label Optimisation and Evaluation

In multi-label classification, a large number of evaluation metrics exist, for example Hamming loss, exact match, and Jaccard similarity – but there are many more. In fact, there remains an apparent uncertainty in the multi-label literature about which metrics should be considered and when and how to optimise them. This has given rise to a proliferation of metrics, with some papers carrying out empirical evaluations under 10 or more different metrics in order to analyse method performance. We argue that further understanding of underlying mechanisms is necessary. In this paper we tackle the challenge of having a clearer view of evaluation strategies. We present a blended loss function. This function allows us to evaluate under the properties of several major loss functions with a single parameterisation. Furthermore we demonstrate the successful use of this metric as a surrogate loss for other metrics. We offer experimental investigation and theoretical backing to demonstrate that optimising this surrogate loss offers best results for several different metrics than optimising the metrics directly. It simplifies and provides insight to the task of evaluating multi-label prediction methodologies. Data related to this paper are available at: http://mulan.sourceforge.net/datasets-mlc.html , https://sourceforge.net/projects/meka/files/Datasets/ , http://www.ces.clemson.edu/~ahoover/stare/ .

Laurence A. F. Park, Jesse Read
Backmatter
Metadaten
Titel
Machine Learning and Knowledge Discovery in Databases
herausgegeben von
Michele Berlingerio
Francesco Bonchi
Thomas Gärtner
Neil Hurley
Georgiana Ifrim
Copyright-Jahr
2019
Electronic ISBN
978-3-030-10925-7
Print ISBN
978-3-030-10924-0
DOI
https://doi.org/10.1007/978-3-030-10925-7