Skip to main content
Top

2021 | Book

Machine Learning and Knowledge Discovery in Databases

European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part III

Editors: Frank Hutter, Prof. Dr. Kristian Kersting, Dr. Jefrey Lijffijt, Isabel Valera

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

The 5-volume proceedings, LNAI 12457 until 12461 constitutes the refereed proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2020, which was held during September 14-18, 2020. The conference was planned to take place in Ghent, Belgium, but had to change to an online format due to the COVID-19 pandemic.

The 232 full papers and 10 demo papers presented in this volume were carefully reviewed and selected for inclusion in the proceedings.

The volumes are organized in topical sections as follows:

Part I: Pattern Mining; clustering; privacy and fairness; (social) network analysis and computational social science; dimensionality reduction and autoencoders; domain adaptation; sketching, sampling, and binary projections; graphical models and causality; (spatio-) temporal data and recurrent neural networks; collaborative filtering and matrix completion.

Part II: deep learning optimization and theory; active learning; adversarial learning; federated learning; Kernel methods and online learning; partial label learning; reinforcement learning; transfer and multi-task learning; Bayesian optimization and few-shot learning.

Part III: Combinatorial optimization; large-scale optimization and differential privacy; boosting and ensemble methods; Bayesian methods; architecture of neural networks; graph neural networks; Gaussian processes; computer vision and image processing; natural language processing; bioinformatics.

Part IV: applied data science: recommendation; applied data science: anomaly detection; applied data science: Web mining; applied data science: transportation; applied data science: activity recognition; applied data science: hardware and manufacturing; applied data science: spatiotemporal data.

Part V: applied data science: social good; applied data science: healthcare; applied data science: e-commerce and finance; applied data science: computational social science; applied data science: sports; demo track.

Table of Contents

Frontmatter

Combinatorial Optimization

Frontmatter
Algorithms for Optimizing the Ratio of Monotone k-Submodular Functions

We study a new optimization problem that minimizes the ratio of two monotone k-submodular functions. The problem has applications in sensor placement, influence maximization, and feature selection among many others where one wishes to make a tradeoff between two objectives, measured as a ratio of two functions (e.g., solution cost vs. quality). We develop three greedy based algorithms for the problem, with approximation ratios that depend on the curvatures and/or the values of the functions. We apply our algorithms to a sensor placement problem where one aims to install k types of sensors, while minimizing the ratio between cost and uncertainty of sensor measurements, as well as to an influence maximization problem where one seeks to advertise k products to minimize the ratio between advertisement cost and expected number of influenced users. Our experimental results demonstrate the effectiveness of minimizing the respective ratios and the runtime efficiency of our algorithms. Finally, we discuss various extensions of our problems.

Hau Chan, Grigorios Loukides, Zhenghui Su
Mining Dense Subgraphs with Similar Edges

When searching for interesting structures in graphs, it is often important to take into account not only the graph connectivity, but also the metadata available, such as node and edge labels, or temporal information. In this paper we are interested in settings where such metadata is used to define a similarity between edges. We consider the problem of finding subgraphs that are dense and whose edges are similar to each other with respect to a given similarity function. Depending on the application, this function can be, for example, the Jaccard similarity between the edge label sets, or the temporal correlation of the edge occurrences in a temporal graph.We formulate a Lagrangian relaxation-based optimization problem to search for dense subgraphs with high pairwise edge similarity. We design a novel algorithm to solve the problem through parametric min-cut [15, 17], and provide an efficient search scheme to iterate through the values of the Lagrangian multipliers. Our study is complemented by an evaluation on real-world datasets, which demonstrates the usefulness and efficiency of the proposed approach.

Polina Rozenshtein, Giulia Preti, Aristides Gionis, Yannis Velegrakis
Towards Description of Block Model on Graph

Existing block modeling methods can detect communities as blocks. However it remains a challenge to easily explain to a human why nodes belong to the same block. Such a description is very useful for answering why people in the same community tend to interact cohesively. In this paper we explore a novel problem: Given a block model already found, describe the blocks using an auxiliary set of information. We formulate a combinatorial optimization problem which finds a unique disjunction of the auxiliary information shared by the nodes either in the same block or between a pair of different blocks. The former terms intra-block description, the latter inter-block description. Given an undirected graph and its $$k-$$ k - block model, our method generates $$k + \frac{k(k-1)}{2}$$ k + k ( k - 1 ) 2 different descriptions. If the tags are descriptors of events occurring at the vertices, our descriptions can be interpreted as common events occurring within blocks and between blocks. We show that this problem is intractable even for simple cases, e.g., when the underlying graph is a tree with just two blocks. However, simple and efficient ILP formulations and algorithms exist for its relaxation and yield insights different from a state-of-the-art related work in unsupervised description. We empirically show the power of our work on multiple real-world large datasets.

Zilong Bai, S. S. Ravi, Ian Davidson

Large-Scale Optimization and Differential Privacy

Frontmatter
Orthant Based Proximal Stochastic Gradient Method for -Regularized Optimization

Sparsity-inducing regularization problems are ubiquitous in machine learning applications, ranging from feature selection to model compression. In this paper, we present a novel stochastic method – Orthant Based Proximal Stochastic Gradient Method (OBProx-SG) – to solve perhaps the most popular instance, i.e., the $$\ell _1$$ ℓ 1 -regularized problem. The OBProx-SG method contains two steps: (i) a proximal stochastic gradient step to predict a support cover of the solution; and (ii) an orthant step to aggressively enhance the sparsity level via orthant face projection. Compared to the state-of-the-art methods, e.g., Prox-SG, RDA and Prox-SVRG, the OBProx-SG not only converges comparably in both convex and non-convex scenarios, but also promotes the sparsity of the solutions substantially. Particularly, on a large number of convex problems, OBProx-SG outperforms the existing methods comprehensively in the aspect of sparsity exploration and objective values. Moreover, the experiments on non-convex deep neural networks, e.g., MobileNetV1 and ResNet18, further demonstrate its superiority by generating the solutions of much higher sparsity without sacrificing generalization accuracy, which further implies that OBProx-SG may achieve significant memory and energy savings. The source code is available at https://github.com/tianyic/obproxsg .

Tianyi Chen, Tianyu Ding, Bo Ji, Guanyi Wang, Yixin Shi, Jing Tian, Sheng Yi, Xiao Tu, Zhihui Zhu
Efficiency of Coordinate Descent Methods for Structured Nonconvex Optimization

We propose novel coordinate descent (CD) methods for minimizing nonconvex functions comprising three terms: (i) a continuously differentiable term, (ii) a simple convex term, and (iii) a concave and continuous term. First, by extending randomized CD to nonsmooth nonconvex settings, we develop a coordinate subgradient method that randomly updates block-coordinate variables by using block composite subgradient mapping. This method converges asymptotically to critical points with proven sublinear convergence rate for certain optimality measure. Second, we develop a randomly permuted CD method with two alternating steps: linearizing the concave part and cycling through variables. We prove asymptotic convergence to critical points and establish sublinear complexity rate for objectives with both smooth and concave components. Third, we develop a randomized proximal difference-of-convex (DC) algorithm whereby we solve the subproblem inexactly by accelerated coordinate descent (ACD). Convergence is guaranteed with at most a few number of ACD iterations for each DC subproblem, and convergence complexity is established for identifying certain approximate critical points. Fourth, we further develop the third method to minimize smooth and composite weakly convex functions, and show advantages of the proposed method over gradient methods for ill-conditioned nonconvex functions, namely the weakly convex functions with high Lipschitz constant to negative curvature ratios. Finally, an empirical study on sparsity-inducing learning models demonstrates that CD methods are superior to gradient-based methods for certain large-scale problems.

Qi Deng, Chenghao Lan
Escaping Saddle Points of Empirical Risk Privately and Scalably via DP-Trust Region Method

It has been shown recently that many non-convex objective/loss functions in machine learning are known to be strict saddle. This means that finding a second-order stationary point (i.e., approximate local minimum) and thus escaping saddle points are sufficient for such functions to obtain a classifier with good generalization performance. Existing algorithms for escaping saddle points, however, all fail to take into consideration a critical issue in their designs, that is, the protection of sensitive information in the training set. Models learned by such algorithms can often implicitly memorize the details of sensitive information, and thus offer opportunities for malicious parties to infer it from the learned models. In this paper, we investigate the problem of privately escaping saddle points and finding a second-order stationary point of the empirical risk of non-convex loss function. Previous result on this problem is mainly of theoretical importance and has several issues (e.g., high sample complexity and non-scalable) which hinder its applicability, especially, in big data. To deal with these issues, we propose in this paper a new method called Differentially Private Trust Region, and show that it outputs a second-order stationary point with high probability and less sample complexity, compared to the existing one. Moreover, we also provide a stochastic version of our method (along with some theoretical guarantees) to make it faster and more scalable. Experiments on benchmark datasets suggest that our methods are indeed more efficient and practical than the previous one.

Di Wang, Jinhui Xu

Boosting and Ensemble Methods

Frontmatter
To Ensemble or Not Ensemble: When Does End-to-End Training Fail?

End-to-End training (E2E) is becoming more and more popular to train complex Deep Network architectures. An interesting question is whether this trend will continue—are there any clear failure cases for E2E training? We study this question in depth, for the specific case of E2E training an ensemble of networks. Our strategy is to blend the gradient smoothly in between two extremes: from independent training of the networks, up to to full E2E training. We find clear failure cases, where overparameterized models cannot be trained E2E. A surprising result is that the optimum can sometimes lie in between the two, neither an ensemble or an E2E system. The work also uncovers links to Dropout, and raises questions around the nature of ensemble diversity and multi-branch networks.

Andrew Webb, Charles Reynolds, Wenlin Chen, Henry Reeve, Dan Iliescu, Mikel Luján, Gavin Brown
Learning Gradient Boosted Multi-label Classification Rules

In multi-label classification, where the evaluation of predictions is less straightforward than in single-label classification, various meaningful, though different, loss functions have been proposed. Ideally, the learning algorithm should be customizable towards a specific choice of the performance measure. Modern implementations of boosting, most prominently gradient boosted decision trees, appear to be appealing from this point of view. However, they are mostly limited to single-label classification, and hence not amenable to multi-label losses unless these are label-wise decomposable. In this work, we develop a generalization of the gradient boosting framework to multi-output problems and propose an algorithm for learning multi-label classification rules that is able to minimize decomposable as well as non-decomposable loss functions. Using the well-known Hamming loss and subset 0/1 loss as representatives, we analyze the abilities and limitations of our approach on synthetic data and evaluate its predictive performance on multi-label benchmarks.

Michael Rapp, Eneldo Loza Mencía, Johannes Fürnkranz, Vu-Linh Nguyen, Eyke Hüllermeier
Landmark-Based Ensemble Learning with Random Fourier Features and Gradient Boosting

This paper jointly leverages two state-of-the-art learning stra-tegies—gradient boosting (GB) and kernel Random Fourier Features (RFF)—to address the problem of kernel learning. Our study builds on a recent result showing that one can learn a distribution over the RFF to produce a new kernel suited for the task at hand. For learning this distribution, we exploit a GB scheme expressed as ensembles of RFF weak learners, each of them being a kernel function designed to fit the residual. Unlike Multiple Kernel Learning techniques that make use of a pre-computed dictionary of kernel functions to select from, at each iteration we fit a kernel by approximating it from the training data as a weighted sum of RFF. This strategy allows one to build a classifier based on a small ensemble of learned kernel “landmarks” better suited for the underlying application. We conduct a thorough experimental analysis to highlight the advantages of our method compared to both boosting-based and kernel-learning state-of-the-art methods.

Léo Gautheron, Pascal Germain, Amaury Habrard, Guillaume Metzler, Emilie Morvant, Marc Sebban, Valentina Zantedeschi
A General Machine Learning Framework for Survival Analysis

The modeling of time-to-event data, also known as survival analysis, requires specialized methods that can deal with censoring and truncation, time-varying features and effects, and that extend to settings with multiple competing events. However, many machine learning methods for survival analysis only consider the standard setting with right-censored data and proportional hazards assumption. The methods that do provide extensions usually address at most a subset of these challenges and often require specialized software that can not be integrated into standard machine learning workflows directly. In this work, we present a very general machine learning framework for time-to-event analysis that uses a data augmentation strategy to reduce complex survival tasks to standard Poisson regression tasks. This reformulation is based on well developed statistical theory. With the proposed approach, any algorithm that can optimize a Poisson (log-)likelihood, such as gradient boosted trees, deep neural networks, model-based boosting and many more can be used in the context of time-to-event analysis. The proposed technique does not require any assumptions with respect to the distribution of event times or the functional shapes of feature and interaction effects. Based on the proposed framework we develop new methods that are competitive with specialized state of the art approaches in terms of accuracy, and versatility, but with comparatively small investments of programming effort or requirements for specialized methodological know-how.

Andreas Bender, David Rügamer, Fabian Scheipl, Bernd Bischl
Fairness by Explicability and Adversarial SHAP Learning

The ability to understand and trust the fairness of model predictions, particularly when considering the outcomes of unprivileged groups, is critical to the deployment and adoption of machine learning systems. SHAP values provide a unified framework for interpreting model predictions and feature attribution but do not address the problem of fairness directly. In this work, we propose a new definition of fairness that emphasises the role of an external auditor and model explicability. To satisfy this definition, we develop a framework for mitigating model bias using regularizations constructed from the SHAP values of an adversarial surrogate model. We focus on the binary classification task with a single unprivileged group and link our fairness explicability constraints to classical statistical fairness metrics. We demonstrate our approaches using gradient and adaptive boosting on: a synthetic dataset, the UCI Adult (Census) dataset and a real-world credit scoring dataset. The models produced were fairer and performant.

James M. Hickey, Pietro G. Di Stefano, Vlasios Vasileiou
End-to-End Learning for Prediction and Optimization with Gradient Boosting

Mathematical optimization is a fundamental tool in decision making. However, it is often difficult to obtain an accurate formulation of an optimization problem due to uncertain parameters. Machine learning frameworks are attractive to address this issue: we predict the uncertain parameters and then optimize the problem based on the prediction. Recently, end-to-end learning approaches to predict and optimize the successive problems have received attention in the field of both optimization and machine learning. In this paper, we focus on gradient boosting which is known as a powerful ensemble method, and develop the end-to-end learning algorithm with maximizing the performance on the optimization problems directly. Our algorithm extends the existing gradient-based optimization through implicit differentiation to the second-order optimization for efficiently learning gradient boosting. We also conduct computational experiments to analyze how the end-to-end approaches work well and show the effectiveness of our end-to-end approach.

Takuya Konishi, Takuro Fukunaga

Bayesian Methods

Frontmatter
Probabilistic Reconciliation of Hierarchical Forecast via Bayes’ Rule

We present a novel approach for reconciling hierarchical forecasts, based on Bayes’ rule. We define a prior distribution for the bottom time series of the hierarchy, based on the bottom base forecasts. Then we update their distribution via Bayes’ rule, based on the base forecasts for the upper time series. Under the Gaussian assumption, we derive the updating in closed-form. We derive two algorithms, which differ as for the assumed independencies. We discuss their relation with the MinT reconciliation algorithm and with the Kalman filter, and we compare them experimentally.

Giorgio Corani, Dario Azzimonti, João P. S. C. Augusto, Marco Zaffalon
Quantifying the Confidence of Anomaly Detectors in Their Example-Wise Predictions

Anomaly detection focuses on identifying examples in the data that somehow deviate from what is expected or typical. Algorithms for this task usually assign a score to each example that represents how anomalous the example is. Then, a threshold on the scores turns them into concrete predictions. However, each algorithm uses a different approach to assign the scores, which makes them difficult to interpret and can quickly erode a user’s trust in the predictions. This paper introduces an approach for assessing the reliability of any anomaly detector’s example-wise predictions. To do so, we propose a Bayesian approach for converting anomaly scores to probability estimates. This enables the anomaly detector to assign a confidence score to each prediction which captures its uncertainty in that prediction. We theoretically analyze the convergence behaviour of our confidence estimate. Empirically, we demonstrate the effectiveness of the framework in quantifying a detector’s confidence in its predictions on a large benchmark of datasets.

Lorenzo Perini, Vincent Vercruyssen, Jesse Davis

Architecture of Neural Networks

Frontmatter
XferNAS: Transfer Neural Architecture Search

The term Neural Architecture Search (NAS) refers to the automatic optimization of network architectures for a new, previously unknown task. Since testing an architecture is computationally very expensive, many optimizers need days or even weeks to find suitable architectures. However, this search time can be significantly reduced if knowledge from previous searches on different tasks is reused. In this work, we propose a generally applicable framework that introduces only minor changes to existing optimizers to leverage this feature. As an example, we select an existing optimizer and demonstrate the complexity of the integration of the framework as well as its impact. In experiments on CIFAR-10 and CIFAR-100, we observe a reduction in the search time from 200 to only 6 GPU days, a speed up by a factor of 33. In addition, we observe new records of 1.99 and 14.06 for NAS optimizers on the CIFAR benchmarks, respectively. In a separate study, we analyze the impact of the amount of source and target data. Empirically, we demonstrate that the proposed framework generally gives better results and, in the worst case, is just as good as the unmodified optimizer.

Martin Wistuba
Finding the Optimal Network Depth in Classification Tasks

We develop a fast end-to-end method for training lightweight neural networks using multiple classifier heads. By allowing the model to determine the importance of each head and rewarding the choice of a single shallow classifier, we are able to detect and remove unneeded components of the network. This operation, which can be seen as finding the optimal depth of the model, significantly reduces the number of parameters and accelerates inference across different processing units, which is not the case for many standard pruning methods. We show the performance of our method on multiple network architectures and datasets, analyze its optimization properties, and conduct ablation studies.

Bartosz Wójcik, Maciej Wołczyk, Klaudia Bałazy, Jacek Tabor
Topological Insights into Sparse Neural Networks

Sparse neural networks are effective approaches to reduce the resource requirements for the deployment of deep neural networks. Recently, the concept of adaptive sparse connectivity, has emerged to allow training sparse neural networks from scratch by optimizing the sparse structure during training. However, comparing different sparse topologies and determining how sparse topologies evolve during training, especially for the situation in which the sparse structure optimization is involved, remain as challenging open questions. This comparison becomes increasingly complex as the number of possible topological comparisons increases exponentially with the size of networks. In this work, we introduce an approach to understand and compare sparse neural network topologies from the perspective of graph theory. We first propose Neural Network Sparse Topology Distance (NNSTD) to measure the distance between different sparse neural networks. Further, we demonstrate that sparse neural networks can outperform over-parameterized models in terms of performance, even without any further structure optimization. To the end, we also show that adaptive sparse connectivity can always unveil a plenitude of sparse sub-networks with very different topologies which outperform the dense model, by quantifying and comparing their topological evolutionary processes. The latter findings complement the Lottery Ticket Hypothesis by showing that there is a much more efficient and robust way to find “winning tickets”. Altogether, our results start enabling a better theoretical understanding of sparse neural networks, and demonstrate the utility of using graph theory to analyze them.

Shiwei Liu, Tim Van der Lee, Anil Yaman, Zahra Atashgahi, Davide Ferraro, Ghada Sokar, Mykola Pechenizkiy, Decebal Constantin Mocanu

Graph Neural Networks

Frontmatter
GRAM-SMOT: Top-N Personalized Bundle Recommendation via Graph Attention Mechanism and Submodular Optimization

Bundle recommendation – recommending a group of products in place of individual products to customers is gaining attention day by day. It presents two interesting challenges – (1) how to personalize and recommend existing bundles to users, and (2) how to generate personalized novel bundles targeting specific users. Recently, few models have been proposed for modeling the bundle recommendation problem. However, they have the following shortcomings. First, they do not consider the higher-order relationships amongst the entities (users, items and bundles). Second, they do not model the relative influence of items present in the bundles, which is crucial in defining such bundles. In this work, we propose GRAM-SMOT – a graph attention-based framework to address the above challenges. Further, we define a loss function based on the metric-learning approach to learn the embeddings of entities efficiently. To generate novel bundles, we propose a strategy that leverages submodular function maximization. To analyze the performance of the proposed model, we conduct comprehensive experiments on two real-world datasets. The experimental results demonstrate the superior performance of the proposed model over the existing state-of-the-art models.

M. Vijaikumar, Shirish Shevade, M. N. Murty
Temporal Heterogeneous Interaction Graph Embedding for Next-Item Recommendation

In the scenario of next-item recommendation, previous methods attempt to model user preferences by capturing the evolution of sequential interactions. However, their sequential expression is often limited, without modeling complex dynamics that short-term demands can often be influenced by long-term habits. Moreover, few of them take into account the heterogeneous types of interaction between users and items. In this paper, we model such complex data as a Temporal Heterogeneous Interaction Graph (THIG) and learn both user and item embeddings on THIGs to address next-item recommendation. The main challenges involve two aspects: the complex dynamics and rich heterogeneity of interactions. We propose THIG Embedding (THIGE) which models the complex dynamics so that evolving short-term demands are guided by long-term historical habits, and leverages the rich heterogeneity to express the latent relevance of different-typed preferences. Extensive experiments on real-world datasets demonstrate that THIGE consistently outperforms the state-of-the-art methods.

Yugang Ji, MingYang Yin, Yuan Fang, Hongxia Yang, Xiangwei Wang, Tianrui Jia, Chuan Shi
Node Classification in Temporal Graphs Through Stochastic Sparsification and Temporal Structural Convolution

Node classification in temporal graphs aims to predict node labels based on historical observations. In real-world applications, temporal graphs are complex with both graph topology and node attributes evolving rapidly, which poses a high overfitting risk to existing graph learning approaches. In this paper, we propose a novel Temporal Structural Network (TSNet) model, which jointly learns temporal and structural features for node classification from the sparsified temporal graphs. We show that the proposed TSNet learns how to sparsify temporal graphs to favor the subsequent classification tasks and prevent overfitting from complex neighborhood structures. The effective local features are then extracted by simultaneous convolutions in temporal and spatial domains. Using the standard stochastic gradient descent and backpropagation techniques, TSNet iteratively optimizes sparsification and node representations for subsequent classification tasks. Experimental study on public benchmark datasets demonstrates the competitive performance of the proposed model in node classification. Besides, TSNet has the potential to help domain experts to interpret and visualize the learned models.

Cheng Zheng, Bo Zong, Wei Cheng, Dongjin Song, Jingchao Ni, Wenchao Yu, Haifeng Chen, Wei Wang
DyHGCN: A Dynamic Heterogeneous Graph Convolutional Network to Learn Users’ Dynamic Preferences for Information Diffusion Prediction

Information diffusion prediction is a fundamental task for understanding the information propagation process. It has wide applications in such as misinformation spreading prediction and malicious account detection. Previous works either concentrate on utilizing the context of a single diffusion sequence or using the social network among users for information diffusion prediction. However, the diffusion paths of different messages naturally constitute a dynamic diffusion graph. For one thing, previous works cannot jointly utilize both the social network and diffusion graph for prediction, which is insufficient to model the complexity of the diffusion process and results in unsatisfactory prediction performance. For another, they cannot learn users’ dynamic preferences. Intuitively, users’ preferences are changing as time goes on and users’ personal preference determines whether the user will repost the information. Thus, it is beneficial to consider users’ dynamic preferences in information diffusion prediction.In this paper, we propose a novel dynamic heterogeneous graph convolutional network (DyHGCN) to jointly learn the structural characteristics of the social graph and dynamic diffusion graph. Then, we encode the temporal information into the heterogeneous graph to learn the users’ dynamic preferences. Finally, we apply multi-head attention to capture the context-dependency of the current diffusion path to facilitate the information diffusion prediction task. Experimental results show that DyHGCN significantly outperforms the state-of-the-art models on three public datasets, which shows the effectiveness of the proposed model.

Chunyuan Yuan, Jiacheng Li, Wei Zhou, Yijun Lu, Xiaodan Zhang, Songlin Hu
A Self-attention Network Based Node Embedding Model

Despite several signs of progress have been made recently, limited research has been conducted for an inductive setting where embeddings are required for newly unseen nodes – a setting encountered commonly in practical applications of deep learning for graph networks. This significantly affects the performances of downstream tasks such as node classification, link prediction or community extraction. To this end, we propose SANNE – a novel unsupervised embedding model – whose central idea is to employ a transformer self-attention network to iteratively aggregate vector representations of nodes in random walks. Our SANNE aims to produce plausible embeddings not only for present nodes, but also for newly unseen nodes. Experimental results show that the proposed SANNE obtains state-of-the-art results for the node classification task on well-known benchmark datasets.

Dai Quoc Nguyen, Tu Dinh Nguyen, Dinh Phung
Graph-Revised Convolutional Network

Graph Convolutional Networks (GCNs) have received increasing attention in the machine learning community for effectively leveraging both the content features of nodes and the linkage patterns across graphs in various applications. As real-world graphs are often incomplete and noisy, treating them as ground-truth information, which is a common practice in most GCNs, unavoidably leads to sub-optimal solutions. Existing efforts for addressing this problem either involve an over-parameterized model which is difficult to scale, or simply re-weight observed edges without dealing with the missing-edge issue. This paper proposes a novel framework called Graph-Revised Convolutional Network (GRCN), which avoids both extremes. Specifically, a GCN-based graph revision module is introduced for predicting missing edges and revising edge weights w.r.t. downstream tasks via joint optimization. A theoretical analysis reveals the connection between GRCN and previous work on multigraph belief propagation. Experiments on six benchmark datasets show that GRCN consistently outperforms strong baseline methods, especially when the original graphs are severely incomplete or the labeled instances for model training are highly sparse. (Our code is available at https://github.com/Maysir/GRCN ).

Donghan Yu, Ruohong Zhang, Zhengbao Jiang, Yuexin Wu, Yiming Yang
Robust Training of Graph Convolutional Networks via Latent Perturbation

Despite the recent success of graph convolutional networks (GCNs) in modeling graph structured data, its vulnerability to adversarial attacks has been revealed and attacks on both node feature and graph structure have been designed. Direct extension of defense algorithms based on adversarial samples meets with immediate challenge because computing the adversarial network costs substantially. We propose addressing this issue by perturbing the latent representations in GCNs, which not only dispenses with generating adversarial networks, but also attains improved robustness and accuracy by respecting the latent manifold of the data. This new framework of latent adversarial training on graphs is applied to node classification, link prediction, and recommender systems. Our empirical experimental results confirm the superior robustness performance over strong baselines.

Hongwei Jin, Xinhua Zhang
Enhancing Robustness of Graph Convolutional Networks via Dropping Graph Connections

Graph convolutional networks (GCNs) have emerged as one of the most popular neural networks for a variety of tasks over graphs. Despite their remarkable learning and inference ability, GCNs are still vulnerable to adversarial attacks that imperceptibly perturb graph structures and node features to degrade the performance of GCNs, which poses serious threats to the real-world applications. Inspired by the observations from recent studies suggesting that edge manipulations play a key role in graph adversarial attacks, in this paper, we take those attack behaviors into consideration and design a biased graph-sampling scheme to drop graph connections such that random, sparse and deformed subgraphs are constructed for training and inference. This method yields a significant regularization on graph learning, alleviates the sensitivity to edge manipulations, and thus enhances the robustness of GCNs. We evaluate the performance of our proposed method, while the experimental results validate its effectiveness against adversarial attacks.

Lingwei Chen, Xiaoting Li, Dinghao Wu

Gaussian Processes

Frontmatter
Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation

Many machine learning models require a training procedure based on running stochastic gradient descent. A key element for the efficiency of those algorithms is the choice of the learning rate schedule. While finding good learning rates schedules using Bayesian optimisation has been tackled by several authors, adapting it dynamically in a data-driven way is an open question. This is of high practical importance to users that need to train a single, expensive model. To tackle this problem, we introduce an original probabilistic model for traces of optimisers, based on latent Gaussian processes and an auto-/regressive formulation, that flexibly adjusts to abrupt changes of behaviours induced by new learning rate values. As illustrated, this model is well-suited to tackle a set of problems: first, for the on-line adaptation of the learning rate for a cold-started run; then, for tuning the schedule for a set of similar tasks (in a classical BO setup), as well as warm-starting it for a new task.

Victor Picheny, Vincent Dutordoir, Artem Artemev, Nicolas Durrande
MUMBO: MUlti-task Max-Value Bayesian Optimization

We propose MUMBO, the first high-performing yet computationally efficient acquisition function for multi-task Bayesian optimization. Here, the challenge is to perform efficient optimization by evaluating low-cost functions somehow related to our true target function. This is a broad class of problems including the popular task of multi-fidelity optimization. However, while information-theoretic acquisition functions are known to provide state-of-the-art Bayesian optimization, existing implementations for multi-task scenarios have prohibitive computational requirements. Previous acquisition functions have therefore been suitable only for problems with both low-dimensional parameter spaces and function query costs sufficiently large to overshadow very significant optimization overheads. In this work, we derive a novel multi-task version of entropy search, delivering robust performance with low computational overheads across classic optimization challenges and multi-task hyper-parameter tuning. MUMBO is scalable and efficient, allowing multi-task Bayesian optimization to be deployed in problems with rich parameter and fidelity spaces.

Henry B. Moss, David S. Leslie, Paul Rayson
Interactive Multi-objective Reinforcement Learning in Multi-armed Bandits with Gaussian Process Utility Models

In interactive multi-objective reinforcement learning (MORL), an agent has to simultaneously learn about the environment and the preferences of the user, in order to quickly zoom in on those decisions that are likely to be preferred by the user. In this paper we study interactive MORL in the context of multi-objective multi-armed bandits. Contrary to earlier approaches to interactive MORL that force the utility of the user to be expressed as a weighted sum of the values for each objective, we do not make such stringent a priori assumptions. Specifically, we not only allow non-linear preferences, but also obviate the need to specify the exact model class in the utility function must fall. To achieve this, we propose a new approach called Gaussian-process Utility Thompson Sampling (GUTS). GUTS employs parameterless Bayesian learning to allow any type of utility function, exploits monotonicity information, and limits the number of queries posed to the user by ensuring that questions are statistically significant. We show empirically that GUTS can learn non-linear preferences, and that the regret and number of queries posed to the user are highly sub-linear in the number of arm pulls. (A preliminary version of this work was presented at the ALA workshop in 2018 [20]).

Diederik M. Roijers, Luisa M. Zintgraf, Pieter Libin, Mathieu Reymond, Eugenio Bargiacchi, Ann Nowé
Deep Gaussian Processes Using Expectation Propagation and Monte Carlo Methods

Deep Gaussian processes (DGPs) are the natural extension of Gaussian processes (GPs) to a multi-layer architecture. DGPs are powerful probabilistic models that have shown better results than standard GPs in terms of generalization performance and prediction uncertainty estimation. Nevertheless, exact inference in DGPs is intractable, making these models hard to train. For this task, current approaches in the literature use approximate inference techniques such as variational inference or approximate expectation propagation. In this work, we present a new method for inference in DGPs using an approximate inference technique based on Monte Carlo methods and the expectation propagation algorithm. Our experiments show that our method scales well to large datasets and that its performance is comparable or better than other state of the art methods. Furthermore, our training method leads to interesting properties in the predictive distribution of the DGP. More precisely, it is able to capture output noise that is dependent on the input and it can generate multimodal predictive distributions. These two properties, which are not shared by other state-of-the-art approximate inference methods for DGPs, are analyzed in detail in our experiments.

Gonzalo Hernández-Muñoz, Carlos Villacampa-Calvo, Daniel Hernández-Lobato

Computer Vision and Image Processing

Frontmatter
Companion Guided Soft Margin for Face Recognition

Face recognition has achieved remarkable improvements with the help of the angular margin based softmax losses. However, the margin is usually manually set and kept constant during the training process, which neglects both the optimization difficulty and the informative similarity structures among different instances. Although some works have been proposed to tackle this issue, they adopt similar methods by simply changing the margin for different classes, leading to limited performance improvements. In this paper, we propose a novel sample-wise adaptive margin loss function from the perspective of the hypersphere manifold structure, which we call companion guided soft margin (CGSM). CGSM introduces the information of distribution in the feature space, and conducts teacher-student optimization within each mini-batch. Samples of better convergence are considered as teachers, while students are optimized with extra soft penalties, so that the intra-class distances of inferior samples can be further compacted. Moreover, CGSM does not require sophisticated mining techniques, which makes it easy to implement. Extensive experiments and analysis on MegaFace, LFW, CALFW, IJB-B and IJB-C demonstrate that our approach outperforms state-of-the-art methods using the same network architecture and training dataset.

Yingcheng Su, Yichao Wu, Zhenmao Li, Qiushan Guo, Ken Chen, Junjie Yan, Ding Liang, Xiaolin Hu
Soft Labels Transfer with Discriminative Representations Learning for Unsupervised Domain Adaptation

Domain adaptation aims to address the challenge of transferring the knowledge obtained from the source domain with rich label information to the target domain with less or even no label information. Recent methods start to tackle this problem by incorporating the hard-pseudo labels for the target samples to better reduce the cross-domain distribution shifts. However, these approaches are vulnerable to the error accumulation and hence unable to preserve cross-domain category consistency. Because the accuracy of pseudo labels cannot be guaranteed explicitly. To address this issue, we propose a Soft Labels transfer with Discriminative Representations learning (SLDR) framework to jointly optimize the class-wise adaptation with soft target labels and learn the discriminative domain-invariant features in a unified model. Specifically, to benefit each other in an effective manner, we simultaneously explore soft target labels by label propagation for better condition adaptation and preserve the important properties of inter-class separability and intra-class compactness for reducing more domain shifts. Extensive experiments are conducted on several unsupervised domain adaptation datasets, and the results show that SLDR outperforms the state-of-the-art methods.

Manliang Cao, Xiangdong Zhou, Lan Lin
Information-Bottleneck Approach to Salient Region Discovery

We propose a new method for learning image attention masks in a semi-supervised setting based on the Information Bottleneck principle. Provided with a set of labeled images, the mask generation model is minimizing mutual information between the input and the masked image while maximizing the mutual information between the same masked image and the image label. In contrast with other approaches, our attention model produces a Boolean rather than a continuous mask, entirely concealing the information in masked-out pixels. Using a set of synthetic datasets based on MNIST and CIFAR10 and the SVHN datasets, we demonstrate that our method can successfully attend to features known to define the image class.

Andrey Zhmoginov, Ian Fischer, Mark Sandler
FAWA: Fast Adversarial Watermark Attack on Optical Character Recognition (OCR) Systems

Deep neural networks (DNNs) significantly improved the accuracy of optical character recognition (OCR) and inspired many important applications. Unfortunately, OCRs also inherit the vulnerability of DNNs under adversarial examples. Different from colorful vanilla images, text images usually have clear backgrounds. Adversarial examples generated by most existing adversarial attacks are unnatural and pollute the background severely. To address this issue, we propose the F ast Adversarial Watermark Attack (FAWA) against sequence-based OCR models in the white-box manner. By disguising the perturbations as watermarks, we can make the resulting adversarial images appear natural to human eyes and achieve a perfect attack success rate. FAWA works with either gradient-based or optimization-based perturbation generation. In both letter-level and word-level attacks, our experiments show that in addition to natural appearance, FAWA achieves a 100% attack success rate with 60% less perturbations and 78% fewer iterations on average. In addition, we further extend FAWA to support full-color watermarks, other languages, and even the OCR accuracy-enhancing mechanism.

Lu Chen, Jiao Sun, Wei Xu

Natural Language Processing

Frontmatter
Less Is More: Rejecting Unreliable Reviews for Product Question Answering

Promptly and accurately answering questions on products is important for e-commerce applications. Manually answering product questions (e.g. on community question answering platforms) results in slow response and does not scale. Recent studies show that product reviews are a good source for real-time, automatic product question answering (PQA). In the literature, PQA is formulated as a retrieval problem with the goal to search for the most relevant reviews to answer a given product question. In this paper, we focus on the issue of answerability and answer reliability for PQA using reviews. Our investigation is based on the intuition that many questions may not be answerable with a finite set of reviews. When a question is not answerable, a system should return nil answers rather than providing a list of irrelevant reviews, which can have significant negative impact on user experience. Moreover, for answerable questions, only the most relevant reviews that answer the question should be included in the result. We propose a conformal prediction based framework to improve the reliability of PQA systems, where we reject unreliable answers so that the returned results are more concise and accurate at answering the product question, including returning nil answers for unanswerable questions. Experiments on a widely used Amazon dataset show encouraging results of our proposed framework. More broadly, our results demonstrate a novel and effective application of conformal methods to a retrieval task.

Shiwei Zhang, Xiuzhen Zhang, Jey Han Lau, Jeffrey Chan, Cecile Paris
AMQAN: Adaptive Multi-Attention Question-Answer Networks for Answer Selection

Community Question Answering (CQA) provides platforms for users with various background to obtain information and share knowledge. In the recent years, with the rapid development of such online platforms, an enormous amount of archive data has accumulated which makes it more and more difficult for users to identify desirable answers. Therefore, answer selection becomes a very important subtask in Community Question Answering. A posted question often consists of two parts: a question subject with summarization of users’ intention, and a question body clarifying the subject with more details. Most of the existing answer selection techniques often roughly concatenate these two parts, so that they cause excessive noises besides useful information to questions, inevitably reducing the performance of answer selection approaches. In this paper, we propose AMQAN, an adaptive multi-attention question-answer network with embeddings at different levels, which makes comprehensive use of semantic information in questions and answers, and alleviates the noise issue at the same time. To evaluate our proposed approach, we implement experiments on two datasets, SemEval 2015 and SemEval 2017. Experiment results show that AMQAN outperforms all existing models on two standard CQA datasets.

Haitian Yang, Weiqing Huang, Xuan Zhao, Yan Wang, Yuyan Chen, Bin Lv, Rui Mao, Ning Li
Inductive Document Representation Learning for Short Text Clustering

Short text clustering (STC) is an important task that can discover topics or groups in the fast-growing social networks, e.g., Tweets and Google News. Different from the long texts, STC is more challenging since the word co-occurrence patterns presented in short texts usually make the traditional methods (e.g., TF-IDF) suffer from a sparsity problem of inevitably generating sparse representations. Moreover, these learned representations may lead to the inferior performance of clustering which essentially relies on calculating the distances between the presentations. For alleviating this problem, recent studies are mostly committed to developing representation learning approaches to learn compact low-dimensional embeddings, while most of them, including probabilistic graph models and word embedding models, require all documents in the corpus to be present during the training process. Thus, these methods inherently perform transductive learning which naturally cannot handle well the representations of unseen documents where few words have been learned before. Recently, Graph Neural Networks (GNNs) has drawn a lot of attention in various applications. Inspired by the mechanism of vertex information propagation guided by the graph structure in GNNs, we propose an inductive document representation learning model, called IDRL, that can map the short text structures into a graph network and recursively aggregate the neighbor information of the words in the unseen documents. Then, we can reconstruct the representations of the previously unseen short texts with the limited numbers of word embeddings learned before. Experimental results show that our proposed method can learn more discriminative representations in terms of inductive classification tasks and achieve better clustering performance than state-of-the-art models on four real-world datasets.

Junyang Chen, Zhiguo Gong, Wei Wang, Xiao Dong, Wei Wang, Weiwen Liu, Cong Wang, Xian Chen
FUSE: Multi-faceted Set Expansion by Coherent Clustering of Skip-Grams

Set expansion aims to expand a small set of seed entities into a complete set of relevant entities. Most existing approaches assume the input seed set is unambiguous and completely ignore the multi-faceted semantics of seed entities. As a result, given the seed set {“Canon”, “Sony”, “Nikon”}, previous models return one mixed set of entities that are either Camera Brands or Japanese Companies. In this paper, we study the task of multi-faceted set expansion, which aims to capture all semantic facets in the seed set and return multiple sets of entities, one for each semantic facet. We propose an unsupervised framework, FUSE, which consists of three major components: (1) facet discovery module: identifies all semantic facets of each seed entity by extracting and clustering its skip-grams, and (2) facet fusion module: discovers shared semantic facets of the entire seed set by an optimization formulation, and (3) entity expansion module: expands each semantic facet by utilizing a masked language model with pre-trained BERT models. Extensive experiments demonstrate that FUSE can accurately identify multiple semantic facets of the seed set and generate quality entities for each facet.

Wanzheng Zhu, Hongyu Gong, Jiaming Shen, Chao Zhang, Jingbo Shang, Suma Bhat, Jiawei Han
Hierarchical Interaction Networks with Rethinking Mechanism for Document-Level Sentiment Analysis

Document-level Sentiment Analysis (DSA) is more challenging due to vague semantic links and complicate sentiment information. Recent works have been devoted to leveraging text summarization and have achieved promising results. However, these summarization-based methods did not take full advantage of the summary including ignoring the inherent interactions between the summary and document. As a result, they limited the representation to express major points in the document, which is highly indicative of the key sentiment. In this paper, we study how to effectively generate a discriminative representation with explicit subject patterns and sentiment contexts for DSA. A Hierarchical Interaction Networks (HIN) is proposed to explore bidirectional interactions between the summary and document at multiple granularities and learn subject-oriented document representations for sentiment classification. Furthermore, we design a Sentiment-based Rethinking mechanism (SR) by refining the HIN with sentiment label information to learn a more sentiment-aware document representation. We extensively evaluate our proposed models on three public datasets. The experimental results consistently demonstrate the effectiveness of our proposed models and show that HIN-SR outperforms various state-of-the-art methods.

Lingwei Wei, Dou Hu, Wei Zhou, Xuehai Tang, Xiaodan Zhang, Xin Wang, Jizhong Han, Songlin Hu
Early Detection of Fake News with Multi-source Weak Social Supervision

Social media has greatly enabled people to participate in online activities at an unprecedented rate. However, this unrestricted access also exacerbates the spread of misinformation and fake news which cause confusion and chaos if not detected in a timely manner. Given the rapidly evolving nature of news events and the limited amount of annotated data, state-of-the-art systems on fake news detection face challenges for early detection. In this work, we exploit multiple weak signals from different sources from user engagements with contents (referred to as weak social supervision), and their complementary utilities to detect fake news. We jointly leverage limited amount of clean data along with weak signals from social engagements to train a fake news detector in a meta-learning framework which estimates the quality of different weak instances. Experiments on real-world datasets demonstrate that the proposed framework outperforms state-of-the-art baselines for early detection of fake news without using any user engagements at prediction time.

Kai Shu, Guoqing Zheng, Yichuan Li, Subhabrata Mukherjee, Ahmed Hassan Awadallah, Scott Ruston, Huan Liu
Generating Financial Reports from Macro News via Multiple Edits Neural Networks

Automatically generating financial reports given a piece of breaking macro news is quite challenging task. Essentially, this task is a text-to-text generation problem but is to learn long text, i.e., greater than 40 words, from a piece of short macro news. Moreover, the core component for human beings to generate financial reports is the logic inference given a piece of succinct macro news. To address this issue, we propose the novel multiple edits neural networks which first learns the outline for given news and then generates financial reports from the learnt outline. Particularly, the input news is first embedded via skip-gram model and is then fed into Bi-LSTM component to train the contextual representation vector. This vector is used to learn the latent word probability distribution for the generation of financial reports. To train this end to end neural network model, we have collected one hundred thousand pairs of news-report data. Extensive experiments are performed on this collected dataset. The proposed model achieves the SOTA performance against baseline models w.r.t. the evaluation criteria BLEU, ROUGE and human scores. Although the readability of the generated reports by our approach is better than that of the rest models, it remains an open problem which needs further efforts in the future.

Wenxin Hu, Xiaofeng Zhang, Yunpeng Ren
Continual Learning with Knowledge Transfer for Sentiment Classification

This paper studies continual learning (CL) for sentiment classification (SC). In this setting, the CL system learns a sequence of SC tasks incrementally in a neural network, where each task builds a classifier to classify the sentiment of reviews of a particular product category or domain. Two natural questions are: Can the system transfer the knowledge learned in the past from the previous tasks to the new task to help it learn a better model for the new task? And, can old models for previous tasks be improved in the process as well? This paper proposes a novel technique called KAN to achieve these objectives. KAN can markedly improve the SC accuracy of both the new task and the old tasks via forward and backward knowledge transfer. The effectiveness of KAN is demonstrated through extensive experiments (Code and data are available at: https://github.com/ZixuanKe/LifelongSentClass ).

Zixuan Ke, Bing Liu, Hao Wang, Lei Shu

Bioinformatics

Frontmatter
Predictive Bi-clustering Trees for Hierarchical Multi-label Classification

In the recent literature on multi-label classification, a lot of attention is given to methods that exploit label dependencies. Most of these methods assume that the dependencies are static over the entire instance space. In contrast, here we present an approach that dynamically adapts the label partitions in a multi-label decision tree learning context. In particular, we adapt the recently introduced predictive bi-clustering tree (PBCT) method towards multi-label classification tasks. This way, tree nodes can split the instance-label matrix both in a horizontal and a vertical way. We focus on hierarchical multi-label classification (HMC) tasks, and map the label hierarchy to a feature set over the label space. This feature set is exploited to infer vertical splits, which are regulated by a lookahead strategy in the tree building procedure. We evaluate our proposed method using benchmark datasets. Experiments demonstrate that our proposal (PBCT-HMC) obtained better or competitive results in comparison to its direct competitors, both in terms of predictive performance and model size. Compared to an HMC method that does not produce label partitions though, our method results in larger models on average, while still producing equally large or smaller models in one third of the datasets by creating suitable label partitions.

Bruna Z. Santos, Felipe K. Nakano, Ricardo Cerri, Celine Vens
Self-attention Enhanced Patient Journey Understanding in Healthcare System

Understanding patients’ journeys in healthcare system is a fundamental prepositive task for a broad range of AI-based healthcare applications. This task aims to learn an informative representation that can comprehensively encode hidden dependencies among medical events and its inner entities, and then the use of encoding outputs can greatly benefit the downstream application-driven tasks. A patient journey is a sequence of electronic health records (EHRs) over time that is organized at multiple levels: patient, visits and medical codes. The key challenge of patient journey understanding is to design an effective encoding mechanism which can properly tackle the aforementioned multi-level structured patient journey data with temporal sequential visits and a set of medical codes. This paper proposes a novel self-attention mechanism that can simultaneously capture the contextual and temporal relationships hidden in patient journeys. A multi-level self-attention network (MusaNet) is specifically designed to learn the representations of patient journeys that is used to be a long sequence of activities. We evaluated the efficacy of our method on two medical application tasks with real-world benchmark datasets. The results have demonstrated the proposed MusaNet produces higher-quality representations than state-of-the-art baseline methods. The source code is available in https://github.com/xueping/MusaNet .

Xueping Peng, Guodong Long, Tao Shen, Sen Wang, Jing Jiang
MMCNN: A Multi-branch Multi-scale Convolutional Neural Network for Motor Imagery Classification

Electroencephalography (EEG) based motor imagery (MI) is one of the promising Brain–computer interface (BCI) paradigms enable humans to communicate with the outside world based solely on brain intentions. Although convolutional neural networks have been gradually applied to MI classification task and gained high performance, the following problems still exist to make effective decoding of raw EEG signals challenging: 1) EEG signals are non-linear, non-stationary, and low signal-noise ratio. 2) Most existing end-to-end MI models utilize single scale convolution which limits the result of classification because the best convolution scale varies with different subject (called subject difference). In addition, even for the same subject, the best convolution scale also differs from time to time (called time difference). In this paper, we propose a novel end-to-end model, named Multi-branch Multi-scale Convolutional Neural Network (MMCNN), for motor imagery classification. The MMCNN model effectively decodes raw EEG signals without any pre-processing including filtering. Meanwhile, the multi-branch multi-scale convolution structure successfully addresses the problems of subject difference and time difference based on parallel processing. In addition, multi-scale convolution can realize the characterization of information in different frequency bands, thereby effectively solving the problem that the best convolution scale is difficult to determine. Experiments on two public BCI competition datasets demonstrate that the proposed MMCNN model outperforms the state-of-the-art models. The implementation code is made publicly available https://github.com/jingwang2020/ECML-PKDD_MMCNN .

Ziyu Jia, Youfang Lin, Jing Wang, Kaixin Yang, Tianhang Liu, Xinwang Zhang
Backmatter
Metadata
Title
Machine Learning and Knowledge Discovery in Databases
Editors
Frank Hutter
Prof. Dr. Kristian Kersting
Dr. Jefrey Lijffijt
Isabel Valera
Copyright Year
2021
Electronic ISBN
978-3-030-67664-3
Print ISBN
978-3-030-67663-6
DOI
https://doi.org/10.1007/978-3-030-67664-3

Premium Partner