Skip to main content

2019 | Buch

Machine Learning and Knowledge Discovery in Databases

European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part II

herausgegeben von: Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, Georgiana Ifrim

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The three volume proceedings LNAI 11051 – 11053 constitutes the refereed proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2018, held in Dublin, Ireland, in September 2018.

The total of 131 regular papers presented in part I and part II was carefully reviewed and selected from 535 submissions; there are 52 papers in the applied data science, nectar and demo track.

The contributions were organized in topical sections named as follows:
Part I: adversarial learning; anomaly and outlier detection; applications; classification; clustering and unsupervised learning; deep learningensemble methods; and evaluation.
Part II: graphs; kernel methods; learning paradigms; matrix and tensor analysis; online and active learning; pattern and sequence mining; probabilistic models and statistical methods; recommender systems; and transfer learning.
Part III: ADS data science applications; ADS e-commerce; ADS engineering and design; ADS financial and security; ADS health; ADS sensing and positioning; nectar track; and demo track.

Inhaltsverzeichnis

Frontmatter

Graphs

Frontmatter
Temporally Evolving Community Detection and Prediction in Content-Centric Networks

In this work, we consider the problem of combining link, content and temporal analysis for community detection and prediction in evolving networks. Such temporal and content-rich networks occur in many real-life settings, such as bibliographic networks and question answering forums. Most of the work in the literature (that uses both content and structure) deals with static snapshots of networks, and they do not reflect the dynamic changes occurring over multiple snapshots. Incorporating dynamic changes in the communities into the analysis can also provide useful insights about the changes in the network such as the migration of authors across communities. In this work, we propose Chimera ( https://github.com/renatolfc/chimera-stf ), a shared factorization model that can simultaneously account for graph links, content, and temporal analysis. This approach works by extracting the latent semantic structure of the network in multidimensional form, but in a way that takes into account the temporal continuity of these embeddings. Such an approach simplifies temporal analysis of the underlying network by using the embedding as a surrogate. A consequence of this simplification is that it is also possible to use this temporal sequence of embeddings to predict future communities. We present experimental results illustrating the effectiveness of the approach. Code related to this paper is available at: https://github.com/renatolfc/chimera-stf .

Ana Paula Appel, Renato L. F. Cunha, Charu C. Aggarwal, Marcela Megumi Terakado
Local Topological Data Analysis to Uncover the Global Structure of Data Approaching Graph-Structured Topologies

Gene expression data of differentiating cells, galaxies distributed in space, and earthquake locations, all share a common property: they lie close to a graph-structured topology in their respective spaces [1, 4, 9, 10, 20], referred to as one-dimensional stratified spaces in mathematics. Often, the uncovering of such topologies offers great insight into these data sets. However, methods for dimensionality reduction are clearly inappropriate for this purpose, and also methods from the relatively new field of Topological Data Analysis (TDA) are inappropriate, due to noise sensitivity, computational complexity, or other limitations. In this paper we introduce a new method, termed Local TDA (LTDA), which resolves the issues of pre-existing methods by unveiling (global) graph-structured topologies in data by means of robust and computationally cheap local analyses. Our method rests on a simple graph-theoretic result that enables one to identify isolated, end-, edge- and multifurcation points in the topology underlying the data. It then uses this information to piece together a graph that is homeomorphic to the unknown one-dimensional stratified space underlying the point cloud data. We evaluate our method on a number of artificial and real-life data sets, demonstrating its superior effectiveness, robustness against noise, and scalability. Code related to this paper is available at: https://bitbucket.org/ghentdatascience/gltda-public .

Robin Vandaele, Tijl De Bie, Yvan Saeys
Similarity Modeling on Heterogeneous Networks via Automatic Path Discovery

Heterogeneous networks are widely used to model real-world semi-structured data. The key challenge of learning over such networks is the modeling of node similarity under both network structures and contents. To deal with network structures, most existing works assume a given or enumerable set of meta-paths and then leverage them for the computation of meta-path-based proximities or network embeddings. However, expert knowledge for given meta-paths is not always available, and as the length of considered meta-paths increases, the number of possible paths grows exponentially, which makes the path searching process very costly. On the other hand, while there are often rich contents around network nodes, they have hardly been leveraged to further improve similarity modeling. In this work, to properly model node similarity in content-rich heterogeneous networks, we propose to automatically discover useful paths for pairs of nodes under both structural and content information. To this end, we combine continuous reinforcement learning and deep content embedding into a novel semi-supervised joint learning framework. Specifically, the supervised reinforcement learning component explores useful paths between a small set of example similar pairs of nodes, while the unsupervised deep embedding component captures node contents and enables inductive learning on the whole network. The two components are jointly trained in a closed loop to mutually enhance each other. Extensive experiments on three real-world heterogeneous networks demonstrate the supreme advantages of our algorithm. Code related to this paper is available at: https://github.com/yangji9181/AutoPath .

Carl Yang, Mengxiong Liu, Frank He, Xikun Zhang, Jian Peng, Jiawei Han
Dynamic Hierarchies in Temporal Directed Networks

The outcome of interactions in many real-world systems can be often explained by a hierarchy between the participants. Discovering hierarchy from a given directed network can be formulated as follows: partition vertices into levels such that, ideally, there are only forward edges, that is, edges from upper levels to lower levels. In practice, the ideal case is impossible, so instead we minimize some penalty function on the backward edges. One practical option for such a penalty is agony, where the penalty depends on the severity of the violation. In this paper we extend the definition of agony to temporal networks. In this setup we are given a directed network with time stamped edges, and we allow the rank assignment to vary over time. We propose 2 strategies for controlling the variation of individual ranks. In our first variant, we penalize the fluctuation of the rankings over time by adding a penalty directly to the optimization function. In our second variant we allow the rank change at most once. We show that the first variant can be solved exactly in polynomial time while the second variant is NP-hard, and in fact inapproximable. However, we develop an iterative method, where we first fix the change point and optimize the ranks, and then fix the ranks and optimize the change points, and reiterate until convergence. We show empirically that the algorithms are reasonably fast in practice, and that the obtained rankings are sensible. Code related to this paper is available at: https://bitbucket.org/orlyanalytics/temporalagony/ .

Nikolaj Tatti
Risk-Averse Matchings over Uncertain Graph Databases

In this work we study a problem that naturally arises in the context of several important applications, such as online dating, kidney exchanges, and team formation. Given an uncertain, weighted (hyper)graph, how can we efficiently find a (hyper)matching with high expected reward, and low risk? We introduce a novel formulation for finding matchings with maximum expected reward and bounded risk under a general model of uncertain weighted (hyper)graphs. Given that our optimization problem is NP-hard, we turn our attention to designing efficient approximation algorithms. For the case of uncertain weighted graphs, we provide a $$\frac{1}{3}$$ -approximation algorithm, and a $$\frac{1}{5}$$ -approximation algorithm with near optimal run time. For the case of uncertain weighted hypergraphs, we provide a $$\varOmega (\frac{1}{k})$$ -approximation algorithm, where k is the rank of the hypergraph (i.e., any hyperedge includes at most k nodes), that runs in almost (modulo log factors) linear time.We complement our theoretical results by testing our algorithms on a wide variety of synthetic experiments, where we observe in a controlled setting interesting findings on the trade-off between reward, and risk. We also provide an application of our formulation for providing recommendations of teams that are likely to collaborate, and have high impact. Code related to this paper is available at: https://github.com/tsourolampis/risk-averse-graph-matchings .

Charalampos E. Tsourakakis, Shreyas Sekar, Johnson Lam, Liu Yang
Discovering Urban Travel Demands Through Dynamic Zone Correlation in Location-Based Social Networks

Location-Based Social Networks (LBSN), which enable mobile users to announce their locations by checking-in to Points-of-Interests (POI), has accumulated a huge amount of user-POI interaction data. Compared to traditional sensor data, check-in data provides the much-needed information about trip purpose, which is critical to motivate human mobility but was not available for travel demand studies. In this paper, we aim to exploit the rich check-in data to model dynamic travel demands in urban areas, which can support a wide variety of mobile business solutions. Specifically, we first profile the functionality of city zones using the categorical density of POIs. Second, we use a Hawkes Process-based State-Space formulation to model the dynamic trip arrival patterns based on check-in arrival patterns. Third, we developed a joint model that integrates Pearson Product-Moment Correlation (PPMC) analysis into zone gravity modeling to perform dynamic Origin-Destination (OD) prediction. Last, we validated our methods using real-world LBSN and transportation data of New York City. The experimental results demonstrate the effectiveness of the proposed method for modeling dynamic urban travel demands. Our method achieves a significant improvement on OD prediction compared to baselines. Code related to this paper is available at: https://github.com/nicholasadam/PKDD2018-dynamic-zone-correlation .

Wangsu Hu, Zijun Yao, Sen Yang, Shuhong Chen, Peter J. Jin
Social-Affiliation Networks: Patterns and the SOAR Model

Given a social-affiliation network – a friendship graph where users have many, binary attributes e.g., check-ins, page likes or group memberships – what rules do its structural properties such as edge or triangle counts follow, in relation to its attributes? More challengingly, how can we synthetically generate networks which provably satisfy those rules or patterns? Our work attempts to answer these closely-related questions in the context of the increasingly prevalent social-affiliation graphs. Our contributions are two-fold: (a) Patterns: we discover three new rules (power laws) in the properties of attribute-induced subgraphs, substructures which connect the friendship structure to affiliations; (b) Model: we propose SOAR– short for SOcial-Affiliation graphs via Recursion– a stochastic model based on recursion and self-similarity, to provably generate graphs obeying the observed patterns. Experiments show that: (i) the discovered rules are useful in detecting deviations as anomalies and (ii) SOAR is fast and scales linearly with network size, producing graphs with millions of edges and attributes in only a few seconds. Code related to this paper is available at: www.github.com/dhivyaeswaran/soar .

Dhivya Eswaran, Reihaneh Rabbany, Artur W. Dubrawski, Christos Faloutsos
ONE-M: Modeling the Co-evolution of Opinions and Network Connections

How do opinions of individuals on controversial issues such as marijuana and gay marriage and their underlying social network connections evolve over time? Do people alter their network to have more like-minded friends or do they change their own opinions? Does the society eventually develop echo chambers? In this paper, we study dynamically evolving networks and changing user opinions to answer these questions. Our contributions are as follows: (a) Discovering Evolution of Polarization in Networks: We present evidence of growing divide among users based on their opinions who eventually form homophilic groups (b) Studying Opinion and Network Co-Evolution: We present observations of how individuals change opinions and position themselves in dynamically changing networks (c) Forecasting Persistence and Change in Opinions and Network: We propose ONE-M to forecast individual beliefs and persistence or dissolution of social ties. Using a unique real-world network dataset including periodic user surveys, we show that ONE-M performs with high accuracy, while outperforming the baseline approaches. Code related to this paper is available at: https://github.com/anigam/ONE-M and Data related to this paper is available at: http://netsense.nd.edu/ .

Aastha Nigam, Kijung Shin, Ashwin Bahulkar, Bryan Hooi, David Hachen, Boleslaw K. Szymanski, Christos Faloutsos, Nitesh V. Chawla
Think Before You Discard: Accurate Triangle Counting in Graph Streams with Deletions

Given a stream of edge additions and deletions, how can we estimate the count of triangles in it? If we can store only a subset of the edges, how can we obtain unbiased estimates with small variances?Counting triangles (i.e., cliques of size three) in a graph is a classical problem with applications in a wide range of research areas, including social network analysis, data mining, and databases. Recently, streaming algorithms for triangle counting have been extensively studied since they can naturally be used for large dynamic graphs. However, existing algorithms cannot handle edge deletions or suffer from low accuracy.Can we handle edge deletions while achieving high accuracy? We propose ThinkD, which accurately estimates the counts of global triangles (i.e., all triangles) and local triangles associated with each node in a fully dynamic graph stream with edge additions and deletions. Compared to its best competitors, ThinkD is (a) Accurate: up to 4.3 $${\times }$$ more accurate within the same memory budget, (b) Fast: up to 2.2 $${\times }$$ faster for the same accuracy requirements, and (c) Theoretically sound: always maintaining unbiased estimates with small variances. Code related to this paper is available at: https://github.com/kijungs/thinkd .

Kijung Shin, Jisu Kim, Bryan Hooi, Christos Faloutsos
Semi-supervised Blockmodelling with Pairwise Guidance

Blockmodelling is an important technique for detecting underlying patterns in graphs. Existing blockmodelling algorithms are unsupervised and cannot take advantage of the existing information that might be available about objects that are known to be similar. This background information can help finding complex patterns, such as hierarchical or ring blockmodel structures, which are difficult for traditional blockmodelling algorithms to detect. In this paper, we propose a new semi-supervised framework for blockmodelling, which allows background information to be incorporated in the form of pairwise membership information. Our proposed framework is based on the use of Lagrange multipliers and can be incorporated into existing iterative blockmodelling algorithms, enabling them to find complex blockmodel patterns in graphs. We demonstrate the utility of our framework for discovering complex patterns, via experiments over a range of synthetic and real data sets. Code related to this paper is available at: https://people.eng.unimelb.edu.au/mganji/ .

Mohadeseh Ganji, Jeffrey Chan, Peter J. Stuckey, James Bailey, Christopher Leckie, Kotagiri Ramamohanarao, Laurence Park

Kernel Methods

Frontmatter
Large-Scale Nonlinear Variable Selection via Kernel Random Features

We propose a new method for input variable selection in nonlinear regression. The method is embedded into a kernel regression machine that can model general nonlinear functions, not being a priori limited to additive models. This is the first kernel-based variable selection method applicable to large datasets. It sidesteps the typical poor scaling properties of kernel methods by mapping the inputs into a relatively low-dimensional space of random features. The algorithm discovers the variables relevant for the regression task together with learning the prediction model through learning the appropriate nonlinear random feature maps. We demonstrate the outstanding performance of our method on a set of large-scale synthetic and real datasets. Code related to this paper is available at: https://bitbucket.org/dmmlgeneva/srff_pytorch .

Magda Gregorová, Jason Ramapuram, Alexandros Kalousis, Stéphane Marchand-Maillet
Fast and Provably Effective Multi-view Classification with Landmark-Based SVM

We introduce a fast and theoretically founded method for learning landmark-based SVMs in a multi-view classification setting which leverages the complementary information of the different views and linearly scales with the size of the dataset. The proposed method – called MVL-SVM – applies a non-linear projection to the dataset through multi-view similarity estimates w.r.t. a small set of randomly selected landmarks, before learning a linear SVM in this latent space joining all the views. Using the uniform stability framework, we prove that our algorithm is robust to slight changes in the training set leading to a generalization bound depending on the number of views and landmarks. We also show that our method can be easily adapted to a missing-view scenario by only reconstructing the similarities to the landmarks. Empirical results, both in complete and missing view settings, highlight the superior performances of our method, in terms of accuracy and execution time, w.r.t. state of the art techniques. Code related to this paper is available at: https://github.com/vzantedeschi/multiviewLSVM .

Valentina Zantedeschi, Rémi Emonet, Marc Sebban
Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Kernel methods are a popular choice for classification problems, but when solving large-scale learning tasks computing the quadratic kernel matrix quickly becomes infeasible. To circumvent this problem, the Nyström method that approximates the kernel matrix using only a smaller sample of the kernel matrix has been proposed. Other techniques to speed up kernel learning include stochastic first order optimization and conditioning. We introduce Nyström-SGD, a learning algorithm that trains kernel classifiers by minimizing a convex loss function with conditioned stochastic gradient descent while exploiting the low-rank structure of a Nyström kernel approximation. Our experiments suggest that the Nyström-SGD enables us to rapidly train high-accuracy classifiers for large-scale classification tasks. Code related to this paper is available at: https://bitbucket.org/Whadup/kernelmachine/ .

Lukas Pfahler, Katharina Morik

Learning Paradigms

Frontmatter
Hyperparameter Learning for Conditional Kernel Mean Embeddings with Rademacher Complexity Bounds

Conditional kernel mean embeddings are nonparametric models that encode conditional expectations in a reproducing kernel Hilbert space. While they provide a flexible and powerful framework for probabilistic inference, their performance is highly dependent on the choice of kernel and regularization hyperparameters. Nevertheless, current hyperparameter tuning methods predominantly rely on expensive cross validation or heuristics that is not optimized for the inference task. For conditional kernel mean embeddings with categorical targets and arbitrary inputs, we propose a hyperparameter learning framework based on Rademacher complexity bounds to prevent overfitting by balancing data fit against model complexity. Our approach only requires batch updates, allowing scalable kernel hyperparameter tuning without invoking kernel approximations. Experiments demonstrate that our learning framework outperforms competing methods, and can be further extended to incorporate and learn deep neural network weights to improve generalization. (Source code available at: https://github.com/Kelvin-Hsu/cake ).

Kelvin Hsu, Richard Nock, Fabio Ramos
Deep Learning Architecture Search by Neuro-Cell-Based Evolution with Function-Preserving Mutations

The design of convolutional neural network architectures for a new image data set is a laborious and computational expensive task which requires expert knowledge. We propose a novel neuro-evolutionary technique to solve this problem without human interference. Our method assumes that a convolutional neural network architecture is a sequence of neuro-cells and keeps mutating them using function-preserving operations. This novel combination of approaches has several advantages. We define the network architecture by a sequence of repeating neuro-cells which reduces the search space complexity. Furthermore, these cells are possibly transferable and can be used in order to arbitrarily extend the complexity of the network. Mutations based on function-preserving operations guarantee better parameter initialization than random initialization such that less training time is required per network architecture. Our proposed method finds within 12 GPU hours neural network architectures that can achieve a classification error of about 4% and 24% with only 5.5 and 6.5 million parameters on CIFAR-10 and CIFAR-100, respectively. In comparison to competitor approaches, our method provides similar competitive results but requires orders of magnitudes less search time and in many cases less network parameters.

Martin Wistuba
VC-Dimension Based Generalization Bounds for Relational Learning

In many applications of relational learning, the available data can be seen as a sample from a larger relational structure (e.g. we may be given a small fragment from some social network). In this paper we are particularly concerned with scenarios in which we can assume that (i) the domain elements appearing in the given sample have been uniformly sampled without replacement from the (unknown) full domain and (ii) the sample is complete for these domain elements (i.e. it is the full substructure induced by these elements). Within this setting, we study bounds on the error of sufficient statistics of relational models that are estimated on the available data. As our main result, we prove a bound based on a variant of the Vapnik-Chervonenkis dimension which is suitable for relational data.

Ondřej Kuželka, Yuyi Wang, Steven Schockaert
Robust Super-Level Set Estimation Using Gaussian Processes

This paper focuses on the problem of determining as large a region as possible where a function exceeds a given threshold with high probability. We assume that we only have access to a noise-corrupted version of the function and that function evaluations are costly. To select the next query point, we propose maximizing the expected volume of the domain identified as above the threshold as predicted by a Gaussian process, robustified by a variance term. We also give asymptotic guarantees on the exploration effect of the algorithm, regardless of the prior misspecification. We show by various numerical examples that our approach also outperforms existing techniques in the literature in practice.

Andrea Zanette, Junzi Zhang, Mykel J. Kochenderfer
Scalable Nonlinear AUC Maximization Methods

The area under the ROC curve (AUC) is a widely used measure for evaluating classification performance on heavily imbalanced data. The kernelized AUC maximization machines have established a superior generalization ability compared to linear AUC machines because of their capability in modeling the complex nonlinear structures underlying most real-world data. However, the high training complexity renders the kernelized AUC machines infeasible for large-scale data. In this paper, we present two nonlinear AUC maximization algorithms that optimize linear classifiers over a finite-dimensional feature space constructed via the k-means Nyström approximation. Our first algorithm maximizes the AUC metric by optimizing a pairwise squared hinge loss function using the truncated Newton method. However, the second-order batch AUC maximization method becomes expensive to optimize for extremely massive datasets. This motivates us to develop a first-order stochastic AUC maximization algorithm that incorporates a scheduled regularization update and scheduled averaging to accelerate the convergence of the classifier. Experiments on several benchmark datasets demonstrate that the proposed AUC classifiers are more efficient than kernelized AUC machines while they are able to surpass or at least match the AUC performance of the kernelized AUC machines. We also show experimentally that the proposed stochastic AUC classifier is able to reach the optimal solution, while the other state-of-the-art online and stochastic AUC maximization methods are prone to suboptimal convergence. Code related to this paper is available at: https://sites.google.com/view/majdikhalid/ .

Majdi Khalid, Indrakshi Ray, Hamidreza Chitsaz

Matrix and Tensor Analysis

Frontmatter
Lambert Matrix Factorization

Many data generating processes result in skewed data, which should be modeled by distributions that can capture the skewness. In this work we adopt the flexible family of Lambert W distributions that combine arbitrary standard distribution with specific nonlinear transformation to incorporate skewness. We describe how Lambert W distributions can be used in probabilistic programs by providing stable gradient-based inference, and demonstrate their use in matrix factorization. In particular, we focus in modeling logarithmically transformed count data. We analyze the weighted squared loss used by state-of-the-art word embedding models to learn interpretable representations from word co-occurrences and show that a generative model capturing the essential properties of those models can be built using Lambert W distributions.

Arto Klami, Jarkko Lagus, Joseph Sakaya
Identifying and Alleviating Concept Drift in Streaming Tensor Decomposition

Tensor decompositions are used in various data mining applications from social network to medical applications and are extremely useful in discovering latent structures or concepts in the data. Many real-world applications are dynamic in nature and so are their data. To deal with this dynamic nature of data, there exist a variety of online tensor decomposition algorithms. A central assumption in all those algorithms is that the number of latent concepts remains fixed throughout the entire stream. However, this need not be the case. Every incoming batch in the stream may have a different number of latent concepts, and the difference in latent concepts from one tensor batch to another can provide insights into how our findings in a particular application behave and deviate over time. In this paper, we define “concept” and “concept drift” in the context of streaming tensor decomposition, as the manifestation of the variability of latent concepts throughout the stream. Furthermore, we introduce SeekAndDestroy (The method name is after (and a tribute to) Metallica’s song from their first album (who also owns the copyright for the name)), an algorithm that detects concept drift in streaming tensor decomposition and is able to produce results robust to that drift. To the best of our knowledge, this is the first work that investigates concept drift in streaming tensor decomposition. We extensively evaluate SeekAndDestroy on synthetic datasets, which exhibit a wide variety of realistic drift. Our experiments demonstrate the effectiveness of SeekAndDestroy, both in the detection of concept drift and in the alleviation of its effects, producing results with similar quality to decomposing the entire tensor in one shot. Additionally, in real datasets, SeekAndDestroy outperforms other streaming baselines, while discovering novel useful components. Code related to this paper is available at: https://github.com/ravdeep003/conceptDrift .

Ravdeep Pasricha, Ekta Gujral, Evangelos E. Papalexakis
MASAGA: A Linearly-Convergent Stochastic First-Order Method for Optimization on Manifolds

We consider the stochastic optimization of finite sums over a Riemannian manifold where the functions are smooth and convex. We present MASAGA, an extension of the stochastic average gradient variant SAGA on Riemannian manifolds. SAGA is a variance-reduction technique that typically outperforms methods that rely on expensive full-gradient calculations, such as the stochastic variance-reduced gradient method. We show that MASAGA achieves a linear convergence rate with uniform sampling, and we further show that MASAGA achieves a faster convergence rate with non-uniform sampling. Our experiments show that MASAGA is faster than the recent Riemannian stochastic gradient descent algorithm for the classic problem of finding the leading eigenvector corresponding to the maximum eigenvalue. Code related to this paper is available at: https://github.com/IssamLaradji/MASAGA .

Reza Babanezhad, Issam H. Laradji, Alireza Shafaei, Mark Schmidt
Block CUR: Decomposing Matrices Using Groups of Columns

A common problem in large-scale data analysis is to approximate a matrix using a combination of specifically sampled rows and columns, known as CUR decomposition. In many real-world environments, the ability to sample specific individual rows or columns of the matrix is limited by either system constraints or cost. In this paper, we consider matrix approximation by sampling predefined blocks of columns (or rows) from the matrix. We present an algorithm for sampling useful column blocks and provide novel guarantees for the quality of the approximation. We demonstrate the effectiveness of the proposed algorithms for computing the Block CUR decomposition of large matrices in a distributed setting with multiple nodes in a compute cluster and in a biometric data analysis setting using real-world user data from content testing.

Urvashi Oswal, Swayambhoo Jain, Kevin S. Xu, Brian Eriksson

Online and Active Learning

Frontmatter
: Online Spectral Learning for Single Topic Models

We study the problem of learning a latent variable model online from a stream of data. Latent variable models are popular because they can explain observed data through unobserved concepts. These models have traditionally been studied in the offline setting. In the online setting, online expectation maximization (EM) is arguably the most popular approach for learning latent variable models. Although online EM is computationally efficient, it typically converges to a local optimum. In this work, we develop a new online learning algorithm for latent variable models, which we call $$\mathtt{SpectralLeader}$$ . $$\mathtt{SpectralLeader}$$ converges to the global optimum, and we derive a sublinear upper bound on its n-step regret in a single topic model. In both synthetic and real-world experiments, we show that $$\mathtt{SpectralLeader}$$ performs similarly to or better than online EM with tuned hyper-parameters.

Tong Yu, Branislav Kveton, Zheng Wen, Hung Bui, Ole J. Mengshoel
Online Learning of Weighted Relational Rules for Complex Event Recognition

Systems for symbolic complex event recognition detect occurrences of events in time using a set of event definitions in the form of logical rules. The Event Calculus is a temporal logic that has been used as a basis in event recognition applications, providing among others, connections to techniques for learning such rules from data. We advance the state-of-the-art by combining an existing online algorithm for learning crisp relational structure with an online method for weight learning in Markov Logic Networks (MLN). The result is an algorithm that learns complex event patterns in the form of Event Calculus theories in the MLN semantics. We evaluate our approach on a challenging real-world application for activity recognition and show that it outperforms both its crisp predecessor and competing online MLN learners in terms of predictive performance, at the price of a small increase in training time. Code related to this paper is available at: https://github.com/nkatzz/OLED .

Nikos Katzouris, Evangelos Michelioudakis, Alexander Artikis, Georgios Paliouras
Toward Interpretable Deep Reinforcement Learning with Linear Model U-Trees

Deep Reinforcement Learning (DRL) has achieved impressive success in many applications. A key component of many DRL models is a neural network representing a Q function, to estimate the expected cumulative reward following a state-action pair. The Q function neural network contains a lot of implicit knowledge about the RL problems, but often remains unexamined and uninterpreted. To our knowledge, this work develops the first mimic learning framework for Q functions in DRL. We introduce Linear Model U-trees (LMUTs) to approximate neural network predictions. An LMUT is learned using a novel on-line algorithm that is well-suited for an active play setting, where the mimic learner observes an ongoing interaction between the neural net and the environment. Empirical evaluation shows that an LMUT mimics a Q function substantially better than five baseline methods. The transparent tree structure of an LMUT facilitates understanding the network’s learned strategic knowledge by analyzing feature influence, extracting rules, and highlighting the super-pixels in image inputs. Code related to this paper is available at: https://github.com/Guiliang/uTree_mimic_mountain_car .

Guiliang Liu, Oliver Schulte, Wang Zhu, Qingcan Li
Online Feature Selection by Adaptive Sub-gradient Methods

The overall goal of online feature selection is to iteratively select, from high-dimensional streaming data, a small, “budgeted” number of features for constructing accurate predictors. In this paper, we address the online feature selection problem using novel truncation techniques for two online sub-gradient methods: Adaptive Regularized Dual Averaging (ARDA) and Adaptive Mirror Descent (AMD). The corresponding truncation-based algorithms are called B-ARDA and B-AMD, respectively. The key aspect of our truncation techniques is to take into account the magnitude of feature values in the current predictor, together with their frequency in the history of predictions. A detailed regret analysis for both algorithms is provided. Experiments on six high-dimensional datasets indicate that both B-ARDA and B-AMD outperform two advanced online feature selection algorithms, OFS and SOFS, especially when the number of selected features is small. Compared to sparse online learning algorithms that use $$\ell _1$$ regularization, B-ARDA is superior to $$\ell _1$$ -ARDA, and B-AMD is superior to Ada-Fobos. Code related to this paper is available at: https://github.com/LUCKY-ting/online-feature-selection .

Tingting Zhai, Hao Wang, Frédéric Koriche, Yang Gao
Frame-Based Optimal Design

Optimal experimental design (OED) addresses the problem of selecting an optimal subset of the training data for learning tasks. In this paper, we propose to efficiently compute OED by leveraging the geometry of data: We restrict computations to the set of instances lying on the border of the convex hull of all data points. This set is called the frame. We (i) provide the theoretical basis for our approach and (ii) show how to compute the frame in kernel-induced feature spaces. The latter allows us to sample optimal designs for non-linear hypothesis functions without knowing the explicit feature mapping. We present empirical results showing that the performance of frame-based OED is often on par or better than traditional OED approaches, but its solution can be computed up to twenty times faster.

Sebastian Mair, Yannick Rudolph, Vanessa Closius, Ulf Brefeld
Hierarchical Active Learning with Proportion Feedback on Regions

Learning of classification models in practice often relies on human annotation effort in which humans assign class labels to data instances. As this process can be very time-consuming and costly, finding effective ways to reduce the annotation cost becomes critical for building such models. To solve this problem, instead of soliciting instance-based annotation we explore region-based annotation as the feedback. A region is defined as a hyper-cubic subspace of the input feature space and it covers a subpopulation of data instances that fall into this region. Each region is labeled with a number in [0, 1] (in binary classification setting), representing a human estimate of the positive (or negative) class proportion in the subpopulation. To learn a classifier from region-based feedback we develop an active learning framework that hierarchically divides the input space into smaller and smaller regions. In each iteration we split the region with the highest potential to improve the classification models. This iterative process allows us to gradually learn more refined classification models from more specific regions with more accurate proportions. Through experiments on numerous datasets we demonstrate that our approach offers a new and promising active learning direction that can outperform existing active learning approaches especially in situations when labeling budget is limited and small. Code related to this paper is available at: https://github.com/patrick-luo/hierarchical-active-learning.git .

Zhipeng Luo, Milos Hauskrecht

Pattern and Sequence Mining

Frontmatter
An Efficient Algorithm for Computing Entropic Measures of Feature Subsets

Entropic measures such as conditional entropy or mutual information have been used numerous times in pattern mining, for instance to characterize valuable itemsets or approximate functional dependencies. Strangely enough the fundamental problem of designing efficient algorithms to compute entropy of subsets of features (or mutual information of feature subsets relatively to some target feature) has received little attention compared to the analog problem of computing frequency of itemsets. The present article proposes to fill this gap: it introduces a fast and scalable method that computes entropy and mutual information for a large number of feature subsets by adopting the divide and conquer strategy used by FP-growth – one of the most efficient frequent itemset mining algorithm. In order to illustrate its practical interest, the algorithm is then used to solve the recently introduced problem of mining reliable approximate functional dependencies. It finally provides empirical evidences that in the context of non-redundant pattern extraction, the proposed method outperforms existing algorithms for both speed and scalability. Code related to this chapter is available at: https://github.com/P-Fred/HFP-Growth .

Frédéric Pennerath
Anytime Subgroup Discovery in Numerical Domains with Guarantees

Subgroup discovery is the task of discovering patterns that accurately discriminate a class label from the others. Existing approaches can uncover such patterns either through an exhaustive or an approximate exploration of the pattern search space. However, an exhaustive exploration is generally unfeasible whereas approximate approaches do not provide guarantees bounding the error of the best pattern quality nor the exploration progression (“How far are we of an exhaustive search”). We design here an algorithm for mining numerical data with three key properties w.r.t. the state of the art: (i) It yields progressively interval patterns whose quality improves over time; (ii) It can be interrupted anytime and always gives a guarantee bounding the error on the top pattern quality and (iii) It always bounds a distance to the exhaustive exploration. After reporting experimentations showing the effectiveness of our method, we discuss its generalization to other kinds of patterns. Code related to this paper is available at: https://github.com/Adnene93/RefineAndMine .

Aimene Belfodil, Adnene Belfodil, Mehdi Kaytoue
Discovering Spatio-Temporal Latent Influence in Geographical Attention Dynamics

We address the problem of modeling the occurrence process of events for visiting attractive places, called points-of-interest (POIs), in a sightseeing city in the setting of a continuous time-axis and a continuous spatial domain, which is referred to as modeling geographical attention dynamics. By combining a Hawkes process with a time-varying Gaussian mixture model in a novel way and incorporating the influence structure depending on time slots as well, we propose a probabilistic model for discovering the spatio-temporal influence structure among major sightseeing areas from the viewpoint of geographical attention dynamics, and aim to accurately predict POI visit events in the near future. We develop an efficient method of inferring the parameters in the proposed model from the observed sequence of POI visit events, and present an analysis method for the geographical attention dynamics. Using real data of POI visit events in a Japanese sightseeing city, we demonstrate that the proposed model outperforms conventional models in terms of predictive accuracy, and uncover the spatio-temporal influence structure among major sightseeing areas in the city from the perspective of geographical attention dynamics.

Minoru Higuchi, Kanji Matsutani, Masahito Kumano, Masahiro Kimura
Mining Periodic Patterns with a MDL Criterion

The quantity of event logs available is increasing rapidly, be they produced by industrial processes, computing systems, or life tracking, for instance. It is thus important to design effective ways to uncover the information they contain. Because event logs often record repetitive phenomena, mining periodic patterns is especially relevant when considering such data. Indeed, capturing such regularities is instrumental in providing condensed representations of the event sequences.We present an approach for mining periodic patterns from event logs while relying on a Minimum Description Length (MDL) criterion to evaluate candidate patterns. Our goal is to extract a set of patterns that suitably characterises the periodic structure present in the data. We evaluate the interest of our approach on several real-world event log datasets. Code related to this paper is available at: https://github.com/nurblageij/periodic-patterns-mdl .

Esther Galbrun, Peggy Cellier, Nikolaj Tatti, Alexandre Termier, Bruno Crémilleux
Revisiting Conditional Functional Dependency Discovery: Splitting the “C” from the “FD”

Many techniques for cleaning dirty data are based on enforcing some set of integrity constraints. Conditional functional dependencies (CFDs) are a combination of traditional Functional dependencies (FDs) and association rules, and are widely used as a constraint formalism for data cleaning. However, the discovery of such CFDs has received limited attention. In this paper, we regard CFDs as an extension of association rules, and present three general methodologies for (approximate) CFD discovery, each using a different way of combining pattern mining for discovering the conditions (the “C” in CFD) with FD discovery. We discuss how existing algorithms fit into these three methodologies, and introduce new techniques to improve the discovery process. We show that the right choice of methodology improves performance over the traditional CFD discovery method CTane. Code related to this paper is available at: https://github.com/j-r77/cfddiscovery , https://codeocean.com/2018/06/20/discovering-conditional-functional-dependencies/code .

Joeri Rammelaere, Floris Geerts
Sqn2Vec: Learning Sequence Representation via Sequential Patterns with a Gap Constraint

When learning sequence representations, traditional pattern-based methods often suffer from the data sparsity and high-dimensionality problems while recent neural embedding methods often fail on sequential datasets with a small vocabulary. To address these disadvantages, we propose an unsupervised method (named Sqn2Vec) which first leverages sequential patterns (SPs) to increase the vocabulary size and then learns low-dimensional continuous vectors for sequences via a neural embedding model. Moreover, our method enforces a gap constraint among symbols in sequences to obtain meaningful and discriminative SPs. Consequently, Sqn2Vec produces significantly better sequence representations than a comprehensive list of state-of-the-art baselines, particularly on sequential datasets with a relatively small vocabulary. We demonstrate the superior performance of Sqn2Vec in several machine learning tasks including sequence classification, clustering, and visualization.

Dang Nguyen, Wei Luo, Tu Dinh Nguyen, Svetha Venkatesh, Dinh Phung
Mining Tree Patterns with Partially Injective Homomorphisms

One of the main differences between inductive logic programming (ILP) and graph mining lies in the pattern matching operator applied: While it is mainly defined by relational homomorphism (i.e., subsumption) in ILP, subgraph isomorphism is the most common pattern matching operator in graph mining. Using the fact that subgraph isomorphisms are injective homomorphisms, we bridge the gap between ILP and graph mining by considering a natural transition from homomorphisms to subgraph isomorphisms that is defined by partially injective homomorphisms, i.e., which require injectivity only for subsets of the vertex pairs in the pattern. Utilizing positive complexity results on deciding homomorphisms from bounded tree-width graphs, we present an algorithm mining frequent trees from arbitrary graphs w.r.t. partially injective homomorphisms. Our experimental results show that the predictive performance of the patterns obtained is comparable to that of ordinary frequent subgraphs. Thus, by preserving much from the advantageous properties of homomorphisms and subgraph isomorphisms, our approach provides a trade-off between efficiency and predictive power.

Till Hendrik Schulz, Tamás Horváth, Pascal Welke, Stefan Wrobel

Probabilistic Models and Statistical Methods

Frontmatter
Variational Bayes for Mixture Models with Censored Data

In this paper, we propose a variational Bayesian algorithm for mixture models that can deal with censored data, which is the data under the situation that the exact value is known only when the value is within a certain range and otherwise only partial information is available. The proposed algorithm can be applied to any mixture model whose component distribution belongs to exponential family; it is a natural generalization of the variational Bayes that deals with “standard” samples whose values are known. We confirm the effectiveness of the proposed algorithm by experiments on synthetic and real world data.

Masahiro Kohjima, Tatsushi Matsubayashi, Hiroyuki Toda
Exploration Enhanced Expected Improvement for Bayesian Optimization

Bayesian optimization (BO) is a sample-efficient method for global optimization of expensive, noisy, black-box functions using probabilistic methods. The performance of a BO method depends on its selection strategy through an acquisition function. This must balance improving our understanding of the function in unknown regions (exploration) with locally improving on known promising samples (exploitation). Expected improvement (EI) is one of the most widely used acquisition functions for BO. Unfortunately, it has a tendency to over-exploit, meaning that it can be slow in finding new peaks. We propose a modification to EI that will allow for increased early exploration while providing similar exploitation once the system has been suitably explored. We also prove that our method has a sub-linear convergence rate and test it on a range of functions to compare its performance against the standard EI and other competing methods. Code related to this paper is available at: https://github.com/jmaberk/BO_with_E3I .

Julian Berk, Vu Nguyen, Sunil Gupta, Santu Rana, Svetha Venkatesh
A Left-to-Right Algorithm for Likelihood Estimation in Gamma-Poisson Factor Analysis

Computing the probability of unseen documents is a natural evaluation task in topic modeling. Previous work has addressed this problem for the well-known Latent Dirichlet Allocation (LDA) model. However, the same problem for a more general class of topic models, referred here to as Gamma-Poisson Factor Analysis (GaP-FA), remains unexplored, which hampers a fair comparison between models. Recent findings on the exact marginal likelihood of GaP-FA enable the derivation of a closed-form expression. In this paper, we show that its exact computation grows exponentially with the number of topics and non-zero words in a document, thus being only solvable for relatively small models and short documents. Experimentation in various corpus also indicates that existing methods in the literature are unlikely to accurately estimate this probability. With that in mind, we propose L2R, a left-to-right sequential sampler that decomposes the document probability into a product of conditionals and estimates them separately. We then proceed by confirming that our estimator converges and is unbiased for both small and large collections. Code related to this paper is available at: https://github.com/jcapde/L2R , https://doi.org/10.7910/DVN/GDTAAC .

Joan Capdevila, Jesús Cerquides, Jordi Torres, François Petitjean, Wray Buntine
Causal Inference on Multivariate and Mixed-Type Data

How can we discover whether X causes Y, or vice versa, that Y causes X, when we are only given a sample over their joint distribution? How can we do this such that X and Y can be univariate, multivariate, or of different cardinalities? And, how can we do so regardless of whether X and Y are of the same, or of different data type, be it discrete, numeric, or mixed? These are exactly the questions we answer. We take an information theoretic approach, based on the Minimum Description Length principle, from which it follows that first describing the data over cause and then that of effect given cause is shorter than the reverse direction. Simply put, if Y can be explained more succinctly by a set of classification or regression trees conditioned on X, than in the opposite direction, we conclude that X causes Y. Empirical evaluation on a wide range of data shows that our method, Crack, infers the correct causal direction reliably and with high accuracy on a wide range of settings, outperforming the state of the art by a wide margin. Code related to this paper is available at: http://eda.mmci.uni-saarland.de/crack .

Alexander Marx, Jilles Vreeken

Recommender Systems

Frontmatter
POLAR: Attention-Based CNN for One-Shot Personalized Article Recommendation

In this paper, we propose POLAR, an attention-based CNN combined with one-shot learning for personalized article recommendation. Given a query, POLAR uses an attention-based CNN to estimate the relevance score between the query and related articles. The attention mechanism can help significantly improve the relevance estimation. For example, on AMiner, this can help achieve a +5.0% improvement in terms of NDCG@3. One more challenge in personalized article recommendation is how to collect statistically sufficient training data for a recommendation model. POLAR combines a one-shot learning function into the recommendation model, which further gains significant improvements. For example, on AMiner, with only 1.6 feedbacks on average, POLAR achieves 2.7% improvement by NDCG@3. We evaluate the proposed POLAR on three different datasets: AMiner, Patent, and RARD. Experimental results demonstrate the effectiveness of the proposed model. Recently, we have successfully deployed POLAR into AMiner as the recommendation engine for article recommendation, which further confirms the effectiveness of the proposed model. Data related to this paper is available at: https://doi.org/10.6084/m9.figshare.7297319 .

Zhengxiao Du, Jie Tang, Yuhui Ding
Learning Multi-granularity Dynamic Network Representations for Social Recommendation

With the rapid proliferation of online social networks, personalized social recommendation has become an important means to help people discover useful information over time. However, the cold-start issue and the special properties of social networks, such as rich temporal dynamics, heterogeneous and complex structures with millions of nodes, render the most commonly used recommendation approaches (e.g. Collaborative Filtering) inefficient. In this paper, we propose a novel multi-granularity dynamic network embedding (m-DNE) model for the social recommendation which is capable of recommending relevant users and interested items. In order to support online recommendation, we construct a heterogeneous user-item (HUI) network and incrementally maintain it as the social network evolves. m-DNE jointly captures the temporal semantic effects, social relationships and user behavior sequential patterns in a unified way by embedding the HUI network into a shared low dimensional space. Meanwhile, multi-granularity proximities which include the second-order proximity and the community-aware high-order proximity of nodes, are introduced to learn more informative and robust network representations. Then, with an efficient search method, we use the encoded representation of temporal contexts to generate recommendations. Experimental results on several real large-scale datasets show its advantages over other state-of-the-art methods.

Peng Liu, Lemei Zhang, Jon Atle Gulla
GeoDCF: Deep Collaborative Filtering with Multifaceted Contextual Information in Location-Based Social Networks

In this study we investigate the recommendation problem with multifaceted contextual information to overcome the scarcity of users’ check-in data in Location-based Social Networks. To generate accurate personalized Point-of-Interest (POI) recommendations in the presence of data scarcity, we account for both users’ and POIs’ contextual information such as the social influence of friends, as well as the geographical and sequential transition influence of POIs on user’s check-in behavior. We first propose a multi-view learning strategy to capture the multifaceted contextual information of users and POIs along with users’ check-in data. Then, we feed the learned user and POI latent vectors to a deep neural framework, to capture their non-linear correlations. Finally, we formulate the objective function of our geo-based deep collaborative filtering model (GeoDCF) as a Bayesian personalized ranking problem to focus on the top-k recommendation task and we learn the parameters of our model via backpropagation. Our experiments on real-world datasets confirm that GeoDCF achieves high recommendation accuracy, significantly outperforming other state-of-the-art methods. Furthermore, we confirm the influence of both users’ and POIs’ contextual information on our GeoDCF model. The evaluation datasets are publicly available at: http://snap.stanford.edu/data/loc-gowalla.html , https://sites.google.com/site/yangdingqi/home/foursquare-dataset .

Dimitrios Rafailidis, Fabio Crestani
Personalized Thread Recommendation for MOOC Discussion Forums

Social learning, i.e., students learning from each other through social interactions, has the potential to significantly scale up instruction in online education. In many cases, such as in massive open online courses (MOOCs), social learning is facilitated through discussion forums hosted by course providers. In this paper, we propose a probabilistic model for the process of learners posting on such forums, using point processes. Different from existing works, our method integrates topic modeling of the post text, timescale modeling of the decay in post excitation over time, and learner topic interest modeling into a single model, and infers this information from user data. Our method also varies the excitation levels induced by posts according to the thread structure, to reflect typical notification settings in discussion forums. We experimentally validate the proposed model on three real-world MOOC datasets, with the largest one containing up to 6,000 learners making 40,000 posts in 5,000 threads. Results show that our model excels at thread recommendation, achieving significant improvement over a number of baselines, thus showing promise of being able to direct learners to threads that they are interested in more efficiently. Moreover, we demonstrate analytics that our model parameters can provide, such as the timescales of different topic categories in a course.

Andrew S. Lan, Jonathan C. Spencer, Ziqi Chen, Christopher G. Brinton, Mung Chiang
Inferring Continuous Latent Preference on Transition Intervals for Next Point-of-Interest Recommendation

Temporal information plays an important role in Point-of-Interest (POI) recommendations. Most existing studies model the temporal influence by utilizing the observed check-in time stamps explicitly. With the conjecture that transition intervals between successive check-ins may carry more information for diversified behavior patterns, we propose a probabilistic factor analysis model to incorporate three components, namely, personal preference, distance preference, and transition interval preference. They are modeled by an observed third-rank transition tensor, a distance constraint, and a continuous latent variable, respectively. In our framework, the POI recommendation and the transition interval for user’s very next move can be inferred simultaneously by maximizing the posterior probability of the overall transitions. Expectation Maximization (EM) algorithm is used to tune the model parameters. We demonstrate that the proposed methodology achieves substantial gains in terms of prediction on next move and its expected time over state-of-the-art baselines. Code related to this paper is available at: https://github.com/skyhejing/ECML-PKDD-2018 .

Jing He, Xin Li, Lejian Liao, Mingzhong Wang

Transfer Learning

Frontmatter
Feature Selection for Unsupervised Domain Adaptation Using Optimal Transport

In this paper, we propose a new feature selection method for unsupervised domain adaptation based on the emerging optimal transportation theory. We build upon a recent theoretical analysis of optimal transport in domain adaptation and show that it can directly suggest a feature selection procedure leveraging the shift between the domains. Based on this, we propose a novel algorithm that aims to sort features by their similarity across the source and target domains, where the order is obtained by analyzing the coupling matrix representing the solution of the proposed optimal transportation problem. We evaluate our method on a well-known benchmark data set and illustrate its capability of selecting correlated features leading to better classification performances. Furthermore, we show that the proposed algorithm can be used as a pre-processing step for existing domain adaptation techniques ensuring an important speed-up in terms of the computational time while maintaining comparable results. Finally, we validate our algorithm on clinical imaging databases for computer-aided diagnosis task with promising results. Code related to this paper is available at: https://leogautheron.github.io/ and Data related to this paper is available at: https://github.com/LeoGautheron/ECML2018-FeatureSelectionOptimalTransport

Leo Gautheron, Ievgen Redko, Carole Lartizien
Web-Induced Heterogeneous Transfer Learning with Sample Selection

Transfer learning algorithms utilize knowledge from a data-rich source domain to learn a model in the target domain where labeled data is scarce. This paper presents a novel solution for the challenging and interesting problem of Heterogeneous Transfer Learning (HTL) where the source and target task have heterogeneous feature and label spaces. Contrary to common space based HTL algorithms, the proposed HTL algorithm adapts source data for the target task. The correspondence required for aligning the heterogeneous features of the source and target domain is obtained through labels across two domains that are semantically aligned using web-induced knowledge. The experimental results suggest that the proposed algorithm performs significantly better than state-of-the-art transfer approaches on three diverse real-world transfer tasks.

Sanatan Sukhija, Narayanan C. Krishnan
Towards More Reliable Transfer Learning

Multi-source transfer learning has been proven effective when within-target labeled data is scarce. Previous work focuses primarily on exploiting domain similarities and assumes that source domains are richly or at least comparably labeled. While this strong assumption is never true in practice, this paper relaxes it and addresses challenges related to sources with diverse labeling volume and diverse reliability. The first challenge is combining domain similarity and source reliability by proposing a new transfer learning method that utilizes both source-target similarities and inter-source relationships. The second challenge involves pool-based active learning where the oracle is only available in source domains, resulting in an integrated active transfer learning framework that incorporates distribution matching and uncertainty sampling. Extensive experiments on synthetic and two real-world datasets clearly demonstrate the superiority of our proposed methods over several baselines including state-of-the-art transfer learning methods. Code related to this paper is available at: https://github.com/iedwardwangi/ReliableMSTL .

Zirui Wang, Jaime Carbonell
Differentially Private Hypothesis Transfer Learning

In recent years, the focus of machine learning has been shifting to the paradigm of transfer learning where the data distribution in the target domain differs from that in the source domain. This is a prevalent setting in real-world classification problems and numerous well-established theoretical results in the classical supervised learning paradigm will break down under this setting. In addition, the increasing privacy protection awareness restricts access to source domain samples and poses new challenges for the development of privacy-preserving transfer learning algorithms. In this paper, we propose a novel differentially private multiple-source hypothesis transfer learning method for logistic regression. The target learner operates on differentially private hypotheses and importance weighting information from the sources to construct informative Gaussian priors for its logistic regression model. By leveraging a publicly available auxiliary data set, the importance weighting information can be used to determine the relationship between the source domain and the target domain without leaking source data privacy. Our approach provides a robust performance boost even when high quality labeled samples are extremely scarce in the target data set. The extensive experiments on two real-world data sets confirm the performance improvement of our approach over several baselines. Data related to this paper is available at: http://qwone.com/~jason/20Newsgroups/ and https://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html .

Yang Wang, Quanquan Gu, Donald Brown
Information-Theoretic Transfer Learning Framework for Bayesian Optimisation

Transfer learning in Bayesian optimisation is a popular way to alleviate “cold start” issue. However, most of the existing transfer learning algorithms use overall function space similarity, not a more aligned similarity measure for Bayesian optimisation based on the location of the optima. That makes these algorithms fragile to noisy perturbations, and even simple scaling of function values. In this paper, we propose a robust transfer learning based approach that transfer knowledge of the optima using a consistent probabilistic framework. From the finite samples for both source and target, a distribution on the optima is computed and then divergence between these distributions are used to compute similarities. Based on the similarities a mixture distribution is constructed, which is then used to build a new information-theoretic acquisition function in a manner similar to Predictive Entropy Search (PES). The proposed approach also offers desirable “no bias” transfer learning in the limit. Experiments on both synthetic functions and a set of hyperparameter tuning tests clearly demonstrate the effectiveness of our approach compared to the existing transfer learning methods. Code related to this paper is available at: https://github.com/AnilRamachandran/ITTLBO.git and Data related to this paper is available at: https://doi.org/10.7910/DVN/LRNLZV .

Anil Ramachandran, Sunil Gupta, Santu Rana, Svetha Venkatesh
A Unified Framework for Domain Adaptation Using Metric Learning on Manifolds

We present a novel framework for domain adaptation, whereby both geometric and statistical differences between a labeled source domain and unlabeled target domain can be reconciled using a unified mathematical framework that exploits the curved Riemannian geometry of statistical manifolds. We exploit a simple but important observation that as the space of covariance matrices is both a Riemannian space as well as a homogeneous space, the shortest path geodesic between two covariances on the manifold can be computed analytically. Statistics on the SPD matrix manifold, such as the geometric mean of two SPD matries can be reduced to solving the well-known Riccati equation. We show how the Ricatti-based solution can be constrained to not only reduce the statistical differences between the source and target domains, such as aligning second order covariances and minimizing the maximum mean discrepancy, but also the underlying geometry of the source and target domains using diffusions on the underlying source and target manifolds. Our solution also emerges as a consequence of optimal transport theory, which shows that the optimal transport mapping between source and target distributions that are multivariate Gaussians is a function of the geometric mean of the source and target covariances, a quantity that also minimizes the Wasserstein distance. A key strength of our proposed approach is that it enables integrating multiple sources of variation between source and target in a unified way, by reducing the combined objective function to a nested set of Ricatti equations where the solution can be represented by a cascaded series of geometric mean computations. In addition to showing the theoretical optimality of our solution, we present detailed experiments using standard transfer learning testbeds from computer vision comparing our proposed algorithms to past work in domain adaptation, showing improved results over a large variety of previous methods. Code related to this paper is available at: https://github.com/sridharmahadevan/Geodesic-Covariance-Alignment .

Sridhar Mahadevan, Bamdev Mishra, Shalini Ghosh
Backmatter
Metadaten
Titel
Machine Learning and Knowledge Discovery in Databases
herausgegeben von
Michele Berlingerio
Francesco Bonchi
Thomas Gärtner
Neil Hurley
Georgiana Ifrim
Copyright-Jahr
2019
Electronic ISBN
978-3-030-10928-8
Print ISBN
978-3-030-10927-1
DOI
https://doi.org/10.1007/978-3-030-10928-8