Contextual Word Embeddings Clustering Through Multiway Analysis: A Comparative Study

Transformer-based contextual word embedding models are widely used to improve several NLP tasks such as text classification and question answering. Knowledge about these multi-layered models is growing in the literature, with several studies trying to understand what is learned by each of the layers. However, little is known about how to combine the information provided by these different layers in order to make the most of the deep Transformer models. On the other hand, even less is known about how to best use these modes for unsupervised text mining tasks such as clustering. We address both questions in this paper, and propose to study several multiway-based methods for simultaneously leveraging the word representations provided by all the layers. We show that some of them are capable to perform word clustering in an effective and interpretable way. We evaluate their performances across a wide variety of Transformer models, datasets, multiblock techniques and tensor-decomposition methods commonly used to tackle three-way data.

Mira Ait-Saada, Mohamed Nadif

Transferable Deep Metric Learning for Clustering

Clustering in high dimension spaces is a difficult task; the usual distance metrics may no longer be appropriate under the curse of dimensionality. Indeed, the choice of the metric is crucial, and it is highly dependent on the dataset characteristics. However a single metric could be used to correctly perform clustering on multiple datasets of different domains. We propose to do so, providing a framework for learning a transferable metric. We show that we can learn a metric on a labelled dataset, then apply it to cluster a different dataset, using an embedding space that characterises a desired clustering in the generic sense. We learn and test such metrics on several datasets of variable complexity (synthetic, MNIST, SVHN, omniglot) and achieve results competitive with the state-of-the-art while using only a small number of labelled training datasets and shallow networks.

Mohamed Alami Chehboune, Rim Kaddah, Jesse Read

Spatial Graph Convolution Neural Networks for Water Distribution Systems

We investigate the task of missing value estimation in graphs as given by water distribution systems (WDS) based on sparse signals as a representative machine learning challenge in the domain of critical infrastructure. The underlying graphs have a comparably low node degree and high diameter, while information in the graph is globally relevant, hence graph neural networks face the challenge of long term dependencies. We propose a specific architecture based on message passing which displays excellent results for a number of benchmark tasks in the WDS domain. Further, we investigate a multi-hop variation, which requires considerably less resources and opens an avenue towards big WDS graphs.

Inaam Ashraf, Luca Hermes, André Artelt, Barbara Hammer

Data-Centric Perspective on Explainability Versus Performance Trade-Off

The performance versus interpretability trade-off has been well-established in the literature for many years in the context of machine learning models. This paper demonstrates its twin, namely the data-centric performance versus interpretability trade-off. In a case study of bearing fault diagnosis, we found that substituting the original acceleration signal with a demodulated version offers a higher level of interpretability, but it comes at the cost of significantly lower classification performance. We demonstrate these results on two different datasets and across four different machine learning algorithms. Our results suggest that “there is no free lunch,” i.e., the contradictory relationship between interpretability and performance should be considered earlier in the analysis process than it is typically done in the literature today; in other words, already in the preprocessing and feature extraction step.

Amirhossein Berenji, Sławomir Nowaczyk, Zahra Taghiyarrenani

Towards Data Science Design Patterns

We propose data flow diagrams to model data science design patterns and demonstrate, using a number of explanatory patterns, how they can be used to explain and document data science best practices, aid data science education, and enable validation of data science processes.

Michael R. Berthold, Dashiell Brookhart, Schalk Gerber, Satoru Hayasaka, Maarit Widmann

Diverse Paraphrasing with Insertion Models for Few-Shot Intent Detection

In contrast to classic autoregressive generation, insertion-based models can predict in a order-free way multiple tokens at a time, which make their generation uniquely controllable: it can be constrained to strictly include an ordered list of tokens. We propose to exploit this feature in a new diverse paraphrasing framework: first, we extract important tokens or keywords in the source sentence; second, we augment them; third, we generate new samples around them by using insertion models. We show that the generated paraphrases are competitive with state of the art autoregressive paraphrasers, not only in diversity but also in quality. We further investigate their potential to create new pseudo-labelled samples for data augmentation, using a meta-learning classification framework, and find equally competitive result. In addition to proving non-autoregressive (NAR) viability for paraphrasing, we contribute our open-source framework as a starting point for further research into controllable NAR generation.

Raphaël Chevasson, Charlotte Laclau, Christophe Gravier

Open Access

LEMON: Alternative Sampling for More Faithful Explanation Through Local Surrogate Models

Local surrogate learning is a popular and successful method for machine learning explanation. It uses synthetic transfer data to approximate a complex reference model. The sampling technique used for this transfer data has a significant impact on the provided explanation, but remains relatively unexplored in literature. In this work, we explore alternative sampling techniques in pursuit of more faithful and robust explanations, and present LEMON: a sampling technique that samples directly from the desired distribution instead of reweighting samples as done in other explanation techniques (e.g., LIME). Next, we evaluate our technique in a synthetic and UCI dataset-based experiment, and show that our sampling technique yields more faithful explanations compared to current state-of-the-art explainers.

Dennis Collaris, Pratik Gajane, Joost Jorritsma, Jarke J. van Wijk, Mykola Pechenizkiy

PDF Zum Volltext

GASTeN: Generative Adversarial Stress Test Networks

Concerns with the interpretability of ML models are growing as the technology is used in increasingly sensitive domains (e.g., health and public administration). Synthetic data can be used to understand models better, for instance, if the examples are generated close to the frontier between classes. However, data augmentation techniques, such as Generative Adversarial Networks (GAN), have been mostly used to generate training data that leads to better models. We propose a variation of GANs that, given a model, generates realistic data that is classified with low confidence by a given classifier. The generated examples can be used in order to gain insights on the frontier between classes. We empirically evaluate our approach on two well-known image classification benchmark datasets, MNIST and Fashion MNIST. Results show that the approach is able to generate images that are closer to the frontier when compared to the original ones, but still realistic. Manual inspection confirms that some of those images are confusing even for humans.

Luís Cunha, Carlos Soares, André Restivo, Luís F. Teixeira

Learning Permutation-Invariant Embeddings for Description Logic Concepts

Concept learning deals with learning description logic concepts from a background knowledge and input examples. The goal is to learn a concept that covers all positive examples, while not covering any negative examples. This non-trivial task is often formulated as a search problem within an infinite quasi-ordered concept space. Although state-of-the-art models have been successfully applied to tackle this problem, their large-scale applications have been severely hindered due to their excessive exploration incurring impractical runtimes. Here, we propose a remedy for this limitation. We reformulate the learning problem as a multi-label classification problem and propose a neural embedding model (NERO) that learns permutation-invariant embeddings for sets of examples tailored towards predicting $$F_1$$ F 1 scores of pre-selected description logic concepts. By ranking such concepts in descending order of predicted scores, a possible goal concept can be detected within few retrieval operations, i.e., no excessive exploration. Importantly, top-ranked concepts can be used to start the search procedure of state-of-the-art symbolic models in multiple advantageous regions of a concept space, rather than starting it in the most general concept $$\top $$ ⊤ . Our experiments on 5 benchmark datasets with 770 learning problems firmly suggest that NERO significantly (p-value $$<1\%$$ < 1 % ) outperforms the state-of-the-art models in terms of $$F_1$$ F 1 score, the number of explored concepts, and the total runtime. We provide an open-source implementation of our approach ( https://github.com/dice-group/Nero ).

Caglar Demir, Axel-Cyrille Ngonga Ngomo

Diffusion Transport Alignment

The integration of multimodal data presents a challenge in cases where the study of a given phenomena by different instruments or conditions generates distinct but related domains. Many existing data integration methods assume a known one-to-one correspondence between domains of the entire dataset, which may be unrealistic. Furthermore, existing manifold alignment methods are not suited for cases where the data contains domain-specific regions, i.e., there is not a counterpart for a certain portion of the data in the other domain. We propose Diffusion Transport Alignment (DTA), a semi-supervised manifold alignment method that exploits prior knowledge of between only a few points to align the domains. After building a diffusion process, DTA finds a transportation plan between data measured from two heterogeneous domains with different feature spaces, which by assumption, share a similar geometrical structure coming from the same underlying data generating process. DTA can also compute a partial alignment in a data-driven fashion, resulting in accurate alignments when some data are measured in only one domain. We empirically demonstrate that DTA outperforms other methods in aligning multiview data in this semi-supervised setting. We also show that the alignment obtained by DTA can improve the performance of machine learning tasks, such as domain adaptation, inter-domain feature mapping, and exploratory data analysis, while outperforming competing methods.

Andrés F. Duque, Guy Wolf, Kevin R. Moon

Mind the Gap: Measuring Generalization Performance Across Multiple Objectives

Modern machine learning models are often constructed taking into account multiple objectives, e.g., minimizing inference time while also maximizing accuracy. Multi-objective hyperparameter optimization (MHPO) algorithms return such candidate models, and the approximation of the Pareto front is used to assess their performance. In practice, we also want to measure generalization when moving from the validation to the test set. However, some of the models might no longer be Pareto-optimal which makes it unclear how to quantify the performance of the MHPO method when evaluated on the test set. To resolve this, we provide a novel evaluation protocol that allows measuring the generalization performance of MHPO methods and studying its capabilities for comparing two optimization experiments.

Matthias Feurer, Katharina Eggensperger, Edward Bergman, Florian Pfisterer, Bernd Bischl, Frank Hutter

Effects of Locality and Rule Language on Explanations for Knowledge Graph Embeddings

Knowledge graphs (KGs) are key tools in many AI-related tasks such as reasoning or question answering. This has, in turn, propelled research in link prediction in KGs, the task of predicting missing relationships from the available knowledge. Solutions based on KG embeddings have shown promising results in this matter. On the downside, these approaches are usually unable to explain their predictions. While some works have proposed to compute post-hoc rule explanations for embedding-based link predictors, these efforts have mostly resorted to rules with unbounded atoms, e.g., $$\textit{bornIn}(x,y) \Rightarrow \textit{residence}(x,y)$$ bornIn ( x , y ) ⇒ residence ( x , y ) , learned on a global scope, i.e., the entire KG. None of these works has considered the impact of rules with bounded atoms such as $$\textit{nationality}(x,\textit{England}) \Rightarrow \textit{speaks}(x, \textit{English})$$ nationality ( x , England ) ⇒ speaks ( x , English ) , or the impact of learning from regions of the KG, i.e., local scopes. We therefore study the effects of these factors on the quality of rule-based explanations for embedding-based link predictors. Our results suggest that more specific rules and local scopes can improve the accuracy of the explanations. Moreover, these rules can provide further insights about the inner-workings of KG embeddings for link prediction.

Luis Galárraga

Shapley Values with Uncertain Value Functions

We propose a novel definition of Shapley values with uncertain value functions based on first principles using probability theory. Such uncertain value functions can arise in the context of explainable machine learning as a result of non-deterministic algorithms. We show that random effects can in fact be absorbed into a Shapley value with a noiseless but shifted value function. Hence, Shapley values with uncertain value functions can be used in analogy to regular Shapley values. However, their reliable evaluation typically requires more computational effort.

Raoul Heese, Sascha Mücke, Matthias Jakobs, Thore Gerlach, Nico Piatkowski

Revised Conditional t-SNE: Looking Beyond the Nearest Neighbors

Conditional t-SNE (ct-SNE) is a recent extension to t-SNE that allows removal of known cluster information from the embedding, to obtain a visualization revealing structure beyond label information. This is useful, for example, when one wants to factor out unwanted differences between a set of classes. We show that ct-SNE fails in many realistic settings, namely if the data is well clustered over the labels in the original high-dimensional space. We introduce a revised method by conditioning the high-dimensional similarities instead of the low-dimensional similarities and storing within- and across-label nearest neighbors separately. This also enables the use of recently proposed speedups for t-SNE, improving the scalability. From experiments on synthetic data, we find that our proposed method resolves the considered problems and improves the embedding quality. On real data containing batch effects, the expected improvement is not always there. We argue revised ct-SNE is preferable overall, given its improved scalability. The results also highlight new open questions, such as how to handle distance variations between clusters.

Edith Heiter, Bo Kang, Ruth Seurinck, Jefrey Lijffijt

On the Change of Decision Boundary and Loss in Learning with Concept Drift

Concept drift, i.e., the change of the data generating distribution, can render machine learning models inaccurate. Many technologies for learning with drift rely on the interleaved test-train error (ITTE) as a quantity to evaluate model performance and trigger drift detection and model updates. Online learning theory mainly focuses on providing generalization bounds for future loss. Usually, these bounds are too loose to be of practical use. Improving them further is not easily possible as they are tight in many cases. In this work, a new theoretical framework focusing on more practical questions is presented: change of training result, optimal models, and ITTE in the presence (and type) of drift. We support our theoretical findings with empirical evidence for several learning algorithms, models, and datasets.

Fabian Hinder, Valerie Vaquet, Johannes Brinkrolf, Barbara Hammer

AID4HAI: Automatic Idea Detection for Healthcare-Associated Infections from Twitter, a Framework Based on Active Learning and Transfer Learning

This research is an interdisciplinary work between data scientists, innovation management researchers and experts from Swedish academia and a hygiene and health company. Based on this collaboration, we have developed a novel package for automatic idea detection with the motivation of controlling and preventing healthcare-associated infections (HAI). The principal idea of this study is to use machine learning methods to extract informative ideas from social media to assist healthcare professionals in reducing the rate of HAI. Therefore, the proposed package offers a corpus of data collected from Twitter, associated expert-created labels, and software implementation of an annotation framework based on the Active Learning paradigm. We employed Transfer Learning and built a two-step deep neural network model that incrementally extracts the semantic representation of the collected text data using the BERTweet language model in the first step and classifies these representations as informative or non-informative using a multi-layer perception (MLP) in the second step. The package is called AID4HAI (Automatic Idea Detection for controlling and preventing Healthcare-Associated Infections) and is made fully available (software code and the collected data) through a public GitHub repository ( https://github.com/XaraKar/AID4HAI ). We believe that sharing our ideas and releasing these ready-to-use tools contributes to the development of the field and inspires future research.

Zahra Kharazian, Mahmoud Rahat, Fábio Gama, Peyman Sheikholharam Mashhadi, Sławomir Nowaczyk, Tony Lindgren, Sindri Magnússon

Explanations for Itemset Mining by Constraint Programming: A Case Study Using ChEMBL Data

In sensitive applications, such as drug development, offering experts an explanation for why data mining operations arrive at certain results adds a very valuable facet. In this work we benefit from modelling the task as a Constraint Satisfaction Problem (CSP) twice: by adding multiple constraints to the mining process and by deriving pattern failure explanations. We illustrate experimentally how to apply our method on data originally retrieved from the ChEMBL database [14]. We also report some interesting dependencies discovered by our method which are not easy to observe when analysing data manually.

Maksim Koptelov, Albrecht Zimmermann, Patrice Boizumault, Ronan Bureau, Jean-Luc Lamotte

Translated Texts Under the Lens: From Machine Translation Detection to Source Language Identification

Machine Translation Systems are today used to break down linguistic barriers. People from different countries and languages can now interact with each other thanks to state-of-the-art translators from prominent software companies like Google and Microsoft. However, these tools are also used to expand the audience for phishing attacks, scam emails or to generate fake reviews to promote a product on different e-commerce platforms. In all these cases, detecting whether a text has been translated can be crucial information. In this work, we tackle the problem of the detection of translated texts from different angles. On top of addressing the classic task of machine translation detection, we investigate and find common patterns across different machine translation systems unrelated to the original text’s source language. Then, we show that it is possible to identify the machine translation system used to generate a translated text with high performances (F1-score 88.5%) and that it is also possible to identify the source language of the original text. We perform our tasks over two datasets that we use to evaluate our models: Books, a new dataset we built from scratch based on excerpts of novels, and the well-known Europarl dataset, based on proceedings of the European Parliament.

Massimo La Morgia, Alessandro Mei, Eugenio Nerio Nemmi, Luca Sabatini, Francesco Sassi

Geolet: An Interpretable Model for Trajectory Classification

The large and diverse availability of mobility data enables the development of predictive models capable of recognizing various types of movements. Through a variety of GPS devices, any moving entity, animal, person, or vehicle can generate spatio-temporal trajectories. This data is used to infer migration patterns, manage traffic in large cities, and monitor the spread and impact of diseases, all critical situations that necessitate a thorough understanding of the underlying problem. Researchers, businesses, and governments use mobility data to make decisions that affect people’s lives in many ways, employing accurate but opaque deep learning models that are difficult to interpret from a human standpoint. To address these limitations, we propose Geolet, a human-interpretable machine-learning model for trajectory classification. We use discriminative sub-trajectories extracted from mobility data to turn trajectories into a simplified representation that can be used as input by any machine learning classifier. We test our approach against state-of-the-art competitors on real-world datasets. Geolet outperforms black-box models in terms of accuracy while being orders of magnitude faster than its interpretable competitors.

Cristiano Landi, Francesco Spinnato, Riccardo Guidotti, Anna Monreale, Mirco Nanni

An Investigation of Structures Responsible for Gender Bias in BERT and DistilBERT

In recent years, large Transformer-based Pre-trained Language Models (PLM) have changed the Natural Language Processing (NLP) landscape, by pushing the performance boundaries of the state-of-the-art on a wide variety of tasks. However, this performance gain goes along with an increase in complexity, and as a result, the size of such models (up to billions of parameters) represents a constraint for their deployment on embedded devices or short-inference time tasks. To cope with this situation, compressed models emerged (e.g. DistilBERT), democratizing their usage in a growing number of applications that impact our daily lives. A crucial issue is the fairness of the predictions made by both PLMs and their distilled counterparts. In this paper, we propose an empirical exploration of this problem by formalizing two questions: (1) Can we identify the neural mechanism(s) responsible for gender bias in BERT (and by extension DistilBERT)? (2) Does distillation tend to accentuate or mitigate gender bias (e.g. is DistilBERT more prone to gender bias than its uncompressed version, BERT)? Our findings are the following: (I) one cannot identify a specific layer that produces bias; (II) every attention head uniformly encodes bias; except in the context of underrepresented classes with a high imbalance of the sensitive attribute; (III) this subset of heads is different as we re-fine tune the network; (IV) bias is more homogeneously produced by the heads in the distilled model.

Thibaud Leteno, Antoine Gourru, Charlotte Laclau, Christophe Gravier

Discovering Diverse Top-K Characteristic Lists

In this work, we define the new problem of finding diverse top-k characteristic lists to provide different statistically robust explanations of the same dataset. This type of problem is often encountered in complex domains, such as medicine, in which a single model cannot consistently explain the already established ground truth, needing a diversity of models. We propose a solution for this new problem based on Subgroup Discovery (SD). Moreover, the diversity is described in terms of coverage and descriptions. The characteristic lists are obtained using an extension of SD, in which a subgroup identifies a set of relations between attributes (description) with respect to an attribute of interest (target). In particular, the generation of these characteristic lists is driven by the Minimum Description Length (MDL) principle, which is based on the idea that the best explanation of the data is the one that achieves the greatest compression. Finally, we also propose an algorithm called GMSL which is simple and easy to interpret and obtains a collection of diverse top-k characteristic lists.

Antonio Lopez-Martinez-Carrasco, Hugo M. Proença, Jose M. Juarez, Matthijs van Leeuwen, Manuel Campos

Online Influence Forest for Streaming Anomaly Detection

As the digital world grows, data is being collected at high speed on a continuous and real-time scale. Hence, the imposed imbalanced and evolving scenario that introduces learning from streaming data remains a challenge. As the research field is still open to consistent strategies that assess continuous and evolving data properties, this paper proposes an unsupervised, online, and incremental anomaly detection ensemble of influence trees that implement adaptive mechanisms to deal with inactive or saturated leaves. This proposal features the fourth standardized moment, also known as kurtosis, as the splitting criteria and the isolation score, Shannon’s information content, and the influence function of an instance as the anomaly score. In addition to improving interpretability, this proposal is also evaluated on publicly available datasets, providing a detailed discussion of the results.

Inês Martins, João S. Resende, João Gama

APs: A Proxemic Framework for Social Media Interactions Modeling and Analysis

In this paper, we introduce a novel way to model and analyze social media interactions by leveraging the proxemics theory. Proxemics is the science that studies the effect of space and distance on interactions and behaviors. It is generally applied to the physical space but we hypothesize that adapting it to social media could provide a generic way to model and analyze the various kinds of interactions taking place in this virtual space. We designed a proxemic-based framework aiming to guide the analysis of data from a social media corpus that can be contextualized to a given application domain. We start by formally redefining proxemics in the context of social media and we leverage this redefinition to design a generic and extensible proxemic-based trajectory model dedicated to social media. We also propose novel proxemic distances applicable to this model. Finally, we experiment this proxemic framework on the field of tourism. The application to this use case demonstrates our framework’s flexibility and effectiveness to model and analyze social media interactions.

Maxime Masson, Philippe Roose, Christian Sallaberry, Rodrigo Agerri, Marie-Noelle Bessagnet, Annig Le Parc Lacayrelle

User Authentication via Multifaceted Mouse Movements and Outlier Exposure

Gaining information about how users interact with systems is key to behavioural biometrics. Particularly mouse movements of users have been proven beneficial to authentication tasks for being inexpensive and non-intrusive. State-of-the-art approaches consider this problem an instance of supervised classification tasks. In this paper, we argue that the problem is actually closer to unsupervised one-class classification tasks. We thus propose to view behavioural user authentication as an unsupervised task and learn individual models using data from a single user only. We further show that, by being purely unsupervised, losses in performance can be counterbalanced by augmenting additional data into the training processes (outlier exposure). Empirical results show that our approach is very effective and outperforms the state-of-the-art in several performance metrics.

Jennifer J. Matthiesen, Hanne Hastedt, Ulf Brefeld

Explaining Black Box Reinforcement Learning Agents Through Counterfactual Policies

Despite the increased attention to explainable AI, explainability methods for understanding reinforcement learning (RL) agents have not been extensively studied. Failing to understand the agent’s behavior may cause reduced productivity in human-agent collaborations, or mistrust in automated RL systems. RL agents are trained to optimize a long term cumulative reward, and in this work we formulate a novel problem on how to generate explanations on when an agent could have taken another action to optimize an alternative reward. More concretely, we aim at answering the question: What does an RL agent need to do differently to achieve an alternative target outcome? We introduce the concept of a counterfactual policy, as a policy trained to explain in which states a black box agent could have taken an alternative action to achieve another desired outcome. The usefulness of counterfactual policies is demonstrated in two experiments with different use-cases, and the results suggest that our solution can provide interpretable explanations.

Maria Movin, Guilherme Dinis Junior, Jaakko Hollmén, Panagiotis Papapetrou

A GNN-Based Architecture for Group Detection from Spatio-Temporal Trajectory Data

Detecting and analyzing group behavior from spatio-temporal trajectories is an interesting topic in various domains, such as autonomous driving, urban computing, and social sciences. This paper revisits the group detection problem from spatio-temporal trajectories and proposes “WavenetNRI”, a graph neural network (GNN) based method. The proposed WavenetNRI extends the previously proposed neural relational inference (NRI) method (an unsupervised learning approach for inferring interactions from observational data) in two directions: (1) symmetric edge features and edge updating processes are applied to generate symmetric edge representations corresponding to the symmetric binary group relationships; (2) a gated dilated residual causal convolutional (GD-RCC) block is adopted to capture both short and long dependency of the edge feature sequences. We evaluated the performance of the proposed model on three simulation datasets and three real-world pedestrian datasets, using the Group Mitre metric to measure the quality of the predicted groups. We compared WavenetNRI with four baseline methods, including two clustering-based and two classification-based methods. In these experiments, NRI and WavenetNRI outperformed all other baselines on the group-interaction simulation datasets, while NRI performed slightly better than WavenetNRI. On the pedestrian datasets, the WavenetNRI outperformed other classification-based baselines. However, it did not compete against the clustering-based methods. Our ablation study showed that while both proposed changes cannot be effective at the same time, either of them can improve the performance of the original NRI on one dataset type.

Maedeh Nasri, Zhizhou Fang, Mitra Baratchi, Gwenn Englebienne, Shenghui Wang, Alexander Koutamanis, Carolien Rieffe

Discovering Rule Lists with Preferred Variables

Interpretable machine learning focuses on learning models that are inherently understandable by humans. Even such interpretable models, however, must be trustworthy for domain experts to adopt them. This requires not only accurate predictions, but also reliable explanations that do not contradict a domain expert’s knowledge. When considering rule-based models, for example, rules may include certain variables either due to artefacts in the data, or due to the search heuristics used. When such rules are provided as explanations, this may lead to distrust.We investigate whether human guidance could benefit interpretable machine learning when it comes to learning models that provide both accurate predictions and reliable explanations. The form of knowledge that we consider is that of preferred variables, i.e., variables that the domain expert deems important enough to be given higher priority than the other variables. We study this question for the task of multiclass classification, use probabilistic rule lists as interpretable models, and use the minimum description length (MDL) principle for model selection.We propose S-Classy, an algorithm based on beam search that learns rule lists and takes preferred variables into account. We compare S-Classy to its baseline method, i.e., without using preferred variables, and empirically demonstrate that adding preferred variables does not harm predictive performance, while it does result in the preferred variables being used in rules higher up in the learned rule lists.

Ioanna Papagianni, Matthijs van Leeuwen

Don’t Start Your Data Labeling from Scratch: OpSaLa - Optimized Data Sampling Before Labeling

Many text classification tasks face a severe class imbalance problem that limits the ability to train high-performance models. This is partly due to the small number of instances in the minority class, so that the minority class patterns are not well-represented. A common approach in such cases is to resort to data augmentation techniques; however, these have shown mixed results on text data. Our proposed solution is to Optimize the data Sampling prior to Labeling (OpSaLa) to obtain overrepresented minority class(es) in the training dataset. We evaluate our approach on three real-world hate speech datasets and compare it to four commonly used approaches: training on the “natural” class distribution, a class weighting approach, and two oversampling approaches: minority oversampling and backtranslation. Our results confirm that the OpSaLa approach yields better models while the labeling budget stays the same.

Andraž Pelicon, Syrielle Montariol, Petra Kralj Novak

The Other Side of Compression: Measuring Bias in Pruned Transformers

Social media platforms have become popular worldwide. Online discussion forums attract users because of their easy access, speech freedom, and ease of communication. Yet there are also possible negative aspects of such communication, including hostile and hate language. While fast and effective solutions for detecting inappropriate language online are constantly being developed, there is little research focusing on the bias of compressed language models that are commonly used nowadays. In this work, we evaluate bias in compressed models trained on Gab and Twitter speech data and estimate to which extent these pruned models capture the relevant context when classifying the input text as hateful, offensive or neutral. Results of our experiments show that transformer-based encoders with 70% or fewer preserved weights are prone to gender, racial, and religious identity-based bias, even if the performance loss is insignificant. We suggest a supervised attention mechanism to counter bias amplification using ground truth per-token hate speech annotation. The proposed method allows pruning BERT, RoBERTa and their distilled versions up to 50% while preserving 90% of their initial performance according to bias and plausibility scores.

Irina Proskurina, Guillaume Metzler, Julien Velcin

Dropping Incomplete Records is (not so) Straightforward

A straightforward approach to handling missing values is dropping incomplete records from the dataset. However, for many forms of missingness, this method is known to affect the center and spread of the data distribution. In this paper, we perform an extensive empirical evaluation of the effect of the drop method on the data distribution. In particular, we analyze two scenarios that are likely to occur in practice but are not often considered in simulation studies: 1) when features are skewed rather than symmetrically distributed and 2) when multiple forms of missingness occur simultaneously in one feature. Furthermore, we investigate implications of the drop method for classification accuracy and demonstrate that dropping incomplete records is doubtful, even when test cases are dropped as well.

Rianne M. Schouten, Victoria Taşcău, Gabriel G. Ziegler, Davide Casano, Marco Ardizzone, Michael-Angelos Erotokritou

Meta-learning for Automated Selection of Anomaly Detectors for Semi-supervised Datasets

In anomaly detection, a prominent task is to induce a model to identify anomalies learned solely based on normal data. Generally, one is interested in finding an anomaly detector that correctly identifies anomalies, i.e., data points that do not belong to the normal class, without raising too many false alarms. Which anomaly detector is best suited depends on the dataset at hand and thus needs to be tailored. The quality of an anomaly detector may be assessed via confusion-based metrics such as the Matthews correlation coefficient (MCC). However, since during training only normal data is available in a semi-supervised setting, such metrics are not accessible. To facilitate automated machine learning for anomaly detectors, we propose to employ meta-learning to predict MCC scores using the metrics that can be computed with normal data only and order anomaly detectors using the predicted scores for selection. First promising results can be obtained considering the hypervolume and the false positive rate as meta-features.

David Schubert, Pritha Gupta, Marcel Wever

Should We Consider On-Demand Analysis in Scale-Free Networks?

Networks are structures used in many fields for which it is necessary to have analytical systems. Often, the size of networks increases over the time so that the connectivity of the nodes follows a power law. This scale-free nature also causes analytical queries to be concentrated on nodes with higher connectivity. Rather than computing the query results for each node in advance, this paper considers an on-demand approach to evaluate its potential gain. To this end, we propose a cost model dedicated to scale-free networks for which we compute the cost for both the offline and on-demand systems. It is reasonable in an on-demand approach to cache part of the results on the fly. We study theoretically and on real-world networks three policies: caching nothing, caching everything and minimizing the total cost. Experiments show that the on-demand approach is relevant if some of the results are cached, especially when the query load is low and the query complexity is reasonable.

Arnaud Soulet

ROCKAD: Transferring ROCKET to Whole Time Series Anomaly Detection

The analysis of time series data is of high relevance in fields like manufacturing, health, automotive, or science. In this paper, we propose ROCKAD, a kernel-based approach for semi-supervised whole time series anomaly detection, i.e. the assignment of a single anomaly score to an entire time series. Our key idea is to use ROCKET as an unsupervised feature extractor and to train a single as well as an ensemble of k-nearest neighbors anomaly detectors to deduce an anomaly score. To the best of our knowledge, this is the first approach to transfer the ideas of ROCKET to the task of anomaly detection. We systematically evaluate ROCKAD for univariate time series and show it is statistically significantly better compared to baseline methods. Additionally, we show in a case study that ROCKAD is also applicable to multivariate time series.

Andreas Theissler, Manuel Wengert, Felix Gerschner

Out-of-Distribution Generalisation with Symmetry-Based Disentangled Representations

Learning disentangled representations is suggested to help with generalisation in AI models. This is particularly obvious for combinatorial generalisation, the ability to combine familiar factors to produce new unseen combinations. Disentangling such factors should provide a clear method to generalise to novel combinations, but recent empirical studies suggest that this does not really happen in practice. Disentanglement methods typically assume i.i.d. training and test data, but for combinatorial generalisation we want to generalise towards factor combinations that can be considered out-of-distribution (OOD). There is a misalignment between the distribution of the observed data and the structure that is induced by the underlying factors.A promising direction to address this misalignment is symmetry-based disentanglement, which is defined as disentangling symmetry transformations that induce a group structure underlying the data. Such a structure is independent of the (observed) distribution of the data and thus provides a sensible language to model OOD factor combinations as well. We investigate the combinatorial generalisation capabilities of a symmetry-based disentanglement model (LSBD-VAE) compared to traditional VAE-based disentanglement models. We observe that both types of models struggle with generalisation in more challenging settings, and that symmetry-based disentanglement appears to show no obvious improvement over traditional disentanglement. However, we also observe that even if LSBD-VAE assigns low likelihood to OOD combinations, the encoder may still generalise well by learning a meaningful mapping reflecting the underlying group structure.

Loek Tonnaer, Mike Holenderski, Vlado Menkovski

Forecasting Electricity Prices: An Optimize Then Predict-Based Approach

We are interested in electricity price forecasting at the European scale. The electricity market is ruled by price regulation mechanisms that make it possible to adjust production to demand, as electricity is difficult to store. These mechanisms ensure the highest price for producers, the lowest price for consumers and a zero energy balance by setting day-ahead prices, i.e. prices for the next 24 h. Most studies have focused on learning increasingly sophisticated models to predict the next day’s 24 hourly prices for a given zone. However, the zones are interdependent and this last point has hitherto been largely underestimated. In the following, we show that estimating the energy cross-border transfer by solving an optimization problem and integrating it as input of a model improves the performance of the price forecasting for several zones together.

Léonard Tschora, Erwan Pierre, Marc Plantevit, Céline Robardet

A Similarity-Guided Framework for Error-Driven Discovery of Patient Neighbourhoods in EMA Data

Recent advances in technology and societal changes have increased the amount of patient data that is being collected remotely, outside of hospitals. As technology enables the ability to collect Ecological Momentary Assessments (EMAs) of patient symptoms remotely, personalised predictors have become especially relevant in the field of medicine. However, focusing a predictive model on a single patient’s data comes with sometimes extreme trade-offs on the amount of data available for training. While it is possible to mitigate this loss of data by including data from similar patients, the concept of similarity itself may be poorly defined in cases where patient data are available in two modalities - one that is fixed and relatively static (for e.g.: age, gender, etc.), and those that are more dynamic (instantaneous symptom severity). Including data from users with similar EMA data and disease characteristics has been explored with respect to building personalised predictors of the near future of a patient. We propose a method to build personalised predictors by discovering a neighbourhood for each user that decreases the prediction error of a model over that user’s data. This method is useful not just for building better personalised predictors, but may also serve as a starting point for future investigations into what properties are shared by patients whose EMA data predict each other. We test our method on two EMA datasets, and show that our proposed method achieves significantly better RMSE than a single non-personalised global model, and that our framework provides better predictions for 82%–89% of the users compared to the global model for two datasets.

Vishnu Unnikrishnan, Miro Schleicher, Clara Puga, Ruediger Pryss, Carsten Vogel, Winfried Schlee, Myra Spiliopoulou

QBERT: Generalist Model for Processing Questions

Using a single model across various tasks is beneficial for training and applying deep neural sequence models. We address the problem of developing generalist representations of text that can be used to perform a range of different tasks rather than being specialised to a single application. We focus on processing short questions and developing an embedding for these questions that is useful on a diverse set of problems, such as question topic classification, equivalent question recognition, and question answering. This paper introduces QBERT, a generalist model for processing questions. With QBERT, we demonstrate how we can train a multi-task network that performs all question-related tasks and has achieved similar performance compared to its corresponding single-task models.

Zhaozhen Xu, Nello Cristianini

On Compositionality in Data Embedding

Representing data items as vectors in a space is a common practice in machine learning, where it often goes under the name of “data embedding”. This representation is typically learnt from known relations that exist in the original data, such as co-occurrence of words, or connections in graphs. A property of these embeddings is known as compositionality, whereby the vector representation of an item can be decomposed into different parts, which can be understood separately. This property, first observed in the case of word embeddings, could help with various challenges of modern AI: detection of unwanted bias in the representation, explainability of AI decisions based on these representations, and the possibility of performing analogical reasoning or counterfactual question answering. One important direction of research is to understand the origins, properties and limitations of compositional data embeddings, with the idea of going beyond word embeddings. In this paper, we propose two methods to test for this property, demonstrating their use in the case of sentence embedding and knowledge graph embedding.

Zhaozhen Xu, Zhijin Guo, Nello Cristianini

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter