Zum Inhalt

Discovery Science

27th International Conference, DS 2024, Pisa, Italy, October 14–16, 2024, Proceedings, Part I

  • 2025
  • Buch

Über dieses Buch

Der zweibändige Band LNAI 15243 + 15244 bildet die Abhandlung der 27. Internationalen Konferenz über Discovery Science, DS 2024, die vom 14. bis 16. Oktober 2024 in Pisa, Italien, stattfand. Die 53 vollständigen Beiträge wurden sorgfältig geprüft und aus 121 Einreichungen ausgewählt. Sie waren wie folgt in thematische Abschnitte gegliedert: Teil I: LLM, Textanalyse und ethische Aspekte künstlicher Intelligenz; Natürliche Sprachverarbeitung, sequentielle Daten und wissenschaftliche Entdeckung; datengestützte wissenschaftliche Entdeckungsmethoden; Graph Neural Network, Graph Theory, Unsupervised Learning and Regression; Teil II: Baummodelle und kausale Entdeckung; Sicherheit und Anomalie-Erkennung; Computervision und erklärbare künstliche Intelligenz; Klassifikationsmodelle; SoBigData + +: Stadt für Bürger und erklärbare künstliche Intelligenz; SoBigData + +: Gesellschaftliche Debatten und Fehlinformationsanalyse.

Inhaltsverzeichnis

  1. Frontmatter

  2. LLM, Text Analytics, and Ethical Aspects of AI

    1. Frontmatter

    2. Exploiting Large Language Models for Enhanced Review Classification Explanations Through Interpretable and Multidimensional Analysis

      Cristian Cosentino, Merve Gündüz-Cüre, Fabrizio Marozzo, Şule Öztürk-Birim
      Abstract
      In today’s digital world, user-generated reviews play a pivotal role across diverse industries, providing invaluable insights into consumer experiences, preferences, and concerns. These reviews heavily influence the strategic decisions of businesses. Advanced machine learning techniques, including Large Language Models (LLMs) like BERT and GPT, have greatly facilitated the analysis of this vast amount of unstructured data, enabling the extraction of actionable insights. However, while achieving high classification accuracy is crucial, the demand for explainability has gained prominence. It is essential to comprehend the reasoning behind classification decisions to effectively utilize user-generated content analytics. This paper presents a methodology that leverages interpretable and multidimensional classification to generate explanations from user reviews. Compared to basic explanations readily available through systems like Chat-GPT, our methodology delves deeper into the classification of reviews across various dimensions (such as sentiment, emotion, and topics addressed) to produce more comprehensive explanations for user review classifications. Experimental results demonstrate the precision of our methodology in explaining why a particular review was classified in a specific manner.
    3. Large Language Models-Based Local Explanations of Text Classifiers

      Fabrizio Angiulli, Francesco De Luca, Fabio Fassetti, Simona Nisticó
      Abstract
      The widespread diffusion of text black box classifiers in many areas of human activity poses the need for explainable artificial intelligence techniques specifically tailored for this challenging domain. One of the seminal eXplainable Artificial Intelligence (XAI) techniques is LIME, standing for Local Interpretable Model-agnostic Explanations. In the text classification scenario, LIME maps the input instance sentence and its neighbors into a bag of words, while a linear regressor is used as interpretable model.
      However, this strategy has some main drawbacks. Indeed, since neighborhooding sentences can be obtained only as subsets of the input one, they could not properly describe the decision boundary in the locality of the input sentence, other than being potentially not meaningful. Moreover, the explanation returned solely consists of either confirming the importance of the presence of a specific term or declaring the removal of a specific term relevant.
      In this work, we try to overcome the above limitations by proposing \(\text {LLiMe}\), an extension of the basic LIME approach that exploits recent advances in Large Language Models (LLMs) to perform a classifier-driven generation of the neighborhood of the input instance. In our approach neighbors can employ a vocabulary larger than that imposed by the sentence under consideration. Moreover, we provide a neighborhood generation procedure guaranteeing to better capture the decision boundary in the locality of the sentence and an explanation generation procedure returning the most relevant set of term-operations pairs, each of which consisting a of specific term and a certain edit operation to accomplish to mostly influence the decision of the black box predictor. In this respect, our approach provides to the user a richer and more easy-to-interpret explanation than standard LIME.
      Experiments conducted on real datasets witness the effectiveness of our technique in providing suited, relevant and interpretable explanations.
    4. Evaluating the Reliability of Self-explanations in Large Language Models

      Korbinian Randl, John Pavlopoulos, Aron Henriksson, Tony Lindgren
      Abstract
      This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations – extractive and counterfactual – using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model’s decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.
    5. Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

      Riccardo Cantini, Giada Cosenza, Alessio Orsino, Domenico Talia
      Abstract
      Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation biases, along with common stereotypes related to gender, ethnicity, sexual orientation, religion, socioeconomic status, disability, and age. This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability. We also investigate how known prompt engineering techniques can be exploited to effectively reveal hidden biases of LLMs, testing their adversarial robustness against jailbreak prompts specially crafted for bias elicitation. Extensive experiments are conducted using the most widespread LLMs at different scales, confirming that LLMs can still be manipulated to produce biased or inappropriate responses, despite their advanced capabilities and sophisticated alignment processes. Our findings underscore the importance of enhancing mitigation techniques to address these safety issues, toward a more sustainable and inclusive artificial intelligence.
    6. Play it Straight: An Intelligent Data Pruning Technique for Green-AI

      • Open Access
      Francesco Scala, Sergio Flesca, Luigi Pontieri
      Abstract
      The escalating climate crisis demands urgent action to mitigate the environmental impact of energy-intensive technologies, including Artificial Intelligence (AI). Lowering AI’s environmental impact requires adopting energy-efficient approaches for training Deep Neural Networks (DNNs). One such approach is to use Dataset Pruning (DP) methods to reduce the number of training instances, and thus the total energy consumed. Numerous DP methods have been proposed in the literature (e.g., GraNd and Craig), with the ultimate aim of speeding up model training. On the other hand, Active Learning (AL) approaches, originally conceived to repeatedly select the best data to be labeled by a human expert (from a large collection of unlabeled data), can be exploited as well to train a model on a relatively small subset of (informative) examples. However, despite allowing for reducing the total amount of training data, most DP methods and pure AL-based schemes entail costly computations that may strongly limit their energy saving potential. In this work, we empirically study the effectiveness of DP and AL methods in curbing energy consumption in DNN training, and propose a novel approach to DNN learning, named Play it straight, which efficiently combines data selection methods and AL-like incremental training. Play it straight is shown to outperform traditional DP and AL approaches, achieving a better trade-off between accuracy and energy efficiency.
      PDF-Version jetzt herunterladen
    7. Evaluation of Geographical Distortions in Language Models

      Rémy Decoupes, Roberto Interdonato, Mathieu Roche, Maguelonne Teisseire, Sarah Valentin
      Abstract
      Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language Processing, five sources of bias are well-identified: data, annotation, representation, models, and research design. This study focuses on biases related to geographical knowledge. We explore the connection between geography and language models by highlighting their tendency to misrepresent spatial information, thus leading to distortions in the representation of geographical distances. This study introduces four indicators to assess these distortions, by comparing geographical and semantic distances. Experiments are conducted from these four indicators with eight widely used language models and their implementations are available on github (https://github.com/tetis-nlp/geographical-biases-in-llms). Results underscore the critical necessity of inspecting and rectifying spatial biases in language models to ensure accurate and equitable representations.
    8. AutoML-Guided Fusion of Entity and LLM-Based Representations for Document Classification

      Boshko Koloski, Senja Pollak, Roberto Navigli, Blaž Škrlj
      Abstract
      Large semantic knowledge bases are grounded in factual knowledge. However, recent approaches to dense text representations (i.e. embeddings) do not efficiently exploit these resources. Dense and robust representations of documents are essential for effectively solving downstream classification and retrieval tasks. This work demonstrates that injecting embedded information from knowledge bases can augment the performance of contemporary Large Language Model (LLM)-based representations for the task of text classification. Further, by considering automated machine learning (AutoML) with the fused representation space, we demonstrate it is possible to improve classification accuracy even if we use low-dimensional projections of the original representation space obtained via efficient matrix factorization. This result shows that significantly faster classifiers can be achieved with minimal or no loss in predictive performance, as demonstrated using five strong LLM baselines on six diverse real-life datasets. The code is freely available at https://github.com/bkolosk1/bablfusion.git.
    9. Open-Set Named Entity Recognition: A Preliminary Study

      Angelo Impedovo, Giuseppe Rizzo, Antonio Di Mauro
      Abstract
      In Natural Language Processing, Named Entity Recognition (NER) is a critical task that aims to identify entities of interest in a given text. NER is typically solved by discerning entity tokens from non-entity ones via multi-class classifiers. However, training such models may be challenging due to the prevalence of non-entity tokens. To address this issue, in this paper, we investigated the effectiveness of an open-set recognizer, a machine learning model that, generalizing a multi-class classifier, recognizes only entity tokens and rejects non-entity ones. This paper demonstrates that open-set recognizers are an effective approach to address the token recognition problem. Indeed, we compared a traditional token recognizer based on Conditional Random Field with a state-of-the-art instance-based open-set recognizer, and our evaluation shows that the open-set recognizer outperforms the traditional token recognizer.
  3. Natural Language Processing, Sequential Data and Science Discovery

    1. Frontmatter

    2. Forecasting with Deep Learning: Beyond Average of Average of Average Performance

      Vitor Cerqueira, Luis Roque, Carlos Soares
      Abstract
      Accurate evaluation of forecasting models is essential for ensuring reliable predictions. Current practices for evaluating and comparing forecasting models focus on summarising performance into a single score, using metrics such as SMAPE. We hypothesize that averaging performance over all samples dilutes relevant information about the relative performance of models. Particularly, conditions in which this relative performance is different than the overall accuracy. We address this limitation by proposing a novel framework for evaluating univariate time series forecasting models from multiple perspectives, such as one-step ahead forecasting versus multi-step ahead forecasting. We show the advantages of this framework by comparing a state-of-the-art deep learning approach with classical forecasting techniques. While classical methods (e.g. ARIMA) are long-standing approaches to forecasting, deep neural networks (e.g. NHITS) have recently shown state-of-the-art forecasting performance in benchmark datasets. We conducted extensive experiments that show NHITS generally performs best, but its superiority varies with forecasting conditions. For instance, concerning the forecasting horizon, NHITS only outperforms classical approaches for multi-step ahead forecasting. Another relevant insight is that, when dealing with anomalies, NHITS is outperformed by methods such as Theta. These findings highlight the importance of evaluating forecasts from multiple dimensions.
    3. Multivariate Asynchronous Shapelets for Imbalanced Car Crash Predictions

      Mario Bianchi, Francesco Spinnato, Riccardo Guidotti, Daniele Maccagnola, Antonio Bencini Farina
      Abstract
      Real-time vehicle safety and performance monitoring through crash data recorders is transforming mobility-related businesses. In this work, we collaborate with Generali Italia to improve their in-development automatic decision-making system, designed to assist operators in handling customer car crashes. Currently, Generali uses a deep learning model that can accurately alert operators of potential crashes, but its black-box nature can hinder the operator’s trustworthiness in the model. Given these limitations, we propose MARS, an interpretable shapelet-based classifier using novel multivariate asynchronous shapelets. We show that MARS can handle Generali’s highly irregular and imbalanced time series dataset, outperforming state-of-the-art classifiers and anomaly detection algorithms, including Generali’s black-box system. Further, we validate MARS on multivariate datasets from the UEA repository, demonstrating its competitiveness with existing techniques and providing examples of the explanations MARS can produce.
    4. Soft Hoeffding Tree: A Transparent and Differentiable Model on Data Streams

      Kirsten Köbschall, Lisa Hartung, Stefan Kramer
      Abstract
      We propose soft Hoeffding trees (SoHoT) as a new differentiable and transparent model for possibly infinite and changing data streams. Stream mining algorithms such as Hoeffding trees grow based on the incoming data stream, but they currently lack the adaptability of end-to-end deep learning systems. End-to-end learning can be desirable if a feature representation is learned by a neural network and used in a tree, or if the outputs of trees are further processed in a deep learning model or workflow. Different from Hoeffding trees, soft trees can be integrated into such systems due to their differentiability, but are neither transparent nor explainable. Our novel model combines the extensibility and transparency of Hoeffding trees with the differentiability of soft trees. We introduce a new gating function to regulate the balance between univariate and multivariate splits in the tree. Experiments are performed on 20 data streams, comparing SoHoT to standard Hoeffding trees, Hoeffding trees with limited complexity, and soft trees applying a sparse activation function for sample routing. The results show that soft Hoeffding trees outperform Hoeffding trees in estimating class probabilities and, at the same time, maintain transparency compared to soft trees, with relatively small losses in terms of AUROC and cross-entropy. We also demonstrate how to trade off transparency against performance using a hyperparameter, obtaining univariate splits at one end of the spectrum and multivariate splits at the other.
    5. Meta-learning Loss Functions of Parametric Partial Differential Equations Using Physics-Informed Neural Networks

      Michail Koumpanakis, Ricardo Vilalta
      Abstract
      This paper proposes a new way to learn Physics-Informed Neural Network loss functions using Generalized Additive Models. We apply our method by meta-learning parametric partial differential equations, PDEs, on Burger’s and 2D Heat Equations. The goal is to learn a new loss function for each parametric PDE using meta-learning. The derived loss function replaces the traditional data loss, allowing us to learn each parametric PDE more efficiently, improving the meta-learner’s performance and convergence.
    6. VADA: A Data-Driven Simulator for Nanopore Sequencing

      Jonas Niederle, Simon Koop, Marc Pagès-Gallego, Vlado Menkovski
      Abstract
      Nanopore sequencing offers the ability for real-time analysis of long DNA sequences at a low cost, enabling new applications such as early detection of cancer. Due to the complex nature of nanopore measurements and the high cost of obtaining ground truth datasets, there is a need for nanopore simulators. Existing simulators rely on handcrafted rules and parameters and do not learn an internal representation that would allow for analyzing underlying biological factors of interest. Instead, we propose VADA, a purely data-driven method for simulating nanopores based on an autoregressive latent variable model. We embed subsequences of DNA and introduce a conditional prior to address the challenge of a collapsing conditioning. We experiment with an auxiliary regressor on the latent variable to encourage our model to learn an informative latent representation. We empirically demonstrate that our model achieves competitive simulation performance on experimental nanopore data. Moreover, we show our model learns an informative latent representation that is predictive of the DNA labels. We hypothesize that other biological factors of interest, beyond the DNA labels, can potentially be extracted from such a learned latent representation.
  4. Data-Driven Science Discovery Methodologies

    1. Frontmatter

    2. Differential Equation Discovery of Robotic Swarm as Active Matter

      Roman Titov, Alexander Hvatov
      Abstract
      Numerous modeling approaches treat active matter through various mathematical analogs. However, most of these approaches do not adequately address the physical interactions between the particles that constitute active matter. In this paper, we propose several models of robot swarm interactions that can be derived using differential equation discovery: a simple model of individual robot motion, a model of single robot motion with interaction forces as external inputs, and a model of the displacement field as a continuous active matter analog. These models can enhance our understanding of the underlying physics of robot swarm interactions today and contribute to future studies of active matter.
    3. Science-Gym: A Simple Testbed for AI-Driven Scientific Discovery

      Mattia Cerrato, Nicholas Schmitt, Lennart Baur, Edward Finkelstein, Selina Jukic, Lars Münzel, Felix Peter Paul, Pascal Pfannes, Benedikt Rohr, Julius Schellenberg, Philipp Wolf, Stefan Kramer
      Abstract
      Automating scientific discovery has been one of the motivating tasks in development of AI methods. The task of Equation Discovery (also called Symbolic Regression) is to learn a free-form symbolic equation from experimental data. Equation Discovery benchmarks, however, assume the experimental data as given. Recent successes in protein folding and material optimization, powered by advancements, amongst others, in reinforcement learning and deep learning, have renewed the broader community’s interest in applications of AI in science. Nonetheless, these successful applications do not necessarily lead to an improved understanding of the underlying phenomena, just as super-human chess engines does not necessarily lead to improved understanding of chess theory and practice. In this paper, we propose Science-Gym: a new testbed for basic physics understanding. To the best of our knowledge, Science-Gym is the first scientific discovery benchmark that requires agents to autonomously perform data collection, experimental design, and discover the underlying equations of phenomena. Science-Gym is a python software library with Gym-compatible bindings. It offers 5 scientific simulations which reproduce basic physics and epidemiology principles: the law of the lever, projectile motion, the inclined plane, Lagrangian points in space, brachistochrones, and the SIRV model. In these environments, agents may be evaluated not only on their ability in e.g. balancing objects on the two beams of a lever, but more importantly on finding equations that describe the overall behavior of the dynamical system at hand.
    4. Latent Embedding Based on a Transcription-Decay Decomposition of mRNA Dynamics Using Self-supervised CoxPH

      Martin Špendl, Tomaž Curk, Blaž Zupan
      Abstract
      The discovery of patterns of molecular signatures in genomic profiles is one of the most essential data-driven research approaches in cancer biology. Gene expression data, measured by estimates of mRNA levels, contain tens of thousands of features, making dimensionality reduction a critical step in data analysis. This data-driven selection of genes and markers can be greatly improved by incorporating domain knowledge. For example, autoencoder-based approaches that incorporate the gene set hierarchy into the neural network architecture design can improve both accuracy and interpretability. Alternatively, domain knowledge could be incorporated into the loss functions used to train models. To this end, we propose a novel, biologically inspired loss function for autoencoders based on the first-order dynamics of mRNA expression. By decomposing the steady state of expression into transcription and mRNA decay rates, we model mRNA lifetime as a survival problem. Our approach borrows from Cox proportional hazard partial likelihood to model transcription rates and the risk of decay of individual genes. We show that the resulting autoencoders can improve the clustering of cancer patients and cell lines and drug response prediction.
    5. Social Isolation, Digital Connection: COVID-19’s Impact on Twitter Ego Networks

      Kamer Cekini, Elisabetta Biondi, Chiara Boldrini, Andrea Passarella, Marco Conti
      Abstract
      One of the most impactful measures to fight the COVID-19 pandemic in its early first years was the lockdown, implemented by governments to reduce physical contact among people and minimize opportunities for the virus to spread. As people were compelled to limit their physical interactions and stay at home, they turned to online social platforms to alleviate feelings of loneliness. Ego networks represent how people organize their relationships due to human cognitive constraints that impose limits on meaningful interactions among people. Physical contacts were disrupted during the lockdown, causing socialization to shift entirely online, leading to a shift in socialization into online platforms. Our research aimed to investigate the impact of lockdown measures on online ego network structures potentially caused by the increase of cognitive expenses in online social networks. In particular, we examined a large Twitter dataset of users, covering 7 years of their activities. We found that during the lockdown, there was an increase in network sizes and a richer structure in social circles, with relationships becoming more intimate. Moreover, we observe that, after the lockdown measures were relaxed, these features returned to their pre-lockdown values.
    6. SwitchPath: Enhancing Exploration in Neural Networks Learning Dynamics

      Antonio Di Cecco, Andrea Papini, Carlo Metta, Marco Fantozzi, Silvia Giulia Galfré, Francesco Morandin, Maurizio Parton
      Abstract
      We introduce SwitchPath, a novel stochastic activation function that enhances neural network exploration, performance, and generalization, by probabilistically toggling between the activation of a neuron and its negation. SwitchPath draws inspiration from the analogies between neural networks and decision trees, and from the exploratory and regularizing properties of DropOut as well. Unlike Dropout, which intermittently reduces network capacity by deactivating neurons, SwitchPath maintains continuous activation, allowing networks to dynamically explore alternative information pathways while fully utilizing their capacity. Building on the concept of \(\epsilon \)-greedy algorithms to balance exploration and exploitation, SwitchPath enhances generalization capabilities over traditional activation functions. The exploration of alternative paths happens during training without sacrificing computational efficiency. This paper presents the theoretical motivations, practical implementations, and empirical results, showcasing all the described advantages of SwitchPath over established stochastic activation mechanisms.
  5. Graph Neural Network, Graph Theory, Unsupervised Learning and Regression

    1. Frontmatter

    2. Analyzing Explanations of Deep Graph Networks Through Node Centrality and Connectivity

      Michele Fontanesi, Alessio Micheli, Marco Podda, Domenico Tortorella
      Abstract
      Explanations at the node level produced for Deep Graph Networks (DGNs), i.e., neural networks for graph learning, are commonly used to investigate the relationships between the input graphs and their associated predictions. However, they can also provide relevant information concerning the underlying architecture trained to solve the inductive task. In this work, we analyze explanations generated for convolutional and recursive DGN architectures through the notion of node centrality and graph connectivity as means to gain novel insights on the inductive biases distinguishing these architectural classes of neural networks. We adopt Explainable AI (XAI) to perform model inspection and we compare the retrieved explanations with node centrality and graph connectivity to identify the class assignment policy learned by each model to solve multiple XAI graph classification tasks. Our experimental results indicate that the inductive bias of convolutional DGNs tends towards recognizing high-order graph structures, while the inductive bias of recursive and contractive DGNs tends towards recognizing low-order graph structures.
    3. Interpretable Graph Neural Networks for Heterogeneous Tabular Data

      Amr Alkhatib, Henrik Boström
      Abstract
      Many machine learning algorithms for tabular data produce black-box models, which prevent users from understanding the rationale behind the model predictions. In their unconstrained form, graph neural networks fall into this category, and they have further limited abilities to handle heterogeneous data. To overcome these limitations, an approach is proposed, called IGNH (Interpretable Graph Neural Network for Heterogeneous tabular data), which handles both categorical and numerical features, while constraining the learning process to generate exact feature attributions together with the predictions. A large-scale empirical investigation is presented, showing that the feature attributions provided by IGNH align with Shapley values that are computed post hoc. Furthermore, the results show that IGNH outperforms two powerful machine learning algorithms for tabular data, Random Forests and TabNet, while competing favourably with XGBoost.
    4. A Systematization of the Wagner Framework: Graph Theory Conjectures and Reinforcement Learning

      Flora Angileri, Giulia Lombardi, Andrea Fois, Renato Faraone, Carlo Metta, Michele Salvi, Luigi Amedeo Bianchi, Marco Fantozzi, Silvia Giulia Galfrè, Daniele Pavesi, Maurizio Parton, Francesco Morandin
      Abstract
      In 2021, Adam Zsolt Wagner proposed an approach to disprove conjectures in graph theory using Reinforcement Learning (RL). Wagner frames a conjecture as f(G) < 0 for every graph G, for a certain invariant f; one can then play a single-player graph-building game, where at each turn the player decides whether to add an edge or not. The game ends when all edges have been considered, resulting in a certain graph \(G_T\), and \(f(G_T)\) is the final score of the game; RL is then used to maximize this score. This brilliant idea is as simple as innovative, and it lends itself to systematic generalization. Several different single-player graph-building games can be employed, along with various RL algorithms. Moreover, RL maximizes the cumulative reward, allowing for step-by-step rewards instead of a single final score, provided the final cumulative reward represents the quantity of interest \(f(G_T)\). In this paper, we discuss these and various other choices that can be significant in Wagner’s framework. As a contribution to this systematization, we present four distinct single-player graph-building games. Each game employs both a step-by-step reward system and a single final score. We also propose a principled approach to select the most suitable neural network architecture for any given conjecture and introduce a new dataset of graphs labeled with their Laplacian spectra. The games have been implemented as environments in the Gymnasium framework, and along with the dataset and a simple interface to play with the environments, are available at https://github.com/CuriosAI/graph_conjectures.
    5. Utility vs Usability: Towards a Search for Balance in Subgroup Discovery Problems

      Reynald Eugenie, Erick Stattner
      Abstract
      In recent years, significant strides have been made in the field of subgroup discovery by proposing methods that extract subgroups faster and with high utility levels. However, while the most effective works extract subgroups with more complex descriptions that improve utility by maximizing the quality criterion, the question of the usability of the extracted patterns is central for their understanding and application in the field. In this paper, we focus on the SD-CEDI approach, known for identifying the most relevant subgroups, but which the description is based on discontinuous attribute intervals. In this work we propose, and study, various strategies designed to add usability to the extracted patterns and we thus highlight the dilemma between Utility and Usability, that can be seen as a balance to be struck between adding value and degrading quality.
    6. Revisiting Silhouette Aggregation

      John Pavlopoulos, Georgios Vardakas, Aristidis Likas
      Abstract
      Silhouette coefficient is an established internal clustering evaluation measure that produces a score per data point, assessing the quality of its clustering assignment. To assess the quality of the clustering of the whole dataset, the scores of all the points in the dataset are typically (micro) averaged into a single value. An alternative path, however, that is rarely employed, is to average first at the cluster level and then (macro) average across clusters. As we illustrate in this work with a synthetic example, the typical micro-averaging strategy is sensitive to cluster imbalance while the overlooked macro-averaging strategy is far more robust. By investigating macro-Silhouette further, we find that uniform sub-sampling, the only available strategy in existing libraries, harms the measure’s robustness against imbalance. We address this issue by proposing a per-cluster sampling method. An empirical analysis on eight real-world datasets in two clustering tasks reveals the disagreement between the two coefficients for imbalanced datasets.
    7. Combining SHAP-Driven Co-clustering and Shallow Decision Trees to Explain XGBoost

      Ruggero G. Pensa, Anton Crombach, Sergio Peignier, Christophe Rigotti
      Abstract
      Transparency is a non-functional requirement of machine learning that promotes interpretable or easily explainable outcomes. Unfortunately, interpretable classification models, such as linear, rule-based, and decision tree models, are superseded by more accurate but complex learning paradigms, such as deep neural networks and ensemble methods. For tabular data classification, more specifically, models based on gradient-boosted tree ensembles, such as XGBoost, are still competitive compared to deep learning ones, so they are often preferred to the latter. However, they share the same interpretability issues, due to the complexity of the learnt model and, consequently, of the predictions. While the problem of computing local explanations is largely addressed, the problem of extracting global explanations is scarcely investigated. Existing solutions consist of computing some feature importance score, or extracting approximate surrogate trees from the learnt forest, or even using a black-box explainability method. However, those methods either have poor fidelity or their comprehensibility is questionable. In this paper, we propose to fill this gap by leveraging the strong theoretical basis of the SHAP framework in the context of co-clustering and feature selection. As a result, we are able to extract shallow decision trees that explain XGBoost with competitive fidelity and higher comprehensibility compared to two recent state-of-the-art competitors.
    8. Fast and Understandable Nonlinear Supervised Dimensionality Reduction

      Anri Patron, Rafael Savvides, Lauri Franzon, Hoang Phuc Hau Luu, Kai Puolamäki
      Abstract
      In supervised machine learning, feature creation and dimensionality reduction are essential tasks. Carefully chosen features allow simpler model structures, such as linear models, while decreasing the number of features is often used to reduce overfitting. Classical unsupervised dimensionality reduction methods such as principal component analysis may find features irrelevant to the machine learning task. Supervised dimensionality reduction methods, such as canonical correlation analysis, can construct linear projections of the original features informed by the prediction targets. Still, typically, the dimensionality of these projections is restricted to that of the target variables. On the other hand, deep learning-based approaches (either supervised or unsupervised) can construct high-performing features that are not understandable and often slow to train. We propose a novel supervised dimensionality reduction method, called Gradient Boosting Mapping (gbmap), a fast alternative to linear methods in which we make a minimal alteration (nonlinear transformation) to the linear projections designed to retain understandability. gbmap is fast to compute, provides high-quality, understandable features, and automatically ignores directions in the original data features irrelevant to the prediction task. gbmap is a good alternative to “too simple” linear methods and “too complex” black box methods.
    9. MORE–PLR: Multi-Output Regression Employed for Partial Label Ranking

      Santo M. A. R. Thies, Juan C. Alfaro, Viktor Bengs
      Abstract
      The partial label ranking (PLR) problem is a supervised learning scenario where the learner predicts a ranking with ties of the labels for a given input instance. It generalizes the well-known label ranking (LR) problem, which only allows for strict rankings. So far, previous learning approaches for PLR have primarily adapted LR methods to accommodate ties in predictions. This paper proposes using multi-output regression (MOR) to address the PLR problem by treating ranking positions as multivariate targets, an approach that has received little attention in both LR and PLR. To effectively employ this approach, we introduce several post-hoc layers that convert MOR results into a ranking, potentially including ties. This framework produces a range of learning approaches, which we demonstrate in experimental evaluations to be competitive with the current state-of-the-art PLR methods.
  6. Backmatter

Titel
Discovery Science
Herausgegeben von
Dino Pedreschi
Anna Monreale
Riccardo Guidotti
Roberto Pellungrini
Francesca Naretto
Copyright-Jahr
2025
Electronic ISBN
978-3-031-78977-9
Print ISBN
978-3-031-78976-2
DOI
https://doi.org/10.1007/978-3-031-78977-9

Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, ams.solutions GmbH/© ams.solutions GmbH, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, Ferrari electronic AG/© Ferrari electronic AG, Doxee AT GmbH/© Doxee AT GmbH , Haufe Group SE/© Haufe Group SE, NTT Data/© NTT Data