Skip to main content
Top

2023 | Book

Machine Learning and Principles and Practice of Knowledge Discovery in Databases

International Workshops of ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part II

Editors: Irena Koprinska, Paolo Mignone, Riccardo Guidotti, Szymon Jaroszewicz, Holger Fröning, Francesco Gullo, Pedro M. Ferreira, Damian Roqueiro, Gaia Ceddia, Slawomir Nowaczyk, João Gama, Rita Ribeiro, Ricard Gavaldà, Elio Masciari, Zbigniew Ras, Ettore Ritacco, Francesca Naretto, Andreas Theissler, Przemyslaw Biecek, Wouter Verbeke, Gregor Schiele, Franz Pernkopf, Michaela Blott, Ilaria Bordino, Ivan Luciano Danesi, Giovanni Ponti, Lorenzo Severini, Annalisa Appice, Giuseppina Andresini, Ibéria Medeiros, Guilherme Graça, Lee Cooper, Naghmeh Ghazaleh, Jonas Richiardi, Diego Saldana, Konstantinos Sechidis, Arif Canakoglu, Sara Pido, Pietro Pinoli, Albert Bifet, Sepideh Pashami

Publisher: Springer Nature Switzerland

Book Series : Communications in Computer and Information Science

insite
SEARCH

About this book

This volume constitutes the papers of several workshops which were held in conjunction with the International Workshops of ECML PKDD 2022 on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2022, held in Grenoble, France, during September 19–23, 2022.
The 73 revised full papers and 6 short papers presented in this book were carefully reviewed and selected from 143 submissions. ECML PKDD 2022 presents the following five workshops:
Workshop on Data Science for Social Good (SoGood 2022)
Workshop on New Frontiers in Mining Complex Patterns (NFMCP 2022)
Workshop on Explainable Knowledge Discovery in Data Mining (XKDD 2022)
Workshop on Uplift Modeling (UMOD 2022)
Workshop on IoT, Edge and Mobile for Embedded Machine Learning (ITEM 2022)
Workshop on Mining Data for Financial Application (MIDAS 2022)

Workshop on Machine Learning for Cybersecurity (MLCS 2022)

Workshop on Machine Learning for Buildings Energy Management (MLBEM 2022)
Workshop on Machine Learning for Pharma and Healthcare Applications (PharML 2022)

Workshop on Data Analysis in Life Science (DALS 2022)

Workshop on IoT Streams for Predictive Maintenance (IoT-PdM 2022)

Table of Contents

Frontmatter

Workshop on Mining Data for Financial Application (MIDAS 2022)

Frontmatter
Multi-task Learning for Features Extraction in Financial Annual Reports

For assessing various performance indicators of companies, the focus is shifting from strictly financial (quantitative) publicly disclosed information to qualitative (textual) information. This textual data can provide valuable weak signals, for example through stylistic features, which can complement the quantitative data on financial performance or on Environmental, Social and Governance (ESG) criteria. In this work, we use various multi-task learning methods for financial text classification with the focus on financial sentiment, objectivity, forward-looking sentence prediction and ESG-content detection. We propose different methods to combine the information extracted from training jointly on different tasks; our best-performing method highlights the positive effect of explicitly adding auxiliary task predictions as features for the final target task during the multi-task training. Next, we use these classifiers to extract textual features from annual reports of FTSE350 companies and investigate the link between ESG quantitative scores and these features.

Syrielle Montariol, Matej Martinc, Andraž Pelicon, Senja Pollak, Boshko Koloski, Igor Lončarski, Aljoša Valentinčič, Katarina Sitar Šuštar, Riste Ichev, Martin Žnidaršič
What to Do with Your Sentiments in Finance

This paper presents some practical ideas for making use of financial news-based sentiment indicators in trading, portfolio selection, assets’ industry classification and risk management.

Argimiro Arratia

Open Access

On the Development of a European Tracker of Societal Issues and Economic Activities Using Alternative Data

We provide an overview on the development of a tracker of economic activities and societal issues across EU member states mining alternative data sources, that can be used to complement official statistics. Considered alternative datasets include Google Searches, Dow Jones Data, News and Analytics (DNA), and the Global Dataset of Events, Language and Tone (GDELT). After providing an overview on the methodology under current development, some preliminary findings are also given.

Sergio Consoli, Marco Colagrossi, Francesco Panella, Luca Barbaglia
Privacy-Preserving Machine Learning in Life Insurance Risk Prediction

The application of machine learning to insurance risk prediction requires learning from sensitive data. This raises multiple ethical and legal issues. One of the most relevant ones is privacy. However, privacy-preserving methods can potentially hinder the predictive potential of machine learning models. In this paper, we present preliminary experiments with life insurance data using two privacy-preserving techniques: discretization and encryption. Our objective with this work is to assess the impact of such privacy preservation techniques in the accuracy of ML models. We instantiate the problem in three general, but plausible Use Cases involving the prediction of insurance claims within a 1-year horizon. Our preliminary experiments suggest that discretization and encryption have negligible impact in the accuracy of ML models.

Klismam Pereira, João Vinagre, Ana Nunes Alonso, Fábio Coelho, Melânia Carvalho
Financial Distress Model Prediction Using Machine Learning: A Case Study on Indonesia’s Consumers Cyclical Companies

Machine learning has been gradually introduced into corporate financial distress prediction and several prediction models have been developed. Financial distress affects the sustainability of a company’s operations and undermines the rights and interests of its stakeholders, also harming the national economy and society. Therefore, we developed an accurate predictive model for financial distress. Using 17 financial attributes obtained from the financial statements of Indonesia’s consumer cyclical companies, we developed a machine learning model for predicting financial distress using decision tree, logistic regression, LightGBM, and the k-nearest neighbor algorithms. The overall accuracy of the proposed model ranged from 0.60 to 0.87, which improved on using the one-year prior growth data of financial attributes.

Niken Prasasti Martono, Hayato Ohwada
Improve Default Prediction in Highly Unbalanced Context

Finding a model to predict the default of a firm is a well-known topic over the financial and data science community.Bankruptcy prediction has been studied in the literature for more than fifty years. Of course, despite the plethora of studies, predicting the failure of a company remain a hard task.We dedicated a special effort to the analysis of the highly unbalanced context that characterizes bankruptcy prediction. Imbalanced classes are a common problem in machine learning classification that typically is addressed by removing the imbalance in the training set. We conjecture that it is not always the best choice and propose the use of a slightly unbalanced training set, showing that this approach contributes to improve the performance.

Stefano Piersanti
Towards Explainable Occupational Fraud Detection

Occupational fraud within companies currently causes losses of around 5% of company revenue each year. While enterprise resource planning systems can enable automated detection of occupational fraud through recording large amounts of company data, the use of state-of-the-art machine learning approaches in this domain is limited by their untraceable decision process. In this study, we evaluate whether machine learning combined with explainable artificial intelligence can provide both strong performance and decision traceability in occupational fraud detection. We construct an evaluation setting that assesses the comprehensibility of machine learning-based occupational fraud detection approaches, and evaluate both performance and comprehensibility of multiple approaches with explainable artificial intelligence. Our study finds that high detection performance does not necessarily indicate good explanation quality, but specific approaches provide both satisfactory performance and decision traceability, highlighting the suitability of machine learning for practical application in occupational fraud detection and the importance of research evaluating both performance and comprehensibility together.

Julian Tritscher, Daniel Schlör, Fabian Gwinner, Anna Krause, Andreas Hotho
Towards Data-Driven Volatility Modeling with Variational Autoencoders

In this study, we show how S &P 500 Index volatility surfaces can be modeled in a purely data-driven way using variational autoencoders. The approach autonomously learns concepts such as the volatility level, smile, and term structure without leaning on hypotheses from traditional volatility modeling techniques. In addition to introducing notable improvements to an existing variational autoencoder approach for the reconstruction of both complete and incomplete volatility surfaces, we showcase three practical use cases to highlight the relevance of this approach to the financial industry. First, we show how the latent space learned by the variational autoencoder can be used to produce synthetic yet realistic volatility surfaces. Second, we demonstrate how entire sequences of synthetic volatility surfaces can be generated to stress test and analyze an options portfolio. Third and last, we detect anomalous surfaces in our options dataset and pinpoint exactly which subareas are divergent.

Thomas Dierckx, Jesse Davis, Wim Schoutens
Auto-clustering of Financial Reports Based on Formatting Style and Author’s Fingerprint

We present a new clustering algorithm of financial reports that is based on the reports’ formatting and style. The algorithm uses layout and content information to automatically generate as many clusters as needed. This allows us to reduce the effort of labeling the reports in order to train text-based machine learning models for extracting person or company names, addresses, financial categories, etc. In addition, the algorithm also produces a set of sub-clusters inside each cluster, where each sub-cluster corresponds to a set of reports made by the same author (person or firm). The information about sub-clusters allows us to evaluate the change in the author over time.We have applied the algorithm to a dataset with over 38,000 financial reports (last Annual Account presented by a company) from the Luxembourg Business Registers (LBR) and found 2,165 clusters between 2 and 850 documents with a median of 4 and an average of 14. When adding 2,500 new documents to the existing cluster set (previous annual accounts presented by companies), we found that 67.3% of the financial reports were placed in the correct cluster and sub-cluster. From the remaining documents, 65% were placed in a different subcluster because the company changed the formatting style, which is expected and correct behavior. Finally, labeling 11% of the entire dataset, we can replicate these labels up to 72% of the dataset, keeping a high feature coverage.

Braulio C. Blanco Lambruschini, Mats Brorsson, Maciej Zurad
InFi-BERT 1.0: Transformer-Based Language Model for Indian Financial Volatility Prediction

In recent years, BERT-like pretrained neural language models have been successfully developed and utilized for multiple financial domain-specific tasks. These domain-specific pre-trained models are effective enough to learn the specialized language used in financial context. In this paper, we consider the task of textual regression for the purpose of forecasting financial volatility from financial texts, and designed Infi-BERT (Indian Financial BERT), a transformer-based pre-trained language model using domain-adaptive pre-training approach, which effectively learns linguistic-context from annual financial reports from Indian financial texts. In addition, we present the first Indian financial corpus for the task of volatility prediction. With detailed experimentation and result analysis, we demonstrated that our model outperforms the base model as well as the previous domain-specific models for financial volatility forecasting task.

Sravani Sasubilli, Mridula Verma

Workshop on Machine Learning for Cybersecurity (MLCS 2022)

Frontmatter
Intrusion Detection Using Ensemble Models

A massive amount of work has been carried out in the field of Intrusion Detection Systems (IDS). Predictive models are used to identify various attacks on the network traffic. Several machine learning approaches have been used to prevent malware attacks or network intrusions. However, single classifiers have several limitations which cause low performance in the classification between normal traffic and attacks. In other words, they are not strong enough to be used in practical settings. This is the reason why researchers seek to find more robust and high-performing models. Examples of these stronger models are ensemble models which are able to take advantage of the characteristics of different base models combining them. The main goal of using ensemble classifiers is to achieve higher performance.In this paper, we propose two novel ensemble solutions for a network intrusion problem. We use pairs of strong and weak learners based on five different classifiers and combine them using weights derived through a Particle Swarm Optimization algorithm. We propose a voting and a stacking scheme to obtain the final predictions. We show the overwhelming advantage of using our proposed stacking solution in the context of an intrusion detection problem for multiple performance assessment metrics including F1-Score, AUCROC and G-Mean, a rare outcome in this type of problems. Another interesting outcome of this work concerns the finding that the majority voting scheme is not competitive in the studied scenario.

Tina Yazdizadeh, Shabnam Hassani, Paula Branco
Domain Adaptation with Maximum Margin Criterion with Application to Network Traffic Classification

A fundamental assumption in machine learning is that training and test samples follow the same distribution. Therefore, for training a machine learning-based network traffic classifier, it is necessary to use samples obtained from the desired network. Collecting enough training data, however, can be challenging in many cases. Domain adaptation allows samples from other networks to be utilized. In order to satisfy the aforementioned assumption, domain adaptation reduces the distance between the distribution of the samples in the desired network and that of the available samples in other networks. However, it is important to note that the applications in two different networks can differ considerably. Taking this into account, in this paper, we present a new domain adaptation method for classifying network traffic. Thus, we use the labeled samples from a network and adapt them to the few labeled samples from the desired network; In other words, we adapt shared applications while preserving the information about non-shared applications. In order to demonstrate the efficacy of our method, we construct five different cross-network datasets using the Brazil dataset. These results indicate the effectiveness of adapting samples between different domains using the proposed method.

Zahra Taghiyarrenani, Hamed Farsi
Evaluation of the Limit of Detection in Network Dataset Quality Assessment with PerQoDA

Machine learning is recognised as a relevant approach to detect attacks and other anomalies in network traffic. However, there are still no suitable network datasets that would enable effective detection. On the other hand, the preparation of a network dataset is not easy due to privacy reasons but also due to the lack of tools for assessing their quality. In a previous paper, we proposed a new method for data quality assessment based on permutation testing. This paper presents a parallel study on the limits of detection of such an approach. We focus on the problem of network flow classification and use well-known machine learning techniques. The experiments were performed using publicly available network datasets.

Katarzyna Wasielewska, Dominik Soukup, Tomáš Čejka, José Camacho
Towards a General Model for Intrusion Detection: An Exploratory Study

Exercising Machine Learning (ML) algorithms to detect intrusions is nowadays the de-facto standard for data-driven detection tasks. This activity requires the expertise of the researchers, practitioners, or employees of companies that also have to gather labeled data to learn and evaluate the model that will then be deployed into a specific system. Reducing the expertise and time required to craft intrusion detectors is a tough challenge, which in turn will have an enormous beneficial impact in the domain. This paper conducts an exploratory study that aims at understanding to which extent it is possible to build an intrusion detector that is general enough to learn the model once and then be applied to different systems with minimal to no effort. Therefore, we recap the issues that may prevent building general detectors and propose software architectures that have the potential to overcome them. Then, we perform an experimental evaluation using several binary ML classifiers and a total of 16 feature learners on 4 public attack datasets. Results show that a model learned on a dataset or a system does not generalize well as is to other datasets or systems, showing poor detection performance. Instead, building a unique model that is then tailored to a specific dataset or system may achieve good classification performance, requiring less data and far less expertise from the final user.

Tommaso Zoppi, Andrea Ceccarelli, Andrea Bondavalli

Workshop on Machine Learning for Buildings Energy Management (MLBEM 2022)

Frontmatter
Conv-NILM-Net, a Causal and Multi-appliance Model for Energy Source Separation

Non-Intrusive Load Monitoring (NILM) seeks to save energy by estimating individual appliance power usage from a single aggregate measurement. Deep neural networks have become increasingly popular in attempting to solve NILM problems. However most used models are used for Load Identification rather than online Source Separation. Among source separation models, most use a single-task learning approach in which a neural network is trained exclusively for each appliance. This strategy is computationally expensive and ignores the fact that multiple appliances can be active simultaneously and dependencies between them. The rest of models are not causal, which is important for real-time application. Inspired by Convtas-Net, a model for speech separation, we propose Conv-NILM-net, a fully convolutional framework for end-to-end NILM. Conv-NILM-net is a causal model for multi appliance source separation. Our model is tested on two real datasets REDD and UK-DALE and clearly outperforms the state of the art while keeping a significantly smaller size than the competing models.

Mohamed Alami C., Jérémie Decock, Rim kaddah, Jesse Read
Domestic Hot Water Forecasting for Individual Housing with Deep Learning

The energy sharing used to heat water represents around 15% in European houses. To improve energy efficiency, smart heating systems could benefit from accurate domestic hot water consumption forecasting in order to adapt their heating profile. However, forecasting the hot water consumption for a single accommodation can be difficult since the data are generally highly non smooth and present large variations from day to day. We propose to tackle this issue with three deep learning approaches, Recurrent Neural Networks, 1-Dimensional Convolutional Neural Networks and Multi-Head Attention to perform one day ahead prediction of hot water consumption for an individual residence. Moreover, similarly as in the transformer architecture, we experiment enriching the last two approaches with various forms of position encoding to include the order of the sequence in the data. The experimented models achieved satisfying performances in term of MSE on an individual residence dataset, showing that this approach is promising to conceive building energy management systems based on deep forecasting models.

Paul Compagnon, Aurore Lomet, Marina Reyboz, Martial Mermillod

Workshop on Machine Learning for Pharma and Healthcare Applications (PharML 2022)

Frontmatter
Detecting Drift in Healthcare AI Models Based on Data Availability

There is an increasing interest in the use of AI in healthcare due to its potential for diagnosis or disease prediction. However, healthcare data is not static and is likely to change over time leading a non-adaptive model to poor decision-making. The need of a drift detector in the overall learning framework is therefore essential to guarantee reliable products on the market. Most drift detection algorithms consider that ground truth labels are available immediately after prediction since these methods often work by monitoring the model performance. However, especially in real-world clinical contexts, this is not always the case as collecting labels is often more time consuming as requiring experts’ input. This paper investigates methodologies to address drift detection depending on which information is available during the monitoring process. We explore the topic within a regulatory standpoint, showing challenges and approaches to monitoring algorithms in healthcare with subsequent batch updates of data. This paper explores three different aspects of drift detection: drift based on performance (when labels are available), drift based on model structure (indicating causes of drift) and drift based on change in underlying data characteristics (distribution and correlation) when labels are not available.

Ylenia Rotalinti, Allan Tucker, Michael Lonergan, Puja Myles, Richard Branson
Assessing Different Feature Selection Methods Applied to a Bulk RNA Sequencing Dataset with Regard to Biomedical Relevance

High throughput RNA sequencing (RNA-Seq) allows for the profiling of thousands of transcripts in multiple samples. For the analysis of the generated RNA-Seq datasets, standard and well-established methods exist, which are however limited by (i) the high dimensionality of the data with most of the expression profiles being uninformative, and (ii) by an imbalanced sample-to-feature ratio. This complicates downstream analyses of these data, and the implementation of methods such as Machine Learning (ML) classification. Therefore, the selection of those features that carry the essential information is important. The standard method of informative feature selection is gene expression (DGE) analysis, which is often conducted in a univariate fashion, and ignores interactions between expression profiles. ML-based feature selection methods, on the other hand, are capable of addressing these shortcomings. Here, we have applied five different ML-based feature selection methods, and conventional DGE analysis to a high-dimensional bulk RNA-Seq dataset of PBMCs of healthy children and of children affected with Atopic Dermatitis (AD), and evaluated the resulting feature lists. The similarities between the feature lists were assessed with three similarity coefficients. The selected genetic features were subjected to a Gene Ontology (GO) functional enrichment analysis, and the significantly enriched GO terms were evaluated applying a semantic similarity analysis combined with binary cut clustering. In addition, comparisons with consensus gene lists associated with AD were performed, and the previous identification of the selected features in related studies was assessed. We found that genetic features selected with ML-based methods, in general, were of higher biomedical relevance. We argue that ML-based feature selection followed by a careful evaluation of the selected feature sets extend the possibilities of precision medicine to discover biomarkers.

Damir Zhakparov, Kathleen Moriarty, Nonhlanhla Lunjani, Marco Schmid, Carol Hlela, Michael Levin, Avumile Mankahla, SOS-ALL Consortium, Cezmi Akdis, Liam O’Mahony, Katja Baerenfaller, Damian Roqueiro
Predicting Drug Treatment for Hospitalized Patients with Heart Failure

Heart failure and acute heart failure, the sudden onset or worsening of symptoms related to heart failure, are leading causes of hospital admission in the elderly. Treatment of heart failure is a complex problem that needs to consider a combination of factors such as clinical manifestation and comorbidities of the patient. Machine learning approaches exploiting patient data may potentially improve heart failure patients disease management. However, there is a lack of treatment prediction models for heart failure patients. Hence, in this study, we propose a workflow to stratify patients based on clinical features and predict the drug treatment for hospitalized patients with heart failure. Initially, we train the k-medoids and DBSCAN clustering methods on an extract from the MIMIC III dataset. Subsequently, we carry out a multi-label treatment prediction by assigning new patients to the pre-defined clusters. The empirical evaluation shows that k-medoids and DBSCAN successfully identify patient subgroups, with different treatments in each subgroup. DSBCAN outperforms k-medoids in patient stratification, yet the performance for treatment prediction is similar for both algorithms. Therefore, our work supports that clustering algorithms, specifically DBSCAN, have the potential to successfully perform patient profiling and predict individualized drug treatment for patients with heart failure.

Linyi Zhou, Ioanna Miliou
A Workflow for Generating Patient Counterfactuals in Lung Transplant Recipients

Lung transplantation is a critical procedure performed in end-stage pulmonary patients. The number of lung transplantations performed in the USA in the last decade has been rising, but the survival rate is still lower than that of other solid organ transplantations. First, this study aims to employ machine learning models to predict patient survival after lung transplantation. Additionally, the aim is to generate counterfactual explanations based on these predictions to help clinicians and patients understand the changes needed to increase the probability of survival after the transplantation and better comply with normative requirements. We use data derived from the UNOS database, particularly the lung transplantations performed in the USA between 2019 and 2021. We formulate the problem and define two data representations, with the first being a representation that describes only the lung recipients and the second the recipients and donors. We propose an explainable ML workflow for predicting patient survival after lung transplantation. We evaluate the workflow based on various performance metrics, using five classification models and two counterfactual generation methods. Finally, we demonstrate the potential of explainable ML for resource allocation, predicting patient mortality, and generating explainable predictions for lung transplantation.

Franco Rugolon, Maria Bampa, Panagiotis Papapetrou
Few-Shot Learning for Identification of COVID-19 Symptoms Using Generative Pre-trained Transformer Language Models

Since the onset of the COVID-19 pandemic, social media users have shared their personal experiences related to the viral infection. Their posts contain rich information of symptoms that may provide useful hints to advancing the knowledge body of medical research and supplement the discoveries from clinical settings. Identification of symptom expressions in social media text is challenging, partially due to lack of annotated data. In this study, we investigate utilizing few-shot learning with generative pre-trained transformer language models to identify COVID-19 symptoms in Twitter posts. The results of our approach show that large language models are promising in more accurately identifying symptom expressions in Twitter posts with small amount of annotation effort, and our method can be applied to other medical and health applications where abundant of unlabeled data is available.

Keyuan Jiang, Minghao Zhu, Gordon R. Bernard
A Light-Weight Deep Residual Network for Classification of Abnormal Heart Rhythms on Tiny Devices

An automatic classification of abnormal heart rhythms using electrocardiogram (ECG) signals has been a popular research area in medicine. In spite of reporting good accuracy, the available deep learning-based algorithms are resource-hungry and can not be effectively used for continuous patient monitoring on portable devices. In this paper, we propose an optimized light-weight algorithm for real-time classification of normal sinus rhythm, Atrial Fibrillation (AF), and other abnormal heart rhythms using single-lead ECG on resource-constrained low-powered tiny edge devices. A deep Residual Network (ResNet) architecture with attention mechanism is proposed as the baseline model which is duly compressed using a set of collaborative optimization techniques. Results show that the baseline model outperforms the state-of-the art algorithms on the open-access PhysioNet Challenge 2017 database. The optimized model is successfully deployed on a commercial microcontroller for real-time ECG analysis with a minimum impact on performance.

Rohan Banerjee, Avik Ghose

Workshop on Data Analysis in Life Science (DALS 2022)

Frontmatter
I-CONVEX: Fast and Accurate de Novo Transcriptome Recovery from Long Reads

Long-read sequencing technologies demonstrate high potential for de novo discovery of complex transcript isoforms, but high error rates pose a significant challenge. Existing error correction methods rely on clustering reads based on isoform-level alignment and cannot be efficiently scaled. We propose a new method, I-CONVEX, that performs fast, alignment-free isoform clustering with almost linear computational complexity, and leads to better consensus accuracy on simulated, synthetic, and real datasets.

Sina Baharlouei, Meisam Razaviyayn, Elizabeth Tseng, David Tse
Italian Debate on Measles Vaccination: How Twitter Data Highlight Communities and Polarity

Social media platforms such as Twitter, Facebook, and You-Tube had proven to be valuable sources of information. These platforms are a fruitful source of freely collectible public opinions. Due to the recent outbreak of the monkeypox disease, and in light of the historical pandemic that affected the whole world, we examine the issue of understanding the Italian opinion towards vaccinations of diseases that have apparently disappeared. To address this issue, we study the flow of information on the measles vaccine by looking at Twitter data. We discovered that vaccine skeptics have a higher tweeting activity, and the hashtags used by the three classes of users (pro-vaccine, anti-vaccine, and neutral) fall into three different communities, corresponding to the groups identified by opinion polarization towards the vaccine. By analyzing how hashtags are shared in different communities, we show that communication exists only in the neutral-opinion community.

Cynthia Ifeyinwa Ugwu, Sofia Casarin

3rd Workshop and Tutorial on Streams for Predictive Maintenance (IoT-PdM 2022)

Frontmatter
Online Anomaly Explanation: A Case Study on Predictive Maintenance

Predictive Maintenance applications are increasingly complex, with interactions between many components. Black-box models are popular approaches due to their predictive accuracy and are based on deep-learning techniques. This paper presents an architecture that uses an online rule learning algorithm to explain when the black-box model predicts rare events. The system can present global explanations that model the black-box model and local explanations that describe why the black-box model predicts a failure. We evaluate the proposed system using four real-world public transport data sets, presenting illustrative examples of explanations.

Rita P. Ribeiro, Saulo Martiello Mastelini, Narjes Davari, Ehsan Aminian, Bruno Veloso, João Gama
Fault Forecasting Using Data-Driven Modeling: A Case Study for Metro do Porto Data Set

The demand for high-performance solutions for anomaly detection and forecasting fault events is increasing in the industrial area. The detection and forecasting faults from time-series data are one critical mission in the Internet of Things (IoT) data mining. The classical fault detection approaches based on physical modelling are limited to some measurable output variables. Accurate physical modelling of vehicle dynamics requires substantial prior information about the system. On the other hand, data-driven modelling techniques accurately represent the system’s dynamic from data collection. Experimental results on large-scale data sets from Metro do Porto subsystems verify that our method performs high-quality fault detection and forecasting solutions. Also, health indicator obtained from the principal component analysis of the forecasting solution is applied to predict the remaining useful life.

Narjes Davari, Bruno Veloso, Rita P. Ribeiro, João Gama
An Online Data-Driven Predictive Maintenance Approach for Railway Switches

An online data-driven predictive maintenance approach for railway switches using data logs obtained from the interlocking system of the railway infrastructure is proposed in this paper. The proposed approach is detailed described and consists of a two-phase process: anomaly detection and remaining useful life prediction. The approach is applied to and validated in a real case study, the Metro do Porto, from which seven months of data is available. The approach has been revealed to be satisfactory in detecting anomalies. The results open the possibilities for further studies and validation with a more extensive dataset on the remaining useful life prediction.

Emanuel Sousa Tomé, Rita P. Ribeiro, Bruno Veloso, João Gama
curr2vib: Modality Embedding Translation for Broken-Rotor Bar Detection

Recently and due to the advances in sensor technology and Internet-of-Things, the operation of machinery can be monitored, using a higher number of sources and modalities. In this study, we demonstrate that Multi-Modal Translation is capable of transferring knowledge from a modality with higher level of applicability (more usefulness to solve an specific task) but lower level of accessibility (how easy and affordable it is to collect information from this modality) to another one with higher level of accessibility but lower level of applicability. Unlike the fusion of multiple modalities which requires all of the modalities to be available during the deployment stage, our proposed method depends only on the more accessible one; which results in the reduction of the costs regarding instrumentation equipment. The presented case study demonstrates that by the employment of the proposed method we are capable of replacing five acceleration sensors with three current sensors, while the classification accuracy is also increased by more than 1%.

Amirhossein Berenji, Zahra Taghiyarrenani, Sławomir Nowaczyk
Incorporating Physics-Based Models into Data Driven Approaches for Air Leak Detection in City Buses

In this work-in-progress paper two types of physics-based models, for accessing elastic and non-elastic air leakage processes, were evaluated and compared with conventional statistical methods to detect air leaks in city buses, via a data-driven approach. We have access to data streamed from a pressure sensor located in the air tanks of a few city buses, during their daily operations. The air tank in these buses supplies compressed air to drive various components, e.g. air brake, suspension, doors, gearbox, etc. We fitted three physics-based models only to the leakage segments extracted from the air pressure signal and used fitted model parameters as expert features for detecting air leaks. Furthermore, statistical moments of these fitted parameters, over predetermined time intervals, were compared to conventional statistical features on raw pressure values, under a classification setting in discriminating samples before and after the repair of air leak problems. The result of this exploratory study, on six air leak cases, shows that the fitted parameters of the physics-based models are useful for discriminating samples with air leak faults from the fault-free samples, which were observed right after the repair was performed to deal with the air leak problem. The comparison based on ANOVA F-score shows that the proposed features based on fitted parameters of physics-based models outrank the conventional features. It is observed that features of a non-elastic leakage model perform the best.

Yuantao Fan, Hamid Sarmadi, Sławomir Nowaczyk
Towards Geometry-Preserving Domain Adaptation for Fault Identification

In most industries, the working conditions of equipment vary significantly from one site to another, from one time of a year to another, and so on. This variation poses a severe challenge for data-driven fault identification methods: it introduces a change in the data distribution. This contradicts the underlying assumption of most machine learning methods, namely that training and test samples follow the same distribution. Domain Adaptation (DA) methods aim to address this problem by minimizing the distribution distance between training (source) and test (target) samples.However, in the area of predictive maintenance, this idea is complicated by the fact that different classes – fault categories – also vary across domains. Most of the state-of-the-art DA methods assume that the data in the target domain is complete, i.e., that we have access to examples from all the possible classes or faulty categories during adaptation. In reality, this is often very difficult to guarantee.Therefore, there is a need for a domain adaptation method that is able to align the source and target domains even in cases of having access to an incomplete set of test data. This paper presents our work in progress as we propose an approach for such a setting based on maintaining the geometry information of source samples during the adaptation. This way, the model can capture the relationships between different fault categories and preserve them in the constructed domain-invariant feature space, even in situations where some classes are entirely missing. This paper examines this idea using artificial data sets to demonstrate the effectiveness of geometry-preserving transformation. We have also started investigations on real-world predictive maintenance datasets, such as CWRU.

Zahra Taghiyarrenani, Sławomir Nowaczyk, Sepideh Pashami, Mohamed-Rafik Bouguelia
A Systematic Approach for Tracking the Evolution of XAI as a Field of Research

The increasing use of AI methods in various applications has raised concerns about their explainability and transparency. Many solutions have been developed within the last few years to either explain the model itself or the decisions provided by the model. However, the number of contributions in the field of eXplainable AI (XAI) is increasing at such a high pace that it is almost impossible for a newcomer to identify key ideas, track the field’s evolution, or find promising new research directions.Typically, survey papers serve as a starting point, providing a feasible entry point into a research area. However, this is not trivial for some fields with exponential growth in the literature, such as XAI. For instance, we analyzed 23 surveys in the XAI domain published within the last three years and surprisingly found no common conceptualization among them. This makes XAI one of the most challenging research areas to enter. To address this problem, we propose a systematic approach that enables newcomers to identify the principal ideas and track their evolution. The proposed method includes automating the retrieval of relevant papers, extracting their semantic relationship, and creating a temporal graph of ideas by post-analysis of citation graphs.The main outcome of our method is Field’s Evolution Graph (FEG), which can be used to find the core idea of each approach in this field, see how a given concept has developed and evolved over time, observe how different notions interact with each other, and perceive how a new paradigm emerges through combining multiple ideas. As for demonstration, we show that FEG successfully identifies the field’s key articles, such as LIME or Grad-CAM, and maps out their evolution and relationships.

Samaneh Jamshidi, Sławomir Nowaczyk, Hadi Fanaee-T, Mahmoud Rahat
Frequent Generalized Subgraph Mining via Graph Edit Distances

In this work, we propose a method for computing generalized frequent subgraph patterns which is based on the graph edit distance. Graph data is often equipped with semantic information in form of an ontology, for example when dealing with linked data or knowledge graphs. Previous work suggests to exploit this semantic information in order to compute frequent generalized patterns, i.e. patterns for which the total frequency of all more specific patterns exceeds the frequency threshold. However, the problem of computing the frequency of a generalized pattern has not yet been fully addressed.

Richard Palme, Pascal Welke
Backmatter
Metadata
Title
Machine Learning and Principles and Practice of Knowledge Discovery in Databases
Editors
Irena Koprinska
Paolo Mignone
Riccardo Guidotti
Szymon Jaroszewicz
Holger Fröning
Francesco Gullo
Pedro M. Ferreira
Damian Roqueiro
Gaia Ceddia
Slawomir Nowaczyk
João Gama
Rita Ribeiro
Ricard Gavaldà
Elio Masciari
Zbigniew Ras
Ettore Ritacco
Francesca Naretto
Andreas Theissler
Przemyslaw Biecek
Wouter Verbeke
Gregor Schiele
Franz Pernkopf
Michaela Blott
Ilaria Bordino
Ivan Luciano Danesi
Giovanni Ponti
Lorenzo Severini
Annalisa Appice
Giuseppina Andresini
Ibéria Medeiros
Guilherme Graça
Lee Cooper
Naghmeh Ghazaleh
Jonas Richiardi
Diego Saldana
Konstantinos Sechidis
Arif Canakoglu
Sara Pido
Pietro Pinoli
Albert Bifet
Sepideh Pashami
Copyright Year
2023
Electronic ISBN
978-3-031-23633-4
Print ISBN
978-3-031-23632-7
DOI
https://doi.org/10.1007/978-3-031-23633-4

Premium Partner