main-content

## Über dieses Buch

This book constitutes the refereed proceedings of the 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, held as a virtual event, in June 2021.

The 28 full papers presented together with 30 short papers were selected from 138 submissions. The papers are grouped in topical sections on image analysis; predictive modelling; temporal data analysis; unsupervised learning; planning and decision support; deep learning; natural language processing; and knowledge representation and rule mining.

## Inhaltsverzeichnis

### The Myth of Complete AI-Fairness

Just recently, IBM invited me to participate in a panel titled “Will AI ever be completely fair?” My first reaction was that it surely would be a very short panel, as the only possible answer is ‘no’. In this short paper, I wish to further motivate my position in that debate: “I will never be completely fair. Nothing ever is. The point is not complete fairness, but the need to establish metrics and thresholds for fairness that ensure trust in AI systems”.

Virginia Dignum

### A Petri Dish for Histopathology Image Analysis

With the rise of deep learning, there has been increased interest in using neural networks for histopathology image analysis, a field that investigates the properties of biopsy or resected specimens traditionally manually examined under a microscope by pathologists. However, challenges such as limited data, costly annotation, and processing high-resolution and variable-size images make it difficult to quickly iterate over model designs.Throughout scientific history, many significant research directions have leveraged small-scale experimental setups as petri dishes to efficiently evaluate exploratory ideas. In this paper, we introduce a minimalist histopathology image analysis dataset (MHIST), an analogous petri dish for histopathology image analysis. MHIST is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each with a gold-standard label determined by the majority vote of seven board-certified gastrointestinal pathologists and annotator agreement level. MHIST occupies less than 400 MB of disk space, and a ResNet-18 baseline can be trained to convergence on MHIST in just 6 min using 3.5 GB of memory on a NVIDIA RTX 3090. As example use cases, we use MHIST to study natural questions such as how dataset size, network depth, transfer learning, and high-disagreement examples affect model performance.By introducing MHIST, we hope to not only help facilitate the work of current histopathology imaging researchers, but also make the field more-accessible to the general community. Our dataset is available at https://bmirds.github.io/MHIST .

Jerry Wei, Arief Suriawinata, Bing Ren, Xiaoying Liu, Mikhail Lisovsky, Louis Vaickus, Charles Brown, Michael Baker, Naofumi Tomita, Lorenzo Torresani, Jason Wei, Saeed Hassanpour

### fMRI Multiple Missing Values Imputation Regularized by a Recurrent Denoiser

Functional Magnetic Resonance Imaging (fMRI) is a neuroimaging technique with pivotal importance due to its scientific and clinical applications. As with any widely used imaging modality, there is a need to ensure the quality of the same, with missing values being highly frequent due to the presence of artifacts or sub-optimal imaging resolutions. Our work focus on missing values imputation on multivariate signal data. To do so, a new imputation method is proposed consisting on two major steps: spatial-dependent signal imputation and time-dependent regularization of the imputed signal. A novel layer, to be used in deep learning architectures, is proposed in this work, bringing back the concept of chained equations for multiple imputation [26]. Finally, a recurrent layer is applied to tune the signal, such that it captures its true patterns. Both operations yield an improved robustness against state-of-the-art alternatives. The code is made available on Github .

David Calhas, Rui Henriques

### Bayesian Deep Active Learning for Medical Image Analysis

Deep Learning has achieved a state-of-the-art performance in medical imaging analysis but requires a large number of labelled images to obtain good adequate performance. However, such labelled images are costly to acquire in time, labour, and human expertise. We propose a novel practical Bayesian Active Learning approach using Dropweights and overall bias-corrected uncertainty measure to suggest which unlabelled image to annotate. Experiments were done on Brain Tumour MR images, Microscopic Cell Image classification, Fluoro-chromogenic cytokeratin-Ki-67 double staining cancer images and Retina fundus image segmentation tasks. We demonstrate that our active learning technique is equally successful or better than other existing active learning approaches in high dimensional data to reduce the image labelling effort significantly. We believe Bayesian deep active learning framework with very few annotated samples in a practical way will benefit clinicians to obtain fast and accurate image annotation with confidence.

Biraja Ghoshal, Stephen Swift, Allan Tucker

### A Topological Data Analysis Mapper of the Ovarian Folliculogenesis Based on MALDI Mass Spectrometry Imaging Proteomics

Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry Imaging (MALDI-MSI), also referred to as molecular histology, is an emerging omics, which allows the simultaneous, label-free, detection of thousands of peptides in their tissue localization, and generates highly dimensional data. This technology requires the development of advanced computational methods to deepen our knowledge on relevant biological processes, such as those involved in reproductive biology.The mammalian ovary cyclically undergoes morpho-functional changes. From puberty, at each ovarian cycle, a group of pre-antral follicles (type 4, T4) is recruited and grows to the pre-ovulatory (T8) stage, until ovulation of mature oocytes. The correct follicle growth and acquisition of oocyte developmental competence are strictly related to a continuous, but still poorly understood, molecular crosstalk between the gamete and the surrounding follicle cells.Here, we tested the use of advanced clustering and visual analytics approaches on MALDI-MSI data for the in-situ identification of the protein signature of growing follicles, from the pre-antral T4 to the pre-ovulatory T8. Specifically, we first analyzed follicles MALDI-MSI data with PCA, tSNE and UMAP approaches, and then we developed a framework that employs Topological Data Analysis (TDA) Mapper to detect spatial and temporal related clusters and to pinpoint differentially expressed proteins. TDA Mapper is an unsupervised Machine Learning method suited to the analysis of high-dimensional data that are embedded into a graph model. Interestingly, the graph structure revealed protein patterns in clusters containing different follicle types, highlighting putative factors that drive follicle growth.

Giulia Campi, Giovanna Nicora, Giulia Fiorentino, Andrew Smith, Fulvio Magni, Silvia Garagna, Maurizio Zuccotti, Riccardo Bellazzi

### Predicting Kidney Transplant Survival Using Multiple Feature Representations for HLAs

Kidney transplantation can significantly enhance living standards for people suffering from end-stage renal disease. A significant factor that affects graft survival time (the time until the transplant fails and the patient requires another transplant) for kidney transplantation is the compatibility of the Human Leukocyte Antigens (HLAs) between the donor and recipient. In this paper, we propose new biologically-relevant feature representations for incorporating HLA information into machine learning-based survival analysis algorithms. We evaluate our proposed HLA feature representations on a database of over 100,000 transplants and find that they improve prediction accuracy by about 1%, modest at the patient level but potentially significant at a societal level. Accurate prediction of survival times can improve transplant survival outcomes, enabling better allocation of donors to recipients and reducing the number of re-transplants due to graft failure with poorly matched donors.

Mohammadreza Nemati, Haonan Zhang, Michael Sloma, Dulat Bekbolsynov, Hong Wang, Stanislaw Stepkowski, Kevin S. Xu

### Sum-Product Networks for Early Outbreak Detection of Emerging Diseases

Recent research in syndromic surveillance has focused primarily on monitoring specific, known diseases, concentrating on a certain clinical picture under surveillance. Outbreaks of emerging infectious diseases with different symptom patterns are likely to be missed by such a surveillance system. In contrast, monitoring all available data for anomalies allows to detect any kind of outbreaks, including infectious diseases with yet unknown syndromic clinical pictures. In this work, we propose to model the joint probability distribution of syndromic data with sum-product networks (SPN), which are able to capture correlations in the monitored data and even allow to consider environmental factors, such as the current influenza infection rate. Conversely to the conventional use of SPNs, we present a new approach to detect anomalies by evaluating p-values on the learned model. Our experiments on synthetic and real data with synthetic outbreaks show that SPNs are able to improve upon state-of-the-art techniques for detecting outbreaks of emerging diseases.

Moritz Kulessa, Bennet Wittelsbach, Eneldo Loza Mencía, Johannes Fürnkranz

### Catching Patient’s Attention at the Right Time to Help Them Undergo Behavioural Change: Stress Classification Experiment from Blood Volume Pulse

The CAPABLE project aims to improve the wellbeing of cancer patients managed at home via a coaching system recommending personalized evidence-based health behavioral change interventions and supporting patients compliance. Focusing on managing stress via deep breathing intervention, we hypothesise that the patients are more likely to perform suggested breathing exercises when they need calming down. To prompt them at the right time, we developed a machine-learning stress detector based on blood volume pulse that can be measured via consumer-grade smartwatches. We used a publicly available WESAD dataset to evaluate it. Simple 1D CNN achieves 0.837 average F1-score in binary stress vs. non-stress classification and 0.653 in stress vs. amusement vs. neutral classification reaching the state-of-art performance. Personalisation of the population model via fine-tuning on a small number of annotated patient-specific samples yields 12% improvement in stress vs. amusement vs. neutral classification. In future work we will include additional context information to further refine the timing of the prompt and adjust the exercise level.

Aneta Lisowska, Szymon Wilk, Mor Peleg

### Primary Care Datasets for Early Lung Cancer Detection: An AI Led Approach

Cancer is one of the most common and serious medical conditions, with significant challenges in the detection of cancer originating from the non-specific nature of symptoms and very low prevalence. For general practitioners (GPs), this can be particularly important, as they are the primary contact for patients for most medical conditions. This places high significance on using the data available to a GP to design decision support tools that will aid GPs in detecting cancer as early as possible. With pathology data being one of the datasets available in the GP electronic medical record (EMR), our work targets this type of data in an attempt to incorporate an early cancer detection tool in existing GP practices. We focus on utilizing full blood count pathology results to design features that can be used in an early cancer detection model 3 to 6 months ahead of standard diagnosis. This research focuses initially on lung cancer but can be extended to other types of cancer. Additional challenges are present in this type of data due to the irregular and infrequent nature of doing pathology tests, which are also considered in designing the AI solution. Our findings demonstrate that hematological measures from pathology data are a suitable choice for a cancer detection tool that can deliver early cancer diagnosis up to 6 months ahead for up to 8 out of 10 patients, in a way that is easily incorporated in current GP practice.

Goce Ristanoski, Jon Emery, Javiera Martinez Gutierrez, Damien McCarthy, Uwe Aickelin

### Addressing Extreme Imbalance for Detecting Medications Mentioned in Twitter User Timelines

Tweets mentioning medications are valuable for efforts in digital epidemiology to supplement traditional methods of monitoring public health. A major obstacle, however, is to differentiate them from the large majority of tweets on other topics posted in a user’s timeline: solving the infamous ‘needle in a haystack’ problem. While deep learning models have significantly improved classification, their performance and inference processing time remain low on extremely imbalanced corpora where the tweets of interest are less than 1% of all tweets. In this study, we empirically evaluate under-sampling, fine-tuning, and filtering heuristics to train such classifiers. Using a corpus of 212 Twitter timelines (181,607 tweets with only 0.2% tweets mentioning a medication), our results show that combining these heuristics is necessary to impact the classifier’s performance. In our intrinsic evaluation, a classifier based on a lexicon and a BERT-base neural network achieved a 0.838 F1-score, a score similar to the score achieved by the best classifier on this task during the #SMM4H’20 competition, but it processed the corpus 28 times faster - a positive result, since processing speed is still a roadblock to deploying classifiers on large cohorts of Twitter users needed for pharmacovigilance. In our extrinsic evaluation, our classifier helped a labeler to extract the spans of medications more accurately and achieved a 0.76 Strict F1-score. To the best of our knowledge, this is the first evaluation of medications extraction from Twitter timelines and it establishes the first benchmark for future studies.

Davy Weissenbacher, Siddharth Rawal, Arjun Magge, Graciela Gonzalez-Hernandez

### ICU Days-to-Discharge Analysis with Machine Learning Technology

ICU management depends on the level of occupation and the length of stay of the patients. Daily prediction of the days to discharge (DTD) of ICU patients is essential to that management. Previous studies showed a low predictive capability of internists and ML-generated models. Therefore, more elaborated combinations of ML technologies are required. Here, we present four approaches to the analysis of the DTDs of ICU patients from different perspectives: heterogeneity quantification, biomarker identification, phenotype recognition, and prediction. Several ML-based methods are proposed for each approach, which were tested with the data of 3,973 patients of a Spanish ICU. Results confirm the complexity of analyzing DTDs with intelligent data analysis methods.

### Transformers for Multi-label Classification of Medical Text: An Empirical Comparison

Recent advancements in machine learning-based multi-label medical text classification techniques have been used to help enhance healthcare and aid better patient care. This research is motivated by transformers’ success in natural language processing tasks, and the opportunity to further improve performance for medical-domain specific tasks by exploiting models pre-trained on health data. We consider transfer learning involving fine-tuning of pre-trained models for predicting medical codes, formulated as a multi-label problem. We find that domain-specific transformers outperform state-of-the-art results for multi-label problems with the number of labels ranging from 18 to 158, for a fixed sequence length. Additionally, we find that, for longer documents and/or number of labels greater than 300, traditional neural networks still have an edge over transformers. These findings are obtained by performing extensive experiments on the semi-structured eICU data and the free-form MIMIC III data, and applying various transformers including BERT, RoBERTa, and Longformer variations. The electronic health record data used in this research exhibits a high level of label imbalance. Considering individual label accuracy, we find that for eICU data medical-domain specific RoBERTa models achieve improvements for more frequent labels. For infrequent labels, in both datasets, traditional neural networks still perform better.

Vithya Yogarajan, Jacob Montiel, Tony Smith, Bernhard Pfahringer

### Semantic Web Framework to Computerize Staged Reflex Testing Protocols to Mitigate Underutilization of Pathology Tests for Diagnosing Pituitary Disorders

The complex and insidious presentation of certain health conditions, such as pituitary disorders, makes it challenging for primary care providers (PCP) to render a timely diagnosis—often delaying appropriate treatment for years. In contemporary clinical laboratories, laboratory interventions can appropriately add-on extra tests to help confirm or rule out complex disorders. For these protocols to be clinically valid and economically efficient, they require combining knowledge on abnormal test result patterns and patient health data to automatically “reflex” add-on tests and issue comments subsequent to their results. In this paper, we present a Semantic Web based framework for the computerization of reflex testing protocols. To avoid casting too wide a net in terms of add-on tests, a reflex (testing) protocol may include an arbitrary number of stages, where test result patterns in stagen can trigger add-on tests in stagen+1. Our evaluation applies a computerized reflex protocol for pituitary dysfunction on 1-year retrospective data, and compares its accuracy and financial cost with a combined reflex/reflective approach that included manual laboratory clinician intervention.

William Van Woensel, Manal Elnenaei, Syed Ali Imran, Syed Sibte Raza Abidi

### Using Distribution Divergence to Predict Changes in the Performance of Clinical Predictive Models

Clinical predictive models are vulnerable to degradation in performance due to changes in the distribution of the data (distribution divergence) at application time. Significant reductions in model performance can lead to suboptimal medical decisions and harm to patients. Distribution divergence in healthcare data can arise from changes in medical practice, patient demographics, equipment, and measurement standards. However, estimating model performance at application time is challenging when labels are not readily available, which is often the case in healthcare. One solution to this challenge is to develop unsupervised methods of measuring distribution divergence that are predictive of changes in performance of clinical models. In this article, we investigate the capability of divergence metrics that can be computed without labels in estimating model performance under conditions of distribution divergence. In particular, we examine two popular integral probability metrics, i.e., Wasserstein distance and maximum mean discrepancy, and measure their correlation with model performance in the context of predicting mortality and prolonged stay in the intensive care unit (ICU). When models were trained on data from one hospital’s ICU and assessed on data from ICUs in other hospitals, model performance was significantly correlated with the degree of divergence across hospitals as measured by the distribution divergence metrics. Moreover, regression models could predict model performance from divergence metrics with small errors.

### Analysis of Health Screening Records Using Interpretations of Predictive Models

Health screening is conducted in many countries to track general health conditions and find asymptomatic patients. In recent years, large-scale data analyses on health screening records have been utilized to predict patients’ future health conditions. While such predictions are significantly important, it is also of great interest for medical researchers to identify factors that could deteriorate patients’ medical conditions in the future. For this purpose, we propose to use interpretations of trained predictive models. Specifically, we trained machine learning models to predict future diabetes stages, then applied permutation importance, SHapley Additive exPlanations (SHAP), and a sensitivity analysis to extract features that contribute to aggravation. Among the trained models, XGBoost performed best in terms of the Matthews correlation coefficient. Permutation importance and SHAP showed that the model makes good predictions using a number of attributes conventionally known to be related to diabetes, but also those not commonly used in the diagnosis of diabetes. A sensitivity analysis showed that the predictions’ changes were mostly consistent with our intuition on how daily behavior affects type 2 diabetes’s aggravation.

Yuki Oba, Taro Tezuka, Masaru Sanuki, Yukiko Wagatsuma

### Seasonality in Infection Predictions Using Interpretable Models for High Dimensional Imbalanced Datasets

Seasonality plays a significant role in the prevalence of infectious diseases. We evaluate the performance of different approaches used to deal with seasonality in clinical prediction models, including a new proposal based on sliding windows. Class imbalance, high dimensionality and interpretable models are also considered since they are common traits of clinical datasets.We tested these approaches with four datasets: two created synthetically and two extracted from the MIMIC-III database. Our results corroborate that clinical prediction models for infections can be improved by considering the effect of seasonality. However, the techniques employed to obtain the best results are highly dependent on the dataset.

Bernardo Cánovas-Segura, Antonio Morales, Jose M. Juárez, Manuel Campos

### Monitoring Quality of Life Indicators at Home from Sparse, and Low-Cost Sensor Data

Supporting older people, many of whom live with chronic conditions, cognitive and physical impairments to live independently at home is of increasing importance due to ageing demographicssss. To aid independent living at home, much effort is being directed at reliably detecting activities from sensor data to monitor people’s quality of life or to enhance self-management of their own health. Current efforts typically leverage large numbers of sensors to overcome challenges in the accurate detection of activities. In this work, we report on the results of machine learning models based on data collected with a small number of low-cost, off-the-shelf passive sensors that were retrofitted in real homes, some with more than a single occupant. Models were developed from sensor data to recognize activities of daily living, such as eating and dressing as well as meaningful activities, such as reading a book and socializing. We found that a Recurrent Neural Network was most accurate in recognizing activities. However, many activities remain difficult to detect, in particular meaningful activities, which are characterized by high levels of individual personalization.

Dympna O’Sullivan, Rilwan Basaru, Simone Stumpf, Neil Maiden

### Detection of Parkinson's Disease Early Progressors Using Routine Clinical Predictors

Parkinson's disease (PD) is a progressive, neurodegenerative disease characterised by the presence of motor and non-motor symptoms and signs. The symptoms of PD tend to begin very gradually and then become progressively more severe. The rate of PD progression is hard to predict and is different from one person to another. Namely, while in some patients the disease develops fast in just a few years from the diagnosis, in some the disease takes a more idle course and progresses slowly. We aimed to identify patients that develop severe motor symptoms within four years from PD diagnosis (early progressors) and separate them from those in whom severe symptoms develop beyond this point. We used data from the Parkinson’s Progression Markers Initiative (PPMI) dataset to calculate motor progression of the disease by the use of motor scores as assessed by MDS-UPDRS III. The predictors were defined as baseline scores of selected clinical variables and the difference between motor scores at 1-year after enrolment in the study and the same scores at baseline. The rationale for predictor selection was that they should be readily available in routine clinical practice. We tested four different classifiers: logistic regression, decision tree, random forest, and gradient boosting. The best performing classifier was the logistic regression with an area under the ROC curve of 81%. We believe this can be the basis for a reliable and explainable classifier, using only standard clinical variables, for identifying early progressors with high recall (80%) three years in advance.

Marco Cotogni, Lucia Sacchi, Dejan Georgiev, Aleksander Sadikov

### Detecting Mild Cognitive Impairment Using Smooth Pursuit and a Modified Corsi Task

Over 50 million people today live with some form of dementia as it is the most common neurodegenerative disease in the world. Mild cognitive impairment (MCI) is a stage before dementia symptoms overtly manifest. An estimated 10–15% of patients diagnosed with MCI annually convert to Alzheimer’s dementia. Early detection of MCI is imperative as disease-modifying therapies in development could have the potential to significantly delay disease progression before dementia symptoms develop. There is evidence that observing oculomotor movements during different neuropsychological tasks can serve as a biomarker for MCI. A clinical study with 105 participants was performed at several centres in Ljubljana, Slovenia. All the participants underwent an extensive neurological and psychological evaluation and were, on the basis of this evaluation, divided into two groups: cognitively impaired and healthy controls. At the same time the participants performed several short tasks on the computer screen, including smooth pursuit dot tracking and a modified version of the Corsi block-tapping test. During the tasks, performed using their gaze alone, their eye movements were recorded with an eye-tracker. The eye-tracking data was analysed and a number of features describing the gaze behaviour was proposed. These features were used to construct several machine learning models to predict whether a person exhibits signs of cognitive impairment or not. A model based on random forest classifier achieved the best performance with 80% classification accuracy and an area under the ROC curve of 85%.

Alessia Gerbasi, Vida Groznik, Dejan Georgiev, Lucia Sacchi, Aleksander Sadikov

### Neural Clinical Event Sequence Prediction Through Personalized Online Adaptive Learning

Clinical event sequences consist of thousands of clinical events that represent records of patient care in time. Developing accurate prediction models for such sequences is of a great importance for defining representations of a patient state and for improving patient care. One important challenge of learning a good predictive model of clinical sequences is patient-specific variability. Based on underlying clinical complications, each patient’s sequence may consist of different sets of clinical events. However, population-based models learned from such sequences may not accurately predict patient-specific dynamics of event sequences. To address the problem, we develop a new adaptive event sequence prediction framework that learns to adjust its prediction for individual patients through an online model update.

Jeong Min Lee, Milos Hauskrecht

### Using Event-Based Web-Scraping Methods and Bidirectional Transformers to Characterize COVID-19 Outbreaks in Food Production and Retail Settings

Current surveillance methods may not capture the full extent of COVID-19 spread in high-risk settings like food establishments. Thus, we propose a new method for surveillance that identifies COVID-19 cases among food establishment workers from news reports via web-scraping and natural language processing (NLP). First, we used web-scraping to identify a broader set of articles (n = 67,078) related to COVID-19 based on keyword mentions. In this dataset, we used an open-source NLP platform (ClarityNLP) to extract location, industry, case, and death counts automatically. These articles were vetted and validated by CDC subject matter experts (SMEs) to identify those containing COVID-19 outbreaks in food establishments. CDC and Georgia Tech Research Institute SMEs provided a human-labeled test dataset containing 388 articles to validate our algorithms. Then, to improve quality, we fine-tuned a pretrained RoBERTa instance, a bidirectional transformer language model, to classify articles containing $$\ge$$ ≥ 1 positive COVID-19 cases in food establishments. The application of RoBERTa decreased the number of articles from 67,078 to 1,112 and classified ( $$\ge$$ ≥ 1 positive COVID-19 cases in food establishments) articles with 88% accuracy in the human-labeled test dataset. Therefore, by automating the pipeline of web-scraping and COVID-19 case prediction using RoBERTa, we enable an efficient human in-the-loop process by which COVID-19 data could be manually collected from articles flagged by our model, thus reducing the human labor requirements. Furthermore, our approach could be used to predict and monitor locations of COVID-19 development by geography and could also be extended to other industries and news article datasets of interest.

Joseph Miano, Charity Hilton, Vasu Gangrade, Mary Pomeroy, Jacqueline Siven, Michael Flynn, Frances Tilashalski

### Deep Kernel Learning for Mortality Prediction in the Face of Temporal Shift

Neural models, with their ability to provide novel representations, have shown promising results in prediction tasks in healthcare. However, patient demographics, medical technology, and quality of care change over time. This often leads to drop in the performance of neural models for prospective patients, especially in terms of their calibration. The deep kernel learning (DKL) framework may be robust to such changes as it combines neural models with Gaussian processes, which are aware of prediction uncertainty. Our hypothesis is that out-of-distribution test points will result in probabilities closer to the global mean and hence prevent overconfident predictions. This in turn, we hypothesise, will result in better calibration on prospective data.This paper investigates DKL’s behaviour when facing a temporal shift, which was naturally introduced when an information system that feeds a cohort database was changed. We compare DKL’s performance to that of a neural baseline based on recurrent neural networks. We show that DKL indeed produced superior calibrated predictions. We also confirm that the DKL’s predictions were indeed less sharp. In addition, DKL’s discrimination ability was even improved: its AUC was 0.746 $$(\pm$$ ( ± 0.014 std), compared to 0.739 (±0.028 std) for the baseline. The paper demonstrated the importance of including uncertainty in neural computing, especially for their prospective use.

Miguel Rios, Ameen Abu-Hanna

### Model Evaluation Approaches for Human Activity Recognition from Time-Series Data

There are many evaluation metrics and methods that can be used to quantify and predict a model’s future performance on previously unknown data. In the area of Human Activity Recognition (HAR), the methodology used to determine the training, validation, and test data can have a significant impact on the reported accuracy. HAR data sets typically contain few test subjects with the data from each subject separated into fixed-length segments. Due to the potential leakage of subject-specific information into the training set, cross-validation techniques can yield erroneously high classification accuracy. In this work (Source code available at: https://github.com/imics-lab/model_evaluation_for_HAR .), we examine how variations in evaluation methods impact the reported classification accuracy of a 1D-CNN using two popular HAR data sets.

Lee B. Hinkle, Vangelis Metsis

### Unsupervised Learning to Subphenotype Heart Failure Patients from Electronic Health Records

Heart failure (HF) is a deadly disease and its prevalence is slowly increasing. The sub-types of HF are currently mostly determined by the so-called ejection fraction (EF). In this work, we try to find novel subgroups of heart failure following a complete data-driven approach of clustering patients based on their electronic health records (EHRs). Using a validated phenotyping algorithm we were able to identify 14,334 adult patients with heart failure in our database. We derived embeddings of patients using two different strategies, one processing aggregated clinical features using principal component analysis (PCA) and uniform manifold approximation and projection (UMAP), and one where we learn embeddings from the sequence of medical events using a long short-term memory (LSTM) autoencoder. Then we evaluated different clustering strategies like k-means and agglomerative hierarchical to derive the most informative subtypes. The results were compared based on different metrics such as silhouette coefficient and so on and also based on comparing outcomes such as hospitalization, EF etc. between the clusters. In the most promising result, we were able to identify 3 subclusters using the aggregated data approach in combination with UMAP as dimension reduction method and k-means as cluster method. Patients in cluster 1 had the lowest number of hospital days and comorbidities, while patients in cluster 3 had a significantly higher number of hospital days together with a higher prevalence of comorbidities such as chronic kidney disease and atrial fibrillation. Patients in cluster 2 had a high prevalence of drug allergies in their medical history.

Melanie Hackl, Suparno Datta, Riccardo Miotto, Erwin Bottinger

### Stratification of Parkinson’s Disease Patients via Multi-view Clustering

Parkinson’s disease is a neurodegenerative disease characterised by heterogeneity of the sets of symptoms patients experience and the trajectories of disease progression. The PPMI study includes patients’ symptoms explaining different aspects of patients’ life, i.e. motor, non-motor, and autonomic symptoms. This paper proposes a multi-view clustering approach for determining groups of Parkinson’s disease patients from the PPMI study with distinct disease trajectories over 4 years. The proposed multi-view clustering approach searches groups of patients who share similar disease progression trajectories over multiple types of symptoms. We detected two groups of patients with different disease progression trajectories and significant differences in severity of motor, non-motor, and autonomic symptoms. On the other hand, while we did not detect any significant differences between the patients from the two groups based on their demographics, medications treatment or their disease types, we identified over-sensitivity to bright light as a possible early screening symptom for type of disease progression.

Anita Valmarska, Nada Lavrač, Marko Robnik–Šikonja

### Disentangled Hyperspherical Clustering for Sepsis Phenotyping

Sepsis is a heterogeneous disease. Clustering sepsis patients into homogeneous subgroups with characteristic phenotypes may help for studying the disease progression and for providing targeted therapies. Existing clustering methods use many or all input variables whereas clusters defined by few variables are preferred by clinicians investigating subgroup treatment. To address this gap, we propose a soft F-statistic loss that promotes disentangled clusters differentiating on a small subset of features. Empirical and qualitative results demonstrate our method excels at achieving the desired property against competing methods.

Cheng Cheng, Jason Kennedy, Christopher Seymour, Jeremy C. Weiss

### Phenotypes for Resistant Bacteria Infections Using an Efficient Subgroup Discovery Algorithm

The phenotyping process consists of selecting sets of patients of special interest and identifying their key characteristics. Subgroup Discovery (SD) is a suitable supervised approach for this task. In this work, we have proposed a two step process with an efficient SD algorithm (VLA4SD) for an exhaustive exploration of the search space with very effective prunes based on equivalence classes. We use the Coverage and the Incremental Response Rate quality measures to evaluate general and interesting subgroups. The suitability of our approach has been tested by identifying phenotypes of patients in the MIMIC-III open access database.

Antonio Lopez-Martinez-Carrasco, Jose M. Juarez, Manuel Campos, Bernardo Canovas-Segura

### Predicting Drug-Drug Interactions from Heterogeneous Data: An Embedding Approach

Most approaches for predicting drug-drug interactions (DDIs) have focused on text. We present the first work that uses multiple drug structure data - images, string representations and relationship representations. We exploit the recent advances in deep networks to integrate these varied sources of inputs in predicting DDIs. Our empirical evaluations clearly demonstrate the efficacy of combining heterogeneous data in predicting DDIs.

Devendra Singh Dhami, Siwen Yan, Gautam Kunapuli, David Page, Sriraam Natarajan

### Detection of Junctional Ectopic Tachycardia by Central Venous Pressure

Central venous pressure (CVP) is the blood pressure in the venae cavae, near the right atrium of the heart. This signal waveform is commonly collected in clinical settings, and yet there has been limited discussion of using this data for automatically detecting and monitoring arrhythmia and other cardiac events. In this paper, we introduce a signal processing and feature engineering pipeline for CVP waveform analysis. Through a case study on pediatric junctional ectopic tachycardia (JET), we show that our extracted CVP features reliably detect JET with comparable results to the more commonly used electrocardiogram (ECG) features. This machine learning pipeline can thus improve the clinical diagnosis and ICU monitoring of arrhythmia. It can also corroborate and complement the ECG-based diagnosis, especially when the ECG measurements are unavailable or corrupted.

Xin Tan, Yanwan Dai, Ahmed Imtiaz Humayun, Haoze Chen, Genevera I. Allen, Parag N. Jain MD

### A Cautionary Tale on Using Covid-19 Data for Machine Learning

Introduction: Good quality and real-time epidemiological COVID-19 data are paramount to fight this pandemic through statistical/machine-learning based decision-making support mechanisms.Aims: Evaluate the resources available and used to gather COVID-19 epidemiological data by Portuguese health authorities from the onset of the pandemic until December 2020. The analysis laid on two main topics: (a) work processes at the Public Health Unit (PHU) level and (b) registry forms for epidemiological reporting and control procedures. Recommendations on requirements to overcome problems related to data integration and interoperability in order to build robust decision-making support mechanisms will also be produced.Methods: For topic (a), we revised the Portuguese Directorate-General of Health (DGS) guidelines for data treatment. For topic (b), we analysed the forms used during first and second waves, while comparing them with DGS metadata provided to researchers.Results: On topic (a), we detected the use of two complementary and non-interoperable systems. Further, the workflow does not seem to promote data quality and facilitates the occurrence of communication problems between health professionals. On topic (b), we found 27 deleted questions, 6 new questions, 1 displaced question, and 1 text modification between the 2 form versions.Discussion: Both the workflow and data gathering methods are not the best suited for the generation of good quality data. They do not effectively support Public Health Professionals (PHP) nor provide the elements for posterior data analysis. The use of data by decision-making support mechanisms demands a careful planning of the data used to depict reality, and this condition is not met by the currently used forms.

Diogo Nogueira-Leite, João Miguel Alves, Manuel Marques-Cruz, Ricardo Cruz-Correia

### MitPlan 2.0: Enhanced Support for Multi-morbid Patient Management Using Planning

The complexity of patient care is growing due to an ageing population. As chronic illnesses become more common, the incidence of multi-morbidity increases. Generating disease management plans for multi-morbid patients requires the integration of multiple evidence-based interventions, represented as clinical practice guidelines (CPGs), that are designed to treat a single condition. Our previous work developed a mitigation framework called MitPlan that represented the generation of treatment as a planning problem. The framework used the Planning Domain Definition Language (PDDL) to represent clinical and patient information needed to identify and mitigate adverse interactions resulting from the concurrent application of multiple CPGs for a given patient encounter. In this paper we describe MitPlan 2.0 that supports shared decision-making by identifying a treatment plan optimized according to patient preferences, treatment cost, or perceived patient’s adherence to medication. It mitigates adverse interactions using planning constructs, eliminating the need for procedural handling of adverse interactions and as such provides flexible and comprehensive decision support at the point of care. We demonstrate MitPlan 2.0’s extended capabilities using synthetic scenarios approximating real-world clinical use cases and demonstrate its new capabilities within the context of atrial fibrillation.

Martin Michalowski, Malvika Rao, Szymon Wilk, Wojtek Michalowski, Marc Carrier

### Explanations in Digital Health: The Case of Supporting People Lifestyles

Systems that aim at supporting users on behavior change are expected to implement strategies that can both motivate and gain the users’ trust, like the use of human understandable justifications for system’s decisions. While the literature has dedicated great effort on providing accurate system’s decisions, less focus has been given on addressing the problem of explaining to the user the reasons for a decision. This work presents a SPARQL-based reasoner enabling explainability on systems thought for supporting users in following healthy lifestyles. Our results demonstrate that users that received such information were able to reduce unhealthy behaviors over time.

Milene Santos Teixeira, Ivan Donadello, Mauro Dragoni

### Predicting Medical Interventions from Vital Parameters: Towards a Decision Support System for Remote Patient Monitoring

Cardiovascular diseases and heart failures in particular are the main cause of non-communicable disease mortality in the world. Constant patient monitoring enables better medical treatment as it allows practitioners to react on time and provide the appropriate treatment. Telemedicine can provide constant remote monitoring so patients can stay in their homes, only requiring medical sensing equipment and network connections. A limiting factor for telemedical centers is the amount of patients that can be monitored simultaneously. We aim to increase this amount by implementing a decision support system. This paper investigates a machine learning model to estimate a risk score based on patient vital parameters that allows sorting all cases every day to help practitioners focus their limited capacities on the most severe cases. The model we propose reaches an AUCROC of 0.84, whereas the baseline rule-based model reaches an AUCROC of 0.73. Our results indicate that the usage of deep learning to improve the efficiency of telemedical centers is feasible. This way more patients could benefit from better health-care through remote monitoring .

Kordian Gontarska, Weronika Wrazen, Jossekin Beilharz, Robert Schmid, Lauritz Thamsen, Andreas Polze

### CAncer PAtients Better Life Experience (CAPABLE) First Proof-of-Concept Demonstration

The CAncer PAtient Better Life Experience (CAPABLE) project combines the most advanced technologies for data and knowledge management with a socio-psychological approach, to develop a coaching system for improving the quality of life of cancer patients managed at home. The team includes complementary expertise in data- and knowledge-driven AI, data integration, telemedicine and decision support. The time is right to fully exploit Artificial Intelligence for cancer care and bring the benefits right to patients’ homes. CAPABLE relies on predictive models based on both retrospective and prospective data, integrated with computer interpretable guidelines and made available to oncologists. CAPABLE’s Virtual Coach component identifies unexpected needs and provides patient-specific decision support and lifestyle guidance to improve mental and physical wellbeing of patients. The demo, designed around a use-case scenario developed with clinicians involved in the project, addresses the ESMO Diarrhea guideline. It revolves around a prototypical fictional patient named Maria. Maria, 66, is affected by renal cell carcinoma and moderate insomnia. The demo follows Maria during the first three days of using the CAPABLE system. This allows the audience to understand the scope and innovation behind this AI-based decision-support and coaching system that personalizes lifestyle and medication interventions to patients, their carer and clinicians.

Enea Parimbelli, Matteo Gabetta, Giordano Lanzola, Francesca Polce, Szymon Wilk, David Glasspool, Alexandra Kogan, Roy Leizer, Vitali Gisko, Nicole Veggiotti, Silvia Panzarasa, Rowdy de Groot, Manuel Ottaviano, Lucia Sacchi, Ronald Cornet, Mor Peleg, Silvana Quaglini

### Sensitivity and Specificity Evaluation of Deep Learning Models for Detection of Pneumoperitoneum on Chest Radiographs

Deep learning has great potential to assist with detecting and triaging critical findings such as pneumoperitoneum on medical images. To be clinically useful, the performance of this technology still needs to be validated for generalizability across different types of imaging systems. This retrospective study included 1,287 chest X-ray images of patients who underwent initial chest radiography at 13 different hospitals between 2011 and 2019. State-of-the-art deep learning models were trained on a subset of this dataset, and the automated classification performance was evaluated on the rest of the dataset by measuring the AUC, sensitivity, and specificity. All deep learning models performed well for identifying radiographs with pneumoperitoneum, while DenseNet161 achieved the highest AUC of 95.7%, Specificity of 89.9%, and Sensitivity of 91.6%. The DenseNet161 model was able to accurately classify radiographs from different imaging systems (Accuracy of 90.8%), while it was trained on images captured from a specific imaging system from a single institution. This result suggests the generalizability of our model for learning salient features in chest X-ray images to detect pneumoperitoneum, independent of the imaging system. If verified in clinical settings, this model could assist practitioners with the diagnosis and management of patients with this urgent condition.

Manu Goyal, Judith Austin-Strohbehn, Sean J. Sun, Karen Rodriguez, Jessica M. Sin, Yvonne Y. Cheung, Saeed Hassanpour

### An Application of Recurrent Neural Networks for Estimating the Prognosis of COVID-19 Patients in Northern Italy

Hospital overloads and limited healthcare resources (ICU beds, ventilators, etc.) are fundamental issues related to the outbreak of the COVID-19 pandemic. Machine learning techniques can help the hospitals to recognise in advance the patients at risk of death, and consequently to allocate their resources in a more efficient way. In this paper we present a tool based on Recurrent Neural Networks to predict the risk of death for hospitalised patients with COVID-19. The features used in our predictive models consist of demographics information, several laboratory tests, and a score that indicates the severity of the pulmonary damage observed by chest X-ray exams. The networks were trained and tested using data of 2000 patients hospitalised in Lombardy, the region most affected by COVID-19 in Italy. The experimental results show good performance in solving the addressed task.

Mattia Chiari, Alfonso E. Gerevini, Matteo Olivato, Luca Putelli, Nicholas Rossetti, Ivan Serina

### Recurrent Neural Network to Predict Renal Function Impairment in Diabetic Patients via Longitudinal Routine Check-up Data

People affected by diabetes are at a high risk of developing diabetic nephropathy, which, in turn, is the leading cause of end-stage chronic kidney disease worldwide. Predicting the onset of renal complications as early as possible, when kidney function is still intact, is of paramount importance for therapy selection due to existence of a class of antidiabetic agents (SGLT2 inhibitors) with known nephroprotective properties.In the present work, we study the anthropometric and laboratory data of 28,955 diabetic patients followed for a median of 6.6 years (IQR 4.7–7.8) by 14 Italian diabetes outpatient clinics. We develop a deep learning model, based on the incorporation of variable-length longitudinal baseline data via recurrent layers, to predict the onset of impaired kidney function (KDOQI stage ≥ 3). We adopt a multi-label output-coding system to address the irregularity and sparsity in the sampling of endpoints induced by the real-life structure of the data.Using the cumulative/dynamic AUROC with respect to a variable prediction horizon of 1 to 7 years, we compare the proposed model against the predictor of imminent deterioration of kidney function used in clinical practice, i.e., the estimated glomerular filtration rate (eGFR), and a set of year-specific logistic regressions trained on a single baseline visit.The proposed deep learning model generally outperforms both benchmarks, especially in the medium-to-long term, with AUROC ranging from 0.841 to 0.895. Supplementary analyses confirm the effective encoding of sequence data within the network.

Enrico Longato, Gian Paolo Fadini, Giovanni Sparacino, Angelo Avogaro, Barbara Di Camillo

### Counterfactual Explanations for Survival Prediction of Cardiovascular ICU Patients

In recent years, machine learning methods have been rapidly implemented in the medical domain. However, current state-of-the-art methods usually produce opaque, black-box models. To address the lack of model transparency, substantial attention has been given to develop interpretable machine learning methods. In the medical domain, counterfactuals can provide example-based explanations for predictions, and show practitioners the modifications required to change a prediction from an undesired to a desired state. In this paper, we propose a counterfactual explanation solution for predicting the survival of cardiovascular ICU patients, by representing their electronic health record as a sequence of medical events, and generating counterfactuals by adopting and employing a text style-transfer technique. Experimental results on the MIMIC-III dataset strongly suggest that text style-transfer methods can be effectively adapted for the problem of counterfactual explanations in healthcare applications and can achieve competitive performance in terms of counterfactual validity, BLEU-4 and local outlier metrics.

Zhendong Wang, Isak Samsten, Panagiotis Papapetrou

### Improving the Performance of Melanoma Detection in Dermoscopy Images Using Deep CNN Features

Deep learning based automated approaches mainly based on convolution neural networks (CNN) has recently brought significant attention to diagnose skin cancers (melanoma) from dermoscopic images. However, learning efficient features from these models has been challenging due to unavailability of ample amount of data. To address this problem, in this paper, we propose an improved automated system that derives visual features from a contemporary pre-trained deep CNN model (MobileNet) to identify melanoma from dermoscopic images. Further, skin lesion classification is performed using a set of classifiers. The method introduces boundary localization and cropping that helps in generating more relevant features. Our proposed method has been validated on PH $$^2$$ 2 dataset for the classification of non-melanoma and melanoma cases. The experimental results reveal that the suggested approach obtained promising performance compared to state-of-the-art methods.

Himanshu K. Gajera, Mukesh A. Zaveri, Deepak Ranjan Nayak

### Mobile Aided System of Deep-Learning Based Cataract Grading from Fundus Images

The cataract is an ocular disease which requires early detection to avoid reaching a higher severity level. However, a worldwide deficiency of ophthalmologists and medical imaging devices is registered, which prevents early cataract detection. Our main objective is to propose a high performance method of cataract grading with a lower computational processing to be suitable for mobile devices. The main contribution consists in extracting features through a transfer-learned and fine-tuned MobileNet-V2 model, and deducing the cataract grade using a random forest classifier. The evaluation is conducted using a dataset of 590 fundus images, where 91.43% sensitivity, 89.58% specificity, 90.68% accuracy and 92.75% precision are achieved. In addition, the method implemented into a smartphone requires an average execution time of 1.41 s. The method implementation as an app into a smartphone associated to an optical lens for retina capturing, presents a mobile-aided-grading system that facilitates diagnosing the cataract disease.

Yaroub Elloumi

### Uncertainty Estimation in SARS-CoV-2 B-Cell Epitope Prediction for Vaccine Development

B-cell epitopes play a key role in stimulating B-cells, triggering the primary immune response which results in antibody production as well as the establishment of long-term immunity in the form of memory cells. Consequently, being able to accurately predict appropriate linear B-cell epitope regions would pave the way for the development of new protein-based vaccines. Knowing how much confidence there is in a prediction is also essential for gaining clinicians’ trust in the technology. In this article, we propose a calibrated uncertainty estimation in deep learning to approximate variational Bayesian inference using MC-DropWeights to predict epitope regions using the data from the immune epitope database. Having applied this onto SARS-CoV-2, it can more reliably predict B-cell epitopes than standard methods. This will be able to identify safe and effective vaccine candidates to combat Covid-19.

Bhargab Ghoshal, Biraja Ghoshal, Stephen Swift, Allan Tucker

### Attention-Based Explanation in a Deep Learning Model For Classifying Radiology Reports

Although deep learning techniques have obtained remarkable results in clinical text analysis, the delicacy of this application domain requires also that these models can be easily understood by the hospital staff. The attention mechanism, which assigns numerical weights representing the contribution of each word to the predictive task, can be exploited for identifying the textual evidence the prediction is based on. In this paper, we investigate the explainability of an attention-based classification model for radiology reports collected from an Italian hospital. The identified explanations are compared with a set of manual annotations made by the domain experts in order to analyze the usefulness of the attention mechanism in our context.

Luca Putelli, Alfonso E. Gerevini, Alberto Lavelli, Roberto Maroldi, Ivan Serina

### Evaluation of Encoder-Decoder Architectures for Automatic Skin Lesion Segmentation

Melanoma is one of the most severe skin cancer types due to its high mortality rate, which can achieve 70%. An early diagnosis of the disease is crucial as it increases the ten-year survival rate up to 97%. The segmentation of skin lesions is one of the essential steps of the diagnosis process for accurate melanoma detection. However, even for specialist doctors, segmenting these lesions is costly and challenging due to the wide variety of stains, which can have irregular edges, different dimensions, and colors, and due to the high amounts of exams to analyze. This paper aims to compare encoder-decoder architectures based on popular convolutional neural networks to segmentation dermoscopic images in order to assist in the automatic diagnosis process.

José G. P. Lima, Geraldo Braz Junior, João D. S. de Almeida, Caio E. F. Matos

### A Novel Deep Learning Model for COVID-19 Detection from Combined Heterogeneous X-ray and CT Chest Images

COVID-19 originally started in Wuhan city in China. The disease rapidly became a worldwide pandemic, causing a respiratory illness with symptoms such as coughing, fever, and in more severe cases difficulty in breathing. With the current testing processes, it is very difficult and sometimes impossible to manage and provide the necessary treatment to suspected patients since the number of the infected is rapidly increasing. Hence, the availability of an artificial intelligent driven system can be an assistive tool to provide accurate diagnosis using radiology imaging techniques. In this paper, we put forward a new deep learning architecture, which integrates the Nested Residual Connections (NRCs) in a DarkCovidNet model, called DarkCovidNet-NRC, in order to classify chest images and to detect COVID-19 cases. The proposed architecture is validated with the K-fold cross-validation technique on X-ray and CT chest datasets separately and then combined. The experimental results reveal that the suggested model performs very well in the medical classification task and it competes with the state of the art in multiple performance metrics by respectively achieving an accuracy and precision of 0.9609 and 0.978 on the combined dataset.

Amir Bouden, Ahmed Ghazi Blaiech, Khaled Ben Khalifa, Asma Ben Abdallah, Mohamed Hédi Bedoui

### An Experiment Environment for Definition, Training and Evaluation of Electrocardiogram-Based AI Models

The use of artificial intelligence (AI) for analysis of electrocardiogram (ECG) data has recently gained much interest in the AI and medical communities. The discussed models have shown to be able to deliver high diagnostic sensitivity and specificity for detection of various cardiac diseases including rhythm disorders and ischemic events. However, the experiments leading to these results are often difficult to reproduce outside of the original experimental setup and researchers who want to externally validate such results or use them as starting points for new experiments are forced to develop their own models from scratch. We therefore propose a software environment that enables to build, train and evaluate AI models for ECG classification in a reproducible manner and offers sharing of experiment configurations among researchers. The environment further provides simple connection of publicly available data sources of validated ECG recordings. It offers various validation techniques such as bootstrapping and cross-validation. A proof of concept is given for a deep learning model consisting of a convolutional neural network for the classification of acute myocardial infarction based on ECG data.

Nils Gumpfer, Joshua Prim, Dimitri Grün, Jennifer Hannig, Till Keller, Michael Guckert

### Enhancing the Value of Counterfactual Explanations for Deep Learning

Counterfactual examples can be used to explain a specific clinical prediction from a deep learning model by identifying what kind of feature changes would produce a different result, i.e. flipping the prediction’s classification. On-going research seeks to refine the metrics for discovering counterfactual examples, given a specific input to a deep learning model. Our work enhances this by using feature importance to reveal how much individual feature changes in the counterfactual example contribute to flipping the prediction’s classification, compared with the original. Our approach does not depend on the specific metrics used for generating the counterfactual examples, so it is general. It can be used either to gain further insight when the counterfactual examples have already been generated or to influence the generation of the counterfactual examples. We illustrate this novel approach with a healthcare example.

Yan Jia, John McDermid, Ibrahim Habli

### A Multi-instance Multi-label Weakly Supervised Approach for Dealing with Emerging MeSH Descriptors

The constant evolution of Medical Subject Headings (MeSH) vocabulary and specifically the changes in its descriptors brings forth a number of issues that need automation. The main one being that changed descriptors often lack proper ground truth articles. Therefore, the learning models which demand strong supervision are not directly applicable, settling the predictions on such changes not a straightforward task. The importance of this problem is also enforced by its multi-label nature and the fine-grained character of the examined class-descriptors, factors that demand a lot of human resources. In this work, we alleviate these issues through retrieving insights from a source of information about those descriptors present in MeSH in order to create a weakly-labeled train set. Furthermore, we exploit short-text information per article, implementing an averaging transformation on the corresponding sentence embeddings, applying a similarity mechanism for assigning weak-labels to our formatted data set, thus we named our approach WeakMeSH. The benefits of applying the proposed end-to-end approach are examined on a large-scale subset of the BioASQ 2018 data set consisting of 900 thousand instances, investigating two separate groups of MeSH changes: brand new and complex changes. Our performance tested on BioASQ 2020 data set against several other approaches that can either distill weak information on their own or apply alternative transformations against the proposed one was proven highly competitive.

Nikolaos Mylonas, Stamatis Karlos, Grigorios Tsoumakas

### Demographic Aware Probabilistic Medical Knowledge Graph Embeddings of Electronic Medical Records

Medical knowledge graphs (KGs) constructed from Electronic Medical Records (EMR) contain abundant information about patients and medical entities. The utilization of KG embedding models on these data has proven to be efficient for different medical tasks. However, existing models do not properly incorporate patient demographics and most of them ignore the probabilistic features of the medical KG. In this paper, we propose DARLING (Demographic Aware pRobabiListic medIcal kNowledge embeddinG), a demographic-aware medical KG embedding framework that explicitly incorporates demographics in the medical entities space by associating patient demographics with a corresponding hyperplane. Our framework leverages the probabilistic features within the medical entities for learning their representations through demographic guidance. We evaluate DARLING through link prediction for treatments and medicines, on a medical KG constructed from EMR data, and illustrate its superior performance compared to existing KG embedding models.

Aynur Guluzade, Endri Kacupaj, Maria Maleshkova

### Modeling and Representation by Graphs of the Reasoning of an Emergency Doctor: Symptom Checker MedVir

This article deals with the symptom checker MedVir which is modeled on the reasoning of an emergency physician. His reasoning is very particular because he often has no knowledge of the patient and he doesn’t have much time to evaluate the situation. He needs to make decisions rapidly based on diagnostic hypotheses and an estimation of the severity of the patient’s condition. We present a ten step model of the reasoning of an emergency physician by a four layer network composed with what we call a “neuronal entity" and a question prioritization algorithm which checks the most important questions. This “neuronal entity" generalizes the neuron concept but differs from those usually used in machine learning. Visualization by graphs displays all the characteristics of each neuron and each synapse thickness corresponds to the argumentative strength of a question. Hence, these graphs could be very useful in the training of physicians and health professionals.

Loïc Etienne, Francis Faux, Olivier Roecker

### Effect of Depth Order on Iterative Nested Named Entity Recognition Models

This paper studies the effect of the order of depth of mention on nested named entity recognition (NER) models. NER is an essential task in the extraction of biomedical information, and nested entities are common since medical concepts can assemble to form larger entities. Conventional NER systems only predict disjointed entities. Thus, iterative models for nested NER use multiple predictions to enumerate all entities, imposing a predefined order from largest to smallest or smallest to largest. We design an order-agnostic iterative model and a procedure to choose a custom order during training and prediction. We propose a modification of the Transformer architecture to take into account the entities predicted in the previous steps. We provide a set of experiments to study the model’s capabilities and the effects of the order on performance. Finally, we show that the smallest to largest order gives the best results.

Perceval Wajsbürt, Yoann Taillé, Xavier Tannier

### The Effectiveness of Phrase Skip-Gram in Primary Care NLP for the Prediction of Lung Cancer

Neural models that use context-dependency in the learned text are computationally expensive. We compare the effectiveness (predictive performance) and efficiency (computational effort) of a context-independent Phrase Skip-Gram (PSG) model and a contextualized Hierarchical Attention Network (HAN) model for early prediction of lung cancer using free-text patient files from Dutch primary care physicians. The performance of PSG (AUROC 0.74 (0.69–0.79)) was comparable to HAN (AUROC 0.73 (0.68–0.78)); it achieved better calibration; had much less parameters (301 versus > 300k) and much faster (36 versus 460 s). This demonstrates an important case in which the complex contextualized neural models were not required.

Torec T. Luik, Miguel Rios, Ameen Abu-Hanna, Henk C. P. M. van Weert, Martijn C. Schut

### Customized Neural Predictive Medical Text: A Use-Case on Caregivers

Predictive text can speed up authoring of everyday tasks, such as writing an SMS or a URL. When deployed in a clinical setting, it can enable practitioners to compile diagnostic text reports in a speedier manner, hence allowing them to be more time-efficient when examining patients. The language used by medical practitioners when authoring clinical reports is, however, far from common, not only between practitioners but also between medical units. In this paper, we demonstrate this clinical language variation, by showing that a model trained on texts written by some physicians may not work for predicting the text of others. We use a dataset created out of the clinical notes of 17 caregivers to show that language models trained on the notes of each caregiver outperform the ones trained with texts from several ones.

John Pavlopoulos, Panagiotis Papapetrou

### Outlier Detection for GP Referrals in Otorhinolaryngology

Medical referrals come in unstructured text form, and it is a challenge to classify and find outliers among them. While anomaly detection in the text mining domain is not unusual, it is difficult to apply them in public health as it requires precision especially on the medical terms used. This paper proposed the use of ensembled machine learning algorithms to perform clinical text mining on the referrals and find outlying referrals based on control parameters. The result is a set of ICD codes that can be traced back to the relevant referral for the clinician to investigate further.

Chee Keong Wee, Nathan Wee

### The Champollion Project: Automatic Structuration of Clinical Features from Medical Records

Cancer is one of the leading causes of mortality worldwide and as populations age, the burden is growing. Treating increasing numbers of patients enables us to gather detailed medical records. Databases with exhaustive, high quality structured data are thus an essential resource for cancer researchers and provide invaluable information to clinicians whenever they need to treat their patients. In addition, these databases fuel our data strategy as the cornerstone of our digital healthcare ecosystem and they provide crucial support for the development of Artificial Intelligence-related projects. Feeding such databases and registries requires manual curation to ensure their quality over time. Finding alternatives to manual structuration is essential because around 80% of the relevant clinical information is contained in open text and it is costly to maintain teams of curators given the growing volumes of data generated every year. In this article we describe an Artificial Intelligence system developed at Institut Curie, capable of structuring clinical features from unstructured Electronic Health Records. Our system allows us to structure clinical data with reduced manual labor and with accuracy comparable to that of expert clinicians, empowering our data ecosystem and improving the support we can give to clinicians and researchers.

Oliver Hijano Cubelos, Thomas Balezeau, Julien Guerin

### Modelling and Assessment of One-Drug Dose Titration

In health-care, medical errors are quantified. Among them, wrong dose prescriptions occur. Drug dose titration (DT) is the process by which dosage is progressively adjusted to the patient till a steady dose is reached. Depending on the clinical disease, drug, and patient, dose titration can follow different procedures. Once modeled, these procedures can serve for clinical homogenization, standardization, decision support and retrospective analysis. Here, we propose a language to model dose titration procedures. The language was used to formalize single-drug titration of chronic and acute cases, and perform retrospective analysis of the drug titration processes on 1,000 cases treated with Bisoprolol and 2,430 cases treated with Ramipril, in order to identify different types of drug titration deviations from standard DT methods.

David Riaño, Aida Kamišalić

### TransICD: Transformer Based Code-Wise Attention Model for Explainable ICD Coding

International Classification of Disease (ICD) coding procedure which refers to tagging medical notes with diagnosis codes has been shown to be effective and crucial to the billing system in medical sector. Currently, ICD codes are assigned to a clinical note manually which is likely to cause many errors. Moreover, training skilled coders also requires time and human resources. Therefore, automating the ICD code determination process is an important task. With the advancement of artificial intelligence theory and computational hardware, machine learning approach has emerged as a suitable solution to automate this process. In this project, we apply a transformer-based architecture to capture the interdependence among the tokens of a document and then use a code-wise attention mechanism to learn code-specific representations of the entire document. Finally, they are fed to separate dense layers for corresponding code prediction. Furthermore, to handle the imbalance in the code frequency of clinical datasets, we employ a label distribution aware margin (LDAM) loss function. The experimental results on the MIMIC-III dataset show that our proposed model outperforms other baselines by a significant margin. In particular, our best setting achieves a micro-AUC score of 0.923 compared to 0.868 of bidirectional recurrent neural networks. We also show that by using the code-wise attention mechanism, the model can provide more insights about its prediction, and thus it can support clinicians to make reliable decisions. Our code is available online ( https://github.com/biplob1ly/TransICD ).

Biplob Biswas, Thai-Hoang Pham, Ping Zhang

### Improving Prediction of Low-Prior Clinical Events with Simultaneous General Patient-State Representation Learning

Low-prior targets are common among many important clinical events, which introduces the challenge of having enough data to support learning of their predictive models. Many prior works have addressed this problem by first building a general patient-state representation model, and then adapting it to a new low-prior prediction target. In this schema, there is potential for the predictive performance to be hindered by the misalignment between the general patient-state model and the target task. To overcome this challenge, we propose a new method that simultaneously optimizes a shared model through multi-task learning of both the low-prior supervised target and general purpose patient-state representation (GPSR). More specifically, our method improves prediction performance of a low-prior task by jointly optimizing a shared model that combines the loss of the target event and a broad range of generic clinical events. We study the approach in the context of Recurrent Neural Networks (RNNs). Through extensive experiments on multiple clinical event targets using MIMIC-III [8] data, we show that the inclusion of general patient-state representation tasks during model training improves the prediction of individual low-prior targets.

Matthew Barren, Milos Hauskrecht

### Identifying Symptom Clusters Through Association Rule Mining

Cancer patients experience many symptoms throughout their cancer treatment and sometimes suffer from lasting effects post-treatment. Patient-Reported Outcome (PRO) surveys provide a means for monitoring the patient’s symptoms during and after treatment. Symptom cluster (SC) research seeks to understand these symptoms and their relationships to define new treatment and disease management methods to improve patient’s quality of life. This paper introduces association rule mining (ARM) as a novel alternative for identifying symptom clusters. We compare the results to prior research and find that while some of the SCs are similar, ARM uncovers more nuanced relationships between symptoms such as anchor symptoms that serve as connections between interference and cancer-specific symptoms.

Mikayla Biggs, Carla Floricel, Lisanne Van Dijk, Abdallah S. R. Mohamed, C. David Fuller, G. Elisabeta Marai, Xinhua Zhang, Guadalupe Canahuate

### A Probabilistic Approach to Extract Qualitative Knowledge for Early Prediction of Gestational Diabetes

Qualitative influence statements are often provided a priori to guide learning; we answer a challenging reverse task and automatically extract them from a learned probabilistic model. We apply our Qualitative Knowledge Extraction method toward early prediction of gestational diabetes on clinical study data. Our empirical results demonstrate that the extracted rules are both interpretable and valid.

Athresh Karanam, Alexander L. Hayes, Harsha Kokel, David M. Haas, Predrag Radivojac, Sriraam Natarajan

### Backmatter

Weitere Informationen