Top

2019 | Book

Machine Learning and Knowledge Discovery in Databases

European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part III

Editors: Prof. Dr. Ulf Brefeld, Edward Curry, Elizabeth Daly, Dr. Brian MacNamee, Alice Marascu, Fabio Pinelli, Michele Berlingerio, Neil Hurley

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

The three volume proceedings LNAI 11051 – 11053 constitutes the refereed proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2018, held in Dublin, Ireland, in September 2018.

The total of 131 regular papers presented in part I and part II was carefully reviewed and selected from 535 submissions; there are 52 papers in the applied data science, nectar and demo track.

The contributions were organized in topical sections named as follows:
Part I: adversarial learning; anomaly and outlier detection; applications; classification; clustering and unsupervised learning; deep learning; ensemble methods; and evaluation.
Part II: graphs; kernel methods; learning paradigms; matrix and tensor analysis; online and active learning; pattern and sequence mining; probabilistic models and statistical methods; recommender systems; and transfer learning.
Part III: ADS data science applications; ADS e-commerce; ADS engineering and design; ADS financial and security; ADS health; ADS sensing and positioning; nectar track; and demo track.

Frontmatter

ADS Data Science Applications

Frontmatter

Neural Article Pair Modeling for Wikipedia Sub-article Matching

Nowadays, editors tend to separate different subtopics of a long Wiki-pedia article into multiple sub-articles. This separation seeks to improve human readability. However, it also has a deleterious effect on many Wikipedia-based tasks that rely on the article-as-concept assumption, which requires each entity (or concept) to be described solely by one article. This underlying assumption significantly simplifies knowledge representation and extraction, and it is vital to many existing technologies such as automated knowledge base construction, cross-lingual knowledge alignment, semantic search and data lineage of Wikipedia entities. In this paper we provide an approach to match the scattered sub-articles back to their corresponding main-articles, with the intent of facilitating automated Wikipedia curation and processing. The proposed model adopts a hierarchical learning structure that combines multiple variants of neural document pair encoders with a comprehensive set of explicit features. A large crowdsourced dataset is created to support the evaluation and feature extraction for the task. Based on the large dataset, the proposed model achieves promising results of cross-validation and significantly outperforms previous approaches. Large-scale serving on the entire English Wikipedia also proves the practicability and scalability of the proposed model by effectively extracting a vast collection of newly paired main and sub-articles. Code related to this paper is available at: https://github.com/muhaochen/subarticle .

Muhao Chen, Changping Meng, Gang Huang, Carlo Zaniolo

LinNet: Probabilistic Lineup Evaluation Through Network Embedding

Which of your team’s possible lineups has the best chances against each of your opponent’s possible lineups? To answer this question, we develop LinNet (which stands for LINeup NETwork). LinNet exploits the dynamics of a directed network that captures the performance of lineups during their matchups. The nodes of this network represent the different lineups, while an edge from node B to node A exists if lineup $${\lambda }_A$$ has outperformed lineup $${\lambda }_B$$ . We further annotate each edge with the corresponding performance margin (point margin per minute). We then utilize this structure to learn a set of latent features for each node (i.e., lineup) using the node2vec framework. Consequently, using the latent, learned features, LinNet builds a logistic regression model for the probability of lineup $${\lambda }_A$$ outperforming lineup $${\lambda }_B$$ . We evaluate the proposed method by using NBA lineup data from the five seasons between 2007–08 and 2011–12. Our results indicate that our method has an out-of-sample accuracy of 68%. In comparison, utilizing simple network centrality metrics (i.e., PageRank) achieves an accuracy of just 53%, while using the adjusted plus-minus of the players in the lineup for the same prediction problem provides an accuracy of only 55%. We have also explored the adjusted lineups’ plus-minus as our predictors and obtained an accuracy of 59%. Furthermore, the probability output of LinNet is well-calibrated as indicated by the Brier score and the reliability curve. One of the main benefits of LinNet is its generic nature that allows it to be applied in different sports since the only input required is the lineups’ matchup network, i.e., not any sport-specific features are needed.

Konstantinos Pelechrinis

Improving Emotion Detection with Sub-clip Boosting

With the emergence of systems such as Amazon Echo, Google Home, and Siri, voice has become a prevalent mode for humans to interact with machines. Emotion detection from voice promises to transform a wide range of applications, from adding emotional-awareness to voice assistants, to creating more sensitive robotic helpers for the elderly. Unfortunately, due to individual differences, emotion expression varies dramatically, making it a challenging problem. To tackle this challenge, we introduce the Sub-Clip Classification Boosting (SCB) Framework, a multi-step methodology for emotion detection from non-textual features of audio clips. SCB features a highly-effective sub-clip boosting methodology for classification that, unlike traditional boosting using feature subsets, instead works at the sub-instance level. Multiple sub-instance classifications increase the likelihood that an emotion cue will be found within a voice clip, even if its location varies between speakers. First, each parent voice clip is decomposed into overlapping sub-clips. Each sub-clip is then independently classified. Further, the Emotion Strength of the sub-classifications is scored to form a sub-classification and strength pair. Finally we design a FilterBoost-inspired “Oracle”, that utilizes sub-classification and Emotion Strength pairs to determine the parent clip classification. To tune the classification performance, we explore the relationships between sub-clip properties, such as length and overlap. Evaluation on 3 prominent benchmark datasets demonstrates that our SCB method consistently outperforms all state-of-the art-methods across diverse languages and speakers. Code related to this paper is available at: https://arcgit.wpi.edu/toto/EMOTIVOClean .

Ermal Toto, Brendan J. Foley, Elke A. Rundensteiner

Machine Learning for Targeted Assimilation of Satellite Data

Optimizing the utilization of huge data sets is a challenging problem for weather prediction. To a significant degree, prediction accuracy is determined by the data used in model initialization, assimilated from a variety of observational platforms. At present, the volume of weather data collected in a given day greatly exceeds the ability of assimilation systems to make use of it. Typically, data is ingested uniformly at the highest fixed resolution that enables the numerical weather prediction (NWP) model to deliver its prediction in a timely fashion. In order to make more efficient use of newly available high-resolution data sources, we seek to identify regions of interest (ROI) where increased data quality or volume is likely to significantly enhance weather prediction accuracy. In particular, we wish to improve the utilization of data from the recently launched Geostationary Operation Environmental Satellite (GOES)-16, which provides orders of magnitude more data than its predecessors. To achieve this, we demonstrate a method for locating tropical cyclones using only observations of precipitable water, which is evaluated using the Global Forecast System (GFS) weather prediction model. Most state of the art hurricane detection techniques rely on multiple feature sets, including wind speed, wind direction, temperature, and IR emissions, potentially from multiple data sources. In contrast, we demonstrate that this model is able to achieve comparable performance on historical tropical cyclone data sets, using only observations of precipitable water.

Yu-Ju Lee, David Hall, Jebb Stewart, Mark Govett

From Empirical Analysis to Public Policy: Evaluating Housing Systems for Homeless Youth

There are nearly 2 million homeless youth in the United States each year. Coordinated entry systems are being used to provide homeless youth with housing assistance across the nation. Despite these efforts, the number of youth still homeless or unstably housed remains very high. Motivated by this fact, we initiate a first study to understand and analyze the current governmental housing systems for homeless youth. In this paper, we aim to provide answers to the following questions: (1) What is the current governmental housing system for assigning homeless youth to different housing assistance? (2) Can we infer the current assignment guidelines of the local housing communities? (3) What is the result and outcome of the current assignment process? (4) Can we predict whether the youth will be homeless after receiving the housing assistance? To answer these questions, we first provide an overview of the current housing systems. Next, we use simple and interpretable machine learning tools to infer the decision rules of the local communities and evaluate the outcomes of such assignment. We then determine whether the vulnerability features/rubrics can be used to predict youth’s homelessness status after receiving housing assistance. Finally, we discuss the policy recommendations from our study for the local communities and the U.S. Housing and Urban Development (HUD).

Hau Chan, Eric Rice, Phebe Vayanos, Milind Tambe, Matthew Morton

Discovering Groups of Signals in In-Vehicle Network Traces for Redundancy Detection and Functional Grouping

Modern vehicles exchange signals across multiple ECUs in order to run various functionalities. With increasing functional complexity the amount of distinct signals grew too large to be analyzed manually. During development of a car only subsets of such signals are relevant per analysis and functional group. Moreover, historical growth led to redundancies in signal specifications which need to be discovered. Both tasks can be solved through the discovery of groups. While the analysis of in-vehicle signals is increasingly studied, the grouping of relevant signals as a basis for those tasks was examined less. We therefore present and extensively evaluate a processing and clustering approach for semi-automated grouping of in-vehicle signals based on traces recorded from fleets of cars.

Artur Mrowca, Barbara Moser, Stephan Günnemann

ADS E-commerce

Frontmatter

SPEEDING Up the Metabolism in E-commerce by Reinforcement Mechanism DESIGN

In a large E-commerce platform, all the participants compete for impressions under the allocation mechanism of the platform. Existing methods mainly focus on the short-term return based on the current observations instead of the long-term return. In this paper, we formally establish the lifecycle model for products, by defining the introduction, growth, maturity and decline stages and their transitions throughout the whole life period. Based on such model, we further propose a reinforcement learning based mechanism design framework for impression allocation, which incorporates the first principal component based permutation and the novel experiences generation method, to maximize short-term as well as long-term return of the platform. With the power of trial-and-error, it is possible to recognize in advance the potentially hot products in the introduction stage as well as the potentially slow-selling products in the decline stage, so the metabolism can be speeded up by an optimal impression allocation strategy. We evaluate our algorithm on a simulated environment built based on one of the largest E-commerce platforms, and a significant improvement has been achieved in comparison with the baseline solutions. Code related to this paper is available at: https://github.com/WXFMAV/lifecycle_open .

Hua-Lin He, Chun-Xiang Pan, Qing Da, An-Xiang Zeng

Discovering Bayesian Market Views for Intelligent Asset Allocation

Along with the advance of opinion mining techniques, public mood has been found to be a key element for stock market prediction. However, how market participants’ behavior is affected by public mood has been rarely discussed. Consequently, there has been little progress in leveraging public mood for the asset allocation problem, which is preferred in a trusted and interpretable way. In order to address the issue of incorporating public mood analyzed from social media, we propose to formalize public mood into market views, because market views can be integrated into the modern portfolio theory. In our framework, the optimal market views will maximize returns in each period with a Bayesian asset allocation model. We train two neural models to generate the market views, and benchmark the model performance on other popular asset allocation strategies. Our experimental results suggest that the formalization of market views significantly increases the profitability ( $$5\%$$ to $$10\%$$ annually) of the simulated portfolio at a given risk level.

Frank Z. Xing, Erik Cambria, Lorenzo Malandri, Carlo Vercellis

Intent-Aware Audience Targeting for Ride-Hailing Service

As the market for ride-hailing service is increasing dramatically, an efficient audience targeting system (which aims to identify a group of recipients for a particular message) for ride-hailing services is demanding for marketing campaigns. In this paper, we describe the details of our deployed system for intent-aware audience targeting on Baidu Maps for ride-hailing services. The objective of the system is to predict user intent for requesting a ride and then send corresponding coupons to the user. For this purpose, we develop a hybrid model to combine the LSTM model and GBDT model together to handle sequential map query data and heterogeneous non-sequential data, which leads to a significant improvement in the performance of the intent prediction. We verify the effectiveness of our method over a large real-world dataset and conduct a large-scale online marketing campaign over Baidu Maps app. We present an in-depth analysis of the model’s performance and trade-offs. Both offline experiment and online marketing campaign evaluation show that our method has a consistently good performance in predicting user intent for a ride request and can significantly increase the click-through rate (CTR) of vehicle coupon targeting compared with baseline methods.

Yuan Xia, Jingbo Zhou, Jingjia Cao, Yanyan Li, Fei Gao, Kun Liu, Haishan Wu, Hui Xiong

A Recurrent Neural Network Survival Model: Predicting Web User Return Time

The size of a website’s active user base directly affects its value. Thus, it is important to monitor and influence a user’s likelihood to return to a site. Essential to this is predicting when a user will return. Current state of the art approaches to solve this problem come in two flavors: (1) Recurrent Neural Network (RNN) based solutions and (2) survival analysis methods. We observe that both techniques are severely limited when applied to this problem. Survival models can only incorporate aggregate representations of users instead of automatically learning a representation directly from a raw time series of user actions. RNNs can automatically learn features, but can not be directly trained with examples of non-returning users who have no target value for their return time. We develop a novel RNN survival model that removes the limitations of the state of the art methods. We demonstrate that this model can successfully be applied to return time prediction on a large e-commerce dataset with a superior ability to discriminate between returning and non-returning users than either method applied in isolation. Code related to this paper is available at: https://github.com/grobgl/rnnsm .

Georg L. Grob, Ângelo Cardoso, C. H. Bryan Liu, Duncan A. Little, Benjamin Paul Chamberlain

Implicit Linking of Food Entities in Social Media

Dining is an important part in people’s lives and this explains why food-related microblogs and reviews are popular in social media. Identifying food entities in food-related posts is important to food lover profiling and food (or restaurant) recommendations. In this work, we conduct Implicit Entity Linking (IEL) to link food-related posts to food entities in a knowledge base. In IEL, we link posts even if they do not contain explicit entity mentions. We first show empirically that food venues are entity-focused and associated with a limited number of food entities each. Hence same-venue posts are likely to share common food entities. Drawing from these findings, we propose an IEL model which incorporates venue-based query expansion of test posts and venue-based prior distributions over entities. In addition, our model assigns larger weights to words that are more indicative of entities. Our experiments on Instagram captions and food reviews shows our proposed model to outperform competitive baselines.

Wen-Haw Chong, Ee-Peng Lim

A Practical Deep Online Ranking System in E-commerce Recommendation

User online shopping experience in modern e-commerce websites critically relies on real-time personalized recommendations. However, building a productionized recommender system still remains challenging due to a massive collection of items, a huge number of online users, and requirements for recommendations to be responsive to user actions. In this work, we present our relevant, responsive, and scalable deep online ranking system (DORS) that we developed and deployed in our company. DORS is implemented in a three-level architecture which includes (1) candidate retrieval that retrieves a board set of candidates with various business rules enforced; (2) deep neural network ranking model that takes advantage of available user and item specific features and their interactions; (3) multi-arm bandits based online re-ranking that dynamically takes user real-time feedback and re-ranks the final recommended items in scale. Given a user as a query, DORS is able to precisely capture users’ real-time purchasing intents and help users reach to product purchases. Both offline and online experimental results show that DORS provides more personalized online ranking results and makes more revenue.

Yan Yan, Zitao Liu, Meng Zhao, Wentao Guo, Weipeng P. Yan, Yongjun Bao

ADS Engineering and Design

Frontmatter

Helping Your Docker Images to Spread Based on Explainable Models

Docker is on the rise in today’s enterprise IT. It permits shipping applications inside portable containers, which run from so-called Docker images. Docker images are distributed in public registries, which also monitor their popularity. The popularity of an image impacts on its actual usage, and hence on the potential revenues for its developers. In this paper, we present a solution based on interpretable decision tree and regression trees for estimating the popularity of a given Docker image, and for understanding how to improve an image to increase its popularity. The results presented in this work can provide valuable insights to Docker developers, helping them in spreading their images. Code related to this paper is available at: https://github.com/di-unipi-socc/DockerImageMiner .

Riccardo Guidotti, Jacopo Soldani, Davide Neri, Antonio Brogi, Dino Pedreschi

ST-DenNetFus: A New Deep Learning Approach for Network Demand Prediction

Network Demand Prediction is of great importance to network planning and dynamically allocating network resources based on the predicted demand, this can be very challenging as it is affected by many complex factors, including spatial dependencies, temporal dependencies, and external factors (such as regions’ functionality and crowd patterns as it will be shown in this paper). We propose a deep learning based approach called, ST-DenNetFus, to predict network demand (i.e. uplink and downlink throughput) in every region of a city. ST-DenNetFus is an end to end architecture for capturing unique properties from spatio-temporal data. ST-DenNetFus employs various branches of dense neural networks for capturing temporal closeness, period, and trend properties. For each of these properties, dense convolutional neural units are used for capturing the spatial properties of the network demand across various regions in a city. Furthermore, ST-DenNetFus introduces extra branches for fusing external data sources that have not been considered before in the network demand prediction problem of various dimensionalities. In our case, these external factors are the crowd mobility patterns, temporal functional regions, and the day of the week. We present an extensive experimental evaluation for the proposed approach using two types of network throughput (uplink and downlink) in New York City (NYC), where ST-DenNetFus outperforms four well-known baselines.

Haytham Assem, Bora Caglayan, Teodora Sandra Buda, Declan O’Sullivan

On Optimizing Operational Efficiency in Storage Systems via Deep Reinforcement Learning

This paper deals with the application of deep reinforcement learning to optimize the operational efficiency of a solid state storage rack. Specifically, we train an on-policy and model-free policy gradient algorithm called the Advantage Actor-Critic (A2C). We deploy a dueling deep network architecture to extract features from the sensor readings off the rack and devise a novel utility function that is used to control the A2C algorithm. Experiments show performance gains greater than 30% over the default policy for deterministic as well as random data workloads.

Sunil Srinivasa, Girish Kathalagiri, Julu Subramanyam Varanasi, Luis Carlos Quintela, Mohamad Charafeddine, Chi-Hoon Lee

Automating Layout Synthesis with Constructive Preference Elicitation

Layout synthesis refers to the problem of arranging objects subject to design preferences and structural constraints. Applications include furniture arrangement, space partitioning (e.g. subdividing a house into rooms), urban planning, and other design tasks. Computer-aided support systems are essential tools for architects and designers to produce custom, functional layouts. Existing systems, however, do not learn the designer’s preferences, and therefore fail to generalize across sessions or instances. We propose addressing layout synthesis by casting it as a constructive preference elicitation task. Our solution employs a coactive interaction protocol, whereby the system and the designer interact by mutually improving each other’s proposals. The system iteratively recommends layouts to the user, and learns the user’s preferences by observing her improvements to the recommendations. We apply our system to two design tasks, furniture arrangement and space partitioning, and report promising quantitative and qualitative results on both. Code related to this paper is available at: https://github.com/unitn-sml/constructive-layout-synthesis/tree/master/ecml18 .

Luca Erculiani, Paolo Dragone, Stefano Teso, Andrea Passerini

Configuration of Industrial Automation Solutions Using Multi-relational Recommender Systems

Building complex automation solutions, common to process industries and building automation, requires the selection of components early on in the engineering process. Typically, recommender systems guide the user in the selection of appropriate components and, in doing so, take into account various levels of context information. Many popular shopping basket recommender systems are based on collaborative filtering. While generating personalized recommendations, these methods rely solely on observed user behavior and are usually context free. Moreover, their limited expressiveness makes them less valuable when used for setting up complex engineering solutions. Product configurators based on deterministic, handcrafted rules may better tackle these use cases. However, besides being rather static and inflexible, such systems are laborious to develop and require domain expertise. In this work, we study various approaches to generate recommendations when building complex engineering solutions. Our aim is to exploit statistical patterns in the data that contain a lot of predictive power and are considerably more flexible than strict, deterministic rules. To achieve this, we propose a generic recommendation method for complex, industrial solutions that incorporates both past user behavior and semantic information in a joint knowledge base. This results in a graph-structured, multi-relational data description – commonly referred to as a knowledge graph. In this setting, predicting user preference towards an item corresponds to predicting an edge in this graph. Despite its simplicity concerning data preparation and maintenance, our recommender system proves to be powerful, as shown in extensive experiments with real-world data where our model outperforms several state-of-the-art methods. Furthermore, once our model is trained, recommending new items can be performed efficiently. This ensures that our method can operate in real time when assisting users in configuring new solutions.

Marcel Hildebrandt, Swathi Shyam Sunder, Serghei Mogoreanu, Ingo Thon, Volker Tresp, Thomas Runkler

Learning Cheap and Novel Flight Itineraries

We consider the problem of efficiently constructing cheap and novel round trip flight itineraries by combining legs from different airlines. We analyse the factors that contribute towards the price of such itineraries and find that many result from the combination of just 30% of airlines and that the closer the departure of such itineraries is to the user’s search date the more likely they are to be cheaper than the tickets from one airline. We use these insights to formulate the problem as a trade-off between the recall of cheap itinerary constructions and the costs associated with building them.We propose a supervised learning solution with location embeddings which achieves an AUC = 80.48, a substantial improvement over simpler baselines. We discuss various practical considerations for dealing with the staleness and the stability of the model and present the design of the machine learning pipeline. Finally, we present an analysis of the model’s performance in production and its impact on Skyscanner’s users.

Dmytro Karamshuk, David Matthews

Towards Resource-Efficient Classifiers for Always-On Monitoring

Emerging applications such as natural user interfaces or smart homes create a rising interest in electronic devices that have always-on sensing and monitoring capabilities. As these devices typically have limited computational resources and require battery powered operation, the challenge lies in the development of processing and classification methods that can operate under extremely scarce resource conditions. To address this challenge, we propose a two-layered computational model which enables an enhanced trade-off between computational cost and classification accuracy: The bottom layer consists of a selection of state-of-the-art classifiers, each having a different computational cost to generate the required features and to evaluate the classifier itself. For the top layer, we propose to use a Dynamic Bayesian network which allows to not only reason about the output of the various bottom-layer classifiers, but also to take into account additional information from the past to determine the present state. Furthermore, we introduce the use of the Same-Decision Probability to reason about the added value of the bottom-layer classifiers and selectively activate their computations to dynamically exploit the computational cost versus classification accuracy trade-off space. We validate our methods on the real-world SINS database, where domestic activities are recorded with an accoustic sensor network, as well as the Human Activity Recognition (HAR) benchmark dataset.

Jonas Vlasselaer, Wannes Meert, Marian Verhelst

ADS Financial/Security

Frontmatter

Uncertainty Modelling in Deep Networks: Forecasting Short and Noisy Series

Deep Learning is a consolidated, state-of-the-art Machine Learning tool to fit a function $$y=f(x)$$ when provided with large data sets of examples $$\{(x_i, y_i)\}$$ . However, in regression tasks, the straightforward application of Deep Learning models provides a point estimate of the target. In addition, the model does not take into account the uncertainty of a prediction. This represents a great limitation for tasks where communicating an erroneous prediction carries a risk. In this paper we tackle a real-world problem of forecasting impending financial expenses and incomings of customers, while displaying predictable monetary amounts on a mobile app. In this context, we investigate if we would obtain an advantage by applying Deep Learning models with a Heteroscedastic model of the variance of a network’s output. Experimentally, we achieve a higher accuracy than non-trivial baselines. More importantly, we introduce a mechanism to discard low-confidence predictions, which means that they will not be visible to users. This should help enhance the user experience of our product.

Axel Brando, Jose A. Rodríguez-Serrano, Mauricio Ciprian, Roberto Maestre, Jordi Vitrià

Using Reinforcement Learning to Conceal Honeypot Functionality

Automated malware employ honeypot detecting mechanisms within its code. Once honeypot functionality has been exposed, malware such as botnets will cease the attempted compromise. Subsequent malware variants employ similar techniques to evade detection by known honeypots. This reduces the potential size of a captured dataset and subsequent analysis. This paper presents findings on the deployment of a honeypot using reinforcement learning, to conceal functionality. The adaptive honeypot learns the best responses to overcome initial detection attempts by implementing a reward function with the goal of maximising attacker command transitions. The paper demonstrates that the honeypot quickly identifies the best response to overcome initial detection and subsequently increases attack command transitions. It also examines the structure of a captured botnet and charts the learning evolution of the honeypot for repetitive automated malware. Finally it suggests changes to an existing taxonomy governing honeypot development, based on the learning evolution of the adaptive honeypot. Code related to this paper is available at: https://github.com/sosdow/RLHPot .

Seamus Dowling, Michael Schukat, Enda Barrett

Flexible Inference for Cyberbully Incident Detection

We study detection of cyberbully incidents in online social networks, focusing on session level analysis. We propose several variants of a customized convolutional neural networks (CNN) approach, which processes users’ comments largely independently in the front-end layers, but while also accounting for possible conversational patterns. The front-end layer’s outputs are then combined by one of our designed output layers – namely by either a max layer or by a novel sorting layer, proposed here. Our CNN models outperform existing baselines and are able to achieve classification accuracy of up to 84.29% for cyberbullying and 83.08% for cyberaggression.

Haoti Zhong, David J. Miller, Anna Squicciarini

Solving the False Positives Problem in Fraud Prediction Using Automated Feature Engineering

In this paper, we present an automated feature engineering based approach to dramatically reduce false positives in fraud prediction. False positives plague the fraud prediction industry. It is estimated that only 1 in 5 declared as fraud are actually fraud and roughly 1 in every 6 customers have had a valid transaction declined in the past year. To address this problem, we use the Deep Feature Synthesis algorithm to automatically derive behavioral features based on the historical data of the card associated with a transaction. We generate 237 features (>100 behavioral patterns) for each transaction, and use a random forest to learn a classifier. We tested our machine learning model on data from a large multinational bank and compared it to their existing solution. On an unseen data of 1.852 million transactions, we were able to reduce the false positives by 54% and provide a savings of 190K euros. We also assess how to deploy this solution, and whether it necessitates streaming computation for real time scoring. We found that our solution can maintain similar benefits even when historical features are computed once every 7 days.

Roy Wedge, James Max Kanter, Kalyan Veeramachaneni, Santiago Moral Rubio, Sergio Iglesias Perez

Learning Tensor-Based Representations from Brain-Computer Interface Data for Cybersecurity

Understanding, modeling, and explaining neural data is a challenging task. In this paper, we learn tensor-based representations of electroencephalography (EEG) data to classify and analyze the underlying neural patterns related to phishing detection tasks. Specifically, we conduct a phishing detection experiment to collect the data, and apply tensor factorization to it for feature extraction and interpretation. Traditional feature extraction techniques, like power spectral density, autoregressive models, and Fast Fourier transform, can only represent data either in spatial or temporal dimension; however, our tensor modeling leverages both spatial and temporal traits in the input data. We perform a comprehensive analysis of the neural data and show the practicality of multi-way neural data analysis. We demonstrate that using tensor-based representations, we can classify real and phishing websites with accuracy as high as 97%, which outperforms state-of-the-art approaches in the same task by 21%. Furthermore, the extracted latent factors are interpretable, and provide insights with respect to the brain’s response to real and phishing websites.

Md. Lutfor Rahman, Sharmistha Bardhan, Ajaya Neupane, Evangelos Papalexakis, Chengyu Song

ADS Health

Frontmatter

Can We Assess Mental Health Through Social Media and Smart Devices? Addressing Bias in Methodology and Evaluation

Predicting mental health from smartphone and social media data on a longitudinal basis has recently attracted great interest, with very promising results being reported across many studies [3, 9, 13, 26]. Such approaches have the potential to revolutionise mental health assessment, if their development and evaluation follows a real world deployment setting. In this work we take a closer look at state-of-the-art approaches, using different mental health datasets and indicators, different feature sources and multiple simulations, in order to assess their ability to generalise. We demonstrate that under a pragmatic evaluation framework, none of the approaches deliver or even approach the reported performances. In fact, we show that current state-of-the-art approaches can barely outperform the most naïve baselines in the real-world setting, posing serious questions not only about their deployment ability, but also about the contribution of the derived features for the mental health assessment task and how to make better use of such data in the future.

Adam Tsakalidis, Maria Liakata, Theo Damoulas, Alexandra I. Cristea

AMIE: Automatic Monitoring of Indoor Exercises

Patients with sports-related injuries need to learn to perform rehabilitative exercises with correct movement patterns. Unfortunately, the feedback a physiotherapist can provide is limited by the number of physical therapy appointments. We study the feasibility of a system that automatically provides feedback on correct movement patterns to patients using a Microsoft Kinect camera and Machine Learning techniques. We discuss several challenges related to the Kinect’s proprietary software, the Kinect data’s heterogeneity, and the Kinect data’s temporal component. We introduce AMIE, a machine learning pipeline that detects the exercise being performed, the exercise’s correctness, and if applicable, the mistake that was made. To evaluate AMIE, ten participants were instructed to perform three types of typical rehabilitation exercises (squats, forward lunges and side lunges) demonstrating both correct movement patterns and frequent types of mistakes, while being recorded with a Kinect. AMIE detects the type of exercise almost perfectly with 99% accuracy and the type of mistake with 73% accuracy. Code related to this paper is available at: https://dtai.cs.kuleuven.be/software/amie .

Tom Decroos, Kurt Schütte, Tim Op De Beéck, Benedicte Vanwanseele, Jesse Davis

Rough Set Theory as a Data Mining Technique: A Case Study in Epidemiology and Cancer Incidence Prediction

A big challenge in epidemiology is to perform data pre-processing, specifically feature selection, on large scale data sets with a high dimensional feature set. In this paper, this challenge is tackled by using a recently established distributed and scalable version of Rough Set Theory (RST. It considers epidemiological data that has been collected from three international institutions for the purpose of cancer incidence prediction. The concrete data set used aggregates about 5 495 risk factors (features), spanning 32 years and 38 countries. Detailed experiments demonstrate that RST is relevant to real world big data applications as it can offer insights into the selected risk factors, speed up the learning process, ensure the performance of the cancer incidence prediction model without huge information loss, and simplify the learned model for epidemiologists. Code related to this paper is available at: https://github.com/zeinebchelly/Sp-RST .

Zaineb Chelly Dagdia, Christine Zarges, Benjamin Schannes, Martin Micalef, Lino Galiana, Benoît Rolland, Olivier de Fresnoye, Mehdi Benchoufi

Bayesian Best-Arm Identification for Selecting Influenza Mitigation Strategies

Pandemic influenza has the epidemic potential to kill millions of people. While various preventive measures exist (i.a., vaccination and school closures), deciding on strategies that lead to their most effective and efficient use remains challenging. To this end, individual-based epidemiological models are essential to assist decision makers in determining the best strategy to curb epidemic spread. However, individual-based models are computationally intensive and it is therefore pivotal to identify the optimal strategy using a minimal amount of model evaluations. Additionally, as epidemiological modeling experiments need to be planned, a computational budget needs to be specified a priori. Consequently, we present a new sampling technique to optimize the evaluation of preventive strategies using fixed budget best-arm identification algorithms. We use epidemiological modeling theory to derive knowledge about the reward distribution which we exploit using Bayesian best-arm identification algorithms (i.e., Top-two Thompson sampling and BayesGap). We evaluate these algorithms in a realistic experimental setting and demonstrate that it is possible to identify the optimal strategy using only a limited number of model evaluations, i.e., 2-to-3 times faster compared to the uniform sampling method, the predominant technique used for epidemiological decision making in the literature. Finally, we contribute and evaluate a statistic for Top-two Thompson sampling to inform the decision makers about the confidence of an arm recommendation. Code related to this paper is available at: https://plibin-vub.github.io/epidemic-bandits .

Pieter J. K. Libin, Timothy Verstraeten, Diederik M. Roijers, Jelena Grujic, Kristof Theys, Philippe Lemey, Ann Nowé

Hypotensive Episode Prediction in ICUs via Observation Window Splitting

Hypotension, defined as dangerously low blood pressure, is a significant risk factor in intensive care units (ICUs), which requires a prompt therapeutic intervention. The goal of our research is to predict an impending Hypotensive Episode (HE) by time series analysis of continuously monitored physiological vital signs. Our prognostic model is based on the last Observation Window (OW) at the prediction time. Existing clinical episode prediction studies used a single OW of 5–120 min to extract predictive features, with no significant improvement reported when longer OWs were used. In this work we have developed the In-Window Segmentation (InWiSe) method for time series prediction, which splits a single OW into several sub-windows of equal size. The resulting feature set combines the features extracted from each observation sub-window and then this combined set is used by the Extreme Gradient Boosting (XGBoost) binary classifier to produce an episode prediction model. We evaluate the proposed approach on three retrospective ICU datasets (extracted from MIMIC II, Soroka and Hadassah databases) using cross-validation on each dataset separately, as well as by cross-dataset validation. The results show that InWiSe is superior to existing methods in terms of the area under the ROC curve (AUC).

Elad Tsur, Mark Last, Victor F. Garcia, Raphael Udassin, Moti Klein, Evgeni Brotfain

Equipment Health Indicator Learning Using Deep Reinforcement Learning

Predictive Maintenance (PdM) is gaining popularity in industrial operations as it leverages the power of Machine Learning and Internet of Things (IoT) to predict the future health status of equipment. Health Indicator Learning (HIL) plays an important role in PdM as it learns a health curve representing the health conditions of equipment over time, so that health degradation is visually monitored and optimal planning can be performed accordingly to minimize the equipment downtime. However, HIL is a hard problem due to the fact that there is usually no way to access the actual health of the equipment during most of its operation. Traditionally, HIL is addressed by hand-crafting domain-specific performance indicators or through physical modeling, which is expensive and inapplicable for some industries. In this paper, we propose a purely data-driven approach for solving the HIL problem based on Deep Reinforcement Learning (DRL). Our key insight is that the HIL problem can be mapped to a credit assignment problem. Then DRL learns from failures by naturally backpropagating the credit of failures into intermediate states. In particular, given the observed time series of sensor, operating and event (failure) data, we learn a sequence of health indicators that represent the underlying health conditions of physical equipment. We demonstrate that the proposed methods significantly outperform the state-of-the-art methods for HIL and provide explainable insights about the equipment health. In addition, we propose the use of the learned health indicators to predict when the equipment is going to reach its end-of-life, and demonstrate how an explainable health curve is way more useful for a decision maker than a single-number prediction by a black-box model. The proposed approach has a great potential in a broader range of systems (e.g., economical and biological) as a general framework for the automatic learning of the underlying performance of complex systems.

Chi Zhang, Chetan Gupta, Ahmed Farahat, Kosta Ristovski, Dipanjan Ghosh

ADS Sensing/Positioning

Frontmatter

PBE: Driver Behavior Assessment Beyond Trajectory Profiling

Nowadays, the increasing car accidents ask for the better driver behavior analysis and risk assessment for travel safety, auto insurance pricing and smart city applications. Traditional approaches largely use GPS data to assess drivers. However, it is difficult to fine-grained assess the time-varying driving behaviors. In this paper, we employ the increasingly popular On-Board Diagnostic (OBD) equipment, which measures semantic-rich vehicle information, to extract detailed trajectory and behavior data for analysis. We propose PBE system, which consists of Trajectory Profiling Model (PM), Driver Behavior Model (BM) and Risk Evaluation Model (EM). PM profiles trajectories for reminding drivers of danger in real-time. The labeled trajectories can be utilized to boost the training of BM and EM for driver risk assessment when data is incomplete. BM evaluates the driving risk using fine-grained driving behaviors on a trajectory level. Its output incorporated with the time-varying pattern, is combined with the driver-level demographic information for the final driver risk assessment in EM. Meanwhile, the whole PBE system also considers the real-world cost-sensitive application scenarios. Extensive experiments on the real-world dataset demonstrate that the performance of PBE in risk assessment outperforms the traditional systems by at least 21%.

Bing He, Xiaolin Chen, Dian Zhang, Siyuan Liu, Dawei Han, Lionel M. Ni

Accurate WiFi-Based Indoor Positioning with Continuous Location Sampling

The ubiquity of WiFi access points and the sharp increase in WiFi-enabled devices carried by humans have paved the way for WiFi-based indoor positioning and location analysis. Locating people in indoor environments has numerous applications in robotics, crowd control, indoor facility optimization, and automated environment mapping. However, existing WiFi-based positioning systems suffer from two major problems: (1) their accuracy and precision is limited due to inherent noise induced by indoor obstacles, and (2) they only occasionally provide location estimates, namely when a WiFi-equipped device emits a signal. To mitigate these two issues, we propose a novel Gaussian process (GP) model for WiFi signal strength measurements. It allows for simultaneous smoothing (increasing accuracy and precision of estimators) and interpolation (enabling continuous sampling of location estimates). Furthermore, simple and efficient smoothing methods for location estimates are introduced to improve localization performance in real-time settings. Experiments are conducted on two data sets from a large real-world commercial indoor retail environment. Results demonstrate that our approach provides significant improvements in terms of precision and accuracy with respect to unfiltered data. Ultimately, the GP model realizes continuous location sampling with consistently high quality location estimates.

J. E. van Engelen, J. J. van Lier, F. W. Takes, H. Trautmann

Human Activity Recognition with Convolutional Neural Networks

The problem of automatic identification of physical activities performed by human subjects is referred to as Human Activity Recognition (HAR). There exist several techniques to measure motion characteristics during these physical activities, such as Inertial Measurement Units (IMUs). IMUs have a cornerstone position in this context, and are characterized by usage flexibility, low cost, and reduced privacy impact. With the use of inertial sensors, it is possible to sample some measures such as acceleration and angular velocity of a body, and use them to learn models that are capable of correctly classifying activities to their corresponding classes. In this paper, we propose to use Convolutional Neural Networks (CNNs) to classify human activities. Our models use raw data obtained from a set of inertial sensors. We explore several combinations of activities and sensors, showing how motion signals can be adapted to be fed into CNNs by using different network architectures. We also compare the performance of different groups of sensors, investigating the classification potential of single, double and triple sensor systems. The experimental results obtained on a dataset of 16 lower-limb activities, collected from a group of participants with the use of five different sensors, are very promising.

Antonio Bevilacqua, Kyle MacDonald, Aamina Rangarej, Venessa Widjaya, Brian Caulfield, Tahar Kechadi

Urban Sensing for Anomalous Event Detection:

Distinguishing Between Legitimate Traffic Changes and Abnormal Traffic Variability

Sensors deployed in different parts of a city continuously record traffic data, such as vehicle flows and pedestrian counts. We define an unexpected change in the traffic counts as an anomalous local event. Reliable discovery of such events is very important in real-world applications such as real-time crash detection or traffic congestion detection. One of the main challenges to detecting anomalous local events is to distinguish them from legitimate global traffic changes, which happen due to seasonal effects, weather and holidays. Existing anomaly detection techniques often raise many false alarms for these legitimate traffic changes, making such techniques less reliable. To address this issue, we introduce an unsupervised anomaly detection system that represents relationships between different locations in a city. Our method uses training data to estimate the traffic count at each sensor location given the traffic counts at the other locations. The estimation error is then used to calculate the anomaly score at any given time and location in the network. We test our method on two real traffic datasets collected in the city of Melbourne, Australia, for detecting anomalous local events. Empirical results show the greater robustness of our method to legitimate global changes in traffic count than four benchmark anomaly detection methods examined in this paper. Data related to this paper are available at: https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/147696bb47544a209e0a5e79e165d1b0_0 .

Masoomeh Zameni, Mengyi He, Masud Moshtaghi, Zahra Ghafoori, Christopher Leckie, James C. Bezdek, Kotagiri Ramamohanarao

Combining Bayesian Inference and Clustering for Transport Mode Detection from Sparse and Noisy Geolocation Data

Large-scale and real-time transport mode detection is an open challenge for smart transport research. Although massive mobility data is collected from smartphones, mining mobile network geolocation is non-trivial as it is a sparse, coarse and noisy data for which real transport labels are unknown. In this study, we process billions of Call Detail Records from the Greater Paris and present the first method for transport mode detection of any traveling device. Cellphones trajectories, which are anonymized and aggregated, are constructed as sequences of visited locations, called sectors. Clustering and Bayesian inference are combined to estimate transport probabilities for each trajectory. First, we apply clustering on sectors. Features are constructed using spatial information from mobile networks and transport networks. Then, we extract a subset of $$15\%$$ sectors, having road and rail labels (e.g., train stations), while remaining sectors are multi-modal. The proportion of labels per cluster is used to calculate transport probabilities given each visited sector. Thus, with Bayesian inference, each record updates the transport probability of the trajectory, without requiring the exact itinerary. For validation, we use the travel survey to compare daily average trips per user. With Pearson correlations reaching 0.96 for road and rail trips, the model appears performant and robust to noise and sparsity.

Danya Bachir, Ghazaleh Khodabandelou, Vincent Gauthier, Mounim El Yacoubi, Eric Vachon

CentroidNet: A Deep Neural Network for Joint Object Localization and Counting

In precision agriculture, counting and precise localization of crops is important for optimizing crop yield. In this paper CentroidNet is introduced which is a Fully Convolutional Neural Network (FCNN) architecture specifically designed for object localization and counting. A field of vectors pointing to the nearest object centroid is trained and combined with a learned segmentation map to produce accurate object centroids by majority voting. This is tested on a crop dataset made using a UAV (drone) and on a cell-nuclei dataset which was provided by a Kaggle challenge. We define the mean Average F1 score (mAF1) for measuring the trade-off between precision and recall. CentroidNet is compared to the state-of-the-art networks YOLOv2 and RetinaNet, which share similar properties. The results show that CentroidNet obtains the best F1 score. We also explicitly show that CentroidNet can seamlessly switch between patches of images and full-resolution images without the need for retraining.

K. Dijkstra, J. van de Loosdrecht, L. R. B. Schomaker, M. A. Wiering

Deep Modular Multimodal Fusion on Multiple Sensors for Volcano Activity Recognition

Nowadays, with the development of sensor techniques and the growth in a number of volcanic monitoring systems, more and more data about volcanic sensor signals are gathered. This results in a need for mining these data to study the mechanism of the volcanic eruption. This paper focuses on Volcano Activity Recognition (VAR) where the inputs are multiple sensor data obtained from the volcanic monitoring system in the form of time series. And the output of this research is the volcano status which is either explosive or not explosive. It is hard even for experts to extract handcrafted features from these time series. To solve this problem, we propose a deep neural network architecture called VolNet which adapts Convolutional Neural Network for each time series to extract non-handcrafted feature representation which is considered powerful to discriminate between classes. By taking advantages of VolNet as a building block, we propose a simple but effective fusion model called Deep Modular Multimodal Fusion (DMMF) which adapts data grouping as the guidance to design the architecture of fusion model. Different from conventional multimodal fusion where the features are concatenated all at once at the fusion step, DMMF fuses relevant modalities in different modules separately in a hierarchical fashion. We conducted extensive experiments to demonstrate the effectiveness of VolNet and DMMF on the volcanic sensor datasets obtained from Sakurajima volcano, which are the biggest volcanic sensor datasets in Japan. The experiments showed that DMMF outperformed the current state-of-the-art fusion model with the increase of F-score up to 1.9% on average.

Hiep V. Le, Tsuyoshi Murata, Masato Iguchi

Nectar Track

Frontmatter

Matrix Completion Under Interval Uncertainty: Highlights

We present an overview of inequality-constrained matrix completion, with a particular focus on alternating least-squares (ALS) methods. The simple and seemingly obvious addition of inequality constraints to matrix completion seems to improve the statistical performance of matrix completion in a number of applications, such as collaborative filtering under interval uncertainty, robust statistics, event detection, and background modelling in computer vision. An ALS algorithm MACO by Marecek et al. outperforms others, including Sparkler, the implementation of Li et al. Code related to this paper is available at: http://optml.github.io/ac-dc/ .

Jakub Marecek, Peter Richtarik, Martin Takac

A Two-Step Approach for the Prediction of Mood Levels Based on Diary Data

The analysis of diary data can increase insights into patients suffering from mental disorders and can help to personalize online interventions. We propose a two-step approach for such an analysis. We first categorize free text diary data into activity categories by applying a bag-of-words approach and explore recurrent neuronal networks to support this task. In a second step, we develop partial ordered logit models with varying levels of heterogeneity among clients to predict their mood. We estimate the parameters of these models by employing MCMC techniques and compare the models regarding their predictive performance. This two-step approach leads to an increased interpretability about the relationships between various activity categories and the individual mood level.

Vincent Bremer, Dennis Becker, Tobias Genz, Burkhardt Funk, Dirk Lehr

Best Practices to Train Deep Models on Imbalanced Datasets—A Case Study on Animal Detection in Aerial Imagery

We introduce recommendations to train a Convolutional Neural Network for grid-based detection on a dataset that has a substantial class imbalance. These include curriculum learning, hard negative mining, a special border class, and more. We evaluate the recommendations on the problem of animal detection in aerial images, where we obtain an increase in precision from 9% to 40% at high recalls, compared to state-of-the-art. Data related to this paper are available at: http://doi.org/10.5281/zenodo.609023 .

Benjamin Kellenberger, Diego Marcos, Devis Tuia

Deep Query Ranking for Question Answering over Knowledge Bases

We study question answering systems over knowledge graphs which map an input natural language question into candidate formal queries. Often, a ranking mechanism is used to discern the queries with higher similarity to the given question. Considering the intrinsic complexity of the natural language, finding the most accurate formal counter-part is a challenging task. In our recent paper [1], we leveraged Tree-LSTM to exploit the syntactical structure of input question as well as the candidate formal queries to compute the similarities. An empirical study shows that taking the structural information of the input question and candidate query into account enhances the performance, when compared to the baseline system. Code related to this paper is available at: https://github.com/AskNowQA/SQG .

Hamid Zafar, Giulio Napolitano, Jens Lehmann

Machine Learning Approaches to Hybrid Music Recommender Systems

Music recommender systems have become a key technology supporting the access to increasingly larger music catalogs in on-line music streaming services, on-line music shops, and private collections. The interaction of users with large music catalogs is a complex phenomenon researched from different disciplines. We survey our works investigating the machine learning and data mining aspects of hybrid music recommender systems (i.e., systems that integrate different recommendation techniques). We proposed hybrid music recommender systems robust to the so-called “cold-start problem” for new music items, favoring the discovery of relevant but non-popular music. We thoroughly studied the specific task of music playlist continuation, by analyzing fundamental playlist characteristics, song feature representations, and the relationship between playlists and the songs therein.

Andreu Vall, Gerhard Widmer

Demo Track

Frontmatter

IDEA: An Interactive Dialogue Translation Demo System Using Furhat Robots

We showcase IDEA, an Interactive DialoguE trAnslation system using Furhat robots, whose novel contributions are: (i) it is a web service-based application combining translation service, speech recognition service and speech synthesis service; (ii) it is a task-oriented hybrid machine translation system combining statistical and neural machine learning methods for domain-specific named entity (NE) recognition and translation; and (iii) it provides user-friendly interactive interface using Furhat robot with speech input, output, head movement and facial emotions. IDEA is a case-study demo which can efficiently and accurately assist customers and agents in different languages to reach an agreement in a dialogue for the hotel booking.

Jinhua Du, Darragh Blake, Longyue Wang, Clare Conran, Declan Mckibben, Andy Way

RAPID: Real-time Analytics Platform for Interactive Data Mining

Twitter is a popular social networking site that generates a large volume and variety of tweets, thus a key challenge is to filter and track relevant tweets and identify the main topics discussed in real-time. For this purpose, we developed the Real-time Analytics Platform for Interactive Data mining (RAPID) system, which provides an effective data collection mechanism through query expansion, numerous analysis and visualization capabilities for understanding user interactions, tweeting behaviours, discussion topics, and other social patterns. Code related to this paper is available at: https://youtu.be/1APLeLT_t8w .

Kwan Hui Lim, Sachini Jayasekara, Shanika Karunasekera, Aaron Harwood, Lucia Falzon, John Dunn, Glenn Burgess

Interactive Time Series Clustering with COBRASTS

Time series are ubiquitous, resulting in substantial interest in time series data mining. Clustering is one of the most widely used techniques in this setting. Recent work has shown that time series clustering can benefit greatly from small amounts of supervision in the form of pairwise constraints. Such constraints can be obtained by asking the user to answer queries of the following type: should these two instances be in the same cluster? Answering “yes” results in a must-link constraint, “no” results in a cannot-link. In this paper we present an interactive clustering system that exploits such constraints. It is implemented on top of the recently introduced COBRASTS method. The system repeats the following steps until a satisfactory clustering is obtained: it presents several pairwise queries to the user through a visual interface, uses the resulting pairwise constraints to improve the clustering, and shows this new clustering to the user. Our system is readily available and comes with an easy-to-use interface, making it an effective tool for anyone interested in analyzing time series data. Code related to this paper is available at: https://bitbucket.org/toon_vc/cobras_ts/src .

Toon Van Craenendonck, Wannes Meert, Sebastijan Dumančić, Hendrik Blockeel

pysubgroup: Easy-to-Use Subgroup Discovery in Python

This paper introduces the pysubgroup package for subgroup discovery in Python. Subgroup discovery is a well-established data mining task that aims at identifying describable subsets in the data that show an interesting distribution with respect to a certain target concept. The presented package provides an easy-to-use, compact and extensible implementation of state-of-the-art mining algorithms, interestingness measures, and visualizations. Since it builds directly on the established pandas data analysis library—a de-facto standard for data science in Python—it seamlessly integrates into preprocessing and exploratory data analysis steps. Code related to this paper is available at: http://florian.lemmerich.net/pysubgroup .

Florian Lemmerich, Martin Becker

An Advert Creation System for Next-Gen Publicity

With the rapid proliferation of multimedia data in the internet, there has been a fast rise in the creation of videos for the viewers. This enables the viewers to skip the advertisement breaks in the videos, using ad blockers and ‘skip ad’ buttons – bringing online marketing and publicity to a stall. In this paper, we demonstrate a system that can effectively integrate a new advertisement into a video sequence. We use state-of-the-art techniques from deep learning and computational photogrammetry, for effective detection of existing adverts, and seamless integration of new adverts into video sequences. This is helpful for targeted advertisement, paving the path for next-gen publicity. Code related to this paper is available at: https://youtu.be/zaKpJZhBVL4 .

Atul Nautiyal, Killian McCabe, Murhaf Hossari, Soumyabrata Dev, Matthew Nicholson, Clare Conran, Declan McKibben, Jian Tang, Wei Xu, François Pitié

VHI: Valve Health Identification for the Maintenance of Subsea Industrial Equipment

Subsea valves are a key piece of equipment in the extraction process of oil and natural gas. Valves control the flow of fluids by opening and closing passageways. A malfunctioning valve can lead to significant operational losses. In this paper, we describe VHI, a system designed to assist maintenance engineers with condition-based monitoring services for valves. VHI addresses the challenge of maintenance in two ways: a supervised approach that predicts impending valve failure, and an unsupervised approach that identifies and highlights anomalies i.e., an unusual valve behaviour. While the supervised approach is suitable for valves with long operational history, the unsupervised approach is suitable for valves with no operational history.

M. Atif Qureshi, Luis Miralles-Pechuán, Jing Su, Jason Payne, Ronan O’Malley

Tiler: Software for Human-Guided Data Exploration

Understanding relations in datasets is important for the successful application of data mining and machine learning methods. This paper describes tiler, a software tool for interactive visual explorative data analysis realising the interactive Human-Guided Data Exploration framework. tiler allows a user to formulate different hypotheses concerning the relations in a dataset. Data samples corresponding to these hypotheses are then compared visually, allowing the user to gain insight into relations in the dataset. The exploration process is iterative and the user gradually builds up his or her understanding of the data. Code related to this paper is available at: https://github.com/aheneliu/tiler .

Andreas Henelius, Emilia Oikarinen, Kai Puolamäki

ADAGIO: Interactive Experimentation with Adversarial Attack and Defense for Audio

Adversarial machine learning research has recently demonstrated the feasibility to confuse automatic speech recognition (ASR) models by introducing acoustically imperceptible perturbations to audio samples. To help researchers and practitioners gain better understanding of the impact of such attacks, and to provide them with tools to help them more easily evaluate and craft strong defenses for their models, we present Adagio, the first tool designed to allow interactive experimentation with adversarial attacks and defenses on an ASR model in real time, both visually and aurally. Adagio incorporates AMR and MP3 audio compression techniques as defenses, which users can interactively apply to attacked audio samples. We show that these techniques, which are based on psychoacoustic principles, effectively eliminate targeted attacks, reducing the attack success rate from 92.5% to 0%. We will demonstrate Adagio and invite the audience to try it on the Mozilla Common Voice dataset. Code related to this paper is available at: https://github.com/nilakshdas/ADAGIO .

Nilaksh Das, Madhuri Shanbhogue, Shang-Tse Chen, Li Chen, Michael E. Kounavis, Duen Horng Chau

ClaRe: Classification and Regression Tool for Multivariate Time Series

As sensing and monitoring technology becomes more and more common, multiple scientific domains have to deal with big multivariate time series data. Whether one is in the field of finance, life science and health, engineering, sports or child psychology, being able to analyze and model multivariate time series has become of high importance. As a result, there is an increased interest in multivariate time series data methodologies, to which the data mining and machine learning communities respond with a vast literature on new time series methods.However, there is a major challenge that is commonly overlooked; most of the broad audience of end users lack the knowledge on how to implement and use such methods. To bridge the gap between users and multivariate time series methods, we introduce the ClaRe dashboard. This open source web-based tool, provides to a broad audience a new intuitive data mining methodology for regression and classification tasks over time series. Code related to this paper is available at: https://github.com/parastelios/Accordion-Dashboard .

Ricardo Cachucho, Stylianos Paraschiakos, Kaihua Liu, Benjamin van der Burgh, Arno Knobbe

Industrial Memories: Exploring the Findings of Government Inquiries with Neural Word Embedding and Machine Learning

We present a text mining system to support the exploration of large volumes of text detailing the findings of government inquiries. Despite their historical significance and potential societal impact, key findings of inquiries are often hidden within lengthy documents and remain inaccessible to the general public. We transform the findings of the Irish government’s inquiry into industrial schools and through the use of word embedding, text classification and visualization, present an interactive web-based platform that enables the exploration of the text to uncover new historical insights. Code related to this paper is available at: https://industrialmemories.ucd.ie .

Susan Leavy, Emilie Pine, Mark T. Keane

Monitoring Emergency First Responders’ Activities via Gradient Boosting and Inertial Sensor Data

Emergency first response teams during operations expend much time to communicate their current location and status with their leader over noisy radio communication systems. We are developing a modular system to provide as much of that information as possible to team leaders. One component of the system is a human activity recognition (HAR) algorithm, which applies an ensemble of gradient boosted decision trees (GBT) to features extracted from inertial data captured by a wireless-enabled device, to infer what activity a first responder is engaged in. An easy-to-use smartphone application can be used to monitor up to four first responders’ activities, visualise the current activity, and inspect the GBT output in more detail.

Sebastian Scheurer, Salvatore Tedesco, Òscar Manzano, Kenneth N. Brown, Brendan O’Flynn

Visualizing Multi-document Semantics via Open Domain Information Extraction

Faced with the overwhelming amounts of data in the 24/7 stream of new articles appearing online, it is often helpful to consider only the key entities and concepts and their relationships. This is challenging, as relevant connections may be spread across a number of disparate articles and sources. In this paper, we present a system that extracts salient entities, concepts, and their relationships from a set of related documents, discovers connections within and across them, and presents the resulting information in a graph-based visualization. We rely on a series of natural language processing methods, including open-domain information extraction, a special filtering method to maintain only meaningful relationships, and a heuristic to form graphs with a high coverage rate of topic entities and concepts. Our graph visualization then allows users to explore these connections. In our experiments, we rely on a large collection of news crawled from the Web and show how connections within this data can be explored. Code related to this paper is available at: https://shengyp.github.io/vmse .

Yongpan Sheng, Zenglin Xu, Yafang Wang, Xiangyu Zhang, Jia Jia, Zhonghui You, Gerard de Melo

Backmatter

Title: Machine Learning and Knowledge Discovery in Databases
Editors: Prof. Dr. Ulf Brefeld
Edward Curry
Elizabeth Daly
Dr. Brian MacNamee
Alice Marascu
Fabio Pinelli
Michele Berlingerio
Neil Hurley
Publisher: Springer International Publishing
Electronic ISBN: 978-3-030-10997-4
Print ISBN: 978-3-030-10996-7
DOI: https://doi.org/10.1007/978-3-030-10997-4

Springer Professional

About this book

Table of Contents

Frontmatter

ADS Data Science Applications

Frontmatter

Neural Article Pair Modeling for Wikipedia Sub-article Matching

LinNet: Probabilistic Lineup Evaluation Through Network Embedding

Improving Emotion Detection with Sub-clip Boosting

Machine Learning for Targeted Assimilation of Satellite Data

From Empirical Analysis to Public Policy: Evaluating Housing Systems for Homeless Youth

Discovering Groups of Signals in In-Vehicle Network Traces for Redundancy Detection and Functional Grouping

ADS E-commerce

Frontmatter

SPEEDING Up the Metabolism in E-commerce by Reinforcement Mechanism DESIGN

Discovering Bayesian Market Views for Intelligent Asset Allocation

Intent-Aware Audience Targeting for Ride-Hailing Service

A Recurrent Neural Network Survival Model: Predicting Web User Return Time

Implicit Linking of Food Entities in Social Media

A Practical Deep Online Ranking System in E-commerce Recommendation

ADS Engineering and Design

Frontmatter

Helping Your Docker Images to Spread Based on Explainable Models

ST-DenNetFus: A New Deep Learning Approach for Network Demand Prediction

On Optimizing Operational Efficiency in Storage Systems via Deep Reinforcement Learning

Automating Layout Synthesis with Constructive Preference Elicitation

Configuration of Industrial Automation Solutions Using Multi-relational Recommender Systems

Learning Cheap and Novel Flight Itineraries

Towards Resource-Efficient Classifiers for Always-On Monitoring

ADS Financial/Security

Frontmatter

Uncertainty Modelling in Deep Networks: Forecasting Short and Noisy Series

Using Reinforcement Learning to Conceal Honeypot Functionality

Flexible Inference for Cyberbully Incident Detection

Solving the False Positives Problem in Fraud Prediction Using Automated Feature Engineering

Learning Tensor-Based Representations from Brain-Computer Interface Data for Cybersecurity

ADS Health

Frontmatter

Can We Assess Mental Health Through Social Media and Smart Devices? Addressing Bias in Methodology and Evaluation

AMIE: Automatic Monitoring of Indoor Exercises

Rough Set Theory as a Data Mining Technique: A Case Study in Epidemiology and Cancer Incidence Prediction

Bayesian Best-Arm Identification for Selecting Influenza Mitigation Strategies

Hypotensive Episode Prediction in ICUs via Observation Window Splitting

Equipment Health Indicator Learning Using Deep Reinforcement Learning

ADS Sensing/Positioning

Frontmatter

PBE: Driver Behavior Assessment Beyond Trajectory Profiling

Accurate WiFi-Based Indoor Positioning with Continuous Location Sampling

Human Activity Recognition with Convolutional Neural Networks

Urban Sensing for Anomalous Event Detection:

Combining Bayesian Inference and Clustering for Transport Mode Detection from Sparse and Noisy Geolocation Data

CentroidNet: A Deep Neural Network for Joint Object Localization and Counting

Deep Modular Multimodal Fusion on Multiple Sensors for Volcano Activity Recognition

Nectar Track

Frontmatter

Matrix Completion Under Interval Uncertainty: Highlights

A Two-Step Approach for the Prediction of Mood Levels Based on Diary Data

Best Practices to Train Deep Models on Imbalanced Datasets—A Case Study on Animal Detection in Aerial Imagery

Deep Query Ranking for Question Answering over Knowledge Bases

Machine Learning Approaches to Hybrid Music Recommender Systems

Demo Track

Frontmatter

IDEA: An Interactive Dialogue Translation Demo System Using Furhat Robots

RAPID: Real-time Analytics Platform for Interactive Data Mining

Interactive Time Series Clustering with COBRASTS

pysubgroup: Easy-to-Use Subgroup Discovery in Python

An Advert Creation System for Next-Gen Publicity

VHI: Valve Health Identification for the Maintenance of Subsea Industrial Equipment

Tiler: Software for Human-Guided Data Exploration

ADAGIO: Interactive Experimentation with Adversarial Attack and Defense for Audio

ClaRe: Classification and Regression Tool for Multivariate Time Series

Industrial Memories: Exploring the Findings of Government Inquiries with Neural Word Embedding and Machine Learning

Monitoring Emergency First Responders’ Activities via Gradient Boosting and Inertial Sensor Data

Visualizing Multi-document Semantics via Open Domain Information Extraction

Backmatter

Premium Partner