Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the IFIP TC 5, WG 8.4, 8.9, 12.9 International Cross-Domain Conference for Machine Learning and Knowledge Extraction, CD-MAKE 2017, held in Reggio, Italy, in August/September 2017.
The 24 revised full papers presented were carefully reviewed and selected for inclusion in this volume. The papers deal with fundamental questions and theoretical aspects and cover a wide range of topics in the field of machine learning and knowledge extraction. They are organized in the following topical sections: MAKE topology; MAKE smart factory; MAKE privacy; MAKE VIS; MAKE AAL; and MAKE semantics.



MAKE Topology


On Distance Mapping from non-Euclidean Spaces to Euclidean Spaces

Most Machine Learning techniques traditionally rely on some forms of Euclidean Distances, computed in a Euclidean space (typically \(\mathbb {R}^{d}\)). In more general cases, data might not live in a classical Euclidean space, and it can be difficult (or impossible) to find a direct representation for it in \(\mathbb {R}^{d}\). Therefore, distance mapping from a non-Euclidean space to a canonical Euclidean space is essentially needed. We present in this paper a possible distance-mapping algorithm, such that the behavior of the pairwise distances in the mapped Euclidean space is preserved, compared to those in the original non-Euclidean space. Experimental results of the mapping algorithm are discussed on a specific type of datasets made of timestamped GPS coordinates. The comparison of the original and mapped distances, as well as the standard errors of the mapped distributions, are discussed.
Wei Ren, Yoan Miche, Ian Oliver, Silke Holtmanns, Kaj-Mikael Bjork, Amaury Lendasse

Some Remarks on the Algebraic Properties of Group Invariant Operators in Persistent Homology

Topological data analysis is a new approach to processing digital data, focusing on the fact that topological properties are quite important for efficient data comparison. In particular, persistent topology and homology are relevant mathematical tools in TDA, and their study is attracting more and more researchers. As a matter of fact, in many applications data can be represented by continuous real-valued functions defined on a topological space X, and persistent homology can be efficiently used to compare these data by describing the homological changes of the sub-level sets of those functions. However, persistent homology is invariant under the action of the group \(\mathrm {Homeo}(X)\) of all self-homeomorphisms of X, while in many cases an invariance with respect to a proper subgroup G of \(\mathrm {Homeo}(X)\) is preferable. Interestingly, it has been recently proved that this restricted invariance can be obtained by applying G-invariant non-expansive operators to the considered functions. As a consequence, in order to proceed along this line of research we need methods to build G-invariant non-expansive operators. According to this perspective, in this paper we prove some new results about the algebra of GINOs.
Patrizio Frosini, Nicola Quercioli

Decentralized Computation of Homology in Wireless Sensor Networks Using Spanning Trees

When deploying a wireless sensor network over an area of interest, the information on signal coverage is critical. It has been shown that even when geometric position and orientation of individual nodes is not known, useful information on coverage can still be deduced based on connectivity data. In recent years, homological criteria have been introduced to verify complete signal coverage, given only the network communication graph. However, their algorithmic implementation has been limited due to high computational complexity of centralized algorithms, and high demand for communication in decentralized solutions, where a network employs the processing power of its nodes to check the coverage autonomously. To mitigate these problems, known approaches impose certain limitations on network topologies. In this paper, we propose a novel distributed algorithm which uses spanning trees to verify homology-based network coverage criteria, and works for arbitrary network topologies. We demonstrate that its communication demands are suitable even for low-bandwidth wireless sensor networks.
Domen Šoberl, Neža Mramor Kosta, Primož Škraba

Detecting and Ranking API Usage Pattern in Large Source Code Repository: A LFM Based Approach

Code examples are key resources for helping programmers to learn correct Application Programming Interface (API) usages efficiently. However, most framework and library APIs fail in providing sufficient and adequate code examples in corresponding official documentations. Thus, it takes great programmers’ efforts to browse and extract API usage examples from websites. To reduce such effort, this paper proposes a graph-based pattern-oriented mining approach, LFM-OUPD (Local fitness measure for detecting overlapping usage patterns) for API usage facility, that recommends proper API code examples from data analytics. API method queries are accepted from programmers and corresponding code files are collected from related API dataset. The detailed structural links among API method elements in conceptual source codes are captured and generate a code graph structure. Lancichinetti et al. proposed an overlapping community detecting algorithm (Local fitness measure, LFM), based on the local optimization of a fitness function. In LFM-OUPD, a mining algorithm based on LFM is presented to explore the division of method sequences in the directed source code element graph and detect candidates of different API usage patterns. Then a ranking approach is applied to obtain appropriate API usage pattern and code example candidates. A case study on Google Guava is conducted to evaluate the effectiveness of this approach.
Jitong Zhao, Yan Liu

MAKE Smart Factor


Towards a Framework for Assistance Systems to Support Work Processes in Smart Factories

Increasingly, production processes are enabled and controlled by Information Technology (IT), a development being also referred to as “Industry 4.0”. IT thereby contributes to flexible and adaptive production processes, and in this sense factories become “smart factories”. In line with this, IT also more and more supports human workers via various assistance systems. This support aims to both support workers to better execute their tasks and to reduce the effort and time required when working. However, due to the large spectrum of assistance systems, it is hard to acquire an overview and to select an adequate system for a smart factory based on meaningful criteria. We therefore synthesize a set of comparison criteria into a consistent framework and demonstrate the application of our framework by classifying three examples.
Michael Fellmann, Sebastian Robert, Sebastian Büttner, Henrik Mucha, Carsten Röcker

Managing Complexity: Towards Intelligent Error-Handling Assistance Trough Interactive Alarm Flood Reduction

The current trend of integrating machines and factories into cyber-physical systems (CPS) creates an enormous complexity for operators of such systems. Especially the search for the root cause of cascading failures becomes highly time-consuming. Within this paper, we address the question on how to help human users to better and faster understand root causes of such situations. We propose a concept of interactive alarm flood reduction and present the implementation of a first vertical prototype for such a system. We consider this prototype as a first artifact to be discussed by the research community and aim towards an incremental further development of the system in order to support humans in complex error situations.
Sebastian Büttner, Paul Wunderlich, Mario Heinz, Oliver Niggemann, Carsten Röcker

Online Self-disclosure: From Users’ Regrets to Instructional Awareness

Unlike the offline world, the online world is devoid of well-evolved norms of interaction which guide socialization and self-disclosure. Therefore, it is difficult for members of online communities like Social Network Sites (SNSs) to control the scope of their actions and predict others’ reactions to them. Consequently users might not always anticipate the consequences of their online activities and often engage in actions they later regret. Regrettable and negative self-disclosure experiences can be considered as rich sources of privacy heuristics and a valuable input for the development of privacy awareness mechanisms. In this work, we introduce a Privacy Heuristics Derivation Method (PHeDer) to encode regrettable self-disclosure experiences into privacy best practices. Since information about the impact and the frequency of unwanted incidents (such as job loss, identity theft or bad image) can be used to raise users’ awareness, this method (and its conceptual model) puts special focus on the risks of online self-disclosure. At the end of this work, we provide assessment on how the outcome of the method can be used in the context of an adaptive awareness system for generating tailored feedback and support.
N. E. Díaz Ferreyra, Rene Meis, Maritta Heisel

MAKE Privacy


Decision Tree Rule Induction for Detecting Covert Timing Channels in TCP/IP Traffic

The detection of covert channels in communication networks is a current security challenge. By clandestinely transferring information, covert channels are able to circumvent security barriers, compromise systems, and facilitate data leakage. A set of statistical methods called DAT (Descriptive Analytics of Traffic) has been previously proposed as a general approach for detecting covert channels. In this paper, we implement and evaluate DAT detectors for the specific case of covert timing channels. Additionally, we propose machine learning models to induce classification rules and enable the fine parameterization of DAT detectors. A testbed has been created to reproduce main timing techniques published in the literature; consequently, the testbed allows the evaluation of covert channel detection techniques. We specifically applied Decision Trees to infer DAT-rules, achieving high accuracy and detection rates. This paper is a step forward for the actual implementation of effective covert channel detection plugins in modern network security devices.
Félix Iglesias, Valentin Bernhardt, Robert Annessi, Tanja Zseby

Practical Estimation of Mutual Information on Non-Euclidean Spaces

We propose, in this paper, to address the issue of measuring the impact of privacy and anonymization techniques, by measuring the data loss between “before” and “after”. The proposed approach focuses therefore on data usability, more than in ensuring that the data is sufficiently anonymized. We use Mutual Information as the measure criterion for this approach, and detail how we propose to measure Mutual Information over non-Euclidean data, in practice, using two possible existing estimators. We test this approach using toy data to illustrate the effects of some well known anonymization techniques on the proposed measure.
Yoan Miche, Ian Oliver, Wei Ren, Silke Holtmanns, Anton Akusok, Amaury Lendasse

IntelliAV: Toward the Feasibility of Building Intelligent Anti-malware on Android Devices

Android is targeted the most by malware coders as the number of Android users is increasing. Although there are many Android anti-malware solutions available in the market, almost all of them are based on malware signatures, and more advanced solutions based on machine learning techniques are not deemed to be practical for the limited computational resources of mobile devices. In this paper we aim to show not only that the computational resources of consumer mobile devices allow deploying an efficient anti-malware solution based on machine learning techniques, but also that such a tool provides an effective defense against novel malware, for which signatures are not yet available. To this end, we first propose the extraction of a set of lightweight yet effective features from Android applications. Then, we embed these features in a vector space, and use a pre-trained machine learning model on the device for detecting malicious applications. We show that without resorting to any signatures, and relying only on a training phase involving a reasonable set of samples, the proposed system outperforms many commercial anti-malware products, as well as providing slightly better performances than the most effective commercial products.
Mansour Ahmadi, Angelo Sotgiu, Giorgio Giacinto

DO NOT DISTURB? Classifier Behavior on Perturbed Datasets

Exponential trends in data generation are presenting today’s organizations, economies and governments with challenges never encountered before, especially in the field of privacy and data security. One crucial trade-off regulators are facing regards the simultaneous need for publishing personal information for the sake of statistical analysis and Machine Learning in order to increase quality levels in areas like medical services, while at the same time protecting the identity of individuals. A key European measure will be the introduction of the General Data Protection Regulation (GDPR) in 2018, giving customers the ‘right to be forgotten’, i.e. having their data deleted on request. As this could lead to a competitive disadvantage for European companies, it is important to understand which effects deletion of significant data points has on the performance of ML techniques. In a previous paper we introduced a series of experiments applying different algorithms to a binary classification problem under anonymization as well as perturbation. In this paper we extend those experiments by multi-class classification and introduce outlier-removal as an additional scenario. While the results of our previous work were mostly in-line with our expectations, our current experiments revealed unexpected behavior over a range of different scenarios. A surprising conclusion of those experiments is the fact that classification on an anonymized dataset with outliers removed in beforehand can almost compete with classification on the original, un-anonymized dataset. This could soon lead to competitive Machine Learning pipelines on anonymized datasets for real-world usage in the marketplace.
Bernd Malle, Peter Kieseberg, Andreas Holzinger

A Short-Term Forecast Approach of Public Buildings’ Power Demands upon Multi-source Data

Due to the significant increase of the global electricity demand and the rising number of urban population, the electric consumption in a city has attracted more attentions. Given the fact that public buildings occupy a large proportion of the electric consumption, the accurate prediction of electric consumptions for them is crucial to the rational electricity allocation and supply. This paper studies the possibility of utilizing urban multi-source data such as POI, pedestrian volume etc. to predict buildings’ electric consumptions. Among the multiple datasets, the key influencing factors are extracted to forecast the buildings’ electric power demands by the given probabilistic graphical algorithm named EMG. Our methodology is applied to display the relationships between the factors and forecast the daily electric power demands of nine public buildings including hotels, shopping malls, and office buildings in city of Hangzhou, China over the period of a month. The computational experiments are conducted and the result favors our approach.
Shubing Shan, Buyang Cao



On the Challenges and Opportunities in Visualization for Machine Learning and Knowledge Extraction: A Research Agenda

We describe a selection of challenges at the intersection of machine learning and data visualization and outline a subjective research agenda based on professional and personal experience. The unprecedented increase in the amount, variety and the value of data has been significantly transforming the way that scientific research is carried out and businesses operate. Within data science, which has emerged as a practice to enable this data-intensive innovation by gathering together and advancing the knowledge from fields such as statistics, machine learning, knowledge extraction, data management, and visualization, visualization plays a unique and maybe the ultimate role as an approach to facilitate the human and computer cooperation, and to particularly enable the analysis of diverse and heterogeneous data using complex computational methods where algorithmic results are challenging to interpret and operationalize. Whilst algorithm development is surely at the center of the whole pipeline in disciplines such as Machine Learning and Knowledge Discovery, it is visualization which ultimately makes the results accessible to the end user. Visualization thus can be seen as a mapping from arbitrarily high-dimensional abstract spaces to the lower dimensions and plays a central and critical role in interacting with machine learning algorithms, and particularly in interactive machine learning (iML) with including the human-in-the-loop. The central goal of the CD-MAKE VIS workshop is to spark discussions at this intersection of visualization, machine learning and knowledge discovery and bring together experts from these disciplines. This paper discusses a perspective on the challenges and opportunities in this integration of these discipline and presents a number of directions and strategies for further research.
Cagatay Turkay, Robert Laramee, Andreas Holzinger

Quantitative Externalization of Visual Data Analysis Results Using Local Regression Models

Both interactive visualization and computational analysis methods are useful for data studies and an integration of both approaches is promising to successfully combine the benefits of both methodologies. In interactive data exploration and analysis workflows, we need successful means to quantitatively externalize results from data studies, amounting to a particular challenge for the usually qualitative visual data analysis. In this paper, we propose a hybrid approach in order to quantitatively externalize valuable findings from interactive visual data exploration and analysis, based on local linear regression models. The models are built on user-selected subsets of the data, and we provide a way of keeping track of these models and comparing them. As an additional benefit, we also provide the user with the numeric model coefficients. Once the models are available, they can be used in subsequent steps of the workflow. A model-based optimization can then be performed, for example, or more complex models can be reconstructed using an inversion of the local models. We study two datasets to exemplify the proposed approach, a meteorological data set for illustration purposes and a simulation ensemble from the automotive industry as an actual case study.
Krešimir Matković, Hrvoje Abraham, Mario Jelović, Helwig Hauser

Analysis of Online User Behaviour for Art and Culture Events

Nowadays people share everything on online social networks, from daily life stories to the latest local and global news and events. Many researchers have exploited this as a source for understanding the user behaviour and profile in various settings. In this paper, we address the specific problem of user behavioural profiling in the context of cultural and artistic events. We propose a specific analysis pipeline that aims at examining the profile of online users, based on the textual content they published online. The pipeline covers the following aspects: data extraction and enrichment, topic modeling, user clustering, and prediction of interest. We show our approach at work for the monitoring of participation to a large-scale artistic installation that collected more than 1.5 million visitors in just two weeks (namely The Floating Piers, by Christo and Jeanne-Claude). We report our findings and discuss the pros and cons of the work.
Behnam Rahdari, Tahereh Arabghalizi, Marco Brambilla

On Joint Representation Learning of Network Structure and Document Content

Inspired by the advancements of representation learning for natural language processing, learning continuous feature representations of nodes in networks has recently gained attention. Similar to word embeddings, node embeddings have been shown to capture certain semantics of the network structure. Combining both research directions into a joint representation learning of network structure and document content seems a promising direction to increase the quality of the learned representations. However, research is typically focused on either word or network embeddings and few approaches that learn a joint representation have been proposed. We present an overview of that field, starting at word representations, moving over document and network node representations to joint representations. We make the connections between the different models explicit and introduce a novel model for learning a joint representation. We present different methods for the novel model and compare the presented approaches in an evaluation. This paper explains how the different models recently proposed in the literature relate to each other and compares their performance.
Jörg Schlötterer, Christin Seifert, Michael Granitzer



Ambient Assisted Living Technologies from the Perspectives of Older People and Professionals

Ambient Assisted Living (AAL) and Ambient Intelligence technologies are providing support to older people in living an independent and confident life by developing innovative ICT-based products, services, and systems. Despite significant advancement in AAL technologies and smart systems, they have still not found the way into the nursing home of the older people. The reasons are manifold. On one hand, the development of such systems lack in addressing the requirements of the older people and caregivers of the organization and the other is the unwillingness of the older people to make use of assistive systems. A qualitative study was performed at a nursing home to understand the needs and requirements of the residents and caregivers and their perspectives about the existing AAL technologies.
Deepika Singh, Johannes Kropf, Sten Hanke, Andreas Holzinger

Human Activity Recognition Using Recurrent Neural Networks

Human activity recognition using smart home sensors is one of the bases of ubiquitous computing in smart environments and a topic undergoing intense research in the field of ambient assisted living. The increasingly large amount of data sets calls for machine learning methods. In this paper, we introduce a deep learning model that learns to classify human activities without using any prior knowledge. For this purpose, a Long Short Term Memory (LSTM) Recurrent Neural Network was applied to three real world smart home datasets. The results of these experiments show that the proposed approach outperforms the existing ones in terms of accuracy and performance.
Deepika Singh, Erinc Merdivan, Ismini Psychoula, Johannes Kropf, Sten Hanke, Matthieu Geist, Andreas Holzinger

Modeling Golf Player Skill Using Machine Learning

In this study we apply machine learning techniques to Modeling Golf Player Skill using a dataset consisting of 277 golfers. The dataset includes 28 quantitative metrics, related to the club head at impact and ball flight, captured using a Doppler-radar. For modeling, cost-sensitive decision trees and random forest are used to discern between less skilled players and very good ones, i.e., Hackers and Pros. The results show that both random forest and decision trees achieve high predictive accuracy, with regards to true positive rate, accuracy and area under the ROC-curve. A detailed interpretation of the decision trees shows that they concur with modern swing theory, e.g., consistency is very important, while face angle, club path and dynamic loft are the most important evaluated swing factors, when discerning between Hackers and Pros. Most of the Hackers could be identified by a rather large deviation in one of these values compared to the Pros. Hackers, which had less variation in these aspects of the swing, could instead be identified by a steeper swing plane and a lower club speed. The importance of the swing plane is an interesting finding, since it was not expected and is not easy to explain.
Rikard König, Ulf Johansson, Maria Riveiro, Peter Brattberg

Predicting Chronic Heart Failure Using Diagnoses Graphs

Predicting the onset of heart disease is of obvious importance as doctors try to improve the general health of their patients. If it were possible to identify high-risk patients before their heart failure diagnosis, doctors could use that information to implement preventative measures to keep a heart failure diagnosis from becoming a reality. Integration of Electronic Medical Records (EMRs) into clinical practice has enabled the use of computational techniques for personalized healthcare at scale. The larger goal of such modeling is to pivot from reactive medicine to preventative care and early detection of adverse conditions. In this paper, we present a trajectory-based disease progression model to detect chronic heart failure. We validate our work on a database of Medicare records of 1.1 million elderly US patients. Our supervised approach allows us to assign likelihood of chronic heart failure for an unseen patient’s disease history and identify key disease progression trajectories that intensify or diminish said likelihood. This information will be a tremendous help as patients and doctors try to understand what are the most dangerous diagnoses for those who are susceptible to heart failure. Using our model, we demonstrate some of the most common disease trajectories that eventually result in the development of heart failure.
Saurabh Nagrecha, Pamela Bilo Thomas, Keith Feldman, Nitesh V. Chawla

MAKE Semantics


A Declarative Semantics for P2P Systems

This paper investigates the problem of data integration among Peer-to-Peer (P2P) deductive databases and presents a declarative semantics that generalizes previous proposals in the literature. Basically, by following the classical approach, the objective of a generic peer, joining a P2P system, is to enrich its knowledge by importing as much knowledge as possible while preventing inconsistency anomalies. This basic idea is extended in the present paper by allowing each peer to select between two different settings. It can either declare its local database to be sound but not complete, or declare it to be unsound. In the first case the peer considers its own knowledge more trustable than the knowledge imported from the rest of the system i.e. it gives preference to its knowledge with respect to the knowledge that can be imported from other peers. In the second case the peer considers its own knowledge as trustable as the knowledge that can be imported from the rest of the system i.e. it does not give any preference to its knowledge with respect to the knowledge that can be imported from other peers.
Luciano Caroprese, Ester Zumpano

Improving Language-Dependent Named Entity Detection

Named Entity Recognition (NER) and Named Entity Linking (NEL) are two research areas that have shown big advancements in recent years. The majority of this research is based on the English language. Hence, some of these improvements are language-dependent and do not necessarily lead to better results when applied to other languages. Therefore, this paper discusses TOMO, an approach to language-aware named entity detection and evaluates it for the German language. This also required the development of a German gold standard dataset, which was based on the English dataset used by the OKE 2016 challenge. An evaluation of the named entity detection task using the web-based platform GERBIL was undertaken and results show that our approach produced higher F1 values than the other annotators did. This indicates that language-dependent features do improve the overall quality of the spotter.
Gerald Petz, Werner Wetzlinger, Dietmar Nedbal

Towards the Automatic Detection of Nutritional Incompatibilities Based on Recipe Titles

The present paper reports experimental work on the automatic detection of nutritional incompatibilities of cooking recipes based on their titles. Such incompatibilities viewed as medical or cultural issues became a major concern in western societies. The gastronomy language represents an important challenge because of its elusiveness, its metaphors, and sometimes its catchy style. The recipe title processing brings together the analysis of short and domain-specific texts. We tackle these issues by building our algorithm on the basis of a common knowledge lexical semantic network. The experiment is reproducible. It uses freely available resources.
Nadia Clairet, Mathieu Lafourcade

The More the Merrier - Federated Learning from Local Sphere Recommendations

With Google’s Federated Learning & Facebook’s introduction of client-side NLP into their chat service, the era of client-side Machine Learning is upon us. While interesting ML approaches beyond the realm of toy examples were hitherto confined to large data-centers and powerful GPU’s, exponential trends in technology and the introduction of billions of smartphones enable sophisticated processing swarms of even hand-held devices. Such approaches hold several promises: 1. Without the need for powerful server infrastructures, even small companies could be scalable to millions of users easily and cost-efficiently; 2. Since data only used in the learning process never need to leave the client, personal information can be used free of privacy and data security concerns; 3. Since privacy is preserved automatically, the full range of personal information on the client device can be utilized for learning; and 4. without round-trips to the server, results like recommendations can be made available to users much faster, resulting in enhanced user experience. In this paper we propose an architecture for federated learning from personalized, graph based recommendations computed on client devices, collectively creating & enhancing a global knowledge graph. In this network, individual users will ‘train’ their local recommender engines, while a server-based voting mechanism aggregates the developing client-side models, preventing over-fitting on highly subjective data from tarnishing the global model.
Bernd Malle, Nicola Giuliani, Peter Kieseberg, Andreas Holzinger


Weitere Informationen

Premium Partner