Skip to main content
main-content

Über dieses Buch

The BIRS Workshop “Advances in Interactive Knowledge Discovery and Data Mining in Complex and Big Data Sets” (15w2181), held in July 2015 in Banff, Canada, was dedicated to stimulating a cross-domain integrative machine-learning approach and appraisal of “hot topics” toward tackling the grand challenge of reaching a level of useful and useable computational intelligence with a focus on real-world problems, such as in the health domain. This encompasses learning from prior data, extracting and discovering knowledge, generalizing the results, fighting the curse of dimensionality, and ultimately disentangling the underlying explanatory factors in complex data, i.e., to make sense of data within the context of the application domain.

The workshop aimed to contribute advancements in promising novel areas such as at the intersection of machine learning and topological data analysis. History has shown that most often the overlapping areas at intersections of seemingly disparate fields are key for the stimulation of new insights and further advances. This is particularly true for the extremely broad field of machine learning.

Inhaltsverzeichnis

Frontmatter

Towards Integrative Machine Learning and Knowledge Extraction

This Volume is a result of workshop 15w2181 “Advances in interactive knowledge discovery and data mining in complex and big data sets” at the Banff International Research Station for Mathematical Innovation and Discovery. The workshop was dedicated to bring together experts with diverse backgrounds but with one common goal: to understand intelligence for the successful design, development and evaluation of algorithms that can learn from data, extract knowledge from experience, and to improve their learning behaviour over time – similarly as we humans do. Knowledge discovery, data mining, machine learning, artificial intelligence are more or less synonymously used with no strict definitions or boundaries. “Integrative” means to support not only the machine learning & knowledge extraction pipeline, ranging from dealing with data in arbitrarily high-dimensional spaces to the visualization of results into a lower dimension accessible to a human; it is taking into account seemingly disparate fields which can be very fruitful when brought together - for solving problems in complex application domains (e.g. health informatics). Here we want to emphasize that the most important findings in machine learning will be those we do not know yet. In this paper we provide: (1) a short motivation for the integrative approach; (2) brief summaries of the presentations given in Banff; and (3) some personally flavoured, subjective future research outlooks, e.g. in the combination of geometrical approaches with machine learning.

Andreas Holzinger, Randy Goebel, Vasile Palade, Massimo Ferri

Machine Learning and Knowledge Extraction in Digital Pathology Needs an Integrative Approach

During the last decade pathology has benefited from the rapid progress of image digitizing technologies, which led to the development of scanners, capable to produce so-called Whole Slide images (WSI) which can be explored by a pathologist on a computer screen comparable to the conventional microscope and can be used for diagnostics, research, archiving and also education and training. Digital pathology is not just the transformation of the classical microscopic analysis of histological slides by pathologists to just a digital visualization. It is a disruptive innovation that will dramatically change medical work-flows in the coming years and help to foster personalized medicine. Really powerful gets a pathologist if she/he is augmented by machine learning, e.g. by support vector machines, random forests and deep learning. The ultimate benefit of digital pathology is to enable to learn, to extract knowledge and to make predictions from a combination of heterogenous data, i.e. the histological image, the patient history and the *omics data. These challenges call for integrated/integrative machine learning approach fostering transparency, trust, acceptance and the ability to explain step-by-step why a decision has been made.

Andreas Holzinger, Bernd Malle, Peter Kieseberg, Peter M. Roth, Heimo Müller, Robert Reihs, Kurt Zatloukal

Comparison of Public-Domain Software and Services For Probabilistic Record Linkage and Address Standardization

Probabilistic record linkage (PRL) refers to the process of matching records from various data sources such as database tables with some missing or corrupted index values. Human is often involved in a loop to review cases that an algorithm cannot match. PRL can be applied to join or de-duplicate records, or to impute missing data, resulting in better overall data quality. An important subproblem in PRL is to parse a field such as address into its components, e.g., street number, street name, city, state, and zip code. Various data analysis techniques such as natural language processing and machine learning methods are often gainfully employed in both PRL and address standardization to achieve higher accuracies of linking or prediction. This work compares the performance of four reputable PRL packages freely available in the public domain, namely FRIL, Link Plus, R RecordLinkage, and SERF. In addition, we evaluate the baseline performance and sensitivity of four address-parsing web services including the Data Science Toolkit, Geocoder.us, Google Maps APIs, and the U.S. address parser. Finally, we present some of the strengths and limitations of the software and services we have evaluated.

Sou-Cheng T. Choi, Yongheng Lin, Edward Mulrow

Better Interpretable Models for Proteomics Data Analysis Using Rule-Based Mining

Recent advances in -omics technology has yielded in large data-sets in many areas of biology, such as mass spectrometry based proteomics. However, analyzing this data is still a challenging task mainly due to the very high dimensionality and high noise content of the data. One of the main objectives of the analysis is the identification of relevant patterns (or features) which can be used for classification of new samples to healthy or diseased. So, a method is required to find easily interpretable models from this data.To gain the above mentioned goal, we have adapted the disjunctive association rule mining algorithm, TitanicOR, to identify emerging patterns from our mass spectrometry proteomics data-sets. Comparison to five state-of-the-art methods shows that our method is advantageous them in terms of identifying the inter-dependency between the features and the TP-rate and precision of the features selected. We further demonstrate the applicability of our algorithm to one previously published clinical data-set.

Fahrnaz Jayrannejad, Tim O. F. Conrad

Probabilistic Logic Programming in Action

Probabilistic Programming (PP) has recently emerged as an effective approach for building complex probabilistic models. Until recently PP was mostly focused on functional programming while now Probabilistic Logic Programming (PLP) forms a significant subfield. In this paper we aim at presenting a quick overview of the features of current languages and systems for PLP. We first present the basic semantics for probabilistic logic programs and then consider extensions for dealing with infinite structures and continuous random variables. To show the modeling features of PLP in action, we present several examples: a simple generator of random 2D tile maps, an encoding of Markov Logic Networks, the truel game, the coupon collector problem, the one-dimensional random walk, latent Dirichlet allocation and the Indian GPA problem. These examples show the maturity of PLP.

Arnaud Nguembang Fadja, Fabrizio Riguzzi

Persistent Topology for Natural Data Analysis — A Survey

Natural data offer a hard challenge to data analysis. One set of tools is being developed by several teams to face this difficult task: Persistent topology. After a brief introduction to this theory, some applications to the analysis and classification of cells, liver and skin lesions, music pieces, gait, oil and gas reservoirs, cyclones, galaxies, bones, brain connections, languages, handwritten and gestured letters are shown.

Massimo Ferri

Predictive Models for Differentiation Between Normal and Abnormal EEG Through Cross-Correlation and Machine Learning Techniques

Currently, in hospitals and medical clinics, large amounts of data are becoming increasingly registered, which usually are derived from clinical examinations and procedures. An example of stored data is the electroencephalogram (EEG), which is of high importance to the various diseases that affect the brain. These data are stored to keep the patient’s clinical history and to help medical experts in performing future procedures, such as pattern discovery from specific diseases. However, the increase in medical data storage makes unfeasible their manual analysis. Also, the EEG can contain patterns difficult to be observed by naked eye. In this work, a cross-correlation technique was applied for feature extraction of a set of 200 EEG segments. Afterwards, predictive models were built using machine learning algorithms such as J48, 1NN, and BP-MLP (backpropagation based on multilayer perceptron), that implement decision tree, nearest neighbor, and artificial neural network, respectively. The models were evaluated using 10-fold cross-validation and contingency table methods. The evaluation results showed that the model built with the J48 performed better and was more likely to correctly classify EEG segments in this study than 1NN and BP-MLP, corresponding to 98.50% accuracy.

Jefferson Tales Oliva, João Luís Garcia Rosa

A Brief Philosophical Note on Information

I will start by posing a question that arose to my attention when, some years ago, I realized the importance of Machine Learning for the future theoretical and applicative fields of Computer science.

Vincenzo Manca

Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline

From medical charts to national census, healthcare has traditionally operated under a paper-based paradigm. However, the past decade has marked a long and arduous transformation bringing healthcare into the digital age. Ranging from electronic health records, to digitized imaging and laboratory reports, to public health datasets, today, healthcare now generates an incredible amount of digital information. Such a wealth of data presents an exciting opportunity for integrated machine learning solutions to address problems across multiple facets of healthcare practice and administration. Unfortunately, the ability to derive accurate and informative insights requires more than the ability to execute machine learning models. Rather, a deeper understanding of the data on which the models are run is imperative for their success. While a significant effort has been undertaken to develop models able to process the volume of data obtained during the analysis of millions of digitalized patient records, it is important to remember that volume represents only one aspect of the data. In fact, drawing on data from an increasingly diverse set of sources, healthcare data presents an incredibly complex set of attributes that must be accounted for throughout the machine learning pipeline. This chapter focuses on highlighting such challenges, and is broken down into three distinct components, each representing a phase of the pipeline. We begin with attributes of the data accounted for during preprocessing, then move to considerations during model building, and end with challenges to the interpretation of model output. For each component, we present a discussion around data as it relates to the healthcare domain and offer insight into the challenges each may impose on the efficiency of machine learning techniques.

Keith Feldman, Louis Faust, Xian Wu, Chao Huang, Nitesh V. Chawla

A Fast Semi-Automatic Segmentation Tool for Processing Brain Tumor Images

Segmentation, the process of delineating boundaries and features within images, is a vital part of both the clinical assessment and the computational analysis of brain cancers. Here, we provide an open-source algorithm (MITKats), built on the Medical Imaging Interaction Toolkit, to provide user-friendly and expedient tools for semi-automatic segmentation. To evaluate its performance against competing algorithms, we applied MITKats to MRIs of 38 high-grade glioma cases from publicly available benchmarks. The similarity of the segmentations to expert-delineated ground truths approached the discrepancies among different manual raters, the theoretically maximal precision. The average time spent on each segmentation was 5 min, making MITKats between 4 and 11 times faster than competing semi-automatic algorithms, while retaining similar accuracy. We conclude with remarks on the utility of segmentation for medical data analysis as well as its further challenges.

Andrew X. Chen, Raúl Rabadán

Topological Characteristics of Oil and Gas Reservoirs and Their Applications

We demonstrate applications of topological characteristics of oil and gas reservoirs considered as three-dimensional bodies to geological modeling.

V. A. Baikov, R. R. Gilmanov, I. A. Taimanov, A. A. Yakovlev

Convolutional and Recurrent Neural Networks for Activity Recognition in Smart Environment

Convolutional Neural Networks (CNN) are very useful for fully automatic extraction of discriminative features from raw sensor data. This is an important problem in activity recognition, which is of enormous interest in ambient sensor environments due to its universality on various applications. Activity recognition in smart homes uses large amounts of time-series sensor data to infer daily living activities and to extract effective features from those activities, which is a challenging task. In this paper we demonstrate the use of the CNN and a comparison of results, which has been performed with Long Short Term Memory (LSTM), recurrent neural networks and other machine learning algorithms, including Naive Bayes, Hidden Markov Models, Hidden Semi-Markov Models and Conditional Random Fields. The experimental results on publicly available smart home datasets demonstrate that the performance of 1D-CNN is similar to LSTM and better than the other probabilistic models.

Deepika Singh, Erinc Merdivan, Sten Hanke, Johannes Kropf, Matthieu Geist, Andreas Holzinger

Backmatter

Weitere Informationen

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

INDUSTRIE 4.0

Der Hype um Industrie 4.0 hat sich gelegt – nun geht es an die Umsetzung. Das Whitepaper von Protolabs zeigt Unternehmen und Führungskräften, wie sie die 4. Industrielle Revolution erfolgreich meistern. Es liegt an den Herstellern, die besten Möglichkeiten und effizientesten Prozesse bereitzustellen, die Unternehmen für die Herstellung von Produkten nutzen können. Lesen Sie mehr zu: Verbesserten Strukturen von Herstellern und Fabriken | Konvergenz zwischen Soft- und Hardwareautomatisierung | Auswirkungen auf die Neuaufstellung von Unternehmen | verkürzten Produkteinführungszeiten
Jetzt gratis downloaden!

Bildnachweise