Skip to main content

2014 | Buch

Interactive Knowledge Discovery and Data Mining in Biomedical Informatics

State-of-the-Art and Future Challenges

herausgegeben von: Andreas Holzinger, Igor Jurisica

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

One of the grand challenges in our digital world are the large, complex and often weakly structured data sets, and massive amounts of unstructured information. This “big data” challenge is most evident in biomedical informatics: the trend towards precision medicine has resulted in an explosion in the amount of generated biomedical data sets. Despite the fact that human experts are very good at pattern recognition in dimensions of <= 3; most of the data is high-dimensional, which makes manual analysis often impossible and neither the medical doctor nor the biomedical researcher can memorize all these facts. A synergistic combination of methodologies and approaches of two fields offer ideal conditions towards unraveling these problems: Human–Computer Interaction (HCI) and Knowledge Discovery/Data Mining (KDD), with the goal of supporting human capabilities with machine learning.

This state-of-the-art survey is an output of the HCI-KDD expert network and features 19 carefully selected and reviewed papers related to seven hot and promising research areas: Area 1: Data Integration, Data Pre-processing and Data Mapping; Area 2: Data Mining Algorithms; Area 3: Graph-based Data Mining; Area 4: Entropy-Based Data Mining; Area 5: Topological Data Mining; Area 6 Data Visualization and Area 7: Privacy, Data Protection, Safety and Security.

Inhaltsverzeichnis

Frontmatter
Knowledge Discovery and Data Mining in Biomedical Informatics: The Future Is in Integrative, Interactive Machine Learning Solutions
Abstract
Biomedical research is drowning in data, yet starving for knowledge. Current challenges in biomedical research and clinical practice include information overload – the need to combine vast amounts of structured, semi-structured, weakly structured data and vast amounts of unstructured information – and the need to optimize workflows, processes and guidelines, to increase capacity while reducing costs and improving efficiencies. In this paper we provide a very short overview on interactive and integrative solutions for knowledge discovery and data mining. In particular, we emphasize the benefits of including the end user into the “interactive” knowledge discovery process. We describe some of the most important challenges, including the need to develop and apply novel methods, algorithms and tools for the integration, fusion, pre-processing, mapping, analysis and interpretation of complex biomedical data with the aim to identify testable hypotheses, and build realistic models. The HCI-KDD approach, which is a synergistic combination of methodologies and approaches of two areas, Human–Computer Interaction (HCI) and Knowledge Discovery & Data Mining (KDD), offer ideal conditions towards solving these challenges: with the goal of supporting human intelligence with machine intelligence. There is an urgent need for integrative and interactive machine learning solutions, because no medical doctor or biomedical researcher can keep pace today with the increasingly large and complex data sets – often called “Big Data”.
Andreas Holzinger, Igor Jurisica
Visual Data Mining: Effective Exploration of the Biological Universe
Abstract
Visual Data Mining (VDM) is supported by interactive and scalable network visualization and analysis, which in turn enables effective exploration and communication of ideas within multiple biological and biomedical fields. Large networks, such as the protein interactome or transcriptional regulatory networks, contain hundreds of thousands of objects and millions of relationships. These networks are continuously evolving as new knowledge becomes available, and their content is richly annotated and can be presented in many different ways. Attempting to discover knowledge and new theories within this complex data sets can involve many workflows, such as accurately representing many formats of source data, merging heterogeneous and distributed data sources, complex database searching, integrating results from multiple computational and mathematical analyses, and effectively visualizing properties and results. Our experience with biology researchers has required us to address their needs and requirements in the design and development of a scalable and interactive network visualization and analysis platform, NAViGaTOR, now in its third major release.
David Otasek, Chiara Pastrello, Andreas Holzinger, Igor Jurisica
Darwin or Lamarck? Future Challenges in Evolutionary Algorithms for Knowledge Discovery and Data Mining
Abstract
Evolutionary Algorithms (EAs) are a fascinating branch of computational intelligence with much potential for use in many application areas. The fundamental principle of EAs is to use ideas inspired by the biological mechanisms observed in nature, such as selection and genetic changes, to find the best solution for a given optimization problem. Generally, EAs use iterative processes, by growing a population of solutions selected in a guided random search and using parallel processing, in order to achieve a desired result. Such population based approaches, for example particle swarm and ant colony optimization (inspired from biology), are among the most popular metaheuristic methods being used in machine learning, along with others such as the simulated annealing (inspired from thermodynamics). In this paper, we provide a short survey on the state-of-the-art of EAs, beginning with some background on the theory of evolution and contrasting the original ideas of Darwin and Lamarck; we then continue with a discussion on the analogy between biological and computational sciences, and briefly describe some fundamentals of EAs, including the Genetic Algorithms, Genetic Programming, Evolution Strategies, Swarm Intelligence Algorithms (i.e., Particle Swarm Optimization, Ant Colony Optimization, Bacteria Foraging Algorithms, Bees Algorithm, Invasive Weed Optimization), Memetic Search, Differential Evolution Search, Artificial Immune Systems, Gravitational Search Algorithm, Intelligent Water Drops Algorithm. We conclude with a short description of the usefulness of EAs for Knowledge Discovery and Data Mining tasks and present some open problems and challenges to further stimulate research.
Katharina Holzinger, Vasile Palade, Raul Rabadan, Andreas Holzinger
On the Generation of Point Cloud Data Sets: Step One in the Knowledge Discovery Process
Abstract
Computational geometry and topology are areas which have much potential for the analysis of arbitrarily high-dimensional data sets. In order to apply geometric or topological methods one must first generate a representative point cloud data set from the original data source, or at least a metric or distance function, which defines a distance between the elements of a given data set. Consequently, the first question is: How to get point cloud data sets? Or more precise: What is the optimal way of generating such data sets? The solution to these questions is not trivial. If a natural image is taken as an example, we are concerned more with the content, with the shape of the relevant data represented by this image than its mere matrix of pixels. Once a point cloud has been generated from a data source, it can be used as input for the application of graph theory and computational topology. In this paper we first describe the case for natural point clouds, i.e. where the data already are represented by points; we then provide some fundamentals of medical images, particularly dermoscopy, confocal laser scanning microscopy, and total-body photography; we describe the use of graph theoretic concepts for image analysis, give some medical background on skin cancer and concentrate on the challenges when dealing with lesion images. We discuss some relevant algorithms, including the watershed algorithm, region splitting (graph cuts), region merging (minimum spanning tree) and finally describe some open problems and future challenges.
Andreas Holzinger, Bernd Malle, Marcus Bloice, Marco Wiltgen, Massimo Ferri, Ignazio Stanganelli, Rainer Hofmann-Wellenhof
Adapted Features and Instance Selection for Improving Co-training
Abstract
High quality, labeled data is essential for successfully applying machine learning methods to real-world problems. However, in many cases, the amount of labeled data is insufficient and labeling that data is expensive or time consuming. Co-training algorithms, which use unlabeled data in order to improve classification, have proven to be effective in such cases. Generally, co-training algorithms work by using two classifiers trained on two different views of the data to label large amounts of unlabeled data, and hence they help minimize the human effort required to label new data. In this paper we propose simple and effective strategies for improving the basic co-training framework. The proposed strategies improve two aspects of the co-training algorithm: the manner in which the features set is partitioned and the method of selecting additional instances. An experimental study over 25 datasets, proves that the proposed strategies are especially effective for imbalanced datasets. In addition, in order to better understand the inner workings of the co-training process, we provide an in-depth analysis of the effects of classifier error rates and performance imbalance between the two ”views” of the data. We believe this analysis offers insights that could be used for future research.
Gilad Katz, Asaf Shabtai, Lior Rokach
Knowledge Discovery and Visualization of Clusters for Erythromycin Related Adverse Events in the FDA Drug Adverse Event Reporting System
Abstract
In this paper, a research study to discover hidden knowledge in the reports of the public release of the Food and Drug Administration (FDA)’s Adverse Event Reporting System (FAERS) for erythromycin is presented. Erythromycin is an antibiotic used to treat certain infections caused by bacteria. Bacterial infections can cause significant morbidity, mortality, and the costs of treatment are known to be detrimental to health institutions around the world. Since erythromycin is of great interest in medical research, the relationships between patient demographics, adverse event outcomes, and the adverse events of this drug were analyzed. The FDA’s FAERS database was used to create a dataset for cluster analysis in order to gain some statistical insights. The reports contained within the dataset consist of 3792 (44.1%) female and 4798 (55.8%) male patients. The mean age of each patient is 41.759. The most frequent adverse event reported is oligohtdramnios and the most frequent adverse event outcome is OT(Other). Cluster analysis was used for the analysis of the dataset using the DBSCAN algorithm, and according to the results, a number of clusters and associations were obtained, which are reported here. It is believed medical researchers and pharmaceutical companies can utilize these results and test these relationships within their clinical studies.
Pinar Yildirim, Marcus Bloice, Andreas Holzinger
On Computationally-Enhanced Visual Analysis of Heterogeneous Data and Its Application in Biomedical Informatics
Abstract
With the advance of new data acquisition and generation technologies, the biomedical domain is becoming increasingly data-driven. Thus, understanding the information in large and complex data sets has been in the focus of several research fields such as statistics, data mining, machine learning, and visualization. While the first three fields predominantly rely on computational power, visualization relies mainly on human perceptual and cognitive capabilities for extracting information. Data visualization, similar to Human–Computer Interaction, attempts an appropriate interaction between human and data to interactively exploit data sets. Specifically within the analysis of complex data sets, visualization researchers have integrated computational methods to enhance the interactive processes. In this state-of-the-art report, we investigate how such an integration is carried out. We study the related literature with respect to the underlying analytical tasks and methods of integration. In addition, we focus on how such methods are applied to the biomedical domain and present a concise overview within our taxonomy. Finally, we discuss some open problems and future challenges.
Cagatay Turkay, Fleur Jeanquartier, Andreas Holzinger, Helwig Hauser
A Policy-Based Cleansing and Integration Framework for Labour and Healthcare Data
Abstract
Large amounts of data are collected by public administrations and healthcare organizations, the integration of the data scattered in several information systems can facilitate the comprehension of complex scenarios and support the activities of decision makers.
Unfortunately, the quality of information system archives is very poor, as widely stated by the existing literature. Data cleansing is one of the most frequently used data improvement technique. Data can be cleansed in several ways, the optimal choice however is strictly dependent on the integration and analysis processes to be performed. Therefore, the design of a data analysis process should consider in a holistic way the data integration, cleansing, and analysis activities. However, in the existing literature, the data integration and cleansing issues have been mostly addressed in isolation.
In this paper we describe how a model based cleansing framework is extended to address also integration activities. The combined approach facilitates the rapid prototyping, development, and evaluation of data pre-processing activities. Furthermore, the combined use of formal methods and visualization techniques strongly empower the data analyst which can effectively evaluate how cleansing and integration activities can affect the data analysis. An example focusing on labour and healthcare data integration is showed.
Roberto Boselli, Mirko Cesarini, Fabio Mercorio, Mario Mezzanzanica
Interactive Data Exploration Using Pattern Mining
Abstract
We live in the era of data and need tools to discover valuable information in large amounts of data. The goal of exploratory data mining is to provide as much insight in given data as possible. Within this field, pattern set mining aims at revealing structure in the form of sets of patterns. Although pattern set mining has shown to be an effective solution to the infamous pattern explosion, important challenges remain.
One of the key challenges is to develop principled methods that allow user- and task-specific information to be taken into account, by directly involving the user in the discovery process. This way, the resulting patterns will be more relevant and interesting to the user. To achieve this, pattern mining algorithms will need to be combined with techniques from both visualisation and human-computer interaction. Another challenge is to establish techniques that perform well under constrained resources, as existing methods are usually computationally intensive. Consequently, they are only applied to relatively small datasets and on fast computers.
The ultimate goal is to make pattern mining practically more useful, by enabling the user to interactively explore the data and identify interesting structure. In this paper we describe the state-of-the-art, discuss open problems, and outline promising future directions.
Matthijs van Leeuwen
Resources for Studying Statistical Analysis of Biomedical Data and R
Abstract
The past decade has seen explosive growth in digitized medical data. This trend offers medical practitioners an unparalleled opportunity to identify effectiveness of treatments for patients using summary statistics and to offer patients more personalized medical treatments based on predictive analytics. To exploit this opportunity, statisticians and computer scientists need to work and communicate effectively with medical practitioners to ensure proper measurement data, collection of sufficient volumes of heterogeneous data to ensure patient privacy, and understanding of probabilities and sources of errors associated with data sampling. Interdisciplinary collaborations between scientists are likely to lead to the development of more effective methods for explaining probabilities, possible errors, and risks associated with treatment options to patients. This chapter introduces some online resources to help medical practitioners with little or no background in summary and predictive statistics learn basic statistical concepts and implement data analysis on their personal computers using R, a high-level computer language that requires relatively little training. Readers who are only interested in understanding basic statistical concepts may want to skip the subsection on R.
Mei Kobayashi
A Kernel-Based Framework for Medical Big-Data Analytics
Abstract
The recent trend towards standardization of Electronic Health Records (EHRs) represents a significant opportunity and challenge for medical big-data analytics. The challenge typically arises from the nature of the data which may be heterogeneous, sparse, very high-dimensional, incomplete and inaccurate. Of these, standard pattern recognition methods can typically address issues of high-dimensionality, sparsity and inaccuracy. The remaining issues of incompleteness and heterogeneity however are problematic; data can be as diverse as handwritten notes, blood-pressure readings and MR scans, and typically very little of this data will be co-present for each patient at any given time interval.
We therefore advocate a kernel-based framework as being most appropriate for handling these issues, using the neutral point substitution method to accommodate missing inter-modal data. For pre-processing of image-based MR data we advocate a Deep Learning solution for contextual areal segmentation, with edit-distance based kernel measurement then used to characterize relevant morphology.
David Windridge, Miroslaw Bober
On Entropy-Based Data Mining
Abstract
In the real world, we are confronted not only with complex and high-dimensional data sets, but usually with noisy, incomplete and uncertain data, where the application of traditional methods of knowledge discovery and data mining always entail the danger of modeling artifacts. Originally, information entropy was introduced by Shannon (1949), as a measure of uncertainty in the data. But up to the present, there have emerged many different types of entropy methods with a large number of different purposes and possible application areas. In this paper, we briefly discuss the applicability of entropy methods for the use in knowledge discovery and data mining, with particular emphasis on biomedical data. We present a very short overview of the state-of-the-art, with focus on four methods: Approximate Entropy (ApEn), Sample Entropy (SampEn), Fuzzy Entropy (FuzzyEn), and Topological Entropy (FiniteTopEn). Finally, we discuss some open problems and future research challenges.
Andreas Holzinger, Matthias Hörtenhuber, Christopher Mayer, Martin Bachler, Siegfried Wassertheurer, Armando J. Pinho, David Koslicki
Sparse Inverse Covariance Estimation for Graph Representation of Feature Structure
Abstract
The access to more information provided by modern high-throughput measurement systems has made it possible to investigate finer details of complex systems. However, it also has increased the number of features, and thereby the dimensionality in data, to be processed in data analysis. Higher dimensionality makes it particularly challenging to understand complex systems, by blowing up the number of possible configurations of features we need to consider. Structure learning with the Gaussian Markov random field can provide a remedy, by identifying conditional independence structure of features in a form that is easy to visualize and understand. The learning is based on a convex optimization problem, called the sparse inverse covariance estimation, for which many efficient algorithms have been developed in the past few years. When dimensions are much larger than sample sizes, structure learning requires to consider statistical stability, in which connections to data mining arise in terms of discovering common or rare subgraphs as patterns. The outcome of structure learning can be visualized as graphs, represented accordingly to additional information if required, providing a perceivable way to investigate complex feature spaces.
Sangkyun Lee
Multi-touch Graph-Based Interaction for Knowledge Discovery on Mobile Devices: State-of-the-Art and Future Challenges
Abstract
Graph-based knowledge representation is a hot topic for some years and still has a lot of research potential, particularly in the advancement in the application of graph-theory for creating benefits in the biomedical domain. Graphs are most powerful tools to map structures within a given data set and to recognize relationships between specific data objects. Many advantages of graph-based data structures can be found in the applicability of methods from network analysis, topology and data mining (e.g. small-world phenomenon, cluster analysis). In this paper we present the state-of-the-art in graph-based approaches for multi-touch interaction on mobile devices and we highlight some open problems to stimulate further research and future developments. This is particularly important in the medical domain, as a conceptual graph analysis may provide novel insights on hidden patterns in data, hence support interactive knowledge discovery.
Andreas Holzinger, Bernhard Ofner, Matthias Dehmer
Intelligent Integrative Knowledge Bases: Bridging Genomics, Integrative Biology and Translational Medicine
Abstract
Successful application of translational medicine will require understanding the complex nature of disease, fueled by effective analysis of multidimensional ’omics’ measurements and systems-level studies. In this paper, we present a perspective the intelligent integrative knowledge base (I2KB)for data management, statistical analysis and knowledge discovery related to human disease. By building a bridge between patient associations, clinicians, experimentalists and modelers, I2KB will facilitate the emergence and propagation of systems medicine studies, which are a prerequisite for large-scaled clinical trial studies, efficient diagnosis, disease screening, drug target evaluation and development of new therapeutic strategies.
Hoan Nguyen, Julie D. Thompson, Patrick Schutz, Olivier Poch
Biomedical Text Mining: State-of-the-Art, Open Problems and Future Challenges
Abstract
Text is a very important type of data within the biomedical domain. For example, patient records contain large amounts of text which has been entered in a non-standardized format, consequently posing a lot of challenges to processing of such data. For the clinical doctor the written text in the medical findings is still the basis for decision making – neither images nor multimedia data. However, the steadily increasing volumes of unstructured information need machine learning approaches for data mining, i.e. text mining. This paper provides a short, concise overview of some selected text mining methods, focusing on statistical methods, i.e. Latent Semantic Analysis, Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation, Hierarchical Latent Dirichlet Allocation, Principal Component Analysis, and Support Vector Machines, along with some examples from the biomedical domain. Finally, we provide some open problems and future challenges, particularly from the clinical domain, that we expect to stimulate future research.
Andreas Holzinger, Johannes Schantl, Miriam Schroettner, Christin Seifert, Karin Verspoor
Protecting Anonymity in Data-Driven Biomedical Science
Abstract
With formidable recent improvements in data processing and information retrieval, knowledge discovery/data mining, business intelligence, content analytics and other upcoming empirical approaches have an enormous potential, particularly for the data intensive biomedical sciences. For results derived using empirical methods, the underlying data set should be made available, at least during the review process for the reviewers, to ensure the quality of the research done and to prevent fraud or errors and to enable the replication of studies. However, in particular in the medicine and the life sciences, this leads to a discrepancy, as the disclosure of research data raises considerable privacy concerns, as researchers have of course the full responsibility to protect their (volunteer) subjects, hence must adhere to respective ethical policies. One solution for this problem lies in the protection of sensitive information in medical data sets by applying appropriate anonymization. This paper provides an overview on the most important and well-researched approaches and discusses open research problems in this area, with the goal to act as a starting point for further investigation.
Peter Kieseberg, Heidelinde Hobel, Sebastian Schrittwieser, Edgar Weippl, Andreas Holzinger
Biobanks – A Source of Large Biological Data Sets: Open Problems and Future Challenges
Abstract
Biobanks are collections of biological samples (e.g. tissues, blood and derivatives, other body fluids, cells, DNA, etc.) and their associated data. Consequently, human biobanks represent collections of human samples and data and are of fundamental importance for scientific research as they are an excellent resource to access and measure biological constituents that can be used to monitor the status and trends of both health and disease. Most -omics data trust on a secure access to these collections of stored human samples to provide the basis for establishing the ranges and frequencies of expression. However, there are many open questions and future challenges associated with the large amounts of heterogeneous data, ranging from pre-processing, data integration and data fusion to knowledge discovery and data mining along with a strong focus on privacy, data protection, safety and security.
Berthold Huppertz, Andreas Holzinger
On Topological Data Mining
Abstract
Humans are very good at pattern recognition in dimensions of ≤ 3. However, most of data, e.g. in the biomedical domain, is in dimensions much higher than 3, which makes manual analyses awkward, sometimes practically impossible. Actually, mapping higher dimensional data into lower dimensions is a major task in Human–Computer Interaction and Interactive Data Visualization, and a concerted effort including recent advances in computational topology may contribute to make sense of such data. Topology has its roots in the works of Euler and Gauss, however, for a long time was part of theoretical mathematics. Within the last ten years computational topology rapidly gains much interest amongst computer scientists. Topology is basically the study of abstract shapes and spaces and mappings between them. It originated from the study of geometry and set theory. Topological methods can be applied to data represented by point clouds, that is, finite subsets of the n-dimensional Euclidean space. We can think of the input as a sample of some unknown space which one wishes to reconstruct and understand, and we must distinguish between the ambient (embedding) dimension n, and the intrinsic dimension of the data. Whilst n is usually high, the intrinsic dimension, being of primary interest, is typically small. Therefore, knowing the intrinsic dimensionality of data can be seen as one first step towards understanding its structure. Consequently, applying topological techniques to data mining and knowledge discovery is a hot and promising future research area.
Andreas Holzinger
Backmatter
Metadaten
Titel
Interactive Knowledge Discovery and Data Mining in Biomedical Informatics
herausgegeben von
Andreas Holzinger
Igor Jurisica
Copyright-Jahr
2014
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-662-43968-5
Print ISBN
978-3-662-43967-8
DOI
https://doi.org/10.1007/978-3-662-43968-5

Premium Partner