Skip to main content

2021 | Buch

Heterogeneous Data Management, Polystores, and Analytics for Healthcare

VLDB Workshops, Poly 2020 and DMAH 2020, Virtual Event, August 31 and September 4, 2020, Revised Selected Papers

herausgegeben von: Prof. Dr. Vijay Gadepally, Timothy Mattson, Michael Stonebraker, Tim Kraska, Fusheng Wang, Gang Luo, Jun Kong, Alevtina Dubovitskaya

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes revised selected papers from two VLDB workshops: The International Workshop on Polystore Systems for Heterogeneous Data in Multiple Databases with Privacy and Security Assurances, Poly 2020, and the 6th International Workshop on Data Management and Analytics for Medicine and Healthcare, DMAH 2020, which were held virtually on August 31 and September 4, 2020.

For Poly 2020, 4 full and 3 short papers were accepted from 10 submissions; and for DMAH 2020, 7 full and 2 short papers were accepted from a total of 15 submissions. The papers were organized in topical sections as follows: Privacy, Security and/or Policy Issues for Heterogenous Data; COVID-19 Data Analytics and Visualization; Deep Learning based Biomedical Data Analytics; NLP based Learning from Unstructured Data; Biomedical Data Modelling and Prediction.

Inhaltsverzeichnis

Frontmatter

Poly 2020: Privacy, Security and/or Policy Issues for Heterogenous Data

Frontmatter
A Polystore Based Database Operating System (DBOS)
Abstract
Current operating systems are complex systems that were designed before today’s computing environments. This makes it difficult for them to meet the scalability, heterogeneity, availability, and security challenges in current cloud and parallel computing environments. To address these problems, we propose a radically new OS design based on data-centric architecture: all operating system state should be represented uniformly as database tables, and operations on this state should be made via queries from otherwise stateless tasks. This design makes it easy to scale and evolve the OS without whole-system refactoring, inspect and debug system state, upgrade components without downtime, manage decisions using machine learning, and implement sophisticated security features. We discuss how a database OS (DBOS) can improve the programmability and performance of many of today’s most important applications and propose a plan for the development of a DBOS proof of concept.
Michael Cafarella, David DeWitt, Vijay Gadepally, Jeremy Kepner, Christos Kozyrakis, Tim Kraska, Michael Stonebraker, Matei Zaharia
Polypheny-DB: Towards Bridging the Gap Between Polystores and HTAP Systems
Abstract
Polystore databases allow to store data in different formats and data models and offer several query languages. While such polystore systems are highly beneficial for various analytical workloads, they provide limited support for transactional and for mixed OLTP and OLAP workloads, the latter in contrast to hybrid transactional and analytical processing (HTAP) systems. In this paper, we present Polypheny-DB, a modular polystore that jointly provides support for analytical and transactional workloads including update operations and that thus takes one step towards bridging the gap between polystore and HTAP systems.
Marco Vogt, Nils Hansen, Jan Schönholz, David Lengweiler, Isabel Geissmann, Sebastian Philipp, Alexander Stiemer, Heiko Schuldt
Persona Model Transfer for User Activity Prediction Across Heterogeneous Domains
Abstract
In this keynote talk, we present our project on cross-domain digital marketing where we assume totally different service domains such as Web advertisement domain and E-commerce domain. Cross-domain approaches are useful in situations where some domain does not have enough amount data to develop an accurate prediction model on user activities. Our idea is to transfer persona (user) model from one domain which has richer data to the target domain with less data, i.e., it has a worse prediction model. This project is technically very challenging since we assume totally different domains where the users’ activities are different. We present some of recent achievements of our project and also talk about our future plans.
Takahiro Hara
PolyMigrate: Dynamic Schema Evolution and Data Migration in a Distributed Polystore
Abstract
In the last years, polystore databases have been proposed to cope with the challenges stemming from increasingly dynamic and heterogeneous workloads. A polystore database provides a logical schema to the application, but materializes data in different data stores, different data models, and different physical schemas. When the access pattern to data changes, the polystore can decide to migrate data from one store to the other or from one data model to another. This necessitates a schema evolution in one or several data stores and the subsequent migration of data. Similarly, when applications change, the global schema might have to be changed as well, with similar consequences on local data stores in terms of schema evolution and data migration. However, the aspect of schema evolution in a polystore database has so far largely been neglected. In this paper, we present the challenges imposed by schema evolution and data migration in Polypheny-DB, a distributed polystore database. With our work-in-progress approach called PolyMigrate, we show how schema evolution and data migration affect the different layers of a distributed polystore and we identify different approaches to effectively and efficiently propagate these changes to the underlying stores.
Alexander Stiemer, Marco Vogt, Heiko Schuldt, Uta Störl
An Architecture for the Development of Distributed Analytics Based on Polystore Events
Abstract
To balance the requirements for data consistency and availability, organisations increasingly migrate towards hybrid data persistence architectures (called polystores throughout this paper) comprising both relational and NoSQL databases. The EC-funded H2020 TYPHON project offers facilities for designing and deploying such polystores, otherwise a complex, technically challenging and error-prone task. In addition, it is nowadays increasingly important for organisations to be able to extract business intelligence by monitoring data stored in polystores. In this paper, we propose a novel approach that facilitates the extraction of analytics in a distributed manner by monitoring polystore queries as these arrive for execution. Beyond the analytics architecture, we presented a pre-execution authorisation mechanism. We also report on preliminary scalability evaluation experiments which demonstrate the linear scalability of the proposed architecture.
Athanasios Zolotas, Konstantinos Barmpis, Fady Medhat, Patrick Neubauer, Dimitris Kolovos, Richard F. Paige
Towards Data Discovery by Example
Abstract
Data scientists today have to query an avalanche of multi-source data (e.g., data lakes, company databases) for diverse analytical tasks. Data discovery is labor-intensive as users have to find the right tables, and the combination thereof to answer their queries. Data discovery systems automatically find and link (e.g., joins) tables across various sources to aid users in finding the data they need. In this paper, we outline our ongoing efforts to build a data discovery by example system, DICE, that iteratively searches for new tables guided by user-provided data examples. Additionally, DICE asks users to validate results to improve the discovery process over multiple iterations.
El Kindi Rezig, Allan Vanterpool, Vijay Gadepally, Benjamin Price, Michael Cafarella, Michael Stonebraker
The Transformers for Polystores - The Next Frontier for Polystore Research
Abstract
What if we could solve one of the most complex challenges of polystore research by applying a technique originating in a completely different domain, and originally developed to solve a completely different set of problems? What if we could replace many of the components that make today’s polystore with components that only understand query languages and data in terms of matrices and vectors? This is the vision that we propose as the next frontier for polystore research, and as the opportunity to explore attention-based transformer deep learning architecture as the means for automated source-target query and data translation, with no or minimal hand-coding required, and only through training and transfer learning.
Edmon Begoli, Sudarshan Srinivasan, Maria Mahbub

DMAH 2020: COVID-19 Data Analytics and Visualization

Frontmatter
Open-World COVID-19 Data Visualization [Extended Abstract]
Abstract
As COVID-19 becomes a dangerous pandemic worldwide, there is an urgent need to understand all aspects of it through data visualization. As part of a larger COVID-19 response by KAIST, we have worked with students on generating interesting COVID-19 visualizations including demographic trends, patient behaviors, and effects of mitigation policies. A major challenge we experienced is that, in an open world setting where it is not even clear which datasets are available and useful, generating the right visualizations becomes an extremely tedious process. Traditional data visualization recommendation systems usually assume that the datasets are given, and that the visualizations have a clear objective. We contend that such assumptions do not hold in a COVID-19 setting where one needs to iteratively adjust two moving targets: deciding which datasets to use, and generating useful visualizations with the selected datasets. We thus propose interesting research challenges that can help automate this process.
Hyunseung Hwang, Steven Euijong Whang

DMAH 2020: Deep Learning based Biomedical Data Analytics

Frontmatter
Privacy-Preserving Knowledge Transfer with Bootstrap Aggregation of Teacher Ensembles
Abstract
There is a need to transfer knowledge among institutions and organizations to save effort in annotation and labeling or in enhancing task performance. However, knowledge transfer is difficult because of restrictions that are in place to ensure data security and privacy. Institutions are not allowed to exchange data or perform any activity that may expose personal information. With the leverage of a differential privacy algorithm in a high-performance computing environment, we propose a new training protocol, Bootstrap Aggregation of Teacher Ensembles (BATE), which is applicable to various types of machine learning models. The BATE algorithm is based on and provides enhancements to the PATE algorithm, maintaining competitive task performance scores on complex datasets with underrepresented class labels.
We conducted a proof-of-the-concept study of the information extraction from cancer pathology report data from four cancer registries and performed comparisons between four scenarios: no collaboration, no privacy-preserving collaboration, the PATE algorithm, and the proposed BATE algorithm. The results showed that the BATE algorithm maintained competitive macro-averaged F1 scores, demonstrating that the suggested algorithm is an effective yet privacy-preserving method for machine learning and deep learning solutions.
Hong-Jun Yoon, Hilda B. Klasky, Eric B. Durbin, Xiao-Cheng Wu, Antoinette Stroup, Jennifer Doherty, Linda Coyle, Lynne Penberthy, Christopher Stanley, J. Blair Christian, Georgia D. Tourassi
An Intelligent and Efficient Rehabilitation Status Evaluation Method: A Case Study on Stroke Patients
Abstract
Chronic patients' care encounters challenges, including high cost, lack of professionals, and insufficient rehabilitation state evaluation. Computer-supported cooperative work (CSCW), is capable of alleviating these issues, as it allows healthcare physicians (HCP) to quantify the workload and thus to enhance rehabilitation care quality. This study aims to design a deep learning algorithm Pose-AMGRU, a deep learning-based pose recognition algorithm combining Pose-Attention Mechanism and Gated Recurrent Unit (GRU), to monitor the human pose of rehabilitating patients efficiently. It gives instructions for HCP. To further substantiate the acceptance of our computer-supported method, we develop a multi-fusion theoretical model to determine factors that may influence the acceptance of HCP and verify the usefulness of the method above. Experiment results show Pose-AMGRU achieves an accuracy of 98.61% in the KTH dataset and 100% in the rehabilitation action dataset, which outperforms other algorithms. The efficiency running speed of Pose-AMGRU on the GTX1060 graphics card is up to 14.75FPS, which adapts to the home rehabilitation scene. As to acceptance evaluation, we verified the positive relationship between the computer-supported method and acceptance, and our model presents decent generalizability of stroke patients' care at the Second Affiliated Hospital of Zhengzhou University.
Yao Tong, Hang Yan, Xin Li, Gang Chen, Zhenxiang Zhang
Multiple Interpretations Improve Deep Learning Transparency for Prostate Lesion Detection
Abstract
Detecting suspicious lesions in MRI imaging is a critical task in preventing deaths from cancer. Deep learning systems have produced remarkable accuracy for the task of detecting lesions in MRI images. Although these systems show remarkable performance, they often ignore an indispensable component which is interpretability. Interpretability is essential for many deep learning applications in medicine because of ethical, monetary, and legal factors. Interpretation also builds a necessary degree of trust and transparency between the doctor, patient, and system. This work proposes a framework for the interpretation of medical deep learning systems. The proposed approach is based on the idea that it is advantageous to use different interpretation techniques to show multiple views of reasoning behind the classification. This work demonstrates deep learning interpretations for various patient data modalities using the proposed Multiple Views of Interpretation for Deep Learning framework.
Mehmet A. Gulum, Christopher M. Trombley, Mehmed Kantardzic

DMAH 2020: NLP Based Learning from Unstructured Data

Frontmatter
Tracing State-Level Obesity Prevalence from Sentence Embeddings of Tweets: A Feasibility Study
Abstract
Twitter data has been shown broadly applicable for public health surveillance. Previous public heath studies based on Twitter data have largely relied on keyword-matching or topic models for clustering relevant tweets. However, both methods suffer from the short-length of texts and unpredictable noise that naturally occurs in user-generated contexts. In response, we introduce a deep learning approach that uses hashtags as a form of supervision and learns tweet embeddings for extracting informative textual features. In this case study, we address the specific task of estimating state-level obesity from dietary-related textual features. Our approach yields an estimation that strongly correlates the textual features to government data and outperforms the keyword-matching baseline. The results also demonstrate the potential of discovering risk factors using the textual features. This method is general-purpose and can be applied to a wide range of Twitter-based public health studies.
Xiaoyi Zhang, Rodoniki Athanasiadou, Narges Razavian
Enhancing Medical Word Sense Inventories Using Word Sense Induction: A Preliminary Study
Abstract
Correctly interpreting an ambiguous word in a given context is a critical step for medical natural language processing tasks. Medical word sense disambiguation assumes that all meanings (senses) of an ambiguous word are predetermined in a sense inventory. However, the sense inventory sometimes does not cover all senses or is outdated as new concepts arise in the practice of medicine. Obtaining all word senses is therefore the prerequisite work for word sense disambiguation. A classical method for word sense induction is string expansion, a rule-based method that searches the corpus for full forms of an abbreviation or acronym. Yet, it cannot be applied to ambiguous words that are not abbreviations. In this paper, we study methods that can semi-automatically discover word senses from a large-scale medical corpus, regardless of whether the word is an abbreviation. We conducted a comparative evaluation of four unsupervised data-driven methods, including context clustering, two types of word clustering, and sparse coding in word vector space. Overall, sparse coding outperforms the other methods. This demonstrates the feasibility of using sparse coding to discover more complete word senses. By comparing the senses discovered by sparse coding with those in senses inventory, we observed new word senses. For more than half of the ambiguous words in the MSH WSD data set (sense inventory maintained by National Library of Medicine), sparse coding detected more than one new word sense. This result shows an opportunity in enhancing medical word sense inventories with unsupervised data-driven methods.
Qifei Dong, Yue Wang

DMAH 2020: Biomedical Data Modelling and Prediction

Frontmatter
Teaching Analytics Medical-Data Common Sense
Abstract
The availability of Electronic Medical Records (EMR) has spawned the development of analytics designed to assist caregivers in monitoring, diagnosis, and treatment of patients. The long-term adoption of these tools hinges upon caregivers’ confidence in them, and subsequently, their robustness to data anomalies. Unfortunately, both complex machine-learning-based tools, which require copious amounts of data to train, and a simple trend graph presented in a patient-centered dashboard, may be sensitive to noisy data. While a caregiver would dismiss a heart rate of 2000, a medical analytic relying on it may fail or mislead its users. Developers should endow their systems with medical-data common sense to shield them from improbable values. To effectively do so, they require the ability to identify them. We motivate the need to teach analytics common sense by evaluating how anomalies impact visual-analytics, score-based sepsis-analytics SOFA and qSOFA, and a machine-learning-based sepsis predictor. We then describe the anomalous patterns designers should look for in medical data using a popular public medical research database - MIMIC-III. For each data type, we highlight methods to find these patterns. For numerical data, statistical methods are limited to high-throughput scenarios and large aggregations. Since deployed analytics monitor a single patient and must rely on a limited amount of data, rule-based methods are needed. In light of the dearth of medical guidelines to support such systems, we outline the dimensions upon which they should be defined upon.
Tomer Sagi, Nitzan Shmueli, Bruce Friedman, Ruth Bergman
CDRGen: A Clinical Data Registry Generator (Formal and/or Technical Paper)
Abstract
In the health sector, data analysis is typically performed by specialty using clinical data stored in a Clinical Data Registry (CDR), specific to that medical specialty. Therefore, if we want to analyze data from a new specialty, it is necessary to create a new CDR, which is usually done from scratch. Although the data stored in CDRs depends on the medical specialty, typically data has a common structure and the operations over it are similar (e.g., entering and viewing patient data). These characteristics make the creation of new CDRs possible to automate. In this paper, we present a software system for automatic CDR generation, called CDRGen, that relies on a metadata specification language to describe the data to be collected and stored, and the types of supported users as well as their permissions for accessing data. CDRGen parses the input specification language and generates the code needed for a functional CDR. The specification language is defined on top of a metamodel that describes the metadata of a generic CDR. The metamodel was designed taking into account the analysis of eleven existing CDRs. The experimental assessment of the CDRGen indicates that: (i) developers can create new CDRs more efficiently (in less than 2% of the typical time), (ii) CDRGen creates the user interface functionalities to enter and access data and the database to store that data, and finally, (iii) its specification language has a high expressiveness enabling the inclusion of a large variety of data types. Our solution will help developers creating new CDRs for different specialties in a fast and easy way, without the need to create everything from scratch.
Pedro Alves, Manuel J. Fonseca, João D. Pereira, Helena Galhardas
Prediction of lncRNA-Disease Associations from Tripartite Graphs
Abstract
The discovery of novel lncRNA-disease associations may provide valuable input to the understanding of disease mechanisms at lncRNA level, as well as to the detection of biomarkers for disease diagnosis, treatment, prognosis and prevention. Unfortunately, due to costs and time complexity, the number of possible disease-related lncRNAs verified by traditional biological experiments is very limited. Computational approaches for the prediction of potential disease-lncRNA associations can effectively decrease time and cost of biological experiments. We propose an approach for the prediction of lncRNA-disease associations based on neighborhood analysis performed on a tripartite graph, built upon lncRNAs, miRNAs and diseases. The main idea here is to discover hidden relationships between lncRNAs and diseases through the exploration of their interactions with intermediate molecules (e.g., miRNAs) in the tripartite graph, based on the consideration that while a few of lncRNA-disease associations are still known, plenty of interactions between lncRNAs and other molecules, as well as associations of the latters with diseases, are available. The effectiveness of our approach is proved by its ability in the identification of associations missed by competitors, on real datasets.
Mariella Bonomo, Armando La Placa, Simona E. Rombo

DMAH 2020: Invited Paper

Frontmatter
Parameter Sensitivity Analysis for the Progressive Sampling-Based Bayesian Optimization Method for Automated Machine Learning Model Selection
Abstract
As a key component of automating the entire process of applying machine learning to solve real-world problems, automated machine learning model selection is in great need. Many automated methods have been proposed for machine learning model selection, but their inefficiency poses a major problem for handling large data sets. To expedite automated machine learning model selection and lower its resource requirements, we developed a progressive sampling-based Bayesian optimization (PSBO) method to efficiently automate the selection of machine learning algorithms and hyper-parameter values. Our PSBO method showed good performance in our previous tests and has 20 parameters. Each parameter has its own default value and impacts our PSBO method’s performance. It is unclear for each of these parameters, how much room for improvement there is over its default value, how sensitive our PSBO method’s performance is to it, and what its safe range is. In this paper, we perform a sensitivity analysis of these 20 parameters to answer these questions. Our results show that these parameters’ default values work well. There is not much room for improvement over them. Also, each of these parameters has a reasonably large safe range, within which our PSBO method’s performance is insensitive to parameter value changes.
Weipeng Zhou, Gang Luo
Backmatter
Metadaten
Titel
Heterogeneous Data Management, Polystores, and Analytics for Healthcare
herausgegeben von
Prof. Dr. Vijay Gadepally
Timothy Mattson
Michael Stonebraker
Tim Kraska
Fusheng Wang
Gang Luo
Jun Kong
Alevtina Dubovitskaya
Copyright-Jahr
2021
Electronic ISBN
978-3-030-71055-2
Print ISBN
978-3-030-71054-5
DOI
https://doi.org/10.1007/978-3-030-71055-2