Skip to main content
main-content

Über dieses Buch

This book constitutes the refereed proceedings of the 23rd International Conference on Scientific and Statistical Database Management, SSDBM 2011, held in Portland, OR, USA, in July 2011. The 26 long and 12 short papers presented together with 15 posters were carefully reviewed and selected from 80 submissions. The topics covered are ranked search; temporal data and queries; workflow and provenance; querying graphs; clustering and data mining; architectures and privacy; and applications and models.

Inhaltsverzeichnis

Frontmatter

Keynote Address

The Architecture of SciDB

SciDB is an open-source analytical database oriented toward the data management needs of scientists. As such it mixes statistical and linear algebra operations with data management ones, using a natural nested multidimensional array data model. We have been working on the code for two years, most recently with the help of venture capital backing. Release 11.06 (June 2011) is downloadable from our website (SciDB.org).

This paper presents the main design decisions of SciDB. It focuses on our decisions concerning a high-level, SQL-like query language, the issues facing our query optimizer and executor and efficient storage management for arrays. The paper also discusses implementation of features not usually present in DBMSs, including version control, uncertainty and provenance.

Michael Stonebraker, Paul Brown, Alex Poliakov, Suchi Raman

Ranked Search

Location-Based Instant Search

Location-based keyword search has become an important part of our daily life. Such a query asks for records satisfying both a spatial condition and a keyword condition. State-of-the-art techniques extend a spatial tree structure by adding keyword information. In this paper we study location-based instant search, where a system searches based on a partial query a user has typed in. We first develop a new indexing technique, called filtering-effective hybrid index (FEH), that judiciously uses two types of keyword filters based on their selectiveness to do powerful pruning. Then, we develop indexing and search techniques that store prefix information on the FEH index and efficiently answer partial queries. Our experiments show a high efficiency and scalability of these techniques.

Shengyue Ji, Chen Li

Continuous Inverse Ranking Queries in Uncertain Streams

This paper introduces a scalable approach for continuous inverse ranking on uncertain streams. An uncertain stream is a stream of object instances with confidences, e.g. observed positions of moving objects derived from a sensor. The confidence value assigned to each instance reflects the likelihood that the instance conforms with the current true object state. The inverse ranking query retrieves the rank of a given query object according to a given score function. In this paper we present a framework that is able to update the query result very efficiently, as the stream provides new observations of the objects. We will theoretically and experimentally show that the query update can be performed in linear time complexity. We conduct an experimental evaluation on synthetic data, which demonstrates the efficiency of our approach.

Thomas Bernecker, Hans-Peter Kriegel, Nikos Mamoulis, Matthias Renz, Andreas Zuefle

Finding Haystacks with Needles: Ranked Search for Data Using Geospatial and Temporal Characteristics

The past decade has seen an explosion in the number and types of environmental sensors deployed, many of which provide a continuous stream of observations. Each individual observation consists of one or more sensor measurements, a geographic location, and a time. With billions of historical observations stored in diverse databases and in thousands of datasets, scientists have difficulty finding relevant observations. We present an approach that creates consistent geospatial-temporal metadata from large repositories of diverse data by blending curated and automated extracts. We describe a novel query method over this metadata that returns ranked search results to a query with geospatial and temporal search criteria. Lastly, we present a prototype that demonstrates the utility of these ideas in the context of an ocean and coastalmargin observatory.

V. M. Megler, David Maier

Using Medians to Generate Consensus Rankings for Biological Data

Faced with the deluge of data available in biological databases, it becomes increasingly difficult for scientists to obtain reasonable sets of answers to their biological queries. A critical example appears in medicine, where physicians frequently need to get information about genes associated with a given disease. When they pose such queries to Web portals (e.g., Entrez NCBI) they usually get huge amounts of answers which are not ranked, making them very difficult to be exploited. In the last years, while several ranking approaches have been proposed, none of them is considered as the most promising.

Instead of considering ranking methods as alternative approaches, we propose to generate a

consensus ranking

to highlight the common points of a set of rankings while minimizing their disagreements. Our work is based on the concept of median, originally defined on permutations: Given

m

permutations and a distance function, the

median

problem is to find a permutation that is the

closest

of the

m

given permutations. We have investigated the problem of computing a median of a set of

m

rankings considering different elements and ties, under a generalized Kendall-

τ

distance. This problem is known to be NP-hard. In this paper, we present a new heuristic for the problem and we demonstrate the benefit of our approach on real queries using four different ranking methods.

Availability

:

http://bioguide-project.net/bioconsert

Sarah Cohen-Boulakia, Alain Denise, Sylvie Hamel

A Truly Dynamic Data Structure for Top-k Queries on Uncertain Data

Top-

k

queries allow end-users to focus on the most important (top-

k

) answers amongst those which satisfy the query. In traditional databases, a user defined score function assigns a score value to each tuple and a top-

k

query returns

k

tuples with the highest score. In uncertain database, top-

k

answer depends not only on the scores but also on the membership probabilities of tuples. Several top-

k

definitions covering different aspects of score-probability interplay have been proposed in recent past [20, 13, 6, 18]. Most of the existing work in this research field is focused on developing efficient algorithms for answering top-

k

queries on static uncertain data. Any change (insertion, deletion of a tuple or change in membership probability, score of a tuple) in underlying data forces re-computation of query answers. Such re-computations are not practical considering the dynamic nature of data in many applications. In this paper, we propose a truly dynamic data structure that uses ranking function

PRF

e

(

α

) proposed by Li et al. [18] under the generally adopted model of

x

-relations [21].

PRF

e

can effectively approximate various other top-

k

definitions on uncertain data based on the value of parameter

α

. An

x

-relation consists of a number of

x

-tuples, where

x

-tuple is a set of mutually exclusive tuples (up to a constant number) called alternatives. Each

x

-tuple in a relation randomly instantiates into one tuple from its alternatives. For an uncertain relation with

N

tuples, our structure can answer top-

k

queries in

O

(

k

log

N

) time, handles an update in

O

(log

N

) time and takes

O

(

N

) space. Finally, we evaluate practical efficiency of our structure on both synthetic and real data.

Manish Patil, Rahul Shah, Sharma V. Thankachan

Temporal Data and Queries

Efficient Storage and Temporal Query Evaluation in Hierarchical Data Archiving Systems

Data archiving has been commonly used in many fields for data backup and analysis purposes. Although comprehensive application software, new computing and storage technologies, and the Internet have made it easier to create, collect and store all types of data, the meaningful storing, accessing, and managing of database archives in a cost-effective way remains extremely challenging. In this paper, we focus on hierarchical data archiving that has been popularly used in the scientific field and web data management. First, we propose a novel compaction scheme for archiving hierarchical data. By compacting both data and timestamps, our scheme substantially reduces not only the amount of needed storage, but also the incremental archiving time. Second, we design a temporal query language to support data retrieval from the compact data archives. Third, as compaction on data and timestamps may bring significant overhead to query evaluation, we investigate how to optimize such overhead by exploiting the characteristics of the queries and of the archived hierarchical data. Finally, we conduct an extensive experimentation to demonstrate the effectiveness and efficiency of both our efficient storage and query optimization techniques.

Hui (Wendy) Wang, Ruilin Liu, Dimitri Theodoratos, Xiaoying Wu

Update Propagation in a Streaming Warehouse

Streaming warehouses are used to monitor complex systems such as data centers, web site complexes, and world-wide networks, gathering and correlating rich collections of events and measurements. Ideally, a streaming warehouse provides both historical data, for deep analysis, and real-time data for rapid response to emerging opportunities or problems. The highly temporal nature of the data and the need to support parallel processing naturally leads to extensive use of horizontal partitioning to manage base tables and layers of materialized views. In this paper, we consider the problem of determining when to propagate updates from base tables to dependent views on a partition-wise basis using autonomous updates. We provide a correctness theory for propagating updates to materialized views, simple algorithms which correctly propagate updates, and examples of algorithms which do not. We extend these results to accommodate needs of production warehouses: repartitioning of tables, mutual consistency, and merge tables. We measure the update propagation delays incurred by two different update propagation algorithms in test and production DataDepot warehouses, and find that only those update propagation algorithms which impose no scheduling restrictions are acceptable for use in a real-time streaming warehouse.

Theodore Johnson, Vladislav Shkapenyuk

Efficient Processing of Multiple DTW Queries in Time Series Databases

Dynamic Time Warping (DTW) is a widely used distance measure for time series that has been successfully used in science and many other application domains. As DTW is computationally expensive, there is a strong need for efficient query processing algorithms. Such algorithms exist for single queries. In many of today’s applications, however, large numbers of queries arise at any given time. Existing DTW techniques do not process multiple DTW queries simultaneously, a serious limitation which slows down overall processing.

In this paper, we propose an efficient processing approach for multiple DTW queries. We base our approach on the observation that algorithms in areas such as data mining and interactive visualization incur many queries that share certain characteristics. Our solution exploits these shared characteristics by pruning database time series with respect to sets of queries, and we prove a lower-bounding property that guarantees no false dismissals. Our technique can be flexibly combined with existing DTW lower bounds or other single DTW query speed-up techniques for further runtime reduction. Our thorough experimental evaluation demonstrates substantial performance gains for multiple DTW queries.

Hardy Kremer, Stephan Günnemann, Anca-Maria Ivanescu, Ira Assent, Thomas Seidl

Probabilistic Time Consistent Queries over Moving Objects

Recently, the wide usage of inexpensive mobile devices, along with broad deployment of wireless and positioning technology, has enabled many important applications such as Delay Tolerant Networks (DTN). In these applications, the positions of mobile nodes are dynamically changing, and are often imprecise due to the inaccuracy of positioning devices. Therefore, it is crucial to efficiently and effectively monitor mobile nodes (modeled as uncertain moving objects). In this paper, we propose a novel query, called

probabilistic time consistent query

(PTCQ). In particular, a PTCQ retrieves uncertain moving objects that

consistently

satisfy query predicates within a future period with high confidence. We present effective pruning methods to reduce the search space of PTCQs, and seamlessly integrate them into an efficient query procedure. Moreover, to facilitate query processing, we specifically design a data structure, namely

UC-Grid

, to index uncertain moving objects. The structure construction is based on a formal cost model to minimize the query cost. Extensive experiments demonstrate the efficiency and effectiveness of our proposed approaches to answer PTCQs.

Xiang Lian, Lei Chen

Workflows and Provenance

Knowledge Annotations in Scientific Workflows: An Implementation in Kepler

Scientific research products are the result of long-term collaborations between teams. Scientific workflows are capable of helping scientists in many ways including collecting information about how research was conducted (e.g., scientific workflow tools often collect and manage information about datasets used and data transformations). However, knowledge about

why

data was collected is rarely documented in scientific workflows. In this paper we describe a prototype system built to support the collection of scientific expertise that influences scientific analysis. Through evaluating a scientific research effort underway at the Pacific Northwest National Laboratory, we identified features that would most benefit PNNL scientists in documenting how and why they conduct their research, making this information available to the entire team. The prototype system was built by enhancing the Kepler Scientific Workflow System to create knowledge-annotated scientific workflows and to publish them as semantic annotations.

Aída Gándara, George Chin, Paulo Pinheiro da Silva, Signe White, Chandrika Sivaramakrishnan, Terence Critchlow

Improving Workflow Fault Tolerance through Provenance-Based Recovery

Scientific workflow systems frequently are used to execute a variety of long-running computational pipelines prone to premature termination due to network failures, server outages, and other faults. Researchers have presented approaches for providing fault tolerance for portions of specific workflows, but no solution handles faults that terminate the workflow engine itself when executing a mix of stateless and stateful workflow components. Here we present a general framework for efficiently resuming workflow execution using information commonly captured by workflow systems to record data provenance. Our approach facilitates fast workflow

replay

using only such commonly recorded provenance data. We also propose a

checkpoint

extension to standard provenance models to significantly reduce the computation needed to reset the workflow to a consistent state, thus resulting in much shorter re-execution times. Our work generalizes the rescue-DAG approach used by DAGMan to richer workflow models that may contain stateless and stateful multi-invocation actors as well as workflow loops.

Sven Köhler, Sean Riddle, Daniel Zinn, Timothy McPhillips, Bertram Ludäscher

ProPub: Towards a Declarative Approach for Publishing Customized, Policy-Aware Provenance

Data provenance, i.e., the lineage and processing history of data, is becoming increasingly important in scientific applications. Provenance information can be used, e.g., to explain, debug, and reproduce the results of computational experiments, or to determine the validity and quality of data products. In collaborative science settings, it may be infeasible or undesirable to publish the complete provenance of a data product. We develop a framework that allows data publishers to “customize” provenance data prior to exporting it. For example, users can specify which parts of the provenance graph are to be included in the result and which parts should be hidden, anonymized, or abstracted. However, such user-defined provenance customization needs to be carefully counterbalanced with the need to faithfully report all relevant data and process dependencies. To this end, we propose

ProPub

(Provenance Publisher), a framework and system which allows the user (i) to state provenance publication and customization requests, (ii) to specify provenance policies that should be obeyed, (iii) to check whether the policies are satisfied, and (iv) to repair policy violations and reconcile conflicts between user requests and provenance policies should they occur. In the

ProPub

approach, policies as well as customization requests are expressed as logic rules. By using a declarative, logic-based framework,

ProPub

can first check and then enforce integrity constraints (ICs), e.g., by rejecting inconsistent user requests, or by repairing violated ICs according to a given conflict resolution strategy.

Saumen C. Dey, Daniel Zinn, Bertram Ludäscher

Provenance-Enabled Automatic Data Publishing

Scientists are increasingly being called upon to publish their data as well as their conclusions. Yet computational science often necessarily occurs in exploratory, unstructured environments. Scientists are as likely to use one-off scripts, legacy programs, and volatile collections of data and parametric assumptions as they are to frame their investigations using easily reproducible workflows. The ES3 system can capture the provenance of such unstructured computations and make it available so that the results of such computations can be evaluated in the overall context of their inputs, implementation, and assumptions. Additionally, we find that such provenance can serve as an automatic “checklist” whereby the suitability of data (or other computational artifacts) for publication can be evaluated. We describe a system that, given the request to publish a particular computational artifact, traverses that artifact’s provenance and applies rule-based tests to each of the artifact’s computational antecedents to determine whether the artifact’s provenance is robust enough to justify its publication. Generically, such tests check for proper curation of the artifacts, which specifically can mean such things as: source code checked into a source control system; data accessible from a well-known repository; etc. Minimally, publish requests yield a report on an object’s fitness for publication, although such reports can easily drive an automated cleanup process that remedies many of the identified shortcomings.

James Frew, Greg Janée, Peter Slaughter

Panel I

A Panel Discussion on Data Intensive Science: Moving towards Solutions

Over the past several years, a number of groups, including the National Academy of Engineering, have identified grand challenge problems facing scientists from around the world [1]. While addressing these problems will have global impact, solutions are years away at best – and the next set of challenges are likely to be even harder to solve. Because of the complexity of questions being asked, meeting these challenges requires large, multi-disciplinary teams working closely together for extended periods of time. Enabling this new type of science, involving distributed teams that need to collaborate despite vastly different backgrounds and interests, is the cornerstone of Data Intensive Science.

Terence Critchlow

Querying Graphs

Querying Shortest Path Distance with Bounded Errors in Large Graphs

Shortest paths and shortest path distances are important primary queries for users to query in a large graph. In this paper, we propose a new approach to answer shortest path and shortest path distance queries efficiently with an error bound. The error bound is controlled by a user-specified parameter, and the online query efficiency is achieved with prepossessing offline. In the offline preprocessing, we take a reference node embedding approach which computes the single-source shortest paths from each reference node to all the other nodes. To guarantee the user-specified error bound, we design a novel coverage-based reference node selection strategy, and show that selecting the optimal set of reference nodes is NP-hard. We propose a greedy selection algorithm which exploits the submodular property of the formulated objective function, and use a graph partitioning-based heuristic to further reduce the offline computational complexity of reference node embedding.

In the online query answering, we use the precomputed distances to provide a lower bound and an upper bound of the true shortest path distance based on the triangle inequality. In addition, we propose a linear algorithm which computes the approximate shortest path between two nodes within the error bound. We perform extensive experimental evaluation on a large-scale road network and a social network and demonstrate the effectiveness and efficiency of our proposed methods.

Miao Qiao, Hong Cheng, Jeffrey Xu Yu

PG-Join: Proximity Graph Based String Similarity Joins

In many applications, for example, in data integration scenarios, strings must be matched if they are similar. String similarity joins, which match all pairs of similar strings from two datasets, are of particular interest and have recently received much attention in the database research community. Most approaches, however, assume a global similarity threshold; all string pairs that exceed the threshold form a match in the join result. The global threshold approach has two major problems: (a) the threshold depends on the (mostly unknown) data distribution, (b) often there is no single threshold that is good for all string pairs.

In this paper we propose the

PG-Join

algorithm, a novel string similarity join that requires no configuration and uses an

adaptive threshold

. PG-Join computes a so-called proximity graph to derive an individual threshold for each string. Computing the proximity graph efficiently is essential for the scalability of PG-Join. To this end we develop a new and fast algorithm,

PG-I

, that computes the proximity graph in two steps: First an efficient approximation is computed, then the approximation error is fixed incrementally until the adaptive threshold is stable. Our extensive experiments on real-world and synthetic data show that PG-I is up to five times faster than the state-of-the-art algorithm and suggest that PG-Join is a useful and effective join paradigm.

Michail Kazimianec, Nikolaus Augsten

A Flexible Graph Pattern Matching Framework via Indexing

In recent years, pattern matching has been an important graph analysis tool in various applications. In previous existing models, each edge in the query pattern represents the same relationship, e.g., the two endpoint vertices have to be connected or the distance between them should be within a certain uniform threshold. However, various real world applications may require edges representing different relationships or distances, some may be longer while others may be shorter. Therefore, we introduce the flexible pattern matching model where a range [

min

e

,

max

e

] is associated with an edge

e

in the query pattern, which means that the minimum distance between the matched endpoints of

e

is in the range of [

min

e

,

max

e

]. A novel pattern matching algorithm utilizing two types of indices is devised. In addition to the traditional pattern matching scheme, a top-k matches generation model is also proposed. Extensive empirical studies have been conducted to show the effectiveness and efficiency of our indices and methods.

Wei Jin, Jiong Yang

Subgraph Search over Massive Disk Resident Graphs

Due to the wide applications,

subgraph queries

have attracted lots of attentions in database community. In this paper, we focus on subgraph queries over a large graph

G

. Different from existing feature-based approaches, we propose a bitmap structure based on edges to index the graph. At run time, we decompose

Q

into a set of

a

djacent

e

dge

p

airs (AEP). We develop

e

dge

j

oin (EJ) algorithms to address AEP subqueries. The bitmap index can reduce both I/O and CPU cost. More importantly, the bitmap index has the linear space complexity instead of the exponential complexity in feature-based approaches, which confirms its good scalability. Extensive experiments show that our method outperforms existing ones in both online and offline performances significantly.

Peng Peng, Lei Zou, Lei Chen, Xuemin Lin, Dongyan Zhao

BR-Index: An Indexing Structure for Subgraph Matching in Very Large Dynamic Graphs

Subgraph indexing, i.e., finding all occurrences of a query graph

Q

in a very large connected database graph

G

, becomes an important research problem with great practical implications. To the best of our knowledge, most of subgraph indexing methods focus on the static database graphs. However, in many real applications, database graphs change over time. In this paper, we propose an indexing structure, BR-index, for large dynamic graphs. The large database graph is partitioned into a set of overlapping index regions. Features (small subgraphs) are extracted from these regions and used to index them. The updates to

G

can be localized to a small number of these regions. To further improve the efficiency in updates and query processing, several novel techniques and data structures are invented, which include feature lattice, maximal features, and overlapping regions. Experiments show that the BR-index outperforms alternatives in queries and updates.

Jiong Yang, Wei Jin

Clustering and Data Mining

CloudVista: Visual Cluster Exploration for Extreme Scale Data in the Cloud

The problem of efficient and high-quality clustering of extreme scale datasets with complex clustering structures continues to be one of the most challenging data analysis problems. An innovate use of data cloud would provide unique opportunity to address this challenge. In this paper, we propose the CloudVista framework to address (1) the problems caused by using sampling in the existing approaches and (2) the problems with the latency caused by cloud-side processing on interactive cluster visualization. The CloudVista framework aims to explore the entire large data stored in the cloud with the help of the data structure

visual frame

and the previously developed VISTA visualization model. The latency of processing large data is addressed by the

RandGen

algorithm that generates a series of related visual frames in the cloud without user’s intervention, and a hierarchical exploration model supported by cloud-side subset processing. Experimental study shows this framework is effective and efficient for visually exploring clustering structures for extreme scale datasets stored in the cloud.

Keke Chen, Huiqi Xu, Fengguang Tian, Shumin Guo

Efficient Selectivity Estimation by Histogram Construction Based on Subspace Clustering

Modern databases have to cope with multi-dimensional queries. For efficient processing of these queries, query optimization relies on multi-dimensional selectivity estimation techniques. These techniques in turn typically rely on histograms. A core challenge of histogram construction is the detection of regions with a density higher than the ones of their surroundings. In this paper, we show that subspace clustering algorithms, which detect such regions, can be used to build high quality histograms in multi-dimensional spaces. The clusters are transformed into a memory-efficient histogram representation, while preserving most of the information for the selectivity estimation. We derive a formal criterion for our transformation of clusters into buckets that minimizes the introduced estimation error. In practice, finding optimal buckets is hard, so we propose a heuristic. Our experiments show that our approach is efficient in terms of both runtime and memory usage. Overall, we demonstrate that subspace clustering enables multi-dimensional selectivity estimation with low estimation errors.

Andranik Khachatryan, Emmanuel Müller, Klemens Böhm, Jonida Kopper

Finding Closed MEMOs

Current literature lacks a thorough study on the discovery of meeting patterns in moving object datasets. We (a) introduced MEMO, a more precise definition of meeting patterns, (b) proposed three new algorithms based on a novel data-driven approach to extract closed MEMOs from moving object datasets and (c) implemented and evaluated them along with the algorithm previously reported in [6], whose performance has never been evaluated. Experiments using real-world datasets revealed that our filter-and-refinement algorithm outperforms the others in many realistic settings.

Htoo Htet Aung, Kian-Lee Tan

Density Based Subspace Clustering over Dynamic Data

Modern data are often high dimensional and dynamic. Subspace clustering aims at finding the clusters and the dimensions of the high dimensional feature space where these clusters exist. So far, the subspace clustering methods are mainly static and cannot address the dynamic nature of modern data. In this paper, we propose a dynamic subspace clustering method, which extends the density based projected clustering algorithm

PreDeCon

for dynamic data. The proposed method efficiently examines only those clusters that might be affected due to the population update. Both single and batch updates are considered.

Hans-Peter Kriegel, Peer Kröger, Irene Ntoutsi, Arthur Zimek

Hierarchical Clustering for Real-Time Stream Data with Noise

In stream data mining, stream clustering algorithms provide summaries of the relevant data objects that arrived in the stream. The model size of the clustering, i.e. the granularity, is usually determined by the speed (data per time) of the data stream. For varying streams, e.g. daytime or seasonal changes in the amount of data, most algorithms have to heavily restrict their model size such that they can handle the minimal time allowance. Recently the first anytime stream clustering algorithm has been proposed that flexibly uses all available time and dynamically adapts its model size. However, the method exhibits several drawbacks, as no noise detection is performed, since every point is treated equally, and new concepts can only emerge within existing ones. In this paper we propose the LiarTree algorithm, which is capable of anytime clustering and at the same time robust against noise and novelty to deal with arbitrary data streams.

Philipp Kranen, Felix Reidl, Fernando Sanchez Villaamil, Thomas Seidl

Architectures and Privacy

Energy Proportionality and Performance in Data Parallel Computing Clusters

Energy consumption in datacenters has recently become a major concern due to the rising operational costs and scalability issues. Recent solutions to this problem propose the principle of

energy proportionality

, i.e., the amount of energy consumed by the server nodes must be proportional to the amount of work performed. For data parallelism and fault tolerance purposes, most common file systems used in MapReduce-type clusters maintain a set of replicas for each data block. A

covering set

is a group of nodes that together contain at least one replica of the data blocks needed for performing computing tasks. In this work, we develop and analyze algorithms to maintain energy proportionality by discovering a covering set that minimizes energy consumption while placing the remaining nodes in low-power standby mode. Our algorithms can also discover covering sets in

heterogeneous

computing environments. In order to allow more data parallelism, we generalize our algorithms so that it can discover

k

-covering sets, i.e., a set of nodes that contain at least

k

replicas of the data blocks. Our experimental results show that we can achieve substantial energy saving without significant performance loss in diverse cluster configurations and working environments.

Jinoh Kim, Jerry Chou, Doron Rotem

Privacy Preserving Group Linkage

The problem of privacy preserving record linkage is to find the intersection of records from two parties, while not revealing any private records to each other. Recently, group linkage has been introduced to measure the similarity of groups of records [19]. When we extend the traditional privacy preserving record linkage methods to group linkage measurement, group membership privacy becomes vulnerable – record identity could be discovered from unlinked groups. In this paper, we introduce threshold privacy preserving group linkage (TPPGL) schemes, in which both parties only learn whether or not the groups are linked. Therefore, our approach is secure under

group membership inference attacks

. In experiments, we show that using the proposed TPPGL schemes, group membership privacy is well protected against inference attacks with a reasonable overhead.

Fengjun Li, Yuxin Chen, Bo Luo, Dongwon Lee, Peng Liu

Dynamic Anonymization for Marginal Publication

Marginal publication is one of important techniques to help researchers to improve the understanding about correlation between published attributes. However, without careful treatment, it’s of high risk of privacy leakage for marginal publications. Solution like ANGEL has been available to eliminate such risks of privacy leakage. But, unfortunately, query accuracy has been paid as the cost for the privacy-safety of ANGEL. To improve the data utility of marginal publication while ensuring privacy-safety, we propose a new technique called dynamic anonymization. We present the detail of the technique and theoretical properties of the proposed approach. Extensive experiments on real data show that our technique allows highly effective data analysis, while offering strong privacy guarantees.

Xianmang He, Yanghua Xiao, Yujia Li, Qing Wang, Wei Wang, Baile Shi

Pantheon: Exascale File System Search for Scientific Computing

Modern scientific computing generates petabytes of data in billions of files that must be managed. These files are often organized, by name, in a hierarchical directory tree common to most file systems. As the scale of data has increased, this has proven to be a poor method of file organization. Recent tools have allowed for users to navigate files based on file metadata attributes to provide more meaningful organization. In order to search this metadata, it is often stored on separate metadata servers. This solution has drawbacks though due to the multi-tiered architecture of many large scale storage solutions. As data is moved between various tiers of storage and/or modified, the overhead incurred for maintaining consistency between these tiers and the metadata server becomes very large. As scientific systems continue to push towards exascale, this problem will become more pronounced. A simpler option is to bypass the overhead of the metadata server and use the metadata storage inherent to the file system. This approach currently has few tools to perform operations at a large scale though. This paper introduces the prototype for Pantheon, a file system search tool designed to use the metadata storage within the file system itself, bypassing the overhead from metadata servers. Pantheon is also designed with the scientific community’s push towards exascale computing in mind. Pantheon combines hierarchical partitioning, query optimization, and indexing to perform efficient metadata searches over large scale file systems.

Joseph L. Naps, Mohamed F. Mokbel, David H. C. Du

Massive-Scale RDF Processing Using Compressed Bitmap Indexes

The Resource Description Framework (RDF) is a popular data model for representing linked data sets arising from the web, as well as large scientific data repositories such as UniProt. RDF data intrinsically represents a labeled and directed multi-graph. SPARQL is a query language for RDF that expresses subgraph pattern-finding queries on this implicit multigraph in a SQL-like syntax. SPARQL queries generate complex intermediate join queries; to compute these joins efficiently, this paper presents a new strategy based on bitmap indexes. We store the RDF data in column-oriented compressed bitmap structures, along with two dictionaries. We find that our bitmap index-based query evaluation approach is up to an order of magnitude faster the state-of-the-art system RDF-3X, for a variety of SPARQL queries on gigascale RDF data sets.

Kamesh Madduri, Kesheng Wu

Database-as-a-Service for Long-Tail Science

Database technology remains underused in science, especially in the long tail — the small labs and individual researchers that collectively produce the majority of scientific output. These researchers increasingly require iterative, ad hoc analysis over ad hoc databases but cannot individually invest in the computational and intellectual infrastructure required for state-of-the-art solutions.

We describe a new “delivery vector” for database technology called SQLShare that emphasizes ad hoc integration, query, sharing, and visualization over pre-defined schemas. To empower non-experts to write complex queries, we synthesize example queries from the data itself and explore limited English hints to augment the process. We integrate collaborative visualization via a web-based service called VizDeck that uses automated visualization techniques with a card game metaphor to allow creation of interactive visual dashboards in seconds with zero programming.

We present data on the initial uptake and usage of the system and report preliminary results testingout new features with the datasets collected during the initial pilot deployment. We conclude that the SQLShare system and associated services have the potential to increase uptake of relational database technology in the long tail of science.

Bill Howe, Garret Cole, Emad Souroush, Paraschos Koutris, Alicia Key, Nodira Khoussainova, Leilani Battle

Panel II

Data Scientists, Data Management and Data Policy

US science agencies have or will soon have a requirement that externally funded projects have “data management plans.” Projects with a large budget or a tradition of data access and repositories do not see the impact as significant. However, the impact of the requirement can be particularly challenging for single investigators and small collaborations, especially in multidisciplinary research. These data represent what is known as Dark Data (Heidorn, 2008) in the long tail of science, where the data sets may be relatively small and the funding and expertise for handling also small. But just developing tools or putting computer scientists with the investigators is not sufficient.

Sylvia Spengler

Applications and Models

Context-Aware Parameter Estimation for Forecast Models in the Energy Domain

Continuous balancing of energy demand and supply is a fundamental prerequisite for the stability and efficiency of energy grids. This balancing task requires accurate forecasts of future electricity consumption and production at any point in time. For this purpose, database systems need to be able to rapidly process forecasting queries and to provide accurate results in short time frames. However, time series from the electricity domain pose the challenge that measurements are constantly appended to the time series. Using a naive maintenance approach for such evolving time series would mean a re-estimation of the employed mathematical forecast model from scratch for each new measurement, which is very time consuming. We speed-up the forecast model maintenance by exploiting the particularities of electricity time series to reuse previously employed forecast models and their parameter combinations. These parameter combinations and information about the context in which they were valid are stored in a repository. We compare the current context with contexts from the repository to retrieve parameter combinations that were valid in similar contexts as starting points for further optimization. An evaluation shows that our approach improves the maintenance process especially for complex models by providing more accurate forecasts in less time than comparable estimation methods.

Lars Dannecker, Robert Schulze, Matthias Böhm, Wolfgang Lehner, Gregor Hackenbroich

Implementing a General Spatial Indexing Library for Relational Databases of Large Numerical Simulations

Large multi-terabyte numerical simulations of different physical systems consist of billions of particles or grid points and hundreds to thousands of snapshots. Increasingly these data sets are stored in large object-relational databases. Most statistical analyses involve extracting various spatio-temporal subsets. Existing built-in spatial indexes in commercial systems lack essential features required for many applications in the physical sciences. We describe a library that we have implemented in several languages and platforms (Java/Oracle, C#/SQL Server) based on generic space-filling curves, implemented as plug-ins. The index provides a mapping of higher dimensional space into the standard linear B-tree index of any relational database. The architecture allows intersections with different geometric primitives. The library has been used for cosmological N-body simulations and isotropic turbulence, providing sub-second response time over datasets exceeding several tens of terabytes. The library can also address complex space-time challenges, like temporal look-back into past light-cones of cosmological simulations.

Gerard Lemson, Tamás Budavári, Alexander Szalay

Histogram and Other Aggregate Queries in Wireless Sensor Networks

Wireless Sensor Networks (WSNs) are typically used to collect values of some phenomena in a monitored area. In many applications, users are interested in statistical summaries of the observed data, e.g., histograms reflecting the distribution of the collected values. In this paper we propose two main contributions: (1) an efficient algorithm for answering

Histogram

queries in a WSN, and (2) how to efficiently use the obtained histogram to also process other types of aggregate queries. Our experimental results show that our proposed solutions are able to substantially extend the lifespan of the WSN.

Khaled Ammar, Mario A. Nascimento

Efficient In-Database Maintenance of ARIMA Models

Forecasting is an important analysis task and there is a need of integrating time series models and estimation methods in database systems. The main issue is the computationally expensive maintenance of model parameters when new data is inserted. In this paper, we examine how an important class of time series models, the

AutoRegressive Integrated Moving Average

(ARIMA)

models, can be maintained with respect to inserts. Therefore, we propose a novel approach,

on-demand

estimation, for the efficient maintenance of maximum likelihood estimates from numerically implemented estimators. We present an extensive experimental evaluation on both real and synthetic data, which shows that our approach yields a substantial speedup while sacrificing only a limited amount of predictive accuracy.

Frank Rosenthal, Wolfgang Lehner

Recipes for Baking Black Forest Databases

Building and Querying Black Hole Merger Trees from Cosmological Simulations

Large-scale N-body simulations play an important role in advancing our understanding of the formation and evolution of large structures in the universe. These computations require a large number of particles, in the order of 10-100 of billions, to realistically model phenomena such as the formation of galaxies. Among these particles, black holes play a dominant role on the formation of these structure. The properties of the black holes need to be assembled in merger tree histories to model the process where two or more black holes merge to form a larger one. In the past, these analyses have been carried out with custom approaches that no longer scale to the size of black hole datasets produced by current cosmological simulations. We present algorithms and strategies to store, in relational databases (RDBMS), a forest of black hole merger trees. We implemented this approach and present results with datasets containing 0.5 billion time series records belonging to over 2 million black holes. We demonstrate that this is a feasible approach to support interactive analysis and enables flexible exploration of black hole forest datasets.

Julio López, Colin Degraf, Tiziana DiMatteo, Bin Fu, Eugene Fink, Garth Gibson

CrowdLabs: Social Analysis and Visualization for the Sciences

Managing and understanding the growing volumes of scientific data is one of the most challenging issues scientists face today. As analyses get more complex and large interdisciplinary groups need to work together, knowledge sharing becomes essential to support effective scientific data exploration. While science portals and visualization Web sites have provided a first step towards this goal, by aggregating data from different sources and providing a set of pre-designed analyses and visualizations, they have important limitations. Often, these sites are built manually and are not flexible enough to support the vast heterogeneity of data sources, analysis techniques, data products, and the needs of different user communities. In this paper we describe CrowdLabs, a system that adopts the model used by social Web sites, allowing users to share not only data but also computational pipelines. The shared repository opens up many new opportunities for knowledge sharing and re-use, exposing scientists to tasks that provide examples of sophisticated uses of algorithms they would not have access to otherwise. CrowdLabs combines a set of usable tools and a scalable infrastructure to provide a rich collaborative environment for scientists, taking into account the requirements of computational scientists, such as accessing high-performance computers and manipulating large amounts of data.

Phillip Mates, Emanuele Santos, Juliana Freire, Cláudio T. Silva

Posters

Heidi Visualization of R-tree Structures over High Dimensional Data

High dimensional index structures are used to efficiently answer range queries in large databases. Visualization of such index structures helps in: (a)

visualization of the data set

in a hierarchical format of the index structure, (b)

“explorative querying”

on the data set, similar to explorative browsing on the web, (c)

index structure diagnostics

: visualizing the structure along with its performance statistics enables the user to make changes to structure for better performance. To the best of our knowledge, there is no such visualization for high dimensional index structures.

Shraddha Agrawal, Soujanya Vadapalli, Kamalakar Karlapalem

Towards Efficient and Precise Queries over Ten Million Asteroid Trajectory Models

The new generation of telescopes under construction return to the same area of the sky with sufficient frequency to enable tracking of moving objects such as asteroids, near-earth objects, and comets [4,5]. To detect these moving objects, one image may be subtracted from another (separated by several days or weeks) to differentiate variable and moving sources from the dense background of stars and galaxies. Moving sources may then be identified by querying against a database of expected positions of known asteroids. At a high-level, this task maps onto executing the query: “Return all known asteroids that are expected to be located within a given region at a given time.” We consider the problem of querying for asteroids in a specified interval in space and time, specifically as applied to populating the simulations of the data flow from the Large Synoptic Survey Telescope (LSST).

Yusra AlSayyad, K. Simon Krughoff, Bill Howe, Andrew J. Connolly, Magdalena Balazinska, Lynne Jones

Keyword Search Support for Automating Scientific Workflow Composition

Keyword search is indispensable for locating relevant documents the web. Yet, at the same time, we have also grown aware of its limitations. It is often difficult to reach esoteric information obscured deep within various domains. For example, consider an earth science student who needs to identify how much a certain area in a nearby park has eroded since 1940 for a school project. Certainly, if this exact erosion information had previously been published onto a web page, a search engine could probably locate it effortlessly. But due to the exactness of this query, the likelihood that such a web page exists is slim. However, avenues for obtaining this information exist in the form of scientific workflows, which can be implemented using web service composition.

David Chiu, Travis Hall, Farhana Kabir, Gagan Agrawal

FastQuery: A General Indexing and Querying System for Scientific Data

Modern scientific applications are producing vast amounts of data[2]. In many cases, the essential information needed for understanding scientific processes is stored in a relatively small number of data records. Efficiently locating the interesting data records is indispensable to the overall analysis procedure. The data structures most effective at performing these search operations is known as indices [2]. This work implements a flexible way of applying such an index to a range of different scientific data.

Jerry Chou, Kesheng Wu, Prabhat

Retrieving Accurate Estimates to OLAP Queries over Uncertain and Imprecise Multidimensional Data Streams

In this paper, we introduce a novel framework for estimating OLAP queries over uncertain and imprecise multidimensional data streams, along with three relevant research contributions: (i) a probabilistic data stream model, which describes both precise and imprecise multidimensional data stream readings in terms of nice confidence-interval-based Probability Distribution Functions (PDF); (ii) a possible-world semantics for uncertain and imprecise multidimensional data streams, which is based on an innovative data-driven approach that exploits “natural” features of OLAP data, such as the presence of clusters and high correlations; (iii) an innovative approach for providing theoretically-founded estimates to OLAP queries over uncertain and imprecise multidimensional data streams that exploits the well-recognized probabilistic estimators theory.

Alfredo Cuzzocrea

Hybrid Data-Flow Graphs for Procedural Domain-Specific Query Languages

Domain-specific query languages (DSQL) let users express custom business logic. Relational databases provide a limited set of options to execute business logic. Usually, stored procedures or a series of queries with some glue code. Both methods have drawbacks and often business logic is still executed on application side transferring large amounts of data between application and database, which is expensive. We translate a DSQL into a hybrid data-flow execution plan, containing relational operators mixed with procedural ones. A cost model is used to drive the translation towards an optimal mixture of relational and procedural plan operators.

Bernhard Jaecksch, Franz Faerber, Frank Rosenthal, Wolfgang Lehner

Scalable and Automated Workflow in Mining Large-Scale Severe-Storm Simulations

The simulation of large-scale complex systems, such as modeling the effects of hurricanes or storms in coastal environments, typically requires a large amount of computing resources in addition to data storage capacity. To make an efficient prediction of the potential storm surge height for an incoming hurricane, surrogate models, which are computationally cheap and can reach a comparable level of accuracy with simulations, are desired. In this paper, we present a scalable and automated workflow for surrogate modeling with hurricane-related simulation data.

Lei Jiang, Gabrielle Allen, Qin Chen

Accurate Cost Estimation Using Distribution-Based Cardinality Estimates for Multi-dimensional Queries

Cardinality estimation is crucial for query optimization. The optimizer uses cardinality estimates to compute the query-plan costs. Histograms are one of the most popular data structures used for cardinality estimation [2]. Because histograms compress the data set, the cardinality estimates issued are not exact.We model these estimates as random variables, and denote the cardinality of query

q

by

card

(

q

).

Andranik Khachatryan, Klemens Böhm

Session-Based Browsing for More Effective Query Reuse

Scientists today are able to generate and collect data at an unprecedented scale [1]. Afterwards, scientists analyze and explore these datasets. Composing SQL queries, however, is a significant challenge for scientists because most are not database experts.

Nodira Khoussainova, YongChul Kwon, Wei-Ting Liao, Magdalena Balazinska, Wolfgang Gatterbauer, Dan Suciu

The ETLMR MapReduce-Based ETL Framework

This paper presents

ETLMR

, a parallel Extract–Transform–Load (ETL) programming framework based on MapReduce. It has built-in support for high-level ETL-specific constructs including star schemas, snowflake schemas, and slowly changing dimensions (SCDs). ETLMR gives both high programming productivity and high ETL scalability.

Xiufeng Liu, Christian Thomsen, Torben Bach Pedersen

Top-k Similarity Search on Uncertain Trajectories

Similarity search on uncertain trajectories has a wide range of applications. To the best of our knowledge, no previous work has addressed similarity search on uncertain trajectories. In this paper, we provides a complete set of techniques for spatiotemporal similarity search on uncertain trajectories.

Chunyang Ma, Hua Lu, Lidan Shou, Gang Chen, Shujie Chen

Fast and Accurate Trajectory Streams Clustering

Trajectory data streams are huge amounts of data pertaining to time and position of moving objects. They are continuously generated by different sources exploiting a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data streams clustering, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant.

Elio Masciari

Data-Driven Multidimensional Design for OLAP

OLAP is a popular technology to query scientific and statistical databases, but their success heavily depends on a proper design of the underlying multidimensional (MD) databases (i.e., based on the

fact / dimension

paradigm). Relevantly, different approaches to automatically identify

facts

are nowadays available, but all MD design methods rely on discovering functional dependencies (FDs) to identify

dimensions

. However, an unbound FD search generates a combinatorial explosion and accordingly, these methods produce MD schemas with too many dimensions whose meaning has not been analyzed in advance. On the contrary, i) we use the available ontological knowledge to drive the FD search and avoid the combinatorial explosion and ii) only propose dimensions of interest for analysts by performing a statistical study of data.

Oscar Romero, Alberto Abelló

An Adaptive Outlier Detection Technique for Data Streams

This work presents an adaptive outlier detection technique for data streams, called Automatic Outlier Detection for Data Streams (A-ODDS), which identifies outliers with respect to all the received data points (global context) as well as temporally close data points (local context) where local context are selected based on time and change of data distribution.

Shiblee Sadik, Le Gruenwald

Power-Aware DBMS: Potential and Challenges

Energy consumption has become a first-class optimization goal in computing system design and implementation. Database systems, being a major consumer of computing resources (thus energy) in modern data centers, also face the challenges of going green. In this position paper, we describe our vision on this new direction of database system research, and report the results of our recent work on this topic. We describe our ideas on the key issues in designing a power-aware DBMS and sketch our solutions to such issues. Specifically, we believe that the ability for the DBMS to dynamically adjust various knobs to satisfy energy (and performance) goals is the main technical challenge in this paradigm. To address that challenge, we propose dynamic modeling and tuning techniques based on formal feedback control theory. Our preliminary data clearly show that the energy savings can be significant.

Yi-cheng Tu, Xiaorui Wang, Zichen Xu

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise