Invited Talks

Tradeoffs between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis

As the market demand for analyzing data sets of increasing variety and scale continues to explode, the software options for performing this analysis are beginning to proliferate. No fewer than a dozen companies have launched in the past few years that sell parallel database products to meet this market demand. At the same time, MapReduce-based options, such as the open source Hadoop framework are becoming increasingly popular, and there have been a plethora of research publications in the past two years that demonstrate how MapReduce can be used to accelerate and scale various data analysis tasks.

Both parallel databases and MapReduce-based options have strengths and weaknesses that a practitioner must be aware of before selecting an analytical data management platform. In this talk, I describe some experiences in using these systems, and the advantages and disadvantages of the popular implementations of these systems. I then discuss a hybrid system that we are building at Yale University, called HadoopDB, that attempts to combine the advantages of both types of platforms. Finally, I discuss our experience in using HadoopDB for both traditional decision support workloads (i.e., TPC-H) and also scientific data management (analyzing the Uniprot protein sequence, function, and annotation data).

Daniel J. Abadi

Emerging Trends and Converging Technologies in Data Intensive Scalable Computing

There is today wide agreement that data-intensive scalable computing methods are essential to advancing research in many disciplines. Such methods are expected to play an increasing important role in providing support for wellinformed technical decisions and policies. They are therefore of great scientific and social importance.

Roger S. Barga

Query Processing

Deriving Spatio-temporal Query Results in Sensor Networks

Tracking moving objects in relation to regions of interest, e.g., for pollution control or habitat monitoring, is an important application of Sensor Networks (SN). Research on Moving Object Databases has resulted in sophisticated mechanisms for querying moving objects and regions declaratively. Applying these results to SN in a straightforward way is not possible: First, sensor nodes typically can only determine that an object is in their vicinity, but not the exact position. Second, nodes may fail, or areas may be unobservable. All this is problematic because the evaluation of spatio-temporal queries requires precise knowledge about object positions. In this paper we specify meaningful results of spatio-temporal queries, given those SN-specific phenomena, and say how to derive them from object detections by sensor nodes. We distinguish between objects which definitely fulfill the query and those that could possibly do so, but where those inaccuracies are in the way of a definite answer. We study both spatio-temporal predicates as well as spatio-temporal developments, i.e., sequences of predicates describing complex movement patterns of objects.

Markus Bestehorn, Klemens Böhm, Patrick Bradley, Erik Buchmann

Efficient and Adaptive Distributed Skyline Computation

Skyline queries have attracted considerable attention over the last few years, mainly due to their ability to return interesting objects without the need for user-defined scoring functions. In this work, we study the problem of distributed skyline computation and propose an adaptive algorithm towards controlling the degree of parallelism and the required network traffic. In contrast to state-of-the-art methods, our algorithm handles efficiently diverse preferences imposed on attributes. The key idea is to partition the data using a grid scheme and for each query to build on-the-fly a dependency graph among partitions which can help in effective pruning. Our algorithm operates in two modes: (i)

full-parallel mode

, where processors are activated simultaneously or (ii)

cascading mode

, where processors are activated in a cascading manner using propagation of intermediate results, thus reducing network traffic and potentially increasing throughput. Performance evaluation results, based on real-life and synthetic data sets, demonstrate the scalability with respect to the number of processors and database size.

George Valkanas, Apostolos N. Papadopoulos

On the Efficient Construction of Multislices from Recurrences

Recurrences are defined as sets of time instants associated with events and they are present in many application domains, including public transport schedules and personal calendars. Because of their large size, recurrences are rarely stored explicitly, but some form of compact representation is used. Multislices are a compact representation that is well suited for storage in relational databases. A multislice is a set of time slices where each slice employs a hierarchy of time granularities to compactly represent multiple recurrences.

In this paper we investigate the construction of multislices from recurrences. We define the compression ratio of a multislice, show that different construction strategies produce multislices with different compression ratios, and prove that the construction of minimal multislices, i.e., multislices with a maximal compression ratio, is an NP-hard problem. We propose a scalable algorithm, termed

LMerge

, for the construction of multislices from recurrences. Experiments with real-world recurrences from public transport schedules confirm the scalability and usefulness of

LMerge

: the generated multislices are very close to minimal multislices, achieving an average compression ratio of approx. 99%. A comparison with a baseline algorithm that iteratively merges pairs of mergeable slices shows significant improvements of

LMerge

over the baseline approach.

Romans Kasperovics, Michael H. Böhlen, Johann Gamper

Optimizing Query Processing in Cache-Aware Wireless Sensor Networks

Most models for Wireless Sensor Networks (WSNs) assume the existence of a base station where query results could in principle be cached, however, the opportunity for re-using such cached data for minimizing data traffic in the WSN has not been well explored thus far. Aiming at filling this gap, we propose an approach that first clips the original query into a polygon after selectively choosing a good subset of the cached queries for reuse. Next, this polygon is partitioned into sub-queries that are then submitted to the WSN. These two problems are interconnected and lead to a highly combinatorial problem that justifies the use of efficient and effective heuristics. This paper presents algorithms for each of these problems that are used within a cost-driven optimization search in order to find a set of sub-queries that minimizes the cost of in-network query processing. Experimental results show that our heuristic solution is orders of magnitude faster than an exhaustive search, and yields no more than 10% loss compared to the optimal query processing.

Mario A. Nascimento, Romulo A. E. Alencar, Angelo Brayner

Approximate Query Answering and Result Refinement on XML Data

Today, many economic decisions are based on the fast analysis of XML data. Yet, the time to process analytical XML queries is typically high. Although current XML techniques focus on the optimization of query processing, none of these support early approximate feedback as possible in relational Online Aggregation systems.

In this paper, we introduce a system that provides fast estimates to XML aggregation queries. While processing, these estimates and the assigned confidence bounds are constantly improving. In our evaluation, we show that without significantly increasing the overall execution time our system returns accurate guesses of the final answer long before traditional systems are able to produce output.

Katja Seidler, Eric Peukert, Gregor Hackenbroich, Wolfgang Lehner

Efficient and Scalable Method for Processing Top-k Spatial Boolean Queries

In this paper, we present a novel method to efficiently process

top-k

spatial queries with conjunctive Boolean constraints on textual content. Our method combines an R-tree with an inverted index by the inclusion of spatial references in posting lists. The result is a disk-resident, dual-index data structure that is used to proactively prune the search space. R-tree nodes are visited in best-first order. A node entry is placed in the priority queue if there exists at least one object that satisfies the Boolean condition in the subtree pointed by the entry; otherwise, the subtree is not further explored. We show via extensive experimentation with real spatial databases that our method has increased performance over alternate techniques while scaling to large number of objects.

Ariel Cary, Ouri Wolfson, Naphtali Rishe

Scientific Data Management and Analysis

A Framework for Moving Sensor Data Query and Retrieval of Dynamic Atmospheric Events

One challenge in Earth science research is the accurate and efficient ad-hoc query and retrieval of Earth science satellite sensor data based on user-defined criteria to study and analyze atmospheric events such as tropical cyclones. The problem can be formulated as a spatio-temporal join query to identify the spatio-temporal location where moving sensor objects and dynamic atmospheric event objects intersect, either precisely or within a user-defined proximity. In this paper, we describe an efficient query and retrieval framework to handle the problem of identifying the spatio-temporal intersecting positions for satellite sensor data retrieval. We demonstrate the effectiveness of our proposed framework using sensor measurements from QuikSCAT (wind field measurement) and TRMM (precipitation vertical profile measurements) satellites, and the trajectories of the tropical cyclones occurring in the North Atlantic Ocean in 2009.

Shen-Shyang Ho, Wenqing Tang, W. Timothy Liu, Markus Schneider

Client + Cloud: Evaluating Seamless Architectures for Visual Data Analytics in the Ocean Sciences

Science is becoming data-intensive, requiring new software architectures that can exploit resources at all scales: local GPUs for interactive visualization, server-side multi-core machines with fast processors and large memories, and scalable, pay-as-you-go cloud resources. Architectures that seamlessly and flexibly exploit all three platforms are largely unexplored. Informed by a long-term collaboration with ocean scientists, we articulate a suite of representative visual data analytics workflows and use them to design and implement a multi-tier immersive visualization system. We then analyze a variety of candidate architectures spanning all three platforms, articulate their tradeoffs and requirements, and evaluate their performance. We conclude that although “pushing the computation to the data” is generally the optimal strategy, no one single architecture is optimal in all cases and client-side processing cannot be made obsolete by cloud computing. Rather, rich visual data analytics applications benefit from access to a variety of cross-scale, seamless “client + cloud” architectures.

Keith Grochow, Bill Howe, Mark Stoermer, Roger Barga, Ed Lazowska

Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster

Scientists’ ability to generate and collect massive-scale datasets is increasing. As a result, constraints in data analysis capability rather than limitations in the availability of data have become the bottleneck to scientific discovery. MapReduce-style platforms hold the promise to address this growing data analysis problem, but it is not easy to express many scientific analyses in these new frameworks. In this paper, we study data analysis challenges found in the astronomy simulation domain. In particular, we present a

scalable

,

parallel

algorithm for

data clustering

in this domain. Our algorithm makes two contributions. First, it shows how a clustering problem can be efficiently implemented in a MapReduce-style framework. Second, it includes optimizations that enable

scalability

, even in the presence of skew. We implement our solution in the Dryad parallel data processing system using DryadLINQ. We evaluate its performance and scalability using a real dataset comprised of 906 million points, and show that in an 8-node cluster, our algorithm can process even a highly skewed dataset 17 times faster than the conventional implementation and offers near-linear scalability. Our approach matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.

YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, Sarah Loebman

Database Design for High-Resolution LIDAR Topography Data

The advent of high-resolution mapping technologies such as airborne Light Detection and Ranging (LIDAR) has revolutionized the study of processes acting on the earth’s surface. However, the massive volumes of data produced by LIDAR technology pose significant technical challenges in terms of the management and web-based distribution of these datasets. This paper provides a case study in the use of relational database technology for serving large airborne LIDAR “point cloud” datasets, as part of the National Science Foundation funded OpenTopography facility. We have experimented with the use of spatial extensions in the database as well as implementation solutions from a single partition database on a supercomputer resource to a multi-partition implementation on a shared-nothing commodity cluster for management of these terabyte scale datasets. We also describe future directions being pursued to support binary data formats and for scaling to larger system configurations.

Viswanath Nandigam, Chaitan Baru, Christopher Crosby

PetaScope: An Open-Source Implementation of the OGC WCS Geo Service Standards Suite

A growing number of scientific data archives set up Web interfaces for researchers and sometimes for the general public. While these services often deploy sophisticated search facilities, these are constrained to metadata level where conventional SQL/XML technology can be leveraged; no comparable retrieval support is available on original observation or simulation data.

For raster data in the earth sciences, this is overcome by the 2008 Open GeoSpatial Consortium (OGC) Web Coverage Processing Service (WCPS) standard. It defines a retrieval language on multi-dimensional raster data for ad-hoc navigation, extraction, aggregation, and analysis. WCPS is part of the Web Coverage Service (WCS) suite which additionally contains a simple retrieval service and a upload and update service for raster data.

In this contribution we present

PetaScope

, an open-source implementation of the complete WCS suite. Most of the suite’s functionality is available already, and a first code version has been released. An online showcase is being built, and the system will soon undergo real-life evaluation on mass data. After briefly introducing the WCPS language concept we discuss the architecture of the service stack based on examples publicly available on the demo website.

Andrei Aiordăchioaie, Peter Baumann

Towards Archaeo-informatics: Scientific Data Management for Archaeobiology

In this short paper we outline a novel and rich application domain for scientific data management: Archaeobiology. We give a short introduction to the key quests and issues in this research domain and derive a list of data management and data analysis tasks that requires original contributions from the Computer Science community and can initiate the birth of a new research discipline which we call Archaeo-Informatics. Furthermore, we describe a prototype for scientific data management called OSSOBOOK that meets many of the identified requirements of this application domain. In particular, OSSOBOOK is a distributed database system specially designed for Archaeobiology data management and analysis. Finally we give some future perspectives which is intended to serve as input for novel challenges for the Computer Science community.

Hans-Peter Kriegel, Peer Kröger, Christiaan Hendrikus van der Meijden, Henriette Obermaier, Joris Peters, Matthias Renz

Data Mining

DESSIN: Mining Dense Subgraph Patterns in a Single Graph

Currently, a large amount of data can be best represented as graphs, e.g., social networks, protein interaction networks, etc. The analysis of these networks is an urgent research problem with great practical applications. In this paper, we study the particular problem of finding frequently occurring dense subgraph patterns in a large connected graph. Due to the ambiguous nature of occurrences of a pattern in a graph, we devise a novel frequent pattern model for a single graph. For this model, the widely used Apriori property no longer holds. However, we are able to identify several important properties, i.e.,

small diameter

,

reachability

, and

fast calculation of automorphism

. These properties enable us to employ an index-based method to locate all occurrences of a pattern in a graph and a depth-first search method to find all patterns. Concluding this work, a large number of real and synthetic data sets are used to show the effectiveness and efficiency of the DESSIN method.

Shirong Li, Shijie Zhang, Jiong Yang

Discovery of Evolving Convoys

Traditionally, a convoy is defined as a set of moving objects that are close to each other for a period of time. Existing techniques, following this traditional definition, cannot find evolving convoys with dynamic members and do not have any monitoring aspect in their design. We propose new concepts of dynamic convoys and evolving convoys, which reflect real-life scenarios, and develop algorithms to discover evolving convoys in an incremental manner.

Htoo Htet Aung, Kian-Lee Tan

Finding Top-k Similar Pairs of Objects Annotated with Terms from an Ontology

With the growing focus on semantic searches, an increasing number of standardized ontologies are being designed to describe data. We investigate the querying of objects described by a tree-structured ontology. Specifically, we consider the case of finding the top-

k

best pairs of objects that have been annotated with terms from such an ontology when the object descriptions are available only at runtime. We consider three distance measures. The first one defines the object distance as the minimum pairwise distance between the sets of terms describing them and the second one defines the distance as the average pairwise term distance. The third and most useful distance measure—earth mover’s distance—finds the best way of matching the terms and computes the distance corresponding to this best matching. We develop lower bounds that can be aggregated progressively and utilize them to speed up the search for top-

k

object pairs when the earth mover’s distance is used. For the minimum pairwise distance, we devise an algorithm that runs in

O

(

D

+

T

k

log

k

) time, where

D

is the total information size and

T

is the number of terms in the ontology. We also develop a best-first search strategy for the average pairwise distance that utilizes lower bounds generated in an ordered manner. Experiments on real and synthetic datasets demonstrate the practicality and scalability of our algorithms.

Arnab Bhattacharya, Abhishek Bhowmick, Ambuj K. Singh

Identifying the Most Influential User Preference from an Assorted Collection

A conventional skyline query requires no query point, and usually employs a MIN or MAX annotation only to prefer smaller or larger values on each dimension. A relative skyline query, in contrast, is issued with a combination of a query point and a set of preference annotations for all involved dimensions. Due to the relative dominance definition in a relative skyline query, there exist various such combinations which we call as

user preferences

. It is also often interesting to identify from an assorted user preference collection the most influential preference that leads to the largest relative skyline. We call such a problem

the most influential preference query

. In this paper we propose a complete set of techniques to solve such novel and useful problems within a uniform framework. We first formalize different preference annotations that can be imposed on a dimension by a relative skyline query user. We then propose an effective transformation to handle all these annotations in a uniform way. Based on the transformation, we adapt the well-established Branch-and-Bound Skyline (BBS) algorithm to process relative skyline queries with assorted user preferences. In order to process the most influential preference queries, we develop two aggregation R-tree based algorithms. We conduct extensive experiments on both real and synthetic datasets to evaluate our proposals.

Hua Lu, Linhao Xu

MC-Tree: Improving Bayesian Anytime Classification

In scientific databases large amounts of data are collected to create knowledge repositories for deriving new insights or planning further experiments. These databases can be used to train classifiers that later categorize new data tuples. However, the large amounts of data might yield a time consuming classification process, e.g. for nearest neighbors or kernel density estimators. Anytime classifiers bypass this drawback by being interruptible at any time while the quality of the result improves with higher time allowances. Interruptible classifiers are especially useful when newly arriving data has to be classified on demand, e.g. during a running experiment. A statistical approach to anytime classification has recently been proposed using Bayes classification on kernel density estimates.

In this paper we present a novel data structure called MC-Tree (Multi-Class Tree) that significantly improves Bayesian anytime classification. The tree stores a hierarchy of mixture densities that represent objects from several classes. Data transformations are used during tree construction to optimize the condition of the tree with respect to multiple classes. Anytime classification is achieved through novel query dependent model refinement approaches that take the entropy of the current mixture components into account. We show in experimental evaluation that the MC-Tree outperforms previous approaches in terms of anytime classification accuracy.

Philipp Kranen, Stephan Günnemann, Sergej Fries, Thomas Seidl

Non-intrusive Quality Analysis of Monitoring Data

Any large-scale operational system running over a variety of devices requires a monitoring mechanism to assess the health of the overall system. The Technical Infrastructure Monitoring System (TIM) at CERN is one such system, and monitors a wide variety of devices and their properties, such as electricity supplies, device temperatures, liquid flows etc. Without adequate quality assurance, the data collected from such devices leads to false-positives and false-negatives, reducing the effectiveness of the monitoring system. The quality must, however, be measured in a non-intrusive way, so that the critical path of the data flow is not affected by the quality computation. The quality computation should also scale to large volumes of incoming data. To address these challenges, we develop a new statistical module, which monitors the data collected by TIM and reports its quality to the operators. The statistical module uses Oracle RDBMS as the underlying store, and builds hierarchical summaries on the basic events to scale to the volume of data. It has built-in fault-tolerance capability to recover from multiple computation failures. In this paper, we describe the design of the statistical module, and its usefulness for all parties involved with TIM: the system administrators, the operators using the system to monitor the devices, and the engineers responsible for attaching them to the system. We present concrete examples of how the software module helped with the monitoring, configuration and design of TIM since its introduction last year.

Mark Brightwell, Anastasia Ailamaki, Anna Suwalska

Visual Decision Support for Ensemble Clustering

The continuing growth of data leads to major challenges for data clustering in scientific data management. Clustering algorithms must handle high data volumes/dimensionality, while users need assistance during their analyses.

Ensemble clustering

provides robust, high-quality results and eases the algorithm selection and parameterization. Drawbacks of available concepts are the lack of facilities for result adjustment and the missing support for result interpretation. To tackle these issues, we have already published an extended algorithm for ensemble clustering that uses soft clusterings. In this paper, we propose a novel visualization, tightly coupled to this algorithm, that provides assistance for result adjustments and allows the interpretation of clusterings for data sets of arbitrary size.

Martin Hahmann, Dirk Habich, Wolfgang Lehner

Indexes and Data Representation

An Indexing Scheme for Fast and Accurate Chemical Fingerprint Database Searching

Rapid chemical database searching is important for drug discovery. Chemical compounds are represented as long fixed-length bit vectors called fingerprints. The vectors record the presence or absence of particular features or substructures of the corresponding molecules. In a typical drug discovery application, several thousands of query fingerprints are screened for similarity against a database of millions of fingerprints to identify suitable drug candidates. The existing methods of full database scan and range search take considerable amounts of time for such a task. We present a new index-based search method called “ChemDex” (

Chem

ical fingerprint in

Dex

ing) for speeding up the fingerprint database search. We propose a novel chain scoring scheme to calculate the Tanimoto (Jaccard) scores of the fingerprints using an early-termination strategy. We tested our proposed method using 1,000 randomly selected query fingerprints on the NCBI PubChem database containing about 19.5 million fingerprints. Experimental results show that ChemDex is up to 109.9 times faster than the full database scan method, and up to 2.1 times faster than the state-of-the-art range search method for memory-based retrieval. For disk-based retrieval, it is up to 145.7 times and 1.7 times faster than the full scan and the range search respectively. The speedup is achieved without any loss of accuracy as ChemDex generates exactly the same results as the full scan and the range search.

Zeyar Aung, See-Kiong Ng

BEMC: A Searchable, Compressed Representation for Large Seismic Wavefields

State-of-the-art numerical solvers in Earth Sciences produce multi terabyte datasets per execution. Operating on increasingly larger datasets becomes challenging due to insufficient data bandwidth. Queries result in difficult to handle I/O access patterns. BEMC is a new mechanism that allows querying and processing wavefields in the compressed representation.

This approach combines well-known spatial-indexing techniques with novel compressed representations, thus reducing I/O bandwidth requirements. A new compression approach based on boundary integral representations exploits properties of the simulated domain. Frequency domain representation further compresses the data by eliminating temporal redundancy found in wave propagation data.

This representation enables the transformation of a large I/O workload into a massively-parallel CPU-intensive computation. Queries to this representation result in largely sequential I/O accesses. Although, decompression places heavy demands on the CPU, it exhibits parallelism well-suited for many-core processors. We evaluate our approach in the context of data analysis for the Earth Sciences datasets.

Julio López, Leonardo Ramírez-Guzmán, Jacobo Bielak, David O’Hallaron

Dynamic Data Reorganization for Energy Savings in Disk Storage Systems

High performance computing (HPC) systems utilize parallel file systems that are striped over large number of disks to address the I/O performance requirement of data intensive applications. These large number of disks are typically configured into RAID groups for resiliency. Such arrangements of on-line disk storage systems constitutes one of the major consumers of power in HPC centers. Many disk power management (DPM) schemes have been suggested where by the power consumed by these disks is reduced by spinning them down after they experience long idle periods. Spinning the disks down and up results in additional energy and response time costs. For that reason, DPM schemes are effective only if the disks experience relatively long idle periods and the scheme does not introduce a severe response time penalty. In this paper we focus on RAID storage systems where by, depending on the RAID level, a group of disks are formed into RAID units. We introduce a dynamic block exchange algorithm which switches data between such units based on the observed workload such that frequently accessed blocks end up residing on a few “hot” units thus allowing the majority of RAID groups to experience longer idle periods. We validated the effectiveness of the algorithm with several real-life and synthetic traces showing power savings of up to 50% with very small response time penalties.

Ekow Otoo, Doron Rotem, Shih-Chiang Tsao

Organization of Data in Non-convex Spatial Domains

We present a technique for organizing data in spatial databases with non-convex domains based on an automatic characterization using the medial-axis transform (MAT). We define a tree based on the MAT and enumerate its branches to partition space and define a linear order on the partitions. This ordering clusters data in a manner that respects the complex shape of the domain. The ordering has the property that all data down any branch of the medial axis, regardless of the geometry of the sub-region, are contiguous on disk. Using this data organization technique, we build a system to provide efficient data discovery and analysis of the observational and model data sets of the Chesapeake Bay Environmental Observatory (CBEO). On typical CBEO workloads in which scientists query contiguous substructures of the Chesapeake Bay, we improve query processing performance by a factor of two when compared with orderings derived from space filling curves.

Eric Perlman, Randal Burns, Michael Kazhdan, Rebecca R. Murphy, William P. Ball, Nina Amenta

PrefIndex: An Efficient Supergraph Containment Search Technique

Graphs are prevailingly used in many applications to model complex data structures. In this paper, we study the problem of supergraph containment search. To avoid the NP-complete subgraph isomorphism test, most existing works follow the

filtering-verification

framework and select graph-features to build effective indexes, which filter false results (graphs) before conducting the costly verification. However, searching features multiple times in the query graphs yields huge redundant computation, which leads to the emergence of the

computation-sharing

framework. This paper follows the roadmap of computation-sharing framework to efficiently process supergraph containment queries. Firstly, database graphs are clustered into disjoint groups for sharing the computation cost within each group. While it is shown NP-hard to maximize the computation-sharing benefits of a clustering, efficient algorithm is developed to approximate the optimal solution with an approximation factor of

$\frac{1}{2}$

. A novel prefix-sharing indexing technique, PrefIndex, is then proposed based on which efficient query processing algorithm integrating both filtering and verification is developed. Finally, PrefIndex is enhanced with multi-level sharing and suffix-sharing to further avoid redundant computation. An extensive empirical study demonstrates the efficiency and scalability of our techniques which achieve orders of magnitudes of speed-up against the state-of-the-art techniques.

Gaoping Zhu, Xuemin Lin, Wenjie Zhang, Wei Wang, Haichuan Shang

Supporting Web-Based Visual Exploration of Large-Scale Raster Geospatial Data Using Binned Min-Max Quadtree

Traditionally environmental scientists are limited to simple display and animation of large-scale raster geospatial data derived from remote sensing instrumentation and model simulation outputs. Identifying regions that satisfy certain range criteria, e.g., temperature between [t1,t2) and precipitation between [p1,p2), plays an important role in query-driven visualization and visual exploration in general. In this study, we have proposed a Binned Min-Max Quadtree (BMMQ-Tree) to index large-scale numeric raster geospatial data and efficiently process queries on identifying regions of interests by taking advantages of the approximate nature of visualization related queries. We have also developed an end-to-end system that allows users visually and interactively explore large-scale raster geospatial data in a Web-based environment by integrating our query processing backend and a commercial Web-based Geographical Information System (Web-GIS). Experiments using real global environmental data have demonstrated the efficiency of the proposed BMMQ-Tree. Both experiences and lessons learnt from the development of the prototype system and experiments on the real dataset are reported.

Jianting Zhang, Simin You

Scientific Workflow and Provenance

Bridging Workflow and Data Provenance Using Strong Links

As scientists continue to migrate their work to computational methods, it is important to track not only the steps involved in the computation but also the data consumed and produced. While this provenance information can be captured, in existing approaches, it often contains only weak references between data and provenance. When data files or provenance are moved or modified, it can be difficult to find the data associated with the provenance or to find the provenance associated with the data. We propose a persistent storage mechanism that manages input, intermediate, and output data files, strengthening the links between provenance and data. This mechanism provides better support for reproducibility because it ensures the data referenced in provenance information can be readily located. Another important benefit of such management is that it allows caching of intermediate data which can then be shared with other users. We present an implemented infrastructure for managing data in a provenance-aware manner and demonstrate its application in scientific projects.

David Koop, Emanuele Santos, Bela Bauer, Matthias Troyer, Juliana Freire, Cláudio T. Silva

LIVE: A Lineage-Supported Versioned DBMS

This paper presents LIVE, a complete DBMS designed for applications with many stored derived relations, and with a need for simple versioning capabilities when base data is modified. Target applications include, for example, scientific data management and data integration. A key feature of LIVE is the use of

lineage

(provenance) to support modifications and versioning in this environment. In our system, lineage significantly facilitates both: (1) efficient propagation of modifications from base to derived data; and (2) efficient execution of a wide class of queries over versioned, derived data. LIVE is fully implemented; detailed experimental results are presented that validate our techniques.

Anish Das Sarma, Martin Theobald, Jennifer Widom

Optimizing Resource Allocation for Scientific Workflows Using Advance Reservations

Scientific applications are more and more faced with very large volumes of data and complex, resource-intensive workflows that process or analyze these data. The recent interest in web services and service-oriented architectures has strongly facilitated the development of individual workflow activities as well as their composition and the distributed execution of complete workflows. However, in many applications concurrent scientific workflows may be served by multiple competing providers, with each of them offering only limited resources. At the same time, these workflows need to be executed in a predictable manner, with dedicated Quality of Service guarantees. In this paper, we introduce an approach to Advance Resource Reservation for service-oriented complex scientific workflows that optimize resource consumption based on user-defined criteria (e.g., cost or time). It exploits optimization techniques using genetic algorithms for finding optimal or near-optimal allocations in a distributed system. The approach takes into account the locality of services and in particular enforces constraints imposed by control or data flow dependencies within workflows. Finally, we provide a comprehensive evaluation of the effectiveness of the proposed approach.

Christoph Langguth, Heiko Schuldt

A Fault-Tolerance Architecture for Kepler-Based Distributed Scientific Workflows

Fault-tolerance and failure recovery in scientific workflows is still a relatively young topic. The work done in the domain so far mostly applies classic fault-tolerance mechanisms, such as "alternative versions" and "checkpointing", to scientific workflows. Often scientific workflow systems simply rely on the fault-tolerance capabilities provided by their third party subcomponents such as schedulers, Grid resources, or the underlying operating systems. When failures occur at the underlying layers, a workflow system typically sees them only as failed steps in the process without additional detail and the ability of the system to recover from those failures may be limited. In this paper, we present an architecture that tries to address this for Kepler-based scientific workflows by providing more information about failures and faults we have observed, and through a supporting implementation of more comprehensive failure coverage and recovery options. We discuss our framework in the context of the failures observed in two production-level Kepler-based workflows, specifically XGC and S3D. The framework is divided into three major components: (i) a general contingency Kepler actor that provides a recovery block functionality at the workflow level, (ii) an external monitoring module that tracks the underlying workflow components, and monitors the overall health of the workflow execution, and (iii) a checkpointing mechanism that provides smart resume capabilities for cases in which an unrecoverable error occurs. This framework takes advantage of the provenance data collected by the Kepler-based workflows to detect failures and help in fault-tolerance decision making.

Pierre Mouallem, Daniel Crawl, Ilkay Altintas, Mladen Vouk, Ustun Yildiz

Provenance Context Entity (PaCE): Scalable Provenance Tracking for Scientific RDF Data

The Resource Description Framework (RDF) format is being used by a large number of scientific applications to store and disseminate their datasets. The provenance information, describing the source or lineage of the datasets, is playing an increasingly significant role in ensuring data quality, computing trust value of the datasets, and ranking query results. Current provenance tracking approaches using the RDF reification vocabulary suffer from a number of known issues, including lack of formal semantics, use of blank nodes, and application-dependent interpretation of reified RDF triples. In this paper, we introduce a new approach called Provenance Context Entity (PaCE) that uses the notion of

provenance context

to create provenance-aware RDF triples. We also define the formal semantics of PaCE through a simple extension of the existing RDF(S) semantics that ensures compatibility of PaCE with existing Semantic Web tools and implementations. We have implemented the PaCE approach in the Biomedical Knowledge Repository (BKR) project at the US National Library of Medicine. The evaluations demonstrate a minimum of 49% reduction in total number of provenance-specific RDF triples generated using the PaCE approach as compared to RDF reification. In addition, performance for complex queries improves by three orders of magnitude and remains comparable to the RDF reification approach for simpler provenance queries.

Satya S. Sahoo, Olivier Bodenreider, Pascal Hitzler, Amit Sheth, Krishnaprasad Thirunarayan

Taverna, Reloaded

The Taverna workflow management system is an open source project with a history of widespread adoption within multiple experimental science communities, and a long-term ambition of effectively supporting the evolving need of those communities for complex, data-intensive, service-based experimental pipelines. This short paper describes how the recently overhauled technical architecture of Taverna addresses issues of efficiency, scalability, and extensibility, and presents performance results based on a collection of synthetic workflows, as well as a concrete case study involving a production workflow in the area of cancer research.

Paolo Missier, Stian Soiland-Reyes, Stuart Owen, Wei Tan, Alexandra Nenadic, Ian Dunlop, Alan Williams, Tom Oinn, Carole Goble

Similarity

Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?

The performance of similarity measures for search, indexing, and data mining applications tends to degrade rapidly as the dimensionality of the data increases. The effects of the so-called ‘curse of dimensionality’ have been studied by researchers for data sets generated according to a single data distribution. In this paper, we study the effects of this phenomenon on different similarity measures for multiply-distributed data. In particular, we assess the performance of shared-neighbor similarity measures, which are secondary similarity measures based on the rankings of data objects induced by some primary distance measure. We find that rank-based similarity measures can result in more stable performance than their associated primary distance measures.

Michael E. Houle, Hans-Peter Kriegel, Peer Kröger, Erich Schubert, Arthur Zimek

Optimizing All-Nearest-Neighbor Queries with Trigonometric Pruning

Many applications require to determine the

k

-nearest neighbors for multiple query points simultaneously. This task is known as all-(

k

)-nearest- neighbor (A

k

NN) query. In this paper, we suggest a new method for efficient A

k

NN query processing which is based on spherical approximations for indexing and query set representation. In this setting, we propose trigonometric pruning which enables a significant decrease of the remaining search space for a query. Employing this new pruning method, we considerably speed up A

k

NN queries.

Tobias Emrich, Franz Graf, Hans-Peter Kriegel, Matthias Schubert, Marisa Thoma

Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

Similarity search and similarity join on strings are important for applications such as duplicate detection, error detection, data cleansing, or comparison of biological sequences. Especially DNA sequencing produces large collections of erroneous strings which need to be searched, compared, and merged. However, current RDBMS offer similarity operations only in a very limited and inefficient form that does not scale to the amount of data produced in Life Science projects.

We present PETER, a prefix tree based indexing algorithm supporting approximate search and approimate joins. Our tool supports Hamming and edit distance as similarity measure and is available as C++ library, as Unix command line tool, and as cartridge for a commercial database. It combines an efficient implementation of compressed prefix trees with advanced pre-filtering techniques that exclude many candidate strings early. The achieved speed-ups are dramatic, especially for DNA with its small alphabet. We evaluate our tool on several collections of long strings containing up to 5,000,000 entries of length up to 3,500. We compare its performance to

agrep

,

nrgrep

, and user-defined functions inside a relational database. Our experiments reveal that PETER is faster by orders of magnitudes compared to the command-line tools. Compared to RDBMS, it computes similarity joins in minutes for which UDFs did not finish within a day and outperforms the built-in join methods even in the exact case.

Astrid Rheinländer, Martin Knobloch, Nicky Hochmuth, Ulf Leser

Similarity Estimation Using Bayes Ensembles

Similarity search and data mining often rely on distance or similarity functions in order to provide meaningful results and semantically meaningful patterns. However, standard distance measures like

L

p

-norms are often not capable to accurately mirror the expected similarity between two objects. To bridge the so-called semantic gap between feature representation and object similarity, the distance function has to be adjusted to the current application context or user. In this paper, we propose a new probabilistic framework for estimating a similarity value based on a Bayesian setting. In our framework, distance comparisons are modeled based on distribution functions on the difference vectors. To combine these functions, a similarity score is computed by an Ensemble of weak Bayesian learners for each dimension in the feature space. To find independent dimensions of maximum meaning, we apply a space transformation based on eigenvalue decomposition. In our experiments, we demonstrate that our new method shows promising results compared to related Mahalanobis learners on several test data sets w.r.t. nearest-neighbor classification and precision-recall-graphs.

Tobias Emrich, Franz Graf, Hans-Peter Kriegel, Matthias Schubert, Marisa Thoma

Subspace Similarity Search: Efficient k-NN Queries in Arbitrary Subspaces

There are abundant scenarios for applications of similarity search in databases where the similarity of objects is defined for a subset of attributes, i.e., in a subspace, only. While much research has been done in efficient support of single column similarity queries or of similarity queries in the full space, scarcely any support of similarity search in subspaces has been provided so far. The three existing approaches are variations of the sequential scan. Here, we propose the first index-based solution to subspace similarity search in arbitrary subspaces.

Thomas Bernecker, Tobias Emrich, Franz Graf, Hans-Peter Kriegel, Peer Kröger, Matthias Renz, Erich Schubert, Arthur Zimek

Data Stream Processing

Continuous Skyline Monitoring over Distributed Data Streams

To monitor skylines over dynamic data, one needs to continuously update the skyline query results in order to reflect the new data values. This paper tackles the problem of continuous skyline monitoring on a central query server over dynamic data from multiple data sites. Simply sending the updates of tuple values to the server is cost-prohibitive. Therefore, we propose an approach where the central server collaborates with the data sites to monitor the possible skyline changes. By doing so, the processing load is distributed over all the nodes instead of only on the central server. Furthermore, the approach can minimize the bandwidth consumption between the server and the data sites, which is often critical in a widely distributed environment. Extensive experiments demonstrate that our proposal is efficient and effective.

Hua Lu, Yongluan Zhou, Jonas Haustad

Propagation of Densities of Streaming Data within Query Graphs

Data Stream Systems

DSS use cost models to determine if a DSS can cope with a given workload and to optimize query graphs. However, certain relevant input parameters of these models are often unknown or highly imprecise. Especially selectivities are stream-dependent and application-specific parameters.

In this paper, we describe a method that supports selectivity estimation considering input streams’ attribute value distribution. The novelty of our approach is the propagation of the probability distributions through the query graph in order to give estimates for the inner nodes of the graph. For most common stream operators, we establish formulas that describe their output distribution as a function of their input distributions. For unknown operators like

User-Defined Operators

UDO, we introduce a method to measure the influence of these operators on arbitrary probability distributions. This method is able to do most of the computational work before the query is deployed and introduces minimal overhead at runtime. Our evaluation framework facilitates the appropriate combination of both methods and allows to model almost arbitrary query graphs.

Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Klaus Meyer-Wegener

Spatio-temporal Event Stream Processing in Multimedia Communication Systems

Emerging multimedia communication environments, such as Environment-to-Environment (E2E) systems, require detecting complex events in environments using multimodal sensory data. Based on these spatio-temporal events, systems select and send data from appropriate sensors. Most existing stream processing systems consider temporal streams of alpha-numeric data and provide efficient approach to deal with queries in these environments. In cases where events are detected in different sensory data types, including audio and video collected at different locations, new approaches need to be developed to represent, combine, and process events to answer queries. In this paper, we present our approach in managing event stream processing to address the needs of a real time E2E system being developed in our laboratory. We introduce the modeling of our problem, and describe in detail the filtering and matching algorithms for querying spatio-temporal event stream. Experimental results demonstrate the efficacy and efficiency of our approach.

Mingyan Gao, Xiaoyan Yang, Ramesh Jain, Beng Chin Ooi

Stratified Reservoir Sampling over Heterogeneous Data Streams

Reservoir sampling

is a well-known technique for random sampling over data streams. In many streaming applications, however, an input stream may be naturally

heterogeneous

, i.e., composed of sub-streams whose statistical properties may also vary considerably. For this class of applications, the conventional reservoir sampling technique does not guarantee a statistically sufficient number of tuples from each sub-stream to be included in the reservoir, and this can cause a damage on the statistical quality of the sample. In this paper, we deal with this heterogeneity problem by

stratifying

the reservoir sample among the underlying sub-streams. We particularly consider situations in which the stratified reservoir sample is needed to obtain reliable estimates at the level of either the entire data stream or individual sub-streams. The first challenge in this stratification is to achieve an optimal allocation of a fixed-size reservoir to individual sub-streams. The second challenge is to adaptively adjust the allocation as sub-streams appear in, or disappear from, the input stream and as their statistical properties change over time. We present a stratified reservoir sampling algorithm designed to meet these challenges, and demonstrate through experiments the superior sample quality and the adaptivity of the algorithm.

Mohammed Al-Kateb, Byung Suk Lee

Tree Induction over Perennial Objects

We study the tree induction over a stream of

perennial

objects. The perennial objects are dynamic in nature and

cannot be forgotten

. The objects come from a multi-table stream, e.g., streams of

Customer

and

Transaction

. As the Transactions arrive, the perennial Customers’ profiles

grow

and accumulate over time. To perform tree induction, we propose a tree induction algorithm that can handle perennial objects. The algorithm also encompasses a method that identifies and adapts to the concept drift in the stream. We have also incorporated a conventional classifier (kNN) at the leaves to further improve the classification accuracy of our algorithm. We have evaluated our method on a synthetic dataset and the PKDD Challenge 1999 dataset.

Zaigham Faraz Siddiqui, Myra Spiliopoulou

Springer Professional

Inhaltsverzeichnis

Frontmatter

Invited Talks

Tradeoffs between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis

Emerging Trends and Converging Technologies in Data Intensive Scalable Computing

Query Processing

Deriving Spatio-temporal Query Results in Sensor Networks

Efficient and Adaptive Distributed Skyline Computation

On the Efficient Construction of Multislices from Recurrences

Optimizing Query Processing in Cache-Aware Wireless Sensor Networks

Approximate Query Answering and Result Refinement on XML Data

Efficient and Scalable Method for Processing Top-k Spatial Boolean Queries

Scientific Data Management and Analysis

A Framework for Moving Sensor Data Query and Retrieval of Dynamic Atmospheric Events

Client + Cloud: Evaluating Seamless Architectures for Visual Data Analytics in the Ocean Sciences

Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster

Database Design for High-Resolution LIDAR Topography Data

PetaScope: An Open-Source Implementation of the OGC WCS Geo Service Standards Suite

Towards Archaeo-informatics: Scientific Data Management for Archaeobiology

Data Mining

DESSIN: Mining Dense Subgraph Patterns in a Single Graph

Discovery of Evolving Convoys

Finding Top-k Similar Pairs of Objects Annotated with Terms from an Ontology

Identifying the Most Influential User Preference from an Assorted Collection

MC-Tree: Improving Bayesian Anytime Classification

Non-intrusive Quality Analysis of Monitoring Data

Visual Decision Support for Ensemble Clustering

Indexes and Data Representation

An Indexing Scheme for Fast and Accurate Chemical Fingerprint Database Searching

BEMC: A Searchable, Compressed Representation for Large Seismic Wavefields

Dynamic Data Reorganization for Energy Savings in Disk Storage Systems

Organization of Data in Non-convex Spatial Domains

PrefIndex: An Efficient Supergraph Containment Search Technique

Supporting Web-Based Visual Exploration of Large-Scale Raster Geospatial Data Using Binned Min-Max Quadtree

Scientific Workflow and Provenance

Bridging Workflow and Data Provenance Using Strong Links

LIVE: A Lineage-Supported Versioned DBMS

Optimizing Resource Allocation for Scientific Workflows Using Advance Reservations

A Fault-Tolerance Architecture for Kepler-Based Distributed Scientific Workflows

Provenance Context Entity (PaCE): Scalable Provenance Tracking for Scientific RDF Data

Taverna, Reloaded

Similarity

Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?

Optimizing All-Nearest-Neighbor Queries with Trigonometric Pruning

Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

Similarity Estimation Using Bayes Ensembles

Subspace Similarity Search: Efficient k-NN Queries in Arbitrary Subspaces

Data Stream Processing

Continuous Skyline Monitoring over Distributed Data Streams

Propagation of Densities of Streaming Data within Query Graphs

Spatio-temporal Event Stream Processing in Multimedia Communication Systems

Stratified Reservoir Sampling over Heterogeneous Data Streams

Tree Induction over Perennial Objects

Backmatter

Premium Partner