Invited Presentation

The Scientific Data Management Center: Providing Technologies for Large Scale Scientific Exploration

Managing scientific data has been identified by the scientific community as one of the most important emerging needs because of the sheer volume and increasing complexity of data being collected. Effectively generating, managing, and analyzing this information requires a comprehensive, end-to-end approach to data management that encompasses all of the stages from the initial data acquisition to the final analysis of the data. Fortunately, the data management problems encountered by most scientific domains are common enough to be addressed through shared technology solutions. Based on community input, the SDM center has identified three significant requirements. First, more efficient access to storage systems is needed. In particular, parallel file system and I/O system improvements are needed to write and read large volumes of data without slowing a simulation, analysis, or visualization engine. These processes are complicated by the fact that scientific data are structured differently for specific application domains, and are stored in specialized file formats. Second, scientists require technologies to facilitate better understanding of their data, in particular the ability to effectively perform complex data analysis and searches over extremely large data sets. Specialized feature discovery and statistical analysis techniques are needed before the data can be understood or visualized. Furthermore, interactive analysis requires indexing techniques for efficiently searching and selecting subsets of interest are needed. Finally, generating the data, collecting and storing the results, keeping track of data provenance, data post-processing, and analysis of results is a tedious, fragmented process. Workflow tools for automation of this process in a robust, tractable, and recoverable fashion are required to enhance scientific exploration.

Arie Shoshani

Improving the End-User Experience

Query Recommendations for Interactive Database Exploration

Relational database systems are becoming increasingly popular in the scientific community to support the interactive exploration of large volumes of data. In this scenario, users employ a query interface (typically, a web-based client) to issue a series of SQL queries that aim to analyze the data and mine it for interesting information. First-time users, however, may not have the necessary knowledge to know where to start their exploration. Other times, users may simply overlook queries that retrieve important information. To assist users in this context, we draw inspiration from Web recommender systems and propose the use of personalized query recommendations. The idea is to track the querying behavior of each user, identify which parts of the database may be of interest for the corresponding data analysis task, and recommend queries that retrieve relevant data. We discuss the main challenges in this novel application of recommendation systems, and outline a possible solution based on collaborative filtering. Preliminary experimental results on real user traces demonstrate that our framework can generate effective query recommendations.

Gloria Chatzopoulou, Magdalini Eirinaki, Neoklis Polyzotis

Scientific Mashups: Runtime-Configurable Data Product Ensembles

Mashups are gaining popularity as a rapid-development, re-use-oriented programming model to replace monolithic, bottom-up application development. This programming style is attractive for the “long tail” of scientific data management applications, characterized by exploding data volumes, increasing requirements for data sharing and collaboration, but limited software engineering budgets.

We observe that scientists already routinely construct a primitive, static form of mashup—an ensemble of related visualizations that convey a specific scientific message encoded as, e.g., a Powerpoint slide. Inspired by their ubiquity, we adopt these conventional data-product ensembles as a core model, endow them with interactivity, publish them online, and allow them to be repurposed at runtime by non-programmers.

We observe that these scientific mashups must accommodate a wider audience than commerce-oriented and entertainment-oriented mashups. Collaborators, students (K12 through graduate), the public, and policy makers are all potential consumers, but each group has a different level of domain sophistication. We explore techniques for adapting one mashup for different audiences by attaching additional context, assigning defaults, and re-skinning component products.

Existing mashup frameworks (and scientific workflow systems) emphasize an expressive “boxes-and-arrows” abstraction suitable for engineering individual products but overlook requirements for organizing products into synchronized ensembles or repurposing them for different audiences.

In this paper, we articulate these requirements for scientific mashups, describe an architecture for composing mashups as interactive, reconfigurable, web-based, visualization-oriented data product ensembles, and report on an initial implementation in use at an Ocean Observatory.

Bill Howe, Harrison Green-Fishback, David Maier

View Discovery in OLAP Databases through Statistical Combinatorial Optimization

The capability of OLAP database software systems to handle data complexity comes at a high price for analysts, presenting them a combinatorially vast space of views of a relational database. We respond to the need to deploy technologies sufficient to allow users to guide themselves to areas of local structure by casting the space of “views” of an OLAP database as a combinatorial object of all projections and subsets, and “view discovery” as an search process over that lattice. We equip the view lattice with statistical information theoretical measures sufficient to support a combinatorial optimization process. We outline “hop-chaining” as a particular view discovery algorithm over this object, wherein users are guided across a permutation of the dimensions by searching for successive two-dimensional views, pushing seen dimensions into an increasingly large background filter in a “spiraling” search process. We illustrate this work in the context of data cubes recording summary statistics for radiation portal monitors at US ports.

Cliff Joslyn, John Burke, Terence Critchlow, Nick Hengartner, Emilie Hogan

Designing a Geo-scientific Request Language - A Database Approach

A large part of sensor, image, and statistics data in the Earth Sciences naturally come as data points aligned to regular grids of some dimension, including 1-D time series, 2-D imagery, 3-D image time series and volumetric data, and 4-D spatio-temporal data. Frequently repositories have to host objects of multi-Terabyte size, in the future multi-Petabytes. Serving this information category, large multi-dimensional arrays, is a task of increasing importance among the Earth Sciences. Still, however, ad-hoc implementations with focused functionality prevail which lack the flexibility of SQL-enabled databases.

The Web Coverage Processing Service (WCPS) geo raster model and language allows navigation, extraction, and ad-hoc analysis on multi-dimensional geoscientific raster data which can be geo-referenced or not. The request language has been designed to offer some key properties considered advantageous by the database community: it is formalized, declarative, data independent, optimizable, and safe in evaluation. WCPS has been adopted as an international standard by the Open GeoSpatial Consortium (OGC) in December 2008. The reference implementation is almost finished. Actually, the embedding into the modular framework of the OGC geo service standards has posed particular con straints which the design had to respect. We discuss conceptual model, query language, and the context using real-life use case scenarios.

Peter Baumann

SEEDEEP: A System for Exploring and Querying Scientific Deep Web Data Sources

A recent and emerging trend in scientific data dissemination involves online databases that are hidden behind query forms, thus forming what is referred to as the

deep web

. In this paper, we propose SEEDEEP, a System for Exploring and quErying scientific DEEP web data sources. SEEDEEP is able to automatically mine deep web data source schemas, integrate heterogeneous data sources, answer cross-source keyword queries, and incorporates features like caching and fault-tolerance. Currently, SEEDEEP integrates 16 deep web data sources in the biological domain. We demonstrate how an integrated model for correlated deep web data sources is constructed, how a complex cross-source keyword query is answered efficiently and correctly, and how important performance issues are addressed.

Fan Wang, Gagan Agrawal

Expressing OLAP Preferences

Multidimensional databases play a relevant role in statistical and scientific applications, as well as in business intelligence systems. Their users express complex OLAP queries, often returning huge volumes of facts, sometimes providing little or no information. Thus, expressing preferences could be highly valuable in this domain. The OLAP domain is representative of an unexplored class of preference queries, characterized by three peculiarities: preferences can be expressed on both numerical and categorical domains; they can also be expressed on the aggregation level of facts; the space on which preferences are expressed includes both elemental and aggregated facts. In this paper we propose a preference algebra for OLAP, that takes into account the three peculiarities above.

Matteo Golfarelli, Stefano Rizzi

Indexing, Physical Design, and Energy

Energy Smart Management of Scientific Data

Scientific data centers comprised of high-powered computing equipment and large capacity disk storage systems consume considerable amount of energy. Dynamic power management techniques (DPM) are commonly used for saving energy in disk systems. These involve powering down disks that exhibit long idle periods and placing them in standby mode. A file request from a disk in standby mode will incur both energy and performance penalties as it takes energy (and time) to spin up the disk before it can serve a file. For this reason, DPM has to make decisions as to when to transition the disk into standby mode such that the energy saved is greater than the energy needed to spin it up again and the performance penalty is tolerable. The length of the idle period until the DPM decides to power down a disk is called idleness threshold.

In this paper, we study both analytically and experimentally dynamic power management techniques that save energy subject to performance constraints on file access costs. Based on observed workloads of scientific applications and disk characteristics, we provide a methodology for determining file assignment to disks and computing idleness thresholds that result in significant improvements to the energy saved by existing DPM solutions while meeting response time constraints. We validate our methods with simulations that use traces taken from scientific applications.

Ekow Otoo, Doron Rotem, Shih-Chiang Tsao

Data Parallel Bin-Based Indexing for Answering Queries on Multi-core Architectures

The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitions and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel.

We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture–for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS’s performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column’s base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g. GPUs).

Luke J. Gosink, Kesheng Wu, E. Wes Bethel, John D. Owens, Kenneth I. Joy

Finding Regions of Interest in Large Scientific Datasets

Consider a scientific range query, such as

find all places in Africa where yesterday the temperature was over 35 degrees and it rained

. In theory, one can answer such queries by returning all geographic points that satisfy the query condition. However, in practice, users do not find this low-level answer very useful; instead they require the points to be consolidated into regions, i.e., sets of points that all satisfy the query conditions and are adjacent in the underlying mesh. In this paper, we show that when a high-quality index is used to find the points and a good traditional connected component labeling algorithm is used to build the regions, the cost of consolidating the points into regions dominates range query response time. We then show how to find query result points and consolidate them into regions in expected time that is sublinear in the number of result points. This seemingly miraculous speedup comes from a point lookup phase that uses bitmap indexes and produces a compressed bitmap as the intermediate query result, followed by a region consolidation phase that operates directly on the intermediate query result bitmap and exploits the spatial properties of the underlying mesh to greatly reduce the cost of consolidating the result points into regions. Our experiments with real-world scientific data demonstrate that in practice, our approach to region consolidation is over 10 times faster than a traditional connected component algorithm.

Rishi Rakesh Sinha, Marianne Winslett, Kesheng Wu

Adaptive Physical Design for Curated Archives

We introduce AdaptPD, an automated physical design tool that improves database performance by continuously monitoring changes in the workload and adapting the physical design to suit the incoming workload. Current physical design tools are offline and require specification of a representative workload. AdaptPD is “always on” and incorporates online algorithms which profile the incoming workload to calculate the relative benefit of transitioning to an alternative design. Efficient query and transition cost estimation modules allow AdaptPD to quickly decide between various design configurations. We evaluate AdaptPD with the SkyServer Astronomy database using queries submitted by SkyServer’s users. Experiments show that AdaptPD adapts to changes in the workload, improves query performance substantially over offline tools, and introduces minor computational overhead.

Tanu Malik, Xiaodan Wang, Debabrata Dash, Amitabh Chaudhary, Anastasia Ailamaki, Randal Burns

MLR-Index: An Index Structure for Fast and Scalable Similarity Search in High Dimensions

High-dimensional indexing has been very popularly used for performing similarity search over various data types such as multimedia (audio/image/video) databases, document collections, time-series data, sensor data and scientific databases. Because of the

curse of dimensionality

, it is already known that well-known data structures like kd-tree, R-tree, and M-tree suffer in their performance over high-dimensional data space which is inferior to a brute-force approach

linear scan

. In this paper, we focus on an approximate nearest neighbor search for two different types of queries:

r-Range search

and

k-NN search

. Adapting a novel concept of a

ring

structure, we define a new index structure

MLR-Index

(

M

ulti-

L

ayer

R

ing-based

Index

) in a metric space and propose time and space efficient algorithms with high accuracy. Evaluations through comprehensive experiments comparing with the best-known high-dimensional indexing method

LSH

show that our approach is faster for a similar accuracy, and shows higher accuracy for a similar response time than

LSH

.

Rahul Malik, Sangkyum Kim, Xin Jin, Chandrasekar Ramachandran, Jiawei Han, Indranil Gupta, Klara Nahrstedt

Application Experience

B-Fabric: An Open Source Life Sciences Data Management System

The advances in information technology boosted life sciences research towards systems biology which aims at studying complex interactions in biological systems in an integrative way. Steadily improving high-throughput instruments (genome sequencers, mass spectrometers etc.) and analysis software produce large amounts of experimental data. In this paper, we report on experiences with developing and running a life sciences data management system in a productive environment.

Can Türker, Fuat Akal, Dieter Joho, Ralph Schlapbach

Design and Implementation of Metadata System in PetaShare

As the size of scientific and commercial datasets grows, it becomes imperative that an expressive metadata framework to be developed to facilitate access to the semantics of the data. However, the drive to enhance expressiveness and maximize performance often turns out to be contradictory. Hence, the design of metadata framework needs to take into consideration many factors in order to achieve balance between expressive power and performance. In this paper, we discuss the metadata framework developed for PetaShare and explain the design and implementation decisions we made in the process.

Xinqi Wang, Tevfik Kosar

Covariant Evolutionary Event Analysis for Base Interaction Prediction Using a Relational Database Management System for RNA

With an increasingly large amount of sequences properly aligned, comparative sequence analysis can accurately identify not only common structures formed by standard base pairing but also new types of structural elements and constraints. However, traditional methods are too computationally expensive to perform well on large scale alignment and less effective with the sequences from diversified phylogenetic classifications. We propose a new approach that utilizes coevolutional rates among pairs of nucleotide positions using phylogenetic and evolutionary relationships of the organisms of aligned sequences. With a novel data schema to manage relevant information within a relational database, our method, implemented with a Microsoft SQL Server 2005, showed 90% sensitivity in identifying base pair interactions among 16S ribosomal RNA sequences from Bacteria, at a scale 40 times bigger and 50% better sensitivity than a previous study. The results also indicated covariation signals for a few sets of cross-strand base stacking pairs in secondary structure helices, and other subtle constraints in the RNA structure.

Weijia Xu, Stuart Ozer, Robin R. Gutell

Invited Presentation

What Makes Scientific Workflows Scientific?

A

scientific workflow

is the description of a process for accomplishing a scientific objective, usually expressed in terms of tasks and their dependencies [5]. While workflows have a long history in the database community as well as in business process modeling (where they are also known as

business workflows

), and despite some early works on scientific workflows [3,10], the area has only recently begun to fully flourish (e.g., see [1,2,9,7,4,11]). Similar to scientific data management which has different characteristics from traditional business data management [8], scientific workflows exhibit new challenges and opportunities that distinguish them from business workflows. We present an overview of these challenges and opportunities, covering a number of issues such as different models of computation, scalable data and process management, and data provenance and lineage handling in scientific workflows.

Bertram Ludäscher

Workflow

Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

Technological success has ushered in massive amounts of data for scientific analysis. To enable effective utilization of these data sets for all classes of users, supporting intuitive data access and manipulation interfaces is crucial. This paper describes an autonomous scientific workflow system that enables high-level, natural language based, queries over low-level data sets. Our technique involves a combination of natural language processing, metadata indexing, and a semantically-aware workflow composition engine which dynamically constructs workflows for answering queries based on service and data availability. A specific contribution of this work is a metadata registration scheme that allows for a unified index of heterogeneous metadata formats and service annotations. Our approach thus avoids a standardized format for storing all data sets or the implementation of a federated, mediator-based, querying framework. We have evaluated our system using a case study from the geospatial domain to show functional results. Our evaluation supports the potential benefits which our approach can offer to scientific workflow systems and other domain-specific, data intensive applications.

David Chiu, Gagan Agrawal

Exploring Scientific Workflow Provenance Using Hybrid Queries over Nested Data and Lineage Graphs

Existing approaches for representing the provenance of scientific workflow runs largely ignore computation models that work over structured data, including XML. Unlike models based on transformation semantics, these computation models often employ update semantics, in which only a portion of an incoming XML stream is modified by each workflow step. Applying conventional provenance approaches to such models results in provenance information that is either too coarse (e.g., stating that one version of an XML document depends entirely on a prior version) or potentially incorrect (e.g., stating that each element of an XML document depends on every element in a prior version). We describe a generic provenance model that naturally represents workflow runs involving processes that work over nested data collections and that employ update semantics. Moreover, we extend current query approaches to support our model, enabling queries to be posed not only over data lineage relationships, but also over versions of nested data structures produced during a workflow run. We show how hybrid queries can be expressed against our model using high-level query constructs and implemented efficiently over relational provenance storage schemes.

Manish Kumar Anand, Shawn Bowers, Timothy McPhillips, Bertram Ludäscher

Data Integration with the DaltOn Framework – A Case Study

Data integration has gained a great attention in current scientific applications due to the increasingly high volume of heterogeneous data and proliferation of diverse data generating devices such as sensors. Recently evolved workflow systems contributed a lot towards scientific data integration by exploiting ontologies. Even though they offer good means for modeling computational workflows, they were proved not to be sufficiently strong in addressing data related issues in a transparent and structured manner. The DaltOn system improves the productivity of scientists by providing a framework which copes with these issues in a transparent and well structured manner. In this paper we will elaborate its application in a real world scenario taken from meteorological research where data are retrieved from a sensor network and are integrated into a central scientific database.

Stefan Jablonski, Bernhard Volz, M. Abdul Rehman, Oliver Archner, Olivier Curé

Experiment Line: Software Reuse in Scientific Workflows

Over the last years, scientists have been using scientific workflows to build computer simulations to support the development of new theories. Due to the increasing use of scientific workflows in production environments, the composition of workflows and their executions can no longer be performed in an ad-hoc manner. Although current scientific workflow management systems support the execution of workflows, they present limitations regarding the composition of workflows when it comes to using different levels of abstractions. This paper introduces the concept of experiment line which is a systematic approach for the composition of scientific workflows that represents an in-silico experiment. An experiment line is inspired on the software engineering reuse discipline and allows the composition of scientific workflows at different levels of abstractions, which characterizes both the in-silico experiment and different workflow variations that are related to the experiment.

Eduardo Ogasawara, Carlos Paulino, Leonardo Murta, Cláudia Werner, Marta Mattoso

Tracking Files in the Kepler Provenance Framework

Workflow Management Systems (WFMS), such as Kepler, are proving to be an important tool in scientific problem solving. They can automate and manage complex processes and huge amounts of data produced by petascale simulations. Typically, the produced data need to be properly visualized and analyzed by scientists in order to achieve the desired scientific goals. Both run-time and post analysis may benefit from, even require, additional meta-data – provenance information. One of the challenges in this context is the tracking of the data files that can be produced in very large numbers during stages of the workflow, such as visualizations. The Kepler provenance framework collects all or part of the raw information flowing through the workflow graph. This information then needs to be further parsed to extract meta-data of interest. This can be done through add-on tools and algorithms. We show how to automate tracking specific information such as data files locations.

Pierre Mouallem, Roselyne Barreto, Scott Klasky, Norbert Podhorszki, Mladen Vouk

BioBrowsing: Making the Most of the Data Available in Entrez

One of the most popular ways to access public biological data is using portals, like Entrez (NCBI) which allows users to navigate through the data of 34 major biological sources following cross-references. In this process, data entries are inspected one after the other and cross-references to additional data available in other sources may be followed. This navigational process may be time-consuming and may not be easily reproduced from one entry to another. Most importantly, only a few sources are initially queried, biologists do not exploit all the richness of the data provided by Entrez, and in particular they may not explore alternative source paths that provide complementary information.

In this paper, we introduce BioBrowsing, a tool providing scientists with access to the data obtained when all the combinations between NCBI sources have been followed. Querying is done on-the-fly (no warehousing). As new sources and links between sources appear in Entrez, BioBrowsing has a module able to update automatically the schema used by its query engine. Finally, BioBrowsing makes it possible for users to define profiles as a way of focusing the results on users specific interests.

Availability:

http://bioguide-project.net/biobrowsing

Sarah Cohen-Boulakia, Kevin Masini

Using Workflow Medleys to Streamline Exploratory Tasks

To analyze and understand the growing wealth of scientific data, complex workflows need to be assembled, often requiring the combination of loosely-coupled resources, specialized libraries, distributed computing infrastructure, and Web services. However, constructing these workflows is a non-trivial task, especially for users who do not have programming expertise. This problem is compounded for exploratory tasks, where the workflows need to be iteratively refined. In this paper, we introduce

workflow medleys

, a new approach for manipulating collections of workflows. We propose a workflow manipulation language that includes operations that are common in exploratory tasks and present a visual interface designed for this language. We briefly discuss how medleys have been applied in two (real) applications.

Emanuele Santos, David Koop, Huy T. Vo, Erik W. Anderson, Juliana Freire, Cláudio Silva

Query Processing

Experiences on Processing Spatial Data with MapReduce

The amount of information in spatial databases is growing as more data is made available. Spatial databases mainly store two types of data: raster data (satellite/aerial digital images), and vector data (points, lines, polygons). The complexity and nature of spatial databases makes them ideal for applying parallel processing. MapReduce is an emerging massively parallel computing model, proposed by Google. In this work, we present our experiences in applying the MapReduce model to solve two important spatial problems: (a) bulk-construction of R-Trees and (b) aerial image quality computation, which involve vector and raster data, respectively. We present our results on the scalability of MapReduce, and the effect of parallelism on the quality of the results. Our algorithms were executed on a Google&IBM cluster, which became available to us through an NSF-supported program. The cluster supports the Hadoop framework – an open source implementation of MapReduce. Our results confirm the excellent scalability of the MapReduce framework in processing parallelizable problems.

Ariel Cary, Zhengguo Sun, Vagelis Hristidis, Naphtali Rishe

Optimization and Execution of Complex Scientific Queries over Uncorrelated Experimental Data

Scientific experiments produce large volumes of data represented as complex objects that describe independent events such as particle collisions. Scientific analyses can be expressed as queries selecting objects that satisfy complex local conditions over properties of each object. The conditions include joins, aggregate functions, and numerical computations. Traditional query processing where data is loaded into a database does not perform well, since it takes time and space to load and index data. Therefore, we developed SQISLE to efficiently process in one pass large queries selecting complex objects from sources. Our contributions include runtime query optimization strategies, which during query execution collect runtime query statistics, reoptimize the query using collected statistics, and dynamically switch optimization strategies. Furthermore, performance is improved by query rewrites, temporary view materializations, and compile time evaluation of query fragments. We demonstrate that queries in SQISLE perform close to hard-coded C++ implementations of the same analyses.

Ruslan Fomkin, Tore Risch

Comprehensive Optimization of Declarative Sensor Network Queries

We present a sensor network query processing architecture that covers all the query optimization phases that are required to map a declarative query to executable code. The architecture is founded on the view that a sensor network truly is a distributed computing infrastructure, albeit a very constrained one. As such, we address the problem of how to develop a comprehensive optimizer for an expressive declarative continuous query language over acquisitional streams as one of finding extensions to classical distributed query processing techniques that contend with the peculiarities of sensor networks as an environment for distributed computing.

Ixent Galpin, Christian Y. A. Brenninkmeijer, Farhana Jabeen, Alvaro A. A. Fernandes, Norman W. Paton

Efficient Evaluation of Generalized Tree-Pattern Queries with Same-Path Constraints

Querying XML data is based on the specification of structural patterns which in practice are formulated using XPath. Usually, these structural patterns are in the form of trees (Tree-Pattern Queries – TPQs). Requirements for flexible querying of XML data including XML data from scientific applications have motivated recently the introduction of query languages that are more general and flexible than TPQs. These query languages correspond to a fragment of XPath larger than TPQs for which efficient non-main-memory evaluation algorithms are not known.

In this paper, we consider a query language, called Partial Tree-Pattern Query (PTPQ) language, which generalizes and strictly contains TPQs. PTPQs represent a broad fragment of XPath which is very useful in practice. We show how PTPQs can be represented as directed acyclic graphs augmented with “same-path” constraints. We develop an original polynomial time holistic algorithm for PTPQs under the inverted list evaluation model. To the best of our knowledge, this is the first algorithm to support the evaluation of such a broad structural fragment of XPath. We provide a theoretical analysis of our algorithm and identify cases where it is asymptotically optimal. In order to assess its performance, we design two other techniques that evaluate PTPQs by exploiting the state-of-the-art existing algorithms for smaller classes of queries. An extensive experimental evaluation shows that our holistic algorithm outperforms the other ones.

Xiaoying Wu, Dimitri Theodoratos, Stefanos Souldatos, Theodore Dalamagas, Timos Sellis

Mode Aware Stream Query Processing

Many scientific applications including environmental monitoring, outpatient health care research, and wild life tracking require real-time stream processing. While state-of-the-art techniques for processing window-constrained stream queries tend to employ the delta result strategy (to react to each and every change of the stream sensor measurements), some scientific applications only require to produce results periodically - making the complete result strategy a better choice. In this work, we analyze the trade-offs between the delta and the complete result query evaluation strategies. We then design a solution for hopping window query processing based on the above analysis. In particular, we propose query operators equipped with the ability to accept either delta or complete results as input and to produce either as output. Unlike prior works, these flexible operators can then be integrated within one mode aware query plan - taking advantage of both processing methodologies. Third, we design a mode assignment algorithm to optimally assign the input and output modes for each operator in the mode aware query plan. Lastly, mode assignment is integrated with a cost-based plan optimizer. The proposed techniques have been implemented within the WPI stream query engine, called CAPE. Our experimental results demonstrate that our solution routinely outperforms the state-of-the-art single-mode solutions for various arrival rate and query plan shapes.

Mingrui Wei, Elke Rundensteiner

Evaluating Reachability Queries over Path Collections

Several applications in areas such as biochemistry, GIS, involve storing and querying large volumes of sequential data stored as

path collections

. There is a number of interesting queries that can be posed on such data. This work focuses on reachability queries: given a path collection and two nodes

v

s

,

v

t

, determine whether a path from

v

s

to

v

t

exists and identify it. To answer these queries, the

path-first search

paradigm, which treats paths as first-class citizens, is proposed. To improve the performance of our techniques, two indexing structures that capture the reachability information of paths are introduced. Further, methods for updating a path collection and its indices are discussed. Finally, an extensive experimental evaluation verifies the advantages of our approach.

Panagiotis Bouros, Spiros Skiadopoulos, Theodore Dalamagas, Dimitris Sacharidis, Timos Sellis

Similarity Search

Easing the Dimensionality Curse by Stretching Metric Spaces

Queries over sets of complex elements are performed extracting features from each element, which are used in place of the real ones during the processing. Extracting a large number of significant features increases the representative power of the feature vector and improves the query precision. However, each feature is a dimension in the representation space, consequently handling more features worsen the dimensionality curse. The problem derives from the fact that the elements tends to distribute all over the space and a large dimensionality allows them to spread over much broader spaces. Therefore, in high-dimensional spaces, elements are frequently farther from each other, so the distance differences among pairs of elements tends to homogenize. When searching for nearest neighbors, the first one is usually not close, but as long as one is found, small increases in the query radius tend to include several others. This effect increases the overlap between nodes in access methods indexing the dataset. Both spatial and metric access methods are sensitive to the problem. This paper presents a general strategy applicable to metric access methods in general, improving the performance of similarity queries in high dimensional spaces. Our technique applies a function that “stretches” the distances. Thus, close objects become closer and far ones become even farther. Experiments using the metric access method Slim-tree show that similarity queries performed in the transformed spaces demands up to 70% less distance calculations, 52% less disk access and reduces up to 57% in total time when comparing with the original spaces.

Ives R. V. Pola, Agma J. M. Traina, Caetano Traina Jr.

Probabilistic Similarity Search for Uncertain Time Series

A probabilistic similarity query over uncertain data assigns to each uncertain database object

o

a probability indicating the likelihood that

o

meets the query predicate. In this paper, we formalize the notion of uncertain time series and introduce two novel and important types of probabilistic range queries over uncertain time series. Furthermore, we propose an original approximate representation of uncertain time series that can be used to efficiently support both new query types by upper and lower bounding the Euclidean distance.

Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger, Matthias Renz

Reverse k-Nearest Neighbor Search Based on Aggregate Point Access Methods

We propose an original solution for the general reverse

k

-nearest neighbor (R

k

NN) search problem in Euclidean spaces. Compared to the limitations of existing methods for the RkNN search, our approach works on top of Multi-Resolution Aggregate (MRA) versions of any index structures for multidimensional feature spaces where each non-leaf node is additionally associated with aggregate information like the sum of all leaf-entries indexed by that node. Our solution outperforms the state-of-the-art RkNN algorithms in terms of query execution times because it exploits advanced strategies for pruning index entries.

Hans-Peter Kriegel, Peer Kröger, Matthias Renz, Andreas Züfle, Alexander Katzdobler

Finding Structural Similarity in Time Series Data Using Bag-of-Patterns Representation

For more than one decade, time series similarity search has been given a great deal of attention by data mining researchers. As a result, many time series representations and distance measures have been proposed. However, most existing work on time series similarity search focuses on finding shape-based similarity. While some of the existing approaches work well for short time series data, they typically fail to produce satisfactory results when the sequence is long. For long sequences, it is more appropriate to consider the similarity based on the higher-level structures. In this work, we present a histogram-based representation for time series data, similar to the “bag of words” approach that is widely accepted by the text mining and information retrieval communities. We show that our approach outperforms the existing methods in clustering, classification, and anomaly detection on several real datasets.

Jessica Lin, Yuan Li

Keynote Address

Cloud Computing for Science

Infrastructure-as-a-Service (IaaS) style cloud computing is emerging as a viable alternative to the acquisition and management of physical resources. This raises several questions. How can we take advantage of the opportunities it offers? Are the current commercial offerings suitable for science? What cloud capabilities are required by scientific applications? What additional infrastructure is needed to take advantage of cloud resources?

In this talk, I will describe several application projects using cloud computing in commercial and academic space and discuss the challenges and benefits of this approach in terms of performance, cost, and ease-of-use. I will also discuss our experiences with configuring and running the Science Clouds - a group of clouds in academic domain available to scientific projects configured using the Nimbus Toolkit. Finally, I will discuss the emerging trends and innovation opportunities in cloud computing.

Kate Keahey

Mining

Classification with Unknown Classes

Given a data set D, such that (

X

i

,

y

i

) ∈

D

,

y

i

∈ ℝ, we are interested in first dividing the range of

y

i

, i.e. (

y

max

−

y

min

), (where

y

max

is the maximum of all

y

i

corresponding to (

X

i

,

y

i

) ∈

D

and

y

min

is the minimum of all

y

i

corresponding to (

X

i

,

y

i

) ∈

D

), into contiguous ranges which can be thought of as classes and then for a new point,

X

j

, predicting which range (class) it falls into. The problem is difficult, because neither the size of each range nor the number of ranges, is known a-priori.

This was a practical problem that arose when we wanted to predict the execution time of a query in a database. For our purposes, an accurate prediction was not required, while a time range was sufficient and the time ranges were unknown a-priori.

To solve this problem we introduce a binary tree structure called

Class Discovery Tree

. We have used this technique successfully for predicting the execution times of a query and this is slated for incorporation into a commercial, enterprise level Database Management System.

In this paper, we discuss our solution and validate it on two more real life data sets. In the first one, we compare our result with a naive approach and in the second, with the published results. In both cases, our approach is superior.

Chetan Gupta, Song Wang, Umeshwar Dayal, Abhay Mehta

HSM: Heterogeneous Subspace Mining in High Dimensional Data

Heterogeneous data, i.e. data with both categorical and continuous values, is common in many databases. However, most data mining algorithms assume either continuous or categorical attributes, but not both. In high dimensional data, phenomena due to the “curse of dimensionality” pose additional challenges. Usually, due to locally varying relevance of attributes, patterns do not show across the full set of attributes.

In this paper we propose HSM, which defines a new pattern model for heterogeneous high dimensional data. It allows data mining in arbitrary subsets of the attributes that are relevant for the respective patterns. Based on this model we propose an efficient algorithm, which is aware of the heterogeneity of the attributes. We extend an indexing structure for continuous attributes such that HSM indexing adapts to different attribute types. In our experiments we show that HSM efficiently mines patterns in arbitrary subspaces of heterogeneous high dimensional data.

Emmanuel Müller, Ira Assent, Thomas Seidl

Split-Order Distance for Clustering and Classification Hierarchies

Clustering and classification hierarchies are organizational structures of a set of objects. Multiple hierarchies may be derived over the same set of objects, which makes distance computation between hierarchies an important task. In this paper, we model the classification and clustering hierarchies as rooted, leaf-labeled, unordered trees. We propose a novel distance metric Split-Order distance to evaluate the organizational structure difference between two hierarchies over the same set of leaf objects. Split-Order distance reflects the order in which subsets of the tree leaves are differentiated from each other and can be used to explain the relationships between the leaf objects. We also propose an efficient algorithm for computing Split-Order distance between two trees in

O

(

n

2

d

4

) time, where

n

is the number of leaves, and

d

is the maximum number of children of any node. Our experiments on both real and synthetic data demonstrate the efficiency and effectiveness of our algorithm.

Qi Zhang, Eric Yi Liu, Abhishek Sarkar, Wei Wang

Combining Multiple Interrelated Streams for Incremental Clustering

Many data mining applications analyze structured data that span across many tables

and

accumulate in time. Incremental mining methods have been devised to adapt patterns to new tuples. However, they have been designed for data in one table only. We propose a method for incremental clustering on multiple interrelated streams - a “

multi-table stream

”: its components are streams that reference each other, arrive at different speeds and have attributes of a priori unknown value ranges. Our approach encompasses solutions for the maintenance of cach-es and sliding windows over the individual streams, the propagation of foreign keys across streams, the transformation of all streams into a single-table stream, and an incremental clustering algorithm that operates over that stream. We evaluate our method on two real datasets and show that it approximates well the performance of an ideal method that possesses unlimited resources and knows the future.

Zaigham Faraz Siddiqui, Myra Spiliopoulou

Improving Relation Extraction by Exploiting Properties of the Target Relation

In this paper we demonstrate and quantify the advantage gained by allowing relation extraction algorithms to make use of information about the cardinality of the target relation. The two algorithms presented herein differ only in their assumption about the nature of the target relation (one-to-many or many-to-many). The algorithms are tested on the same relation to show the degree of advantage gained by their differing assumptions. Comparison of the performance of the two algorithms on a one-to-many domain demonstrates the existence of several, previously undocumented behaviors which can be used to improve the performance of relation extraction algorithms. The first is a distinct, inverted u-shape in the initial portion of the recall curve of the many-to-many algorithm. The second is that, as the number of seeds increases, the rate of improvement of the two algorithms descreases to approach the rate at which new information is added via the seeds.

Eric Normand, Kevin Grant, Elias Ioup, John Sample

Cor-Split: Defending Privacy in Data Re-publication from Historical Correlations and Compromised Tuples

Several approaches have been proposed for privacy preserving data publication. In this paper we consider the important case in which a certain view over a dynamic dataset has to be released a number of times during its history. The insufficiency of techniques used for one-shot publication in the case of subsequent releases has been previously recognized, and some new approaches have been proposed. Our research shows that relevant privacy threats, not recognized by previous proposals, can occur in practice. In particular, we show the cascading effects that a single (or a few) compromised tuples can have in data re-publication when coupled with the ability of an adversary to recognize historical correlations among released tuples. A theoretical study of the threats leads us to a defense algorithm, implemented as a significant extension of the

m-invariance

technique. Extensive experiments using publicly available datasets show that the proposed technique preserves the utility of published data and effectively protects from the identified privacy threats.

Daniele Riboni, Claudio Bettini

A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data

Data summarization is an important data mining task which aims to find a compact description of a dataset. Emerging applications place special requirements to the data summarization techniques including the ability to find concise and informative summary from high dimensional data, the ability to deal with different types of attributes such as binary, categorical and numeric attributes, end-user comprehensibility of the summary, insensibility to noise and missing values and scalability with the data size and dimensionality. In this work, a general framework that satisfies all of these requirements is proposed to summarize high-dimensional data. We formulate this problem in a bipartite graph scheme, mapping objects (data records) and values of attributes into two disjoint groups of nodes of a graph, in which a set of representative objects is discovered as the summary of the original data. Further, the capability of representativeness is measured using the MDL principle, which helps to yield a highly intuitive summary with the most informative objects of the input data. While the problem of finding the optimal summary with minimal representation cost is computationally infeasible, an approximate optimal summary is achieved by a heuristic algorithm whose computation cost is quadratic to the size of data and linear to the dimensionality of data. In addition, several techniques are developed to improve both quality of the resultant summary and efficiency of the algorithm. A detailed study on both real and synthetic datasets shows the effectiveness and efficiency of our approach in summarizing high-dimensional datasets with binary, categorical and numeric attributes.

Guanhua Chen, Xiuli Ma, Dongqing Yang, Shiwei Tang, Meng Shuai

Spatial Data

Region Extraction and Verification for Spatial and Spatio-temporal Databases

Newer spatial technologies, such as spatio-temporal databases, geo-sensor networks, and other remote sensing methods, require mechanisms to efficiently process spatial data and identify (and in some cases fix) data items that do not conform to rigorously defined spatial data type definitions. In this paper, we propose an

$O(n \lg n )$

time complexity algorithm that examines a spatial configuration, eliminates any portions of the configuration that violate the definition of spatial regions, and constructs a valid region out of the remaining configuration.

Mark McKenney

Identifying the Most Endangered Objects from Spatial Datasets

Real-life spatial objects are usually described by their geographic locations (e.g., longitude and latitude), and multiple quality attributes. Conventionally, spatial data are queried by two orthogonal aspects: spatial queries involve geographic locations only; skyline queries are used to retrieve those objects that are not dominated by others on all quality attributes. Specifically, an object

p

i

is said to dominate another object

p

j

if

p

i

is no worse than

p

j

on all quality attributes and better than

p

j

on at least one quality attribute. In this paper, we study a novel query that combines both aspects meaningfully. Given two spatial datasets

P

and

S

, and a neighborhood distance

δ

, the

most endangered object query

(MEO) returns the object

s

∈

S

such that within the distance

δ

from

s

, the number of objects in

P

that dominate

s

is maximized. MEO queries appropriately capture the needs that neither spatial queries nor skyline queries alone have addressed. They have various practical applications such as business planning, online war games, and wild animal protection. Nevertheless, the processing of MEO queries is challenging and it cannot be efficiently evaluated by existing solutions. Motivated by this, we propose several algorithms for processing MEO queries, which can be applied in different scenarios where different indexes are available on spatial datasets. Extensive experimental results on both synthetic and real datasets show that our proposed advanced spatial join solution achieves the best performance and it is scalable to large datasets.

Hua Lu, Man Lung Yiu

Constraint-Based Learning of Distance Functions for Object Trajectories

With the drastic increase of object trajectory data, the analysis and exploration of trajectories has become a major research focus with many applications. In particular, several approaches have been proposed in the context of similarity-based trajectory retrieval. While these approaches try to be comprehensive by considering the different properties of object trajectories at different degrees, the distance functions are always pre-defined and therefore do not support different views on what users consider (dis)similar trajectories in a particular domain.

In this paper, we introduce a novel approach to learning distance functions in support of similarity-based retrieval of multi-dimensional object trajectories. Our approach is more generic than existing approaches in that distance functions are determined based on constraints, which specify what object trajectory pairs the user considers similar or dissimilar. Thus, using a single approach, different distance functions can be determined for different users views. We present two learning techniques,

transformed Euclidean distance

and

transformed Dynamic Time Warping

. Both techniques determine a linear transformation of the attributes of multi-dimensional trajectories, based on the constraints specified by the user. We demonstrate the flexibility and efficiency of our approach with applications to clustering and classification on real and synthetic object trajectory datasets from different application domains.

Wei Yu, Michael Gertz

Springer Professional

About this book

Table of Contents

Frontmatter

Invited Presentation

The Scientific Data Management Center: Providing Technologies for Large Scale Scientific Exploration

Improving the End-User Experience

Query Recommendations for Interactive Database Exploration

Scientific Mashups: Runtime-Configurable Data Product Ensembles

View Discovery in OLAP Databases through Statistical Combinatorial Optimization

Designing a Geo-scientific Request Language - A Database Approach

SEEDEEP: A System for Exploring and Querying Scientific Deep Web Data Sources

Expressing OLAP Preferences

Indexing, Physical Design, and Energy

Energy Smart Management of Scientific Data

Data Parallel Bin-Based Indexing for Answering Queries on Multi-core Architectures

Finding Regions of Interest in Large Scientific Datasets

Adaptive Physical Design for Curated Archives

MLR-Index: An Index Structure for Fast and Scalable Similarity Search in High Dimensions

Application Experience

B-Fabric: An Open Source Life Sciences Data Management System

Design and Implementation of Metadata System in PetaShare

Covariant Evolutionary Event Analysis for Base Interaction Prediction Using a Relational Database Management System for RNA

Invited Presentation

What Makes Scientific Workflows Scientific?

Workflow

Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

Exploring Scientific Workflow Provenance Using Hybrid Queries over Nested Data and Lineage Graphs

Data Integration with the DaltOn Framework – A Case Study

Experiment Line: Software Reuse in Scientific Workflows

Tracking Files in the Kepler Provenance Framework

BioBrowsing: Making the Most of the Data Available in Entrez

Using Workflow Medleys to Streamline Exploratory Tasks

Query Processing

Experiences on Processing Spatial Data with MapReduce

Optimization and Execution of Complex Scientific Queries over Uncorrelated Experimental Data

Comprehensive Optimization of Declarative Sensor Network Queries

Efficient Evaluation of Generalized Tree-Pattern Queries with Same-Path Constraints

Mode Aware Stream Query Processing

Evaluating Reachability Queries over Path Collections

Similarity Search

Easing the Dimensionality Curse by Stretching Metric Spaces

Probabilistic Similarity Search for Uncertain Time Series

Reverse k-Nearest Neighbor Search Based on Aggregate Point Access Methods

Finding Structural Similarity in Time Series Data Using Bag-of-Patterns Representation

Keynote Address

Cloud Computing for Science

Mining

Classification with Unknown Classes

HSM: Heterogeneous Subspace Mining in High Dimensional Data

Split-Order Distance for Clustering and Classification Hierarchies

Combining Multiple Interrelated Streams for Incremental Clustering

Improving Relation Extraction by Exploiting Properties of the Target Relation

Cor-Split: Defending Privacy in Data Re-publication from Historical Correlations and Compromised Tuples

A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data

Spatial Data

Region Extraction and Verification for Spatial and Spatio-temporal Databases

Identifying the Most Endangered Objects from Spatial Datasets

Constraint-Based Learning of Distance Functions for Object Trajectories

Backmatter

Premium Partner