Skip to main content

Über dieses Buch

This two volume set LNCS 6587 and LNCS 6588 constitutes the refereed proceedings of the 16th International Conference on Database Systems for Advanced Applications, DASFAA 2011, held in Saarbrücken, Germany, in April 2010. The 53 revised full papers and 12 revised short papers presented together with 2 invited keynote papers, 22 demonstration papers, 4 industrial papers, 8 demo papers, and the abstract of 1 panel discussion, were carefully reviewed and selected from a total of 225 submissions. The topics covered are social network, social network and privacy, data mining, probability and uncertainty, stream processing, graph, XML, XML and graph, similarity, searching and digital preservation, spatial queries, query processing, as well as indexing and high performance.



Keynote Talks

Smarter Planet: Empower People with Information Insights

We are all now connected economically, technically and socially. Our planet is becoming smarter. Infusing intelligence into the way the world literally works the systems and processes that enable physical goods to be developed, manufactured, bought and sold services to be delivered everything from people and money to oil, water and electrons to move and billions of people to work and live. All these become possible via information integration scattering in many different data sources: from the sensors, on the web, in our personal devices, in documents and in databases, or hidden within application programs. Information is exploding with large amount of data generated every second. It creates many challenges in securely storing, managing, integrating, cleansing, analyzing and governing the massive generated information besides the privacy issue. This can be a difficult or time consuming endeavor. This talk describes some information-intensive tasks, choosing examples from such areas as healthcare, science, the business world and our personal lives. I will discuss the barriers to getting information together, delivering it to the people that need it, in a form they can understand, analyzing the diverse spectrum of information, giving insights to the decision makers. I will review key research on information integration and information interaction, indicate how the combination may enable real progress, and illustrate where research challenges remain.

Josephine Cheng

Database Scalability, Elasticity, and Autonomy in the Cloud

(Extended Abstract)

Cloud computing has emerged as an extremely successful paradigm for deploying web applications.






pricing, and

economies of scale

from large scale operations are the major reasons for the successful and widespread adoption of cloud infrastructures. Since a majority of cloud applications are data driven, database management systems (DBMSs) powering these applications form a critical component in the cloud software stack. In this article, we present an overview of our work on instilling these above mentioned “cloud features” in a database system designed to support a variety of applications deployed in the cloud: designing scalable database management architectures using the concepts of

data fission


data fusion

, enabling lightweight elasticity using low cost live database migration, and designing intelligent and autonomic controllers for system management without human intervention.

Divyakant Agrawal, Amr El Abbadi, Sudipto Das, Aaron J. Elmore

Ten Year Award

What Have We Learnt from Deductive Object-Oriented Database Research?

Deductive databases and object-oriented databases (DOOD) are two important extensions of the traditional relational database technology.

Deductive databases provide a rule-based language called Datalog


(Datalog with negation) that uses function-free Horn clauses with negation to express deductive rules [1], and is a simplified version of the logic programming language Prolog [2]. A deductive database consists of an extensional database and an intensional database. The extensional database (EDB) consists of the relations stored in a relational database whereas the intensional database (IDB) consists of a Datalog


program that is a set of deductive rules used to derive relations that are the logical consequences of the program and the extensional database. Datalog


is more expressive than pure relational query languages such as relational algebra and relational calculus as it supports recursive deductive rules and recursive queries. Moreover, deductive databases have a firm logical foundation that consists of both model-theoretic semantics in terms of the minimal model [3], the stable model [4], and the well-founded model [5], and proof-theoretic semantics in terms of bottom-up fixpoint semantics [2].

Mengchi Liu, Gillian Dobbie, Tok Wang Ling

Social Network

ECODE: Event-Based Community Detection from Social Networks

People regularly attend various social events to interact with other community members. For example, researchers attend conferences to present their work and to network with other researchers. In this paper, we propose an







tection algorithm ECODE to mine the underlying community substructures of social networks from event information. Unlike conventional approaches, ECODE makes use of content similarity-based



which are found to be more useful for community detection than the physical links. By performing partial computation between an event and its candidate relevant set instead of computing pair-wise similarities between all the events, ECODE is able to achieve significant computational speedup. Extensive experimental results and comparisons with other existing methods showed that our ECODE algorithm is both efficient and effective in detecting communities from social networks.

Xiao-Li Li, Aloysius Tan, Philip S. Yu, See-Kiong Ng

A User Similarity Calculation Based on the Location for Social Network Services

The online social network services have been growing rapidly over the past few years, and the social network services can easily obtain the locations of users with the recent increasing popularity of the GPS enabled mobile device. In the social network, calculating the similarity between users is an important issue. The user similarity has significant impacts to users, communities and service providers by helping them acquire suitable information effectively.

There are numerous factors such as the location, the interest and the gender to calculate the user similarity. The location becomes a very important factor among them, since nowadays the social network services are highly coupled with the mobile device which the user holds all the time. There have been several researches on calculating the user similarity. However, most of them did not consider the location. Even if some methods consider the location, they only consider the physical location of the user which cannot be used for capturing the user’s intention.

We propose an effective method to calculate the user similarity using the semantics of the location. By using the semantics of the location, we can capture the user’s intention and interest. Moreover, we can calculate the similarity between different locations using the hierarchical location category. To the best of our knowledge, this is the first research that uses the semantics of the location in order to calculate the user similarity. We evaluate the proposed method with a real-world use case: finding the most similar user of a user. We collected more than 251,000 visited locations over 591 users from foursquare. The experimental results show that the proposed method outperforms a popular existing method calculating the user similarity.

Min-Joong Lee, Chin-Wan Chung

Modeling User Expertise in Folksonomies by Fusing Multi-type Features

The folksonomy refers to the online collaborative tagging system which offers a new open platform for content annotation with uncontrolled vocabulary. As folksonomies are gaining in popularity, the expert search and spammer detection in folksonomies attract more and more attention. However, most of previous work are limited on some folksonomy features. In this paper, we introduce a generic and flexible user expertise model for expert search and spammer detection. We first investigate a comprehensive set of expertise evidences related to users, objects and tags in folksonomies. Then we discuss the rich interactions between them and propose a unified Continuous CRF model to integrate these features and interactions. This model’s applications for expert recommendation and spammer detection are also exploited. Extensive experiments are conducted on a real tagging dataset and demonstrate the model’s advantages over previous methods, both in performance and coverage.

Junjie Yao, Bin Cui, Qiaosha Han, Ce Zhang, Yanhong Zhou

Identifying Topic Experts and Topic Communities in the Blogspace

Blogs have become an important media of self-expression recently. Millions of people write blog posts, share their interests, give suggestions and form groups in blogspace. An important way to understand the development of blogspace is to identify topic experts as well as blog communities and to further find how they interact with each other. Topic experts are influential bloggers who usually publish “authoritative” opinions on a specific topic and influence their followers. Here we first discuss the challenge of efficient identifying topic experts and then propose a novel model to quantify topic experts. Based on the topic experts identified, we further propose a new approach to identify the related blog communities on that topic. Experiments are conducted and the results demonstrate that our approaches are very effective and efficient.

Xiaoling Liu, Yitong Wang, Yujia Li, Baile Shi

Social Network and Privacy

Utility-Oriented K-Anonymization on Social Networks

“Identity disclosure” problem on publishing social network data has gained intensive focus from academia. Existing


-anonymization algorithms on social network may result in nontrivial utility loss. The reason is that the number of the edges modified when anonymizing the social network is the only metric to evaluate utility loss, not considering the fact that different edge modifications have different impact on the network structure. To tackle this issue, we propose a novel

utility-oriented social network anonymization

scheme to achieve privacy protection with relatively low utility loss. First, a proper utility evaluation model is proposed. It focuses on the changes on social network topological feature, but not purely the number of edge modifications. Second, an efficient algorithm is designed to anonymize a given social network with relatively low utility loss. Experimental evaluation shows that our approach effectively generates anonymized social network with high utility.

Yazhe Wang, Long Xie, Baihua Zheng, Ken C. K. Lee

Distributed Privacy Preserving Data Collection

We study the distributed privacy preserving data collection problem: an untrusted data collector (e.g., a medical research institute) wishes to collect data (e.g., medical records) from a group of respondents (e.g., patients). Each respondent owns a multi-attributed record which contains both non-sensitive (e.g., quasi-identifiers) and sensitive information (e.g., a particular disease), and submits it to the data collector. Assuming


is the table formed by all the respondent data records, we say that the data collection process is privacy preserving if it allows the data collector to obtain a


-anonymized or


-diversified version of


without revealing the original records to the adversary.

We propose a distributed data collection protocol that outputs an anonymized table by generalization of quasi-identifier attributes. The protocol employs cryptographic techniques such as homomorphic encryption, private information retrieval and secure multiparty computation to ensure the privacy goal in the process of data collection. Meanwhile, the protocol is designed to leak limited but non-critical information to achieve practicability and efficiency. Experiments show that the utility of the anonymized table derived by our protocol is in par with the utility achieved by traditional anonymization techniques.

Mingqiang Xue, Panagiotis Papadimitriou, Chedy Raïssi, Panos Kalnis, Hung Keng Pung

Privacy Preserving Query Processing on Secret Share Based Data Storage

Database as a Service(DaaS) is a paradigm for data management in which the Database Service Provider(DSP), usually a professional third party for data management, can host the database as a service. Many security and query problems are brought about because of the possible untrusted or malicious DSP in this context. Most of the proposed papers are concentrated on using symmetric encryption to guarantee the confidentiality of the delegated data, and using partition based index to help execute the privacy preserving range query. However, encryption and decryption operations on large volume of data are time consuming, and query results always consist of many irrelevant data tuples. Different from encryption based scheme, in this paper, we present a secret share based scheme to guarantee the confidentiality of delegated data. And what is more important, we construct a privacy preserving index to accelerate query and to help return the exactly required data tuples. Finally we analyze the security properties and demonstrate the efficiency and query response time of our approach through empirical data.

XiuXia Tian, ChaoFeng Sha, XiaoLing Wang, AoYing Zhou

Node Protection in Weighted Social Networks

Weighted social network has a broad usage in the data mining fields, such as collaborative filtering, influence analysis, phone log analysis, etc. However, current privacy models which prevent node re-identification for the social network only dealt with unweighted graphs. In this paper, we make use of the special characteristic of edge weights to define a novel k-weighted-degree anonymous model. While keeping the weight utilities, this model helps prevent node re-identification in the weighted graph based on three distance functions which measure the nodes’ difference. We also design corresponding algorithms for each distance to achieve anonymity. Some experiments on real datasets show the effectiveness of our methods.

Mingxuan Yuan, Lei Chen

Data Mining I

An Unbiased Distance-Based Outlier Detection Approach for High-Dimensional Data

Traditional outlier detection techniques usually fail to work efficiently on high-dimensional data due to the curse of dimensionality. This work proposes a novel method for subspace outlier detection, that specifically deals with multidimensional spaces where feature relevance is a local rather than a global property. Different from existing approaches, it is not grid-based and dimensionality unbiased. Thus, its performance is impervious to grid resolution as well as the curse of dimensionality. In addition, our approach ranks the outliers, allowing users to select the number of desired outliers, thus mitigating the issue of high false alarm rate. Extensive empirical studies on real datasets show that our approach efficiently and effectively detects outliers, even in high-dimensional spaces.

Hoang Vu Nguyen, Vivekanand Gopalkrishnan, Ira Assent

A Relational View of Pattern Discovery

The elegant integration of pattern mining techniques into database remains an open issue. In particular, no language is able to manipulate data and patterns without introducing opaque operators or loop-like statement. In this paper, we cope with this problem using relational algebra to formulate pattern mining queries. We introduce several operators based on the notion of cover allowing to express a wide range of queries like the mining of frequent patterns. Beyond modeling aspects, we show how to reason on queries for characterizing and rewriting them for optimization purpose. Thus, we algebraically reformulate the principle of the levelwise algorithm.

Arnaud Giacometti, Patrick Marcel, Arnaud Soulet

Efficient Incremental Mining of Frequent Sequence Generators

Recently, mining sequential patterns, especially closed sequential patterns and generator patterns, has attracted much attention from both academic and industrial communities. In recent years, incremental mining of all sequential patterns (all closed sequential patterns) has been widely studied. However, to our best knowledge, there has not been any study for incremental mining of sequence generators. In this paper, by carefully examining the existing expansion strategies for mining sequential databases, we design a GenTree structure to keep track of the relevant mining information, and propose an efficient algorithm, IncGen, for incremental generator mining. We have conducted thorough experiment evaluation and the experimental results show that the IncGen algorithm outperforms state-of-the-art generator-mining method FEAT significantly.

Yukai He, Jianyong Wang, Lizhu Zhou

An Alternative Interestingness Measure for Mining Periodic-Frequent Patterns

Periodic-frequent patterns are a class of user-interest-based frequent patterns that exist in a transactional database. A frequent pattern can be said


if it appears at a regular user-specified interval in a database. In the literature, an approach has been proposed to extract periodic-frequent patterns that occur periodically throughout the database. However, it is generally difficult for a frequent pattern to appear periodically throughout the database without any interruption in many real-world applications. In this paper, we propose an improved approach by introducing a new interestingness measure to discover periodic-frequent patterns that occur almost periodically in the database. A pattern-growth algorithm has been proposed to discover the complete set of periodic-frequent patterns. Experimental results show that the proposed model is effective.

R. Uday Kiran, P. Krishna Reddy

Data Mining II

A Framework of Mining Semantic Regions from Trajectories

With the pervasive use of mobile devices with location sensing and positioning functions, such as Wi-Fi and GPS, people now are able to acquire present locations and collect their movement. As the availability of trajectory data prospers, mining activities hidden in raw trajectories becomes a hot research problem. Given a set of trajectories, prior works either explore density-based approaches to extract regions with high density of GPS data points or utilize time thresholds to identify users’ stay points. However, users may have different activities along with trajectories. Prior works only can extract one kind of activity by specifying thresholds, such as spatial density or temporal time threshold. In this paper, we explore both spatial and temporal relationships among data points of trajectories to extract semantic regions that refer to regions in where users are likely to have some kinds of activities. In order to extract semantic regions, we propose a sequential clustering approach to discover clusters as the semantic regions from individual trajectory according to the spatial-temporal density. Based on semantic region discovery, we develop a shared nearest neighbor (SNN) based clustering algorithm to discover the frequent semantic region where the moving object often stay, which consists of a group of similar semantic regions from multiple trajectories. Experimental results demonstrate that our techniques are more accurate than existing clustering schemes.

Chun-Ta Lu, Po-Ruey Lei, Wen-Chih Peng, Ing-Jiunn Su

STS: Complex Spatio-Temporal Sequence Mining in Flickr

Nowadays, due to the increasing user requirements of efficient and personalized services, a perfect travel plan is urgently needed. In this paper we propose a novel complex spatio-temporal sequence (STS) mining in Flickr, which retrieves the optimal STS in terms of distance, weight, visiting time, opening hour, scene features, etc.. For example, when a traveler arrives at a city, the system endow every scene with a weight automatically according to scene features and user’s profiles. Then several interesting scenes (e.g.,


















) with larger weights (e.g.,


















) will be chosen. The goal of our work is to provide the traveler with the optimal STS, which passes through as many chosen scenes as possible with the maximum weight and the minimum distance within his travel time (e.g., one day). The difficulty of mining STS lies in the consideration of the weight of each scene, and its difference for different users, as well as the travel time limitation. In this paper, we provide two approximate algorithms: a local optimization algorithm and a global optimization algorithm. Finally, we give an experimental evaluation of the proposed algorithms using real datasets in Flickr.

Chunjie Zhou, Xiaofeng Meng

Mining High Utility Mobile Sequential Patterns in Mobile Commerce Environments

Mining user behaviors in mobile environments is an emerging and important topic in data mining fields. Previous researches have combined moving paths and purchase transactions to find mobile sequential patterns. However, these patterns cannot reflect actual profits of items in transaction databases. In this work, we explore a new problem of mining high utility mobile sequential patterns by integrating mobile data mining with utility mining. To the best of our knowledge, this is the first work that combines mobility patterns with high utility patterns to find high utility mobile sequential patterns, which are mobile sequential patterns with their utilities. Two tree-based methods are proposed for mining high utility mobile sequential patterns. A series of analyses on the performance of the two algorithms are conducted through experimental evaluations. The results show that the proposed algorithms deliver better performance than the state-of-the-art one under various conditions.

Bai-En Shie, Hui-Fang Hsiao, Vincent S. Tseng, Philip S. Yu

Reasoning about Dynamic Delegation in Role Based Access Control Systems

This paper proposes a logic based framework that supports dynamic delegation for role based access control systems in a decentralised environment. It allows delegation of administrative privileges for both roles and access rights between roles. We have introduced the notion of trust in delegation and have shown how extended logic programs can be used to express and reason about roles and their delegations with trust degrees, roles’ privileges and their propagations, delegation depth as well as conflict resolution. Furthermore, our framework is able to enforce various role constraints such as separation of duties, role composition and cardinality constraints. The proposed framework is flexible and provides a sound basis for specifying and evaluating sophisticated role based access control policies in decentralised environments.

Chun Ruan, Vijay Varadharajan

Probability and Uncertainty

Robust Ranking of Uncertain Data

Numerous real-life applications are continually generating huge amounts of uncertain data (e.g., sensor or RFID readings). As a result, top-


queries that return only the


most promising probabilistic tuples become an important means to monitor and analyze such data. These “top” tuples should have both high scores in term of some ranking function, and high occurrence probability. The previous works on ranking semantics are not entirely satisfactory in the following sense: they either require user-specified parameters other than


, or cannot be evaluated efficiently in real-time scale, or even generating results violating the underlying probability model. In order to overcome all these deficiencies, we propose a new semantics called U-Pop


based on a simpler but more fundamental property inherent in the underlying probability model. We then develop an efficient algorithm to evaluate U-Pop


. Extensive experiments confirm that U-Pop


is able to ensure high ranking quality and to support efficient evaluation of top-


queries on probabilistic tuples.

Da Yan, Wilfred Ng

Probabilistic Image Tagging with Tags Expanded By Text-Based Search

Automatic image tagging automatically assigns image with semantic keywords called tags, which significantly facilitates image search and organization. Most of present image tagging approaches assign the query image with the tags derived from the visually similar images in the training dataset only. However, their scalabilities and performances are constrained by the limitation of using the training method and the fixed size tag vocabulary. In this paper, we proposed a search based probabilistic image tagging algorithm (CTSTag), in which the initially assigned tags are mined from the content-based search result and expanded from the text-based search results. Experiments on NUS-WIDE dataset show not only the performance of the proposed algorithm but also the advantage of image retrieval using the tagging result.

Xiaoming Zhang, Zi Huang, Heng Tao Shen, Zhoujun Li

Removing Uncertainties from Overlay Network

Overlay networks are widely used for peer-to-peer (P2P) systems and data center systems (cloud system). P2P and data center systems are in face of node frequently joining, leaving and failure, which leads to topological uncertainty and data uncertainty. Topological uncertainty refers to that overlay network is incomplete, i.e., failures of node and link (between two nodes). Data uncertainty refers to data inconsistency and inaccurate data placement. Existing P2P and data center systems have these two uncertainties, and uncertainties have an impact on querying latency. In this study, therefore, we first give probabilistic lower bounds of diameter and average query distance for overlay network in face of these two uncertainties. The querying latency of existing systems cannot be better than the bounds. Also, existing systems often suffer unsuccessful queries due to uncertainties. To support an efficient and accurate query, we propose a topology constructive method and a data placement strategy for removing two uncertainties from overlay network. Also, efficient algorithms are proposed to support range queries in an overlay network. The DeBruijn graph representing overlay network is used to construct a new system,


, based on proposed methods. Finally, experiments show that performances of


can exceed the probabilistic bounds, and they behave better than existing systems in terms of querying latency, querying costs, fault tolerance and maintenance overhead.

Ye Yuan, Deke Guo, Guoren Wang, Lei Chen

Probabilistic and Interactive Retrieval of Chinese Calligraphic Character Images Based on Multiple Features

This paper proposes an efficient probabilistic indexing scheme called Probabilistic Multiple-Feature-Tree(PMF-Tree) to facilitate an interactive retrieval of Chinese calligraphic manuscript images based on multiple features such as contour points, character styles and types of character. Different from conventional character retrieval and indexing methods [18] which only adopts shape similarity as a query metric, our proposed indexing algorithm allows user to choose the above three kinds of features they prefer to as query elements. Moreover, a probabilistic modal is introduced to refine the retrieval result. Comprehensive experiments are conducted to testify the effectiveness and efficiency of our proposed retrieval and indexing methods respectively.

Yi Zhuang, Nan Jiang, Hua Hu, Haiyang Hu, Guochang Jiang, Chengxiang Yuan

Stream Processing

Real-Time Diameter Monitoring for Time-Evolving Graphs

The goal of this work is to identify the diameter, the maximum distance between any two nodes, of graphs that evolve over time. This problem is useful for many applications such as improving the quality of P2P networks. Our solution, G-Scale, can track the diameter of time-evolving graphs in the most efficient and correct manner. G-Scale is based on two ideas: (1) It estimates the maximal distances at any time to filter unlikely nodes that cannot be associated with the diameter, and (2) It maintains answer node pairs by exploiting the distances from a newly added node to other nodes. Our theoretical analyses show that G-Scale guarantees exactness in identifying the diameter. We perform several experiments on real and large datasets. The results show that G-Scale can detect the diameter significantly faster than existing approaches.

Yasuhiro Fujiwara, Makoto Onizuka, Masaru Kitsuregawa

Handling ER-topk Query on Uncertain Streams

It is critical to manage uncertain data streams nowadays because data uncertainty widely exists in many applications, such as Web and sensor networks. The goal of this paper is to handle top-


query on uncertain data streams. Since the volume of a data stream is unbounded whereas the memory resource is limited, it is challenging to devise one-pass solutions that is both time- and space efficient. We have devised two structures to handle this issue, namely




. The


stores all candidate tuples, and the


is helpful to compute the expected rank of a tuple. The analysis in theory and extensive experimental results show the effectiveness and efficiency of the proposed solution.

Cheqing Jin, Ming Gao, Aoying Zhou

Seamless Event and Data Stream Processing: Reconciling Windows and Consumption Modes

For a number of stream applications, synergistic integration of stream as well as event processing is becoming a necessity. However, the relationship between windows and consumption modes has not been studied in the literature. A clear understanding of this relationship is important for integrating the two synergistically as well as detecting meaningful complex events using events generated by a stream processing system. In this paper, we analyze the notion of


introduced for stream processing and the notion of

consumption modes

introduced for event processing. Based on the analysis, this paper proposes several approaches for combining the two and investigates their ramifications. We present conclusions based on our analysis and an integrated architecture that currently supports one of the reconciled approaches.

Raman Adaikkalavan, Sharma Chakravarthy

Querying Moving Objects with Uncertainty in Spatio-Temporal Databases

Spatio-temporal uncertainty is a special feature of moving objects due to the inability of precisely capturing or predicting their continuously changing locations. Indeterminate locations of moving objects at time instants add uncertainty to their topological relationships. Spatio-temporal uncertainty is important in many applications, for example, to determine whether two moving objects could possibly meet. Previous approaches, such as the 3D cylinder model and the space-time prism model have been proposed to study the spatio-temporal uncertainty. However, topological relationships between uncertain moving objects have been rarely studied and defined formally. In this paper, we propose a model called

pendant model

, which captures the uncertainty of moving objects and represents it in a databases context. As an important part of this model, we define a concept called

spatio-temporal uncertainty predicate



) which expresses the development of topological relationships between moving objects with uncertainty as a binary predicate. The benefit of this approach is that the predicates can be used as selection conditions in query languages and integrated into databases. We show their use by query examples. We also give an efficient algorithm to compute an important STUP.

Hechen Liu, Markus Schneider

A Novel Hash-Based Streaming Scheme for Energy Efficient Full-Text Search in Wireless Data Broadcast

Full-Text Search

is one of the most important and popular query types in document retrieval systems. With the development of The Fourth Generation Wireless Network (4G),

wireless data broadcast

has gained a lot of interest because of its scalability, flexibility, and energy efficiencies for wireless mobile computing. How to apply full-text search to documents transmitted through wireless communications is thus a research topic of interest. In this paper, we propose a novel data streaming scheme (named


) with hash-based indexing and inverted list techniques to facilitate energy and latency efficient full-text search in wireless data broadcast. We are the first work utilizing hash technology for this problem, which takes much less access latency and tuning time comparing to the previous literature. We further extend the proposed scheme by merging the hashed word indices in order to reduce the total access latency (named


). An information retrieval protocol is developed to cope with these two schemes. The performances of




are examined both theoretically and empirically. Simulation results prove their efficiencies with respect to both energy consumption and access latency.

Kai Yang, Yan Shi, Weili Wu, Xiaofeng Gao, Jiaofei Zhong


Efficient Topological OLAP on Information Networks

We propose a framework for efficient OLAP on information networks with a focus on the most interesting kind, the

topological OLAP

(called “


”), which incurs topological changes in the underlying networks. T-OLAP operations generate new networks from the original ones by rolling up a subset of nodes chosen by certain constraint criteria. The key challenge is to efficiently compute measures for the newly generated networks and handle user queries with varied constraints. Two effective computational techniques,




are proposed to achieve efficient query processing and cube materialization. We also provide a T-OLAP query processing framework into which these techniques are weaved. To the best of our knowledge, this is the first work to give a framework study for topological OLAP on information networks. Experimental results demonstrate both the effectiveness and efficiency of our proposed framework.

Qiang Qu, Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu, Hongyan Li

An Edge-Based Framework for Fast Subgraph Matching in a Large Graph

In subgraph matching, we want to find all subgraphs of a database graph that are isomorphic to a query graph. Subgraph matching requires subgraph isomorphism testing, which is NP-complete. Recently, some techniques specifically designed for subgraph matching in a large graph have been proposed. They are based on a filtering-and-verification framework. In the filtering phase, they filter out vertices that are not qualified for subgraph isomorphism testing. In the verification phase, subgraph isomorphism testing is performed and all matched subgraphs are returned to the user. We call them a vertex-based framework in the sense that they use vertex information when filtering out unqualified vertices. Edge information, however, can also be used for efficient subgraph matching. In this paper, we propose an edge-based framework for fast subgraph matching in a large graph. By using edge connectivity information, our framework not only filters out more vertices in the filtering phase, but also avoids unnecessary edge connectivity checking operations in the verification phase. The experimental results show that our method significantly outperforms existing approaches for subgraph matching in a large graph.

Sangjae Kim, Inchul Song, Yoon Joon Lee

Context-Sensitive Query Expansion over the Bipartite Graph Model for Web Service Search

As Service Oriented Architecture (SOA) matures, service consumption demand leads to an urgent requirement to service discovery. Unlike web documents, services are intended to be executed to achieve objectives and/or desired goals of users, which means to realize application requirements. This leads to the notion that service discovery should take into account the “application requirement” of service with service content (descriptions) which have been well explored. Content is defined by service developers, e.g. WSDL file and context is defined by service users, which is service usages to application requirement. We find context(application) information is more useful for query generation, especially for non-expert users. So in this paper, we propose to do context-sensitive query processing to resolve application-oriented queries for web service search engine. Context is modeled by a bipartite graph model to represent the mapping relationship between application space and service space. Application-oriented queries are resolved by query expansion based on the topic sensitive bipartite graph. The experiments verify the efficiency of our idea.

Rong Zhang, Koji Zettsu, Yutaka Kidawara, Yasushi Kiyoki

BMC: An Efficient Method to Evaluate Probabilistic Reachability Queries

Reachability query is a fundamental problem in graph databases. It answers whether or not there exists a path between a source vertex and a destination vertex and is widely used in various applications including road networks, social networks, world wide web and bioinformatics. In some emerging important applications, uncertainties may be inherent in the graphs. For instance, each edge in a graph could be associated with a probability to appear. In this paper, we study the reachability problem over such uncertain graphs in a threshold fashion, namely, to determine if a source vertex could reach a destination vertex with probabilty larger than a user specified probability value


. Finding reachability on uncertain graphs has been proved to be NP-Hard. We first propose novel and effective bounding techniques to obtain the upper bound of reachability probability between the source and destination. If the upper bound fails to prune the query, efficient dynamic Monte Carlo simulation technqiues will be applied to answer the probabilitistic reachability query with an accuracy guarantee. Extensive experiments over real and synthetic datasets are conducted to demonstrate the efficiency and effectiveness of our techniques.

Ke Zhu, Wenjie Zhang, Gaoping Zhu, Ying Zhang, Xuemin Lin


Improving XML Data Quality with Functional Dependencies

We study the problem of repairing XML functional dependency violations by making the smallest value modifications in terms of repair cost. Our cost model assigns a weight to each leaf node in the XML document, and the cost of a repair is measured by the total weight of the modified nodes. We show that it is beyond reach in practice to find optimum repairs: this problem is already NP-complete for a setting with a fixed DTD, a fixed set of functional dependencies, and equal weights for all the nodes in the XML document. To this end we provide an efficient two-step heuristic method to repair XML functional dependency violations. First, the initial violations are captured and fixed by leveraging the conflict hypergraph. Second, the remaining conflicts are resolved by modifying the violating nodes and their related nodes called determinants, in a way that guarantees no new violations. The experimental results demonstrate that our algorithm scales well and is effective in improving data quality.

Zijing Tan, Liyong Zhang

Identifying Relevant Matches with NOT Semantics over XML Documents

Keyword search over XML documents has been widely studied in recent years. It allows users to retrieve relevant data from XML documents without learning complicated query languages. SLCA (smallest lowest common ancestor)-based keyword search is a common mechanism to locate the desirable LCAs for the given query keywords, but the conventional SLCA-based keyword search is for AND-only semantics. In this paper, we extend the SLCA keyword search to a more general case, where the keyword query could be an arbitrary combination of AND, OR, and NOT operators. We further define the query result based on the




properties, and propose an efficient algorithm to figure out the SLCAs and the relevant matches. Since the keyword query becomes more complex, we also discuss the variations of the monotonicity and consistency properties in our framework. Finally, the experimental results show that the proposed algorithm runs efficiently and gives reasonable query results by measuring the processing time, scalability, precision, and recall.

Rung-Ren Lin, Ya-Hui Chang, Kun-Mao Chao

Evaluating Contained Rewritings for XPath Queries on Materialized Views

In this paper, we study the problem how to efficiently evaluate a set of contained rewritings on materialized views. Previous works focused on how to find a set of contained rewritings given a view and a query, but did not address how to evaluate the rewritings on materialized views. To evaluate a potential exponential number of contained rewritings, we design two algorithms, a basic algorithm and an optimized algorithm. Both algorithms are built on the observation that the exponential number of contained rewritings are actually composed by a linear number of component patterns. In the optimized algorithm, we further design four important pruning rules and several heuristic rules that can effectively reduce the number of component patterns we need to evaluate. The experiments demonstrate the efficiency of our algorithms.

Rui Zhou, Chengfei Liu, Jianxin Li, Junhu Wang, Jixue Liu

XStreamCluster: An Efficient Algorithm for Streaming XML Data Clustering

XML clustering finds many applications, ranging from storage to query processing. However, existing clustering algorithms focus on static XML collections, whereas modern information systems frequently deal with streaming XML data that needs to be processed online. Streaming XML clustering is a challenging task because of the high




efficiency requirements implicated for online approaches. In this paper we propose


, which addresses the two challenges using a two-layered optimization. The bottom layer employs Bloom filters to encode the XML documents, providing a space-efficient solution to memory usage. The top layer is based on Locality Sensitive Hashing and contributes to the computational efficiency. The theoretical analysis shows that the approximate solution of XStreamCluster generates similarly good clusters as the exact solution, with high probability. The experimental results demonstrate that XStreamCluster improves both memory efficiency and computational time by at least an order of magnitude without affecting clustering quality, compared to its variants and a baseline approach.

Odysseas Papapetrou, Ling Chen

XML and Graph

Efficient Evaluation of NOT-Twig Queries in Tree-Unaware Relational Databases

Despite a large body of work on


query processing in relational environment, systematic study of


-twig queries

has received little attention in the literature. Such queries contain not-predicates and are useful for many real-world applications. In this paper, we present an efficient strategy to evaluate


-twig queries on top of a dewey-based


system called


[11]. We extend the encoding scheme of


by adding two new labels, namely




, that enable us to


filter out elements satisfying a not-predicate by comparing their

ancestor group identifiers

. In this approach, a set of elements under the same common ancestor at a specific level in the


tree is assigned

same ancestor group identifier

. Based on this encoding scheme, we propose a novel


translation algorithm for


-twig query evaluation. Real and synthetic datasets are employed to demonstrate the superiority of our approach over industrial-strength


and native



Kheng Hong Soh, Sourav S. Bhowmick

A Hybrid Algorithm for Finding Top-k Twig Answers in Probabilistic XML

Uncertainty is inherently ubiquitous in data of real applications, and those uncertain data can be naturally represented by the XML. Matching twig pattern against XML data is a core problem, and on the background of probabilistic XML, each twig answer has a probabilistic value because of the uncertainty of data. The twig answers that have small probabilistic values are useless to the users, and the users only want to get the answers with the largest


probabilistic values. In this paper, we address the problem of finding twig answers with top-


probabilistic values against probabilistic XML documents directly. To cope with this problem, we propose a hybrid algorithm which takes both the probability value constraint and structural relationship constraint into account. The main idea of the algorithm is that the element with larger path probability value will more likely contribute to the twig answers with larger twig probability values, and at the same time lots of useless answers that do not satisfy the structural constraint can be filtered. Therefore the proposed algorithm can avoid lots of intermediate results, and find the top-


answers quickly. Experiments have been conducted to study the performance of the algorithm.

Bo Ning, Chengfei Liu

Optimizing Incremental Maintenance of Minimal Bisimulation of Cyclic Graphs

Graph-structured databases have numerous recent applications including the Semantic Web, biological databases and XML, among many others. In this paper, we study the maintenance problem of a popular structural index, namely


, of a possibly


data graph. In comparison, previous work mainly focuses on acyclic graphs. In the context of database applications, it is natural to compute minimal bisimulation with merging algorithms. First, we propose a maintenance algorithm for a minimal bisimulation of a cyclic graph, in the style of merging. Second, to prune the computation on non-bisimilar


s, we propose a feature-based optimization. The features are designed to be constructed and used more efficiently than bisimulation minimization. Third, we conduct an experimental study that verifies the effectiveness and efficiency of our algorithm. Our features-based optimization pruned 50% (on average) unnecessary bisimulation computation. Our experiment verifies that we extend the current state-of-the-art with a capability on handling cyclic graphs in maintenance of minimal bisimulation.

Jintian Deng, Byron Choi, Jianliang Xu, Sourav S. Bhowmick

Social Based Layouts for the Increase of Locality in Graph Operations

Graphs provide a natural data representation for analyzing the relationships among entities in many application areas. Since the analysis algorithms perform memory intensive operations, it is important that the graph layout is adapted to take advantage of the memory hierarchy.

Here, we propose layout strategies based on community detection to improve the in-memory data locality of generic graph algorithms. We conclude that the detection of communities in a graph provides a layout strategy that improves the performance of graph algorithms consistently over other state of the art strategies.

Arnau Prat-Pérez, David Dominguez-Sal, Josep L. Larriba-Pey

Generating Random Graphic Sequences

The graphs that arise from concrete applications seem to correspond to models with prescribed degree sequences. We present two algorithms for the uniform random generation of graphic sequences. We prove their correctness. We empirically evaluate their performance. To our knowledge these algorithms are the first non trivial algorithms proposed for this task. The algorithms that we propose are Markov chain Monte Carlo algorithms. Our contribution is the original design of the Markov chain and the empirical evaluation of mixing time.

Xuesong Lu, Stéphane Bressan


Weitere Informationen

Premium Partner

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.



Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.
Jetzt gratis downloaden!