Name: Database Systems for Advanced Applications
ISBN: 978-3-540-25334-1

Keynotes

Data Stream Mining and Resource Adaptive Computation

The problem of data streams has gained importance in recent years because of advances in hardware technology. These advances have made it easy to store and record numerous transactions and activities in everyday life in an automated way. The ubiquitous presence of data streams in a number of practical domains has generated a lot of research in this area. Example applications include trade surveillance for security fraud and money laundering, network monitoring for intrusion detection, bio-surveillance for terrorist attack, and others. Data is viewed as a continuous stream in this kind of applications. Problems such as data mining which have been widely studied for traditional data sets cannot be easily solved for the data stream domain. This is because the large volume of data arriving in a stream renders most algorithms to inefficient as most mining algorithms require multiple scans of data which is unrealistic for stream data. More importantly, the characteristics of the data stream can change over time and the evolving pattern needs to be captured. Furthermore, we need to consider the problem of resource allocation in mining data streams. Due to the large volume and the high speed of streaming data, mining algorithms must cope with the effects of system overload. Thus, how to achieve optimum results under various resource constraints becomes a challenging task. In this talk, I’ll provide an overview, discuss the issues and focus on how to mine evolving data streams and perform resource adaptive computation.

Philip S. Yu

Purpose Based Access Control for Privacy Protection in Database Systems

The development of privacy-preserving data management techniques has been the focus of intense research in the last few years. Such research has resulted in important notions and techniques, such as the notions of Hippocratic database systems and k-anonymity, and various privacy-preserving data mining techniques. However, much work still needs to be carried out to develop high assurance privacy-preserving database management systems. An important requirement in the development of such systems is the need of providing comprehensive and accurate privacy-related metadata, such as data usage purposes. Such metadata represent the core of access control mechanisms specifically tailored towards privacy. In this talk we address such issue. We present a comprehensive approach for privacy preserving access control based on the notion of purpose. Purpose information associated with a given data element specifies the intended use of the data element. Purpose information represents an important form of metadata, because data usage purpose is very often part of privacy policies, such as the case of policies expressed according to P3P. A key feature of our model is that it allows multiple purposes to be associated with each data element and it also supports explicit prohibitions, thus allowing privacy officers to specify that some data should not be used for certain purposes. Another important issue to be addressed is the granularity of data labeling, that is, the units of data with which purposes can be associated. We address this issue in the context of relational databases and propose four different labeling schemes, each providing a different granularity. In the paper we also propose an approach to representing purpose information, which results in very low storage overhead, and we exploit query modification techniques to support data access control based on purpose information. We conclude the talk by outlining future work that includes the application of our purpose management techniques to complex data and its integration into RBAC.

Elisa Bertino

Complex Networks and Network Data Mining

We propose a new method for mapping important factors abstracted from a real complex network into the topology of nodes and links. By this method, the effect of a node is denoted with its computable quality, such as the city scale with traffic network, the node throughput of communication network, the hit rates of a web site, and the individual prestige of human relationship. By this method, the interaction between nodes is denoted by the distance or length of links, such as the geographic distance between two cities in the traffic network, the bandwidth between two communication nodes, the number of hyperlinks for a webpage, and the friendship intensity of human relationship. That is, topologically, two-factor operations with node and link are generally expanded to four-factor operations with node, link, distance, and quality. Using this four-factor method, we analyze networking data and simulate the optimization of web mining to form a mining engine by excluding those redundant and irrelevant nodes. The method can lead to the reduction of complicated messy web site structures to a new informative concise graph. In a prototype system for mining informative structure, several experiments for real networking data sets have shown encouraging results in both discovered knowledge and knowledge discovery rate.

Deyi Li

Bioinformatics

Indexing DNA Sequences Using q-Grams

We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on

q

-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are proposed based on the

q

-grams of DNA sequences. The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence. Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity.

Xia Cao, Shuai Cheng Li, Anthony K. H. Tung

PADS: Protein Structure Alignment Using Directional Shape Signatures

A novel data mining approach for similarity search and knowledge discovery in protein structure databases is proposed. PADS (

P

rotein structure

A

lignment by

D

irectional shape

S

ignatures) incorporates the three dimensional coordinates of the main atoms of each amino acid and extracts a geometrical shape signature along with the direction of each amino acid. As a result, each protein structure is presented by a series of multidimensional feature vectors representing local geometry, shape, direction, and biological properties of its amino acid molecules. Furthermore, a distance matrix is calculated and is incorporated into a local alignment dynamic programming algorithm to find the similar portions of two given protein structures followed by a sequence alignment step for more efficient filtration. The optimal superimposition of the detected similar regions is used to assess the quality of the results. The proposed algorithm is fast and accurate and hence could be used for analysis and knowledge discovery in large protein structures. The method has been compared with the results from CE, DALI, and CTSS using a representative sample of PDB structures. Several new structures not detected by other methods are detected.

S. Alireza Aghili, Divyakant Agrawal, Amr El Abbadi

LinkageTracker: A Discriminative Pattern Tracking Approach to Linkage Disequilibrium Mapping

Linkage disequilibrium mapping is a process of inferring the disease gene location from observed associations of marker alleles in affected patients and normal controls. In reality, the presence of disease-associated chromosomes in affected population is relatively low (usually 10% or less). Hence, it is a challenge to locate these disease genes on the chromosomes. In this paper, we propose an algorithm known as LinkageTracker for linkage disequilibrium mapping. Comparing with some of the existing work, LinkageTracker is more robust and does not require any population ancestry information. Furthermore our algorithm is shown to find the disease locations more accurately than a closely related existing work, by reducing the average sum-square error by more than half (from 80.71 to 30.83) over one hundred trials. LinkageTracker was also applied to a real dataset of patients affected with haemophilia, and the disease gene locations found were consistent with several studies in genetic prediction.

Li Lin, Limsoon Wong, Tzeyun Leong, Pohsan Lai

Watermarking and Encryption

Query Optimization in Encrypted Database Systems

To ensure the privacy of data in the relational databases, prior work has given techniques to support data encryption and execute SQL queries over the encrypted data. However, the problem of how to put these techniques together in an optimum manner was not addressed, which is equivalent to having an RDBMS without a query optimizer. This paper models and solves that optimization problem.

Hakan Hacıgümüş, Bala Iyer, Sharad Mehrotra

Watermarking Spatial Trajectory Database

Protection of digital assets from piracy has received increasing interests where sensitive, valuable data need to be released. This paper addresses the problem of watermarking spatial trajectory database. The formal definition of the problem is given and the potential attacks are analyzed. Then a novel watermarking method is proposed, which embed the watermark information by introducing a small error to the trajectory shape rather than certain data values. Experimental results justify the usefulness of the proposed method, and give some empirical conclusions on the parameter settings.

Xiaoming Jin, Zhihao Zhang, Jianmin Wang, Deyi Li

Effective Approaches for Watermarking XML Data

Watermarking enables provable rights over content, which has been successfully applied in multimedia applications. However, it is not trivial to apply the known effective watermarking schemes to XML data, since noisy data may not be acceptable due to its structures and node extents. In this paper, we present two different watermarking schemes on XML data: the selective approach and the compression approach. The former allows us to embed non-destructive hidden information content over XML data. The latter takes verbosity and the need in updating XML data in real life into account. We conduct experiments on the efficiency and robustness of both approaches against different forms of attack, which shows that our proposed watermarking schemes are reasonably efficient and effective.

Wilfred Ng, Ho-Lam Lau

XML Query Processing

A Unifying Framework for Merging and Evaluating XML Information

With the ever increasing connection between XML information systems over the Web, users are able to obtain integrated sources of XML information in a cooperative manner, such as developing an XML mediator schema or using eXtensible Stylesheet Language Transformation (XSLT). However, it is not trivial to evaluate the quality of such merged XML data, even when we have the knowledge of the involved XML data sources. Herein, we present a unifying framework for merging XML data and study the quality issues of merged XML information. We capture the coverage of the object sources as well as the structural diversity of XML data objects, respectively, by the two metrics of Information Completeness (IC) and Data Complexity (DC) of the merged data.

Ho-Lam Lau, Wilfred Ng

Efficient Evaluation of Partial Match Queries for XML Documents Using Information Retrieval Techniques

We propose XIR, a novel method for processing partial match queries on heterogeneous XML documents using information retrieval (IR) techniques. A partial match query is defined as the one having the descendent-or-self axis “//” in its path expression. In its general form, a partial match query has branch predicates forming branching paths. The objective of XIR is to efficiently support this type of queries for large-scale documents of heterogeneous schemas. XIR has its basis on the conventional schema-level methods using relational tables and significantly improves their efficiency using two techniques: an inverted index technique and a novel prefix match join. The former indexes the labels in label paths as keywords in texts, and allows for finding the label paths matching the queries more efficiently than string match used in the conventional methods. The latter supports branching path expressions, and allows for finding the result nodes more efficiently than containment joins used in the conventional methods. We compare the efficiency of XIR with those of XRel and XParent using XML documents crawled from the Internet. The results show that XIR is more efficient than both XRel and XParent by several orders of magnitude for linear path expressions, and by several factors for branching path expressions.

Young-Ho Park, Kyu-Young Whang, Byung Suk Lee, Wook-Shin Han

PathStack¬: A Holistic Path Join Algorithm for Path Query with Not-Predicates on XML Data

The evaluation of path queries forms the basis of complex XML query processing which has attracted a lot of research attention. However, none of these works have examined the processing of more complex queries that contain not-predicates. In this paper, we present the first study on evaluating path queries with not-predicates. We propose an efficient holistic path join algorithm, PathStack

¬

, which has the following advantages: (1) it requires only one scan of the relevant data to evaluate path queries with not-predicates; (2) it does not generate any intermediate results; and (3) its memory space requirement is bounded by the longest path in the input XML document. We also present an improved variant of PathStack

¬

that further minimizes unnecessary computations.

Enhua Jiao, Tok Wang Ling, Chee-Yong Chan

XML Coding and Metadata Management

An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML

A number of labeling schemes have been designed to facilitate the query of XML, based on which the ancestor-descendant relationship between any two nodes can be determined quickly. Another important feature of XML is that the elements in XML are intrinsically ordered. However the label update cost is high based on the present labeling schemes. They have to re-label the existing nodes or re-calculate some values when inserting an order-sensitive element. Thus it is important to design a scheme that supports order-sensitive queries, yet it has low label update cost. In this paper, we design a binary string prefix scheme which supports order-sensitive update without any re-labeling or re-calculation. Theoretical analysis and experimental results also show that this scheme is compact compared to the existing dynamic labeling schemes, and it provides efficient support to both ordered and un-ordered queries.

Changqing Li, Tok Wang Ling

Efficiently Coding and Indexing XML Document

In this paper, a novel and efficient numbering scheme is presented, which combines the label path information and data path information, and it can efficiently support all kinds of queries. A compact index structure, named HiD, is also proposed in this paper. Query algorithms based this index structure are introduced. At last, the comprehensive experiments are conducted to assess all the technologies in question.

Zhongming Han, Congting Xi, Jiajin Le

XQuery-Based TV-Anytime Metadata Management

Digital broadcasting is a novel paradigm for the next generation broadcasting. It can offer a new opportunity for interactive services such as content-based browsing, non-linear navigation, usage of user preference, and history, etc. On the other hand, one of the important factors for this new broadcasting environment is the interoperability among providers and consumers since the environment is distributed. Therefore a standard metadata for digital broadcasting is required and TV-Anytime metadata is one of the metadata standards for digital broadcasting. It is defined using XML schema, so its instances are XML data. In order to fulfill interoperability, a standard query language is also required and XQuery, which is a forthcoming standard query language for XML data, is a natural choice. In this paper we propose an efficient XML data management system that supports TV-Anytime metadata, especially using XQuery as a query language. Since the volume of metadata would be very large in real situation, our system considers a relational database system as storage. We implement a prototype system and test performance for various typical queries by comparing our system with other general-purpose systems.

Jong-Hyun Park, Byung-Kyu Kim, Yong-Hee Lee, Min-Woo Lee, Min-Ok Jung, Ji-Hoon Kang

Data Mining

Effective Database Transformation and Efficient Support Computation for Mining Sequential Patterns

In this paper, we introduce a novel algorithm for mining sequential patterns from transaction databases. Since the FP-tree based approach is efficient in mining frequent itemsets, we adapt it to find frequent 1-sequences. For efficient frequent k-sequence mining, every frequent 1-sequence is encoded as a unique symbol and the database is transformed into one in the symbolic form. We observe that it is unnecessary to encode all the frequent 1-seqences, and make full use of the discovered frequent 1-sequences to transform the database into one with a smallest size. To discover the frequent k-sequences, we design a tree structure to store the candidates. Each customer sequence is then scanned to decide whether the candidates are frequent k-sequences. We propose a technique to avoid redundantly enumerating the identical k-subsequences from a customer sequence to speed up the process. Moreover, the tree structure is designed in a way such that the supports of the candidates can be incremented for a customer sequence by a single sequential traversal of the tree. The experiment results show that our approach outperforms the previous works in various aspects including the scalability and the execution time.

Chung-Wen Cho, Yi-Hung Wu, Arbee L. P. Chen

Mining Succinct Systems of Minimal Generators of Formal Concepts

Formal concept analysis has become an active field of study for data analysis and knowledge discovery. A formal concept

C

is determined by its extent (the set of objects that fall under

C

) and its intent (the set of properties or attributes covered by

C

). The intent for

C

, also called a closed itemset, is the maximum set of attributes that characterize

C

. The minimal generators for

C

are the minimal subsets of

C

’s intent which can similarly characterize

C

. This paper introduces the

s

uccinct

s

ystem of

m

inimal

g

enerators

(SSMG) as a minimal representation of the minimal generators of all concepts, and gives an efficient algorithm for mining SSMGs. The SSMGs are useful for revealing the equivalence relationship among the minimal generators, which may be important for medical and other scientific discovery; and for revealing the extent-based semantic equivalence among associations. The SSMGs are also useful for losslessly reducing the size of the representation of all minimal generators, similar to the way that closed itemsets are useful for losslessly reducing the size of the representation of all frequent itemsets. The removal of redudancies will help human users to grasp the structure and information in the concepts.

Guozhu Dong, Chunyu Jiang, Jian Pei, Jinyan Li, Limsoon Wong

A General Approach to Mining Quality Pattern-Based Clusters from Microarray Data

Pattern-based clustering has broad applications in microarray data analysis, customer segmentation, e-business data analysis, etc. However, pattern-based clustering often returns a large number of highly-overlapping clusters, which makes it hard for users to identify interesting patterns from the mining results. Moreover, there lacks of a general model for pattern-based clustering. Different kinds of patterns or different measures on the pattern coherence may require different algorithms. In this paper, we address the above two problems by proposing a general quality-driven approach to mining top-

k

quality pattern-based clusters. We examine our quality-driven approach using real world microarray data sets. The experimental results show that our method is general, effective and efficient.

Daxin Jiang, Jian Peii, Aidong Zhang

Data Generation and Understanding

Real Datasets for File-Sharing Peer-to-Peer Systems

The fundamental drawback of unstructured peer-to-peer (P2P) networks is the flooding-based query processing protocol that seriously limits their scalability. As a result, a significant amount of research work has focused on designing efficient search protocols that reduce the overall communication cost. What is lacking, however, is the availability of real data, regarding the exact content of users’ libraries and the queries that these users ask. Using trace-driven simulations will clearly generate more meaningful results and further illustrate the efficiency of a generic query processing protocol under a real-life scenario.

Motivated by this fact, we developed a Gnutella-style probe and collected detailed data over a period of two months. They involve around 4,500 users and contain the exact files shared by each user, together with any available metadata (e.g., artist for songs) and information about the nodes (e.g., connection speed). We also collected the queries initiated by these users. After filtering, the data were organized in XML format and are available to researchers. Here, we analyze this dataset and present its statistical characteristics. Additionally, as a case study, we employ it to evaluate two recently proposed P2P searching techniques.

Shen Tat Goh, Panos Kalnis, Spiridon Bakiras, Kian-Lee Tan

SemEQUAL: Multilingual Semantic Matching in Relational Systems

In an increasingly multilingual world, it is critical that information management tools organically support the simultaneous use of multiple

natural languages

. A pre-requisite for efficiently achieving this goal is that the underlying database engines must provide seamless matching of text data across languages. We propose here

SemEQUAL

, a new SQL functionality for semantic matching of multilingual attribute data. Our current implementation defines matches based on the standard WordNet linguistic ontologies. A performance evaluation of

SemEQUAL

, implemented using standard SQL:1999 features on a suite of commercial database systems indicates unacceptably slow response times. However, by tuning the schema and index choices to match typical linguistic features, we show that the performance can be improved to a level commensurate with online user interaction.

A. Kumaran, Jayant R. Haritsa

A Metropolis Sampling Method for Drawing Representative Samples from Large Databases

In this paper, a sampling method based on the Metropolis algorithm is proposed. It is able to draw samples that have the same distribution as the underlying probability distribution. It is a simple, efficient, and powerful method suitable for all distributions. We have performed experiments to examine the qualities of the samples by comparing their statistical properties with the underlying population. The experimental results show that the samples selected by our method are bona fide representative.

Hong Guo, Wen-Chi Hou, Feng Yan, Qiang Zhu

Panel

Stay Current and Relevant in Data Mining Research

In a recent editorial of the Bioinformatics journal, Dr. Pavel Pevzner, a pioneering researcher in the field of bioinformatics, made the following statement [1]:

For many years algorithms were taught exclusively to computer scientists, with relatively few students from other disciplines attending algorithms courses. A biology student in an algorithms class would be a surprising and unlikely (though not entirely unwelcome) guest in the 1990s. Things change; some biology students now take some sort of Algorithms 101. At the same time, curious computer science students often take Genetics 101.

Haixun Wang, Wei Wang

Music Retrieval

An Efficient Approach to Extracting Approximate Repeating Patterns in Music Databases

Pattern extraction from music strings is an important problem. The patterns extracted from music strings can be used as features for music retrieval or analysis. Previous works on music pattern extraction only focus on exact repeating patterns. However, music segments with minor differences may sound similar. The concept of the prototypical melody has therefore been proposed to represent these similar music segments. In musicology, the number of music segments that are similar to a prototypical melody implies the importance degree of the prototypical melody to the music work. In this paper, a novel approach is developed to extract all the prototypical melodies in a music work. Our approach considers each music segment as a candidate for the prototypical melody and uses the edit distance to determine the set of music segments that are similar to this candidate. A lower bounding mechanism, which estimates the number of similar music segments for each candidate and prunes the impossible candidates is designed to speed up the process. Experiments are performed on a real data set and the results show a significant improvement of our approach over the existing approaches in the average response time.

Ning-Han Liu, Yi-Hung Wu, Arbee L. P. Chen

On Efficient Music Genre Classification

Automatic music genre classification has long been an important problem. However, there is a paucity of literature that addresses the issue, and in addition, reported accuracy is fairly low. In this paper, we present empirical study of a novel music descriptor generation method for efficient content based music genre classification. Analysis and empirical evidence demonstrate that our approach outperforms state-of-the-art approaches in the areas including accuracy of genre classification with various machine learning algorithms, efficiency on training process. Furthermore, its effectiveness is robust against various kinds of audio alternation.

Jialie Shen, John Shepherd, Anne H. H. Ngu

Effectiveness of Note Duration Information for Music Retrieval

Content-based music information retrieval uses features extracted from music to answer queries. For melodic queries, the two main features are the pitch and duration of notes. The note pitch feature has been well researched whereas duration has not been fully explored. In this paper, we discuss how the note duration feature can be used to alter music retrieval effectiveness. Notes are represented by strings called standardisations. A standardisation is designed for approximate string matching and may not capture melodic information precisely. To represent pitches, we use a string of pitch differences. Our duration standardisation uses a string of five symbols representing the relative durations of adjacent notes. For both features, the Smith-Waterman alignment is used for matching. We demonstrate combining the similarity in both features using a vector model. Results of our experiments in retrieval effectiveness show that note duration similarity by itself is not useful for effective music retrieval. Combining pitch and duration similarity using the vector model does not improve retrieval effectiveness over the use of pitch on its own.

Iman S. H. Suyoto, Alexandra L. Uitdenbogerd

Query Processing in Subscription Systems

A Self-Adaptive Model to Improve Average Response Time of Multiple-Event Filtering for Pub/Sub System

Publish/subscribe system captures the dynamic aspect of the specified information by notifying users of interesting events as soon as possible. Fast response time is important for event filtering which requires multiple step processing and is also one of important factors to provide good service for subscribers.

Generally the event arrival rate is time varying and unpredictable. It is very possible that no event arrives in one unit time and multiple events arrive in another unit time. When multiple events with different workloads arrive at the same time, the average response time of multiple-event filtering depends on the sequence of event by event filtering.

As far as we know, significant research efforts have been dedicated to the techniques of single event filtering, they can not efficiently filter multiple events in fast response time. In this paper, we first propose a multiple-event filtering algorithm based on R-tree. By calculating relative workload of each event, event by event filtering can be executed with short-job first policy so as to improve average response time of multiple-event filtering. Furthermore, a self-adaptive model is proposed to filter multiple events in dynamically changing environment.

The multiple-event filtering algorithm and the self-adaptive model are evaluated in a simulated environment. The results show that the average response time can be improved maximum up to nearly 50%. With the self-adaptive model, multiple events can be filtered with average response time always same as or close to the possible best time in the dynamically changing environment.

Botao Wang, Wang Zhang, Masaru Kitsuregawa

Filter Indexing: A Scalable Solution to Large Subscription Based Systems

Filtering is a popular approach to reduce information traffic in subscription-based systems, especially if most subscribers are located on mobile devices. The filters are placed between the subscribers and the subscription server. With increasing number of filters from subscribers, filtering may become a bottleneck and challenge the scalability of such systems. In this paper, we propose to use filter indexing to solve this problem. We propose and study four filter-indexing schemes: Ad-Hoc Indexing Scheme (AIS), Group Indexing Scheme (GIS), Group-Sort Indexing Scheme (GSIS) and B’ Tree Indexing Scheme (BTIS). We evaluate the performance of these four indexing schemes with respect to scalability and other factors. Among the proposed schemes, we find that GSIS is the most efficient indexing scheme for searching and BTIS has the best performance for updating and inserting filters.

Wanxia Xie, Shamkant B. Navathe, Sushil K. Prasad

Caching Strategies for Push-Based Broadcast Considering Consecutive Data Accesses with Think-Time

Recently, there has been increasing interest in research on push-based information systems that deliver data by broadcast in both wired and wireless environments. This paper proposes new caching strategies to reduce the response time of data access by assuming an environment where clients consecutively issue access requests for multiple data items with think time. The proposed strategies take into account each client’s access characteristics, such as correlation among data items and think-time between a data access and the next access request, and reduce the average response time by caching data items with long expected response time. Moreover, we evaluate the effectiveness of the proposed strategies by simulation experiments.

Wataru Uchida, Takahiro Hara, Shojiro Nishio

Extending XML

XDO2: A Deductive Object-Oriented Query Language for XML

In the past decade, researchers have combined deductive and object-oriented features to produce systems that are powerful and have excellent modeling capabilities. More recently, an XML query language XTree was proposed. Queries written in XTree are more compact, more convenient to write and easier to understand than queries written in XPath. In this paper, we introduce a novel XML query language XDO2 that extends

XTree

, with

deductive

features such as

deductive rules

and

negation

, and

object-oriented

features such as

inheritance

and

methods

. Our XDO2 language is more

compact

, and

convenient

to use than current query languages for XML such as XQuery and XPath because it is based on XTree, supports (recursive) deductive rules and the not-predicate. An XDO2 database example is given to motivate the usefulness of the language. The formal treatment of language syntax and semantics are presented in the appendices.

Wei Zhang, Tok Wang Ling, Zhuo Chen, Gillian Dobbie

Extending XML with Nonmonotonic Multiple Inheritance

Schema descriptions of XML documents become more and more complicated and schema documents become longer and longer as the structure of XML documents becomes more and more complex. This is mainly because they cannot take full use of object-oriented modeling abilities. In this paper, we extend XML as follows to solve this problem. (1) We extend DTD’s type system to provide richer built-in types. Moreover, a user-defined type can be declared using the

ISA

mechanism in which an existing type is used as the base type and the set of values represented by the derived type is the subset of values represented by the base type. (2) We extend DTD so that element can be global as well as local. (3) We extend DTD with element hierarchy with nonmonotonic inheritance to support super-element sub-element relationship, overriding of elements or attributes inherited from super-elements, blocking of the inheritance of elements or attributes from super-elements, and conflict handling. (4) We extend XML with polymorphism, which is a fundamental feature in object-oriented data models, to support polymorphic elements, typing of references and polymorphic references. Although we extend DTD to support some key object-oriented features, there is not any syntax change of XML documents to fit for our

Extended DTD

.

Guoren Wang, Mengchi Liu

Database Design with Equality-Generating Dependencies

In relational database systems, traditional normalization techniques (eg, BCNF, 4NF) remove data redundancies from a single relation, but can not detect and remove redundancies across multiple relations. However, redundancies among multiple relations are abundant especially in integrated databases. In this paper, we first propose to detect such data redundancies using equality-generating dependencies (EGDs) and propose an extended normal form (ENF) of relational database schema with respect to EGDs. We show that a database has no potential data redundancies with respect to EGDs if and only if the schema is in ENF. For a common special class of EGDs, we provide a set of sound and complete inference rules. A normalization process is presented to losslessly transform a relational database schema to one in ENF. We then extend our EGDs and ENF to XML data, and show how similar data redundancy problems can be detected for data-centric XML documents.

Junhu Wang

Web Services

WDEE: Web Data Extraction by Example

Web data extraction systems in use today transform semi-structured Web documents and deliver structured documents to end users. Some systems provide a visual interface to users to generate the extraction rules. However, to end users, the visual effect of Web documents is lost during the transformation process. In this paper, we propose an approach that allows a user to query extracted documents without knowledge of formal query language. We bridge the gap between visual effect of Web documents and structured documents extracted by providing a QBE-like (Query by Example) interface named

Wdee

. The principle component of our method is the notion of a document schema. Document

schemata

are patterns of structures embedded in documents.

Wdee

generates tree skeletons based on schema information and a user may execute queries by input condition in the skeltons. By maintaining the mapping relation among schemata of Web documents and extracted documents, a visual example may be presented to end users. With the example,

Wdee

allows a user to construct tree skeletons in a manner that resembles the browsing of Web pages.

Zhao Li, Wee Keong Ng

Concept-Based Retrieval of Alternate Web Services

Web services have attracted much attention in recent years with the development of e-commercial technologies over the Internet. Although there are some standards and protocols for web service technologies, such as WSDL, UDDI and SOAP, the core technologies underlying Web services need further study in order to make these technologies practical and flexible. Efficient services management is the main task for services execution and services composition and there is no good solution until now. In this paper, we present a concept-based method for services management, which is efficient for services selection and alternative. Our method takes advantage of lattice and retrieves the optimal alternates for a given Web service efficiently by employing formal concept analysis and concept lattice. Compared with the former methods, this method is more efficient and accurate because the underlying semantics of Web services and users’ requirements are exploited during the processing of retrieval. Experimental results also verify the efficiency and scalability of our approach.

Dunlu Peng, Sheng Huang, Xiaoling Wang, Aoying Zhou

WSQuery: XQuery for Web Services Integration

Web services integration is one of key issues in many B2B applications. Yet, current approaches to this problem tend to use a general-purpose programming language, thus incur type mismatches between the type system of XML schema and the object-oriented type system. In this paper, we present a uniform method for integrating Web services. We propose an extension of XQuery, referred to as

WSQuery

, which can contain Web service calls. We first present the conceptual evaluation strategy of WSQuery programs. Then, for speeding up the evaluation, we propose to schedule Web service calls to exploit parallelism. We carry out dependency analysis to determine dependency relations among Web service calls. The

for

loops,

if

branches and unknown parameters pose particular challenges for dependency analysis and scheduling. We use unfolding techniques to deal with them.

Zhimao Guo, Xiaoling Wang, Aoying Zhou

High-Dimensional Indexing

A New Indexing Method for High Dimensional Dataset

Indexing high dimensional datasets has attracted extensive attention from many researchers in the last decade. Since R-tree type of index structures are known as suffering “curse of dimensionality” problems, Pyramid-tree type of index structures, which are based on the B-tree, have been proposed to break the curse of dimensionality. However, for high dimensional data, the number of pyramids is often insufficient to discriminate data points when the number of dimensions is high. Its effectiveness degrades dramatically with the increase of dimensionality. In this paper, we focus on one particular issue of “curse of dimensionality”; that is, the surface of a hypercube in a high dimensional space approaches 100% of the total hypercube volume when the number of dimensions approaches infinite. We propose a new indexing method based on the surface of dimensionality. We prove that the Pyramid tree technology is a special case of our method. The results of our experiments demonstrate clear priority of our novel method.

Jiyuan An, Yi-Ping Phoebe Chen, Qinying Xu, Xiaofang Zhou

BM + -Tree: A Hyperplane-Based Index Method for High-Dimensional Metric Spaces

In this paper, we propose a novel high-dimensional index method, the BM

+

-tree, to support efficient processing of similarity search queries in high-dimensional spaces. The main idea of the proposed index is to improve data partitioning efficiency in a high-dimensional space by using a rotary binary hyperplane, which further partitions a subspace and can also take advantage of the twin node concept used in the M

+

-tree. Compared with the key dimension concept in the M

+

-tree, the binary hyperplane is more effective in data filtering. High space utilization is achieved by dynamically performing data reallocation between twin nodes. In addition, a post processing step is used after index building to ensure effective filtration. Experimental results using two types of real data sets illustrate a significantly improved filtering efficiency.

Xiangmin Zhou, Guoren Wang, Xiaofang Zhou, Ge Yu

Approaching the Efficient Frontier: Cooperative Database Retrieval Using High-Dimensional Skylines

Cooperative database retrieval is a challenging problem: top k retrieval delivers manageable results only when a suitable compensation function (e.g. a weighted mean) is explicitly given. On the other hand skyline queries offer intuitive querying to users, but result set sizes grow exponentially and hence can easily exceed manageable levels. We show how to combine the advantages of skyline queries and top k retrieval in an interactive query processing scheme using user feedback on a manageable, representative sample of the skyline set to derive most adequate weightings for subsequent focused top k retrieval. Hence, each user’s information needs are conveniently and intuitively obtained, and only a limited set of best matching objects is returned. We will demonstrate our scheme’s efficient performance, manageable result sizes, and representativeness of the skyline. We will also show how to effectively estimate users’ compensation functions using their feedback. Our approach thus paves the way to intuitive and efficient cooperative retrieval with vague query predicates.

Wolf-Tilo Balke, Jason Xin Zheng, Ulrich Güntzer

Sensor and Stream Data Processing

False-Negative Frequent Items Mining from Data Streams with Bursting

False-negative frequent items mining from a high speed transactional data stream is to find an approximate set of frequent items with respect to a minimum support threshold,

s

. It controls the possibility of missing frequent items using a reliability parameter

δ

. The importance of false-negative frequent items mining is that it can exclude false-positives and therefore significantly reduce the memory consumption for frequent itemsets mining. The key issue of false-negative frequent items mining is how to minimize the possibility of missing frequent items. In this paper, we propose a new false-negative frequent items mining algorithm, called Loss-Negative, for handling bursting in data streams. The new algorithm consumes the smallest memory in comparison with other false-negative and false-positive frequent items algorithms. We present theoretical bound of the new algorithm, and analyze the possibility of minimization of missing frequent items, in terms of two possibilities, namely, in-possibility and out-possibility. The former is about how a frequent item can possibly pass the first pruning. The latter is about how long a frequent item can stay in memory while no occurrences of the item comes in the following data stream for a certain period. The new proposed algorithm is superior to the existing false-negative frequent items mining algorithms in terms of the two possibilities. We demonstrate the effectiveness of the new algorithm in this paper.

Zhihong Chong, Jeffrey Xu Yu, Hongjun Lu, Zhengjie Zhang, Aoying Zhou

Adaptively Detecting Aggregation Bursts in Data Streams

Finding bursts in data streams is attracting much attention in research community due to its broad applications. Existing burst detection methods suffer the problems that 1) the parameters of window size and absolute burst threshold, which are hard to be determined a priori, should be given in advance. 2) Only one side bursts, i.e. either increasing or decreasing bursts, can be detected. 3) Bumps, which are changes of aggregation data caused by noises, are often reported as bursts. The disturbance of bumps causes much effort in subsequent exploration of mining results. In this paper, a general burst model is introduced for overcoming above three problems. We develop an efficient algorithm for detecting adaptive aggregation bursts in a data stream given a burst ratio. With the help of a novel inverted histogram, the statistical summary is compressed to be fit in limited main memory, so that bursts on windows of any length can be detected accurately and efficiently on-line. Theoretical analysis show the space and time complexity bound of this method is relatively good, while experimental results depict the applicability and efficiency of our algorithm in different application settings.

Aoying Zhou, Shouke Qin, Weining Qian

Communication-Efficient Implementation of Join in Sensor Networks

A sensor network is a wireless ad hoc network of resource-constrained sensor nodes. In this article, we address the problem of communication-efficient implementation of the SQL “join” operator in sensor networks. We design an optimal join-implementation algorithm that provably incurs minimum communication cost under certain reasonable assumptions. In addition, we design a much faster suboptimal heuristic that empirically delivers a near-optimal solution. We evaluate the performance of our designed algorithms through extensive simulations.

Vishal Chowdhary, Himanshu Gupta

Database Performance Issues

Zoned-RAID for Multimedia Database Servers

This paper proposes a novel fault-tolerant disk subsystem named

Zoned-RAID

(Z-RAID). Z-RAID improves the performance of traditional RAID system by utilizing the

zoning

property of modern disks which provides multiple zones with different data transfer rates in a disk. This study proposes to optimize data transfer rate of RAID system by constraining placement of data blocks in multi-zone disks. We apply Z-RAID for multimedia database servers such as video servers that require a high data transfer rate as well as fault tolerance. Our analytical and experimental results demonstrate the superiority of Z-RAID to conventional RAID. Z-RAID provides a higher effective data transfer rate in normal mode with no disadvantage. In the presence of a disk failure, Z-RAID still performs as well as RAID.

Ali E. Dashti, Seon Ho Kim, Roger Zimmermann

Randomized Data Allocation in Scalable Streaming Architectures

IP-networked streaming media storage has been increasingly used as a part of many applications. Random placement of data blocks has been proven to be an effective approach to balance heterogeneous workload in multi-disk steaming architectures. However, the main disadvantage of this technique is that statistical variation can still result in short term load imbalances in disk utilization. We propose a

packet level randomization

(PLR) technique to solve this challenge. We quantify the exact performance trade-off between PLR approach and the traditional

block level randomization

(BLR) technique through both theoretical analysis and extensive simulation. Our results show that the PLR technique can achieve much better load balancing in scalable streaming architectures by using more memory space.

Kun Fu, Roger Zimmermann

Trace System of iSCSI Storage Access and Performance Improvement

In this paper, an IP-SAN access trace method is proposed and its evaluation is presented. IP-SAN and iSCSI are expected to remedy problems of Fibre Channel (FC)-based SAN. Servers and storage cooperatively work with communications through TCP/IP in IP-SAN system, thus an integrated analysis of both sides is considered to be significant for achieving better performance.

Our system can precisely point out the cause of performance degradation when IP-SAN is used for a remote storage access. In experiment of parallel iSCSI access in a high-latency network, the total performance is limited by a parameter in an implementation of the SCSI layer in the iSCSI protocol stack. Based on the result obtained with our IP-SAN access trace system, the parameter in the layer is modified. As a result, more than 30 times performance improvement is achieved compared with the default value case. Thus it is effective to monitor all the layers in the iSCSI protocol stack and execute an integrated analysis, using our system.

Saneyasu Yamaguchi, Masato Oguchi, Masaru Kitsuregawa

CoCache: Query Processing Based on Collaborative Caching in P2P Systems

In this paper, we propose

CoCache

, a P2P query processing architecture that enables sophisticated optimization techniques.

CoCache

is different from existing P2P query processing systems in three ways. First, a coordinator overlay network (

CON

) maintaining the summary of the whole system is constructed by applying DHT technique to query plan trees.

CON

protocol ensures the efficiency for handling dynamic environments. Second, a preliminary cost-based optimization technique for retrieving appropriate cached copies of data is studied. With the help of

CON

, we show the possibility of fine optimization in even large scale and dynamic environments. Third, the collaborative caching strategy is presented, with which even small portion of cache storage on each peer may result in great improvement on query processing performance. Extensive experiments over real-world and synthetic settings show the effectiveness and efficiency of

CoCache

.

Weining Qian, Linhao Xu, Shuigeng Zhou, Aoying Zhou

Clustering, Classification and Data Warehouses

Multi-represented kNN-Classification for Large Class Sets

The amount of stored information in modern database applications increased tremendously in recent years. Besides their sheer amount, the stored data objects are also more and more complex. Therefore, classification of these complex objects is an important data mining task that yields several new challenges. In many applications, the data objects provide multiple representations. E.g. proteins can be described by text, amino acid sequences or 3D structures. Additionally, many real-world applications need to distinguish thousands of classes. Last but not least, many complex objects are not directly expressible by feature vectors. To cope with all these requirements, we introduce a novel approach to classification of multi-represented objects that is capable to distinguish large numbers of classes. Our method is based on

k

nearest neighbor classification and employs density-based clustering as a new approach to reduce the training instances for instance-based classification. To predict the most likely class, our classifier employs a new method to use several object representations for making accurate class predictions. The introduced method is evaluated by classifying proteins according to the classes of Gene Ontology, one of the most established class systems for biomolecules that comprises several thousand classes.

Hans-Peter Kriegel, Alexey Pryakhin, Matthias Schubert

Enhancing SNNB with Local Accuracy Estimation and Ensemble Techniques

Naïve Bayes, the simplest Bayesian classifier, has shown excellent performance given its unrealistic independence assumption. This paper studies the selective neighborhood-based naïve Bayes (SNNB) for lazy classification, and develops three variant algorithms, SNNB-G, SNNB-L, and SNNB-LV, all with linear computational complexity. The SNNB algorithms use local learning strategy for alleviating the independence assumption. The underlying idea is, for a test example, first to construct multiple classifiers on its multiple neighborhoods with different radius, and then to select out the classifier with the highest estimated accuracy to make decision. Empirical results show that both SNNB-L and SNNB-LV generate more accurate classifiers than naïve Bayes and several other state-of-the-art classification algorithms including C4.5, Naïve Bayes Tree, and Lazy Bayesian Rule. The SNNB-L and SNNB-LV algorithms are also computationally more efficient than the Lazy Bayesian Rule algorithm, especially on the domains with high dimensionality.

Zhipeng Xie, Qing Zhang, Wynne Hsu, Mong Li Lee

MMPClust: A Skew Prevention Algorithm for Model-Based Document Clustering

To support very high dimensionality, model-based clustering is an intuitive choice for document clustering. However, the current model-based algorithms are prone to generating the skewed clusters, which influence the quality of clustering seriously. In this paper, the reasons of skew are examined and determined as the inappropriate initial model, the unfitness of cluster model and the interaction between the decentralization of estimation samples and the over-generalized cluster model. This paper proposes a skew prevention document-clustering algorithm (MMPClust), which has two features: (1) a content-based cluster model is used to model the cluster better; (2) at the re-estimation step, a part of documents most relevant to its corresponding class are selected automatically for each cluster as the estimation samples to break this interaction. MMPClust has less restrictions and more applicability in document clustering than the previous methods.

Xiaoguang Li, Ge Yu, Daling Wang

Designing and Using Views to Improve Performance of Aggregate Queries (Extended Abstract)

Data-intensive systems routinely use derived data (e.g., indexes or materialized views) to improve query-evaluation performance. We present a system architecture for Query-Performance Enhancement by Tuning (QPET), which combines design and use of derived data in an end-to-end approach to automated query-performance tuning. Our focus is on a tradeo. between (1) the amount of system resources spent on designing derived data and on keeping the data up to date, and (2) the degree of the resulting improvement in query performance. From the technical point of view, the novelty that we introduce is that we combine aggregate query rewriting techniques [1, 2] and view selection techniques [3] to achieve our goal.

Foto Afrati, Rada Chirkova, Shalu Gupta, Charles Loftis

Large Relations in Node-Partitioned Data Warehouses

A cheap shared-nothing context can be used to provide significant speedup on large data warehouses, but partitioning and placement decisions are important in such systems as repartitioning requirements can result in much less-than-linear speedup. This problem can be minimized if query workload and schemas are inputs to placement decisions. In this paper we analyze the problem of handling large relations in a node partitioned data warehouse (NPDW) with a basic placement strategy that partitions facts horizontally and replicates dimensions, with the help of a cost model. Then we propose a strategy to improve performance and show both analytical and TPC-H results.

Pedro Furtado

Data Mining and Web Data Processing

Mining Frequent Tree-Like Patterns in Large Datasets

In this paper, we propose a novel data mining scheme to explore the frequent hierarchical structure patterns, named tree-like patterns, with the relationship of each item on a sequence. By tree-like patterns, we are clear to find out the relation of items between the cause and effect. Finally, we discuss the different characteristics to our mined patterns with others. As a consequence, we can find out that our addressed tree-like patterns can be widely used to explore a variety of different applications.

Tzung-Shi Chen, Shih-Chun Hsu

An Efficient Approach for Mining Fault-Tolerant Frequent Patterns Based on Bit Vector Representations

In this paper, an algorithm, called VB-FT-Mine (

V

ectors-

B

ased

F

ault–

T

olerant frequent patterns

Mining

), is proposed for mining fault-tolerant frequent patterns efficiently. In this approach, fault–tolerant appearing vectors are designed to represent the distribution that the candidate patterns contained in data sets with fault-tolerance. VB-FT-Mine algorithm applies depth-first pattern growing method to generate candidate patterns. The fault-tolerant appearing vectors of candidates are obtained systematically, and the algorithm decides whether a candidate is a fault-tolerant frequent pattern quickly by performing vector operations on bit vectors. The experimental results show that VB-FT-Mine algorithm has better performance on execution time significantly than FT-Apriori algorithm proposed previously.

Jia-Ling Koh, Pei-Wy Yo

NNF: An Effective Approach in Medicine Paring Analysis of Traditional Chinese Medicine Prescriptions

Medicine Paring Analysis is one of the most important tasks in the research of Traditional Chinese Medicine Prescriptions. The most essential and difficult step is to mine associations between different medicine items. This paper proposes an effective approach in solving this problem. The main contributions include: (1) proposing a novel data structure called indexed frequent pattern tree (IFPT) to maintain the mined frequent patterns (2) presenting an efficient algorithm called Nearest Neighbor First (NNF) to mine association rules from IFPT (3) designing and implementing two optimization strategies that avoid the examinations of a lot of subsets of

Y

that can’t be the left part of any association rule of the form

X

$\Rightarrow$

Y

–

X

and thus achieving a wonderful performance and (4) conducting extensive experiments which show that NNF runs far faster than Apriori algorithm and has better scalability. And finally we demonstrate the effectiveness of this method in Medicine Paring Analysis.

Li Chuan, Tang Changjie, Peng Jing, Hu Jianjun, Jiang Yongguang, Yong Xiaojia

From XML to Semantic Web

The present web is existing in the HTML and XML formats for persons to browse. Recently there is a trend towards the semantic web where the information can be can be processed and understood by agents. Most of the present research works focus on the translation from HTML to semantic web, but seldom on XML. In this paper, we design a method to translate XML to semantic web. It is known that ontologies play an important role in the semantic web, therefore firstly, we propose a genetic model to organize ontologies, and based on this model, we use three steps, viz. semantic translation, structural translation and schematic translation, to translate XML to semantic web. The aim of the paper is to facilitate the wide use of semantic web.

Changqing Li, Tok Wang Ling

A Hybrid Approach for Refreshing Web Page Repositories

Web pages change frequently and thus crawlers have to download them often. Various policies have been proposed for refreshing local copies of web pages. In this paper, we introduce a new sampling method that excels over other change detection methods in experiment. Change Frequency (CF) is a method that predicts the change frequency of the pages and, in the long run, achieves an optimal efficiency in comparison with the sampling method. Here, we propose a new hybrid method that is a combination of our new sampling approach and CF and show how our hybrid method improves the efficiency of change detection.

M. Ghodsi, O. Hassanzadeh, Sh. Kamali, M. Monemizadeh

Schema Driven and Topic Specific Web Crawling

We propose a new approach to discover and extract topic-specific hypertext resources from the WWW. The method, called schema driven and topical crawling, allows a user to define schema and extracting rules for a specific domain of interests. It supports automatically search and extract schema-relevant web pages from the web. Different from common approaches that surf solely on web pages, our approach supports crawler to surf on a virtual network composed by concept instances and relationships. To achieve such a goal, we design an architecture that integrates several techniques including web extractor, meta-search engine and query expansion, and provide a toolkit to support it.

Qi Guo, Hang Guo, Zhiqiang Zhang, Jing Sun, Jianhua Feng

Moving Object Databases

Towards Optimal Utilization of Main Memory for Moving Object Indexing

In moving object databases, existing disk-based indexes are unable to keep up with the high update rate while providing speedy retrieval at the same time. However, efficient management of moving-object database can be achieved through aggressive use of main memory. In this paper, we propose an

Integrated Memory Partitioning and Activity Conscious Twin-index

(IMPACT) framework where the moving object database is indexed by a pair of indexes based on the properties of the objects’ movement – a main-memory structure manages

active

objects while a disk-based index handles

inactive

objects. As objects become active (or inactive), they dynamically migrate from one structure to the other. Moreover, the main memory is also organized into two partitions – one for the main memory index, and the other as buffers for the frequently accessed nodes of the disk-based index. Our experimental study shows that the IMPACT framework provides superior performance.

Bin Cui, Dan Lin, Kian-Lee Tan

Aqua: An Adaptive QUery-Aware Location Updating Scheme for Mobile Objects

Conventionally, the problem of location updates for moving objects has been addressed by adjusting the location reporting frequency or setting the uncertainty bound, according to object mobility patterns. This induces an obvious tradeoff between the communication cost and the uncertainty bound in querying moving object locations. Most existing works are focused on the object mobility pattern, without exploring the interdependency between queries and location updates. Furthermore, they take the precision of query results for granted as a result of a negotiated deviation threshold for reporting. The Aqua (

A

daptive

QU

ery-

A

ware) location updating scheme proposed in this paper exploits the interdependency between queries and updates. In particular, our scheme is adaptive to changes in both object mobility patterns and query characteristics, thereby resulting in significant performance improvement in terms of communication cost and query processing precision. We performed simulation studies and demonstrated that Aqua can produce desirable performance in most situations.

Jing Zhou, Hong Va Leong, Qin Lu, Ken C. K. Lee

A Spatial Index Using MBR Compression and Hashing Technique for Mobile Map Service

While the volumes of spatial data are tremendous and spatial operations are time-intensive, mobile devices own limited storages and low computational resources. Therefore, a spatial index for mobile map services should be small and efficiently filter out the candidate objects of a spatial operation as well. This paper proposes a spatial index called MHF(Multilevel Hashing File) for the mobile map service. The MHF has a simple structure for storage utilization and uses a hashing technique for search efficiency. This paper also designs a compression scheme of MBR(Minimum Bounding Rectangle) called HMBR. Although the HMBR scheme reduces the volume of MBR to almost a third, it still achieves a good filtering efficiency because of no information loss by quantization in case of small objects that occupy a major portion. Our experimental tests show that the proposed MHF with HMBR is appropriate for mobile devices in terms of the volume of index, the number of the MBR comparisons, the filtering efficiency and the execution time of spatial operations.

Jin-Deog Kim, Sang-Ho Moon, Jin-Oh Choi

Temporal Databases

Indexing and Querying Constantly Evolving Data Using Time Series Analysis

This paper introduces a new approach for efficiently indexing and querying constantly evolving data. Traditional data index structures suffer from frequent updating cost and result in unsatisfactory performance when data changes constantly. Existing approaches try to reduce index updating cost by using a simple linear or recursive function to define the data evolution, however, in many applications, the data evolution is far too complex to be accurately described by a simple function. We propose to take each constantly evolving data as a time series and use the ARIMA (Autoregressive Integrated Moving Average) methodology to analyze and model it. The model enables making effective forecasts for the data. The index is developed based on the forecasting intervals. As long as the data changes within its corresponding forecasting interval, only its current value in the leaf node needs to be updated and no further update needs to be done to the index structure. The model parameters and the index structure can be dynamically adjusted. Experiments show that the forecasting interval index (FI-Index) significantly outperforms traditional indexes in a high updating environment.

Yuni Xia, Sunil Prabhakar, Jianzhong Sun, Shan Lei

Mining Generalized Spatio-Temporal Patterns

Spatio-temporal databases offer a rich repository and opportunities to develop techniques for discovering new types of spatio-temporal patterns. In this paper, we introduce a new class of spatio-temporal patterns, called the

generalized spatio-temporal patterns

, to describe the repeated sequences of events that occur within small neighbourhoods. Such patterns are crucial to the understanding of habitual patterns. To discover this class of patterns, we develop an algorithm GenSTMiner based on the idea of pattern growth approach, and introduce some optimization techniques that are used to reduce the number of candidates generated and minimize the size of the projected databases. Our performance study indicates that GenSTMiner is highly efficient and outperforms PrefixSpan.

Junmei Wang, Wynne Hsu, Mong Li Lee

Exploiting Temporal Correlation in Temporal Data Warehouses

Data is typically incorporated in a data warehouse in increasing order of time. Furthermore, the MOLAP data cube tends to be sparse because of the large cardinality of the time dimension. We propose an approach to improve the efficiency of range aggregate queries on MOLAP data cubes in a temporal data warehouse by factoring out the time-related dimensions. These time-related dimensions are handled separately to take advantage of the monotonic trend over time. The proposed technique captures local data trends with respect to time by partitioning data points into blocks, and then uses a

perfect binary block tree

as an index structure to achieve logarithmic time complexity for both incremental updates and data retrievals. Experimental results establish the scalability and efficiency of the proposed approach on various datasets.

Ying Feng, Hua-Gang Li, Divyakant Agrawal, Amr El Abbadi

Semantics

Semantic Characterization of Real World Events

Reducing the latency of information delivery in an event driven world has always been a challenge. It is often necessary to completely capture the attributes of events and relationships between them, so that the process of retrieval of event related information is efficient. In this paper, we discuss a formal system for representing and analyzing real world events to address these issues. The event representation discussed in this paper accounts for the important event attributes, namely, time, space, and label. We introduce the notion of

sequence templates

that not only provides event related semantics but also helps in semantically analyzing user queries. Finally, we discuss the design for our Query-Event Analysis System, which is an integrated system to (a) identify a best sequence template given a user query; (b) select events based on the best sequence template; and (c) determine content related to the selected events for delivering to users.

Aparna Nagargadde, Sridhar Varadarajan, Krithi Ramamritham

Learning Tree Augmented Naive Bayes for Ranking

Naive Bayes has been widely used in data mining as a simple and effective classification algorithm. Since its conditional independence assumption is rarely true, numerous algorithms have been proposed to improve naive Bayes, among which tree augmented naive Bayes (TAN) [3] achieves a significant improvement in term of classification accuracy, while maintaining efficiency and model simplicity. In many real-world data mining applications, however, an accurate ranking is more desirable than a classification. Thus it is interesting whether TAN also achieves significant improvement in term of ranking, measured by AUC(the area under the Receiver Operating Characteristics curve) [8,1]. Unfortunately, our experiments show that TAN performs even worse than naive Bayes in ranking. Responding to this fact, we present a novel learning algorithm, called forest augmented naive Bayes (FAN), by modifying the traditional TAN learning algorithm. We experimentally test our algorithm on all the 36 data sets recommended by Weka [12], and compare it to naive Bayes, SBC [6], TAN [3], and C4.4 [10], in terms of AUC. The experimental results show that our algorithm outperforms all the other algorithms significantly in yielding accurate rankings. Our work provides an effective and efficient data mining algorithm for applications in which an accurate ranking is required.

Liangxiao Jiang, Harry Zhang, Zhihua Cai, Jiang Su

Finding Hidden Semantics Behind Reference Linkages : An Ontological Approach for Scientific Digital Libraries

The contents and topologies of inter-document linkages, such as citations and references among scientific literature, have received increasing research interests in recent years. Some technologies have been fully studied and utilized upon this meaningful information to improve the organization, analysis and evaluation of scientific digital libraries. In this paper, we present a CiteSeer-like system to access scientific papers in computer science discipline by reference linking technique. Moreover, implicit semantics behind reference indices are mined and organized to improve accessibility of scientific papers. In order to model scientific literature and their interlinked relationships, we develop a domain-specific ontology to analyze contents and citation anchor context of scientific papers. Compared with abstract of a specific paper written by authors themselves, we introduce an automatic summary generation algorithm to create objective descriptions from other scholars’ perspectives based on the ontology. Semantic queries can also be asked to discover interesting patterns in scientific libraries in order to provide a comprehensive and meaningful guidance for users.

Peixiang Zhao, Ming Zhang, Dongqing Yang, Shiwei Tang

XML Update and Query Patterns

Xandy: Detecting Changes on Large Unordered XML Documents Using Relational Databases

Previous works in change detection on XML documents are not suitable for detecting the changes to large XML documents as it requires a lot of memory to keep the two versions of XML documents in the memory. In this paper, we take a more conservative yet novel approach of using traditional relational database engines for detecting the changes to large

unordered

XML documents. We elaborate how we detect the changes on unordered XML documents by using relational database. To this end, we have implemented a prototype system called

Xandy

that converts XML documents into relational tuples and detects the changes from these tuples by using SQL queries. Our experimental results show that the relational approach has better scalability compared to published algorithms like X-Diff. The result quality of our approach is comparable to the one of X-Diff.

Erwin Leonardi, Sourav S. Bhowmick, Sanjay Madria

FASST Mining: Discovering Frequently Changing Semantic Structure from Versions of Unordered XML Documents

In this paper, we present a FASST mining approach to extract the

frequently changing semantic structures

(FASSTs), which are a subset of semantic substructures that change frequently, from versions of unordered XML documents. We propose a data structure, H-DOM

+

, and a FASST mining algorithm, which incorporates the semantic issue and takes the advantage of the related domain knowledge. The distinct feature of this approach is that the FASST mining process is guided by the user-defined

concept hierarchy

. Rather than mining all the frequent changing structures, only these frequent changing structures that are semantically meaningful are extracted. Our experimental results show that the H-DOM

+

structure is compact and the FASST algorithm is efficient with good scalability. We also design a declarative FASST query language, FASSTQUEL, to make the FASST mining process interactive and flexible.

Qiankun Zhao, Sourav S. Bhowmick

Mining Positive and Negative Association Rules from XML Query Patterns for Caching

Recently, several approaches that mine frequent XML query patterns and cache their results have been proposed to improve query response time. However, frequent XML query patterns mined by these approaches ignore the temporal sequence between user queries. In this paper, we take into account the temporal features of user queries to discover association rules, which indicate that when a user inquires some information from the XML document, she/he will probably inquire some other information subsequently. We cluster XML queries according to their semantics first and then mine association rules between the clusters. Moreover, not only positive but also negative association rules are discovered to design the appropriate cache replacement strategy. The experimental results showed that our approach considerably improved the caching performance by significantly reducing the query response time.

Ling Chen, Sourav S. Bhowmick, Liang-Tien Chia

Join Processing and View Management

Distributed Intersection Join of Complex Interval Sequences

In many different application areas, e.g. space observation systems or engineering systems of world-wide operating companies, there is a need for an ef ficient distributed intersection join in order to extract new and global knowledge. A solution for carrying out a global intersection join is to transmit all distributed information from the clients to a central server leading to high transfer cost. In this paper, we present a new distributed intersection join for interval sequences of high-cardinality which tries to minimize these transmission cost. Our approach is based on a suitable probability model for interval intersections which is used on the server as well as on the various clients. On the client sites, we group intervals together based on this probability model. These locally created approximations are sent to the server. The server ranks all intersecting approximations according to our probability model. As not all approximations have to be refined in order to decide whether two objects intersect, we fetch the exact information of the most promising approximations first. This strategy helps to cut down the transmission cost considerably which is proven by our experimental evaluation based on syn thetic and real-world test data sets.

Hans-Peter Kriegel, Peter Kunath, Martin Pfeifle, Matthias Renz

Using Prefix-Trees for Efficiently Computing Set Joins

Joins on set-valued attributes (set joins) have numerous database applications. In this paper we propose PRETTI (PREfix Tree based seT joIn) – a suite of set join algorithms for containment, overlap and equality join predicates. Our algorithms use prefix trees and inverted indices. These structures are constructed on-the-fly if they are not already precomputed. This feature makes our algorithms usable for relations without indices and when joining intermediate results during join queries with more than two relations. Another feature of our algorithms is that results are output continuously during their execution and not just at the end. Experiments on real life datasets show that the total execution time of our algorithms is significantly less than that of previous approaches, even when the indices required by our algorithms are not precomputed.

Ravindranath Jampani, Vikram Pudi

Maintaining Semantics in the Design of Valid and Reversible SemiStructured Views

Existing systems that support semistructured views do not maintain semantics during the process of designing views. Thus, there is no guarantee that the views obtained are valid and reversible views. In this paper, we propose an approach to designing valid and reversible semistructured views. We employ four types of view operators, namely,

select

,

drop

,

join

and

swap

operators, and develop a set of rules to maintain the semantics of the views when the

swap

operator is applied. We also examine the reversible view problem and develop rules to guarantee the designed views are reversible. Finally, we examine the possible changes to the participation constraints of relationship types and propose rules to keep the participation constraints correct.

Ya Bing Chen, Tok Wang Ling, Mong Li Lee

Spatial Databases

DCbot: Finding Spatial Information on the Web

The WWW provides an overwhelming amount of information, which – spatially indexed – can be a valuable additional data source for location-based applications. By manually building a spatial index, only a fraction of the available resources can be covered. This paper introduces a system for the automatic mapping of web pages to geographical locations. Our web robot uses several sets of domain specific keywords, lexical context rules, that are automatically learned, and a hierarchical catalogue of geographical locations that provides exact geographical coordinates for locations. Spatially indexed web pages are used to construct Geographical Web Portals, which can be accessed by different location-based applications. In addition, we present experimental results demonstrating the quantity and the quality of automatically indexed web pages.

Mihály Jakob, Matthias Grossmann, Daniela Nicklas, Bernhard Mitschang

Improving Space-Efficiency in Temporal Text-Indexing

Support for temporal text-containment queries is of interest in a number of contexts. In previous papers we have presented two approaches to temporal text-indexing, the V2X and ITTX indexes. In this paper, we first present improvements to the previous techniques. We then perform a study of the space usage of the indexing approaches based on both analytical models and results from indexing temporal text collections. These results show for what kind of document collections the different techniques should be employed. The results also show that regarding space usage, the new ITTX/VIDPI technique proposed in this paper is in most cases superior to V2X, except in the case of patterns of high number of new documents relative to number of updated documents.

Kjetil Nørvåg, Albert Overskeid Nybø

Nearest Neighbours Search Using the PM-Tree

We introduce a method of searching the

k

nearest neighbours (

k

-NN) using PM-tree. The PM-tree is a metric access method for similarity search in large multimedia databases. As an extension of M-tree, the structure of PM-tree exploits local dynamic pivots (like M-tree does it) as well as global static pivots (used by LAESA-like methods). While in M-tree a metric region is represented by a hyper-sphere, in PM-tree the ”volume” of metric region is further reduced by a set of hyper-rings. As a consequence, the shape of PM-tree’s metric region bounds the indexed objects more tightly which, in turn, improves the overall search efficiency. Besides the description of PM-tree, we propose an optimal

k

-NN search algorithm. Finally, the efficiency of

k

-NN search is experimentally evaluated on large synthetic as well as real-world datasets.

Tomáš Skopal, Jaroslav Pokorný, Václav Snášel

Enhancing Database Services

Deputy Mechanism for Workflow Views

Adapted from the concept of views in databases, workflow views are derived from workflows as a fundamental support for workflow inter-operability and visibility by external parties in a e-service environment. However, until now there are few works focusing on its realization mechanism, i.e. the communication between views and their source entities. In this paper, we extend the object deputy model to the workflow deputy model supporting the interaction of workflow views in a systematic way. In this workflow deputy model, we formally specify the deputy class and the deputy algebra for workflow classes. According to the process meta-model of XPDL, deputy operations are defined for each kind of workflow component class specifically. Based on this deputy mechanism, workflow views are presented in forms of deputy classes. Lastly, several modeling issues are discussed.

Zhe Shan, Qing Li, Yi Luo, Zhiyong Peng

Automatic Data Extraction from Data-Rich Web Pages

Extracting data from web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interests. In this paper, we propose a novel technique to the problem of differentiating roles of data items from Web pages, which is one of the key problems in our automatic extraction approach. The problem is resolved at various levels: semantic blocks, sections and data items, and several approaches are proposed to effectively identify the mapping between data items having the same role. Intensive experiments on real web sites show that the proposed technique can effectively help extracting desired data with high accuracies in most of the cases.

Dongdong Hu, Xiaofeng Meng

Customer Information Visualization via Customer Map

Many data mining techniques which are non-visual methods have been proved their virtues on various customer data. However, there have been hardly applications of visualization methods onto the customer information in spite of their ability of quick and easy knowledge discovery. In this paper, we propose a data visualization method for customer information using a customer map. To develop the customer map, we integrate numerous customer data from various data sources, perform data analyses using data mining techniques and finally visualize the information derived by the former analyses. The customer map makes it possible to mange diverse and complex data sets under the unified goal of value creation through customers. It also affords the ability to make quick observation of current state and the change of customer distribution based on their information without preconception. We applied the customer map to the credit card company, and suggested managerial implications from the customer maps obtained from its data.

Ji Young Woo, Sung Min Bae, Chong Un Pyon, Sang Chan Park

Finding and Analyzing Database User Sessions

A database user session is a sequence of queries issued by a user (or an application) to achieve a certain task. Analysis of task-oriented database user sessions provides useful insight into the query behavior of database users. In this paper, we describe novel algorithms for identifying sessions from database traces and for grouping the sessions different classes. We also present experimental results.

Qingsong Yao, Aijun An, Xiangji Huang

Recovery and Correctness

Time-Cognizant Recovery Processing for Embedded Real-Time Databases

Recovery processing in embedded real-time databases (ERTDBs) is more complex than traditional databases. In this paper, the classifications and consistency constraints of data and transactions in embedded real-time databases are given first. Then time-cognizant recovery principles for different classes of data and transactions are discussed. In terms of these principles, a time-cognizant recovery scheme based on real-time logging is presented, which is suitable for a class of embedded real-time databases applications. Performance evaluations show that the suggested scheme has better performances than traditional recovery techniques in two aspects: the missing deadlines percent of transactions and the time of system denying services after crashes.

Guoqiong Liao, Yunsheng Liu, Yingyuan Xiao

An Efficient Phantom Protection Method for Multi-dimensional Index Structures

In order for a multi-dimensional index structure to be integrated into a commercial database system, efficient concurrency control techniques are necessary. The techniques must support all degrees of isolation offered by the database system. Especially the degree 3 isolation, called no phantom read, protects search ranges from concurrent insertions and the rollbacks of deletions. In this paper, we propose a new phantom protection method for multi-dimensional index structures that uses multi-level grid technique. The proposed mechanism is independent of the types of multi-dimensional index structures, i.e., it can be applied to all types of index structures such as tree-based, file-based and hash-based index structures. Also, it achieves low development cost and high concurrency with low lock overhead. It is shown through various experiments that the proposed method outperforms existing phantom protection methods for multi-dimensional index structures.

Seok Il Song, Seok Jae Lee, Tae Ho Kang, Jae Soo Yoo

CMC: Combining Multiple Schema-Matching Strategies Based on Credibility Prediction

Schema matching is a key operation in data engineering. Combining multiple matching strategies is a very promising technique for schema matching. To overcome the limitations of existing combination systems and to achieve better performances, in this paper the CMC system is proposed, which combines multiple matchers based on credibility prediction. We first predict the accuracy of each matcher on the current matching task, and accordingly calculate each matcher’s credibility. These credibilities are then used as weights in aggregating the matching results of different matchers into a combined one. Our experiments on real world schemas validate the merits of our system.

KeWei Tu, Yong Yu

XML Databases and Indexing

Translating XQuery to SQL Based on Query Forests

It is a difficult task to transform an XQuery posed gainst an XML view into an SQL appropriate for the original relational schema to get data. The main reason lies in the difference of the schema modeling power and the query syntax. In this paper, we propose to represent an XQuery as a set of for trees and return trees, or called the query forests as a whole, based on the functionality of the path expressions specified in each clause. These trees show the structural constraint of the input query and serve as the vehicle to combine the mapping SQL fragments into a complete SQL statement. A prototype has been implemented to show the effectiveness of the transformation process.

Ya-Hui Chang, Greg Liu, Sue-Shain Wu

A New Indexing Structure to Speed Up Processing XPath Queries

In this paper, the focus is on accelerating XPath location steps for evaluating regular path expression with predicate parameter in particular since it is a core component of many XML processing standards such as XSLT or XQuery. We present a new indexing structure, namely Xp-tree, which is used to speed-up the evaluation of XPath. Based on accelerating a node using planar combined with the numbering scheme, we devise efficiently derivative algorithms. Our experimental results demonstrate that the proposed method outperforms previous approaches using R-tree indexing mechanism in processing XML queries.

Jeong Hee Hwang, Van Trang Nguyen, Keun Ho Ryu

Translate Graphical XML Query Language to SQLX

Semi-structured data has become more and more attention-getting with the emergence of XML, and it has aroused much enthusiasm for integrating XML and SQL in database community. Due to the complexity of XQuery, graphical XML query languages have been developed to help users query XML data. In this paper, we propose a new XML-to-SQL solution on the base of ORA-SS, a rich semantic model for semi-structured data. We model the data by ORA-SS schema and store them in an ORDB. Based on ORA-SS, we developed a graphical XML query language GLASS that not only expresses the query constraints and reconstruction structure in XML view but also the relational semantic in the XML view. This paper focuses on the translation algorithm from GLASS to SQLX, an XML extension on traditional SQL.

Wei Ni, Tok Wang Ling

GTree: An Efficient Grid-Based Index for Moving Objects

In mobile environments, tracking the changing positions of moving objects efficiently could substantially improve the quality of the Location Based Services. There arises the high demand for the indexes to support frequent updates. In this paper, we propose a novel grid-based index for moving objects, namely the GTree. Based on the recursive partition of the space and lazy maintenance, the GTree could maximize the stability of the index structure while minimizing the dynamic adjustment, and therefore significantly reduce the update overhead. Different from traditional top-down or bottom-up updates, we present a median-down approach, which could effectively reduce the number of disk access. As an alternative, a bulk-loading technique is introduced. The experiments show that the GTree has good update performance as well as query performance.

Xiaoyuan Wang, Qing Zhang, Weiwei Sun

Adaptive Multi-level Hashing for Moving Objects

Although several sophisticated index structures for moving objects have been proposed, the hashing method based on a simple grid has been widely employed due to its simplicity. Since the performance of the hashing is largely affected by the size of a grid cell, it should be carefully decided with regard to the workload. In many real applications, however, the workload varies dynamically as time, for example the traffic in the commuting time vs. that in the night. The basic hashing cannot handle this dynamic workload because the cell size cannot be changed during the execution. In this paper, we propose the adaptive multi-level hashing to support the dynamic workload efficiently. The proposed technique maintains two levels of the hashes, one for fast moving objects and the other one for quasi-static objects. A moving object changes its level adaptively according to the degree of its movement.

Dongseop Kwon, Sangjun Lee, Wonik Choi, Sukho Lee

Springer Professional

Inhaltsverzeichnis

Frontmatter

Keynotes

Data Stream Mining and Resource Adaptive Computation

Purpose Based Access Control for Privacy Protection in Database Systems

Complex Networks and Network Data Mining

Bioinformatics

Indexing DNA Sequences Using q-Grams

PADS: Protein Structure Alignment Using Directional Shape Signatures

LinkageTracker: A Discriminative Pattern Tracking Approach to Linkage Disequilibrium Mapping

Watermarking and Encryption

Query Optimization in Encrypted Database Systems

Watermarking Spatial Trajectory Database

Effective Approaches for Watermarking XML Data

XML Query Processing

A Unifying Framework for Merging and Evaluating XML Information

Efficient Evaluation of Partial Match Queries for XML Documents Using Information Retrieval Techniques

PathStack¬: A Holistic Path Join Algorithm for Path Query with Not-Predicates on XML Data

XML Coding and Metadata Management

An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML

Efficiently Coding and Indexing XML Document

XQuery-Based TV-Anytime Metadata Management

Data Mining

Effective Database Transformation and Efficient Support Computation for Mining Sequential Patterns

Mining Succinct Systems of Minimal Generators of Formal Concepts

A General Approach to Mining Quality Pattern-Based Clusters from Microarray Data

Data Generation and Understanding

Real Datasets for File-Sharing Peer-to-Peer Systems

SemEQUAL: Multilingual Semantic Matching in Relational Systems

A Metropolis Sampling Method for Drawing Representative Samples from Large Databases

Panel

Stay Current and Relevant in Data Mining Research

Music Retrieval

An Efficient Approach to Extracting Approximate Repeating Patterns in Music Databases

On Efficient Music Genre Classification

Effectiveness of Note Duration Information for Music Retrieval

Query Processing in Subscription Systems

A Self-Adaptive Model to Improve Average Response Time of Multiple-Event Filtering for Pub/Sub System

Filter Indexing: A Scalable Solution to Large Subscription Based Systems

Caching Strategies for Push-Based Broadcast Considering Consecutive Data Accesses with Think-Time

Extending XML

XDO2: A Deductive Object-Oriented Query Language for XML

Extending XML with Nonmonotonic Multiple Inheritance

Database Design with Equality-Generating Dependencies

Web Services

WDEE: Web Data Extraction by Example

Concept-Based Retrieval of Alternate Web Services

WSQuery: XQuery for Web Services Integration

High-Dimensional Indexing

A New Indexing Method for High Dimensional Dataset

BM + -Tree: A Hyperplane-Based Index Method for High-Dimensional Metric Spaces

Approaching the Efficient Frontier: Cooperative Database Retrieval Using High-Dimensional Skylines

Sensor and Stream Data Processing

False-Negative Frequent Items Mining from Data Streams with Bursting

Adaptively Detecting Aggregation Bursts in Data Streams

Communication-Efficient Implementation of Join in Sensor Networks

Database Performance Issues

Zoned-RAID for Multimedia Database Servers

Randomized Data Allocation in Scalable Streaming Architectures

Trace System of iSCSI Storage Access and Performance Improvement

CoCache: Query Processing Based on Collaborative Caching in P2P Systems

Clustering, Classification and Data Warehouses

Multi-represented kNN-Classification for Large Class Sets

Enhancing SNNB with Local Accuracy Estimation and Ensemble Techniques

MMPClust: A Skew Prevention Algorithm for Model-Based Document Clustering

Designing and Using Views to Improve Performance of Aggregate Queries (Extended Abstract)

Large Relations in Node-Partitioned Data Warehouses

Data Mining and Web Data Processing

Mining Frequent Tree-Like Patterns in Large Datasets

An Efficient Approach for Mining Fault-Tolerant Frequent Patterns Based on Bit Vector Representations

NNF: An Effective Approach in Medicine Paring Analysis of Traditional Chinese Medicine Prescriptions

From XML to Semantic Web

A Hybrid Approach for Refreshing Web Page Repositories

Schema Driven and Topic Specific Web Crawling

Moving Object Databases

Towards Optimal Utilization of Main Memory for Moving Object Indexing

Aqua: An Adaptive QUery-Aware Location Updating Scheme for Mobile Objects

A Spatial Index Using MBR Compression and Hashing Technique for Mobile Map Service

Temporal Databases