Invited Papers

Genomic and Proteomic Databases and Applications: A Challenge for Database Technology

The biological science studies the phenomenon of life and encompasses an enormous variety of information. This wealth of information that has been generated, classified, and stored for centuries has only recently become a major application of database technology. The first genome of RNA bacteriophage MS2, was sequenced in 1976, in a truly heroic feat of direct determination of an RNA sequence (Fiers et al. 1976). But it was not until August of 1995 that the complete genome sequence of the parasitic bacterium haemophilius influenza, ushered the era of real genomics, the study of complete genomes of cellular organisms (Fleishman et al. 1995).

Shamkant B. Navathe, Upen Patil

Emergent Semantics Principles and Issues

Information and communication infrastructures underwent a rapid and extreme decentralization process over the past decade: From a world of statically and partially connected central servers rose an intricate web of millions of information sources loosely connecting one to another. Today, we expect to witness the extension of this revolution with the wide adoption of meta-data standards like RDF or OWL underpinning the creation of a semantic web. Again, we hope for global properties to emerge from a multiplicity of pair-wise, local interactions, resulting eventually in a self-stabilizing semantic infrastructure. This paper represents an effort to summarize the conditions under which this revolution would take place as well as an attempt to underline its main properties, limitations and possible applications.

Karl Aberer, Philippe Cudré-Mauroux, Aris M. Ouksel, Tiziana Catarci, Mohand-Said Hacid, Arantza Illarramendi, Vipul Kashyap, Massimo Mecella, Eduardo Mena, Erich J. Neuhold, Olga De Troyer, Thomas Risse, Monica Scannapieco, Fèlix Saltor, Luca de Santis, Stefano Spaccapietra, Steffen Staab, Rudi Studer

Access Methods

Index Clustering for High-Performance Sequential Index Access

This paper presents an index clustering technique called the segment-page clustering (SP-clustering). Most relevant index pages are widely scattered on a disk due to dynamic page allocation, and thus many random disk accesses are required during the query processing. The SP-clustering avoids the scattering by storing the relevant nodes contiguously in a segment that contains a sequence of contiguous disk pages and improves the query performance by offering sequential disk access within a segment. A new cost model is also introduced to estimate the performance of the SP-clustering. It takes account of the physical adjacency of pages read as well as the number of pages accessed. Experimental results demonstrate that the SP-clustering improves the query performance up to several times compared with the traditional ones with respect to the total elapsed time.

Guang-Ho Cha

SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Statistical Accuracy Guarantee

Estimating the selectivity of a simple path expression (SPE) is essential for selecting the most efficient evaluation plans for XML queries. To estimate selectivity, we need an efficient and flexible structure to store a summary of the path expressions that are present in an XML document collection. In this paper we propose a new structure called SF-Treeto address the selectivity estimation problem. SF-Tree provides a flexible way for the users to choose among accuracy, space requirement and selectivity retrieval speed. It makes use of signature files to store the SPEs in a tree form to increase the selectivity retrieval speed and the accuracy of the retrieved selectivity. Our analysis shows that the probability that a selectivity estimation error occurs decreases exponentially with respect to the error size.

Wai-Shing Ho, Ben Kao, David W. Cheung, YIP Chi Lap, Eric Lo

UB-Tree Based Efficient Predicate Index with Dimension Transform for Pub/Sub System

For event filtering of publish/subscribe system, significant research efforts have been dedicated to techniques based on multiple one-dimensional indexes built on attributes of subscription. Because such kinds of techniques are efficient only in the case that operators used in predicates are equality operator (=) and attributes used in subscriptions are fixed, the flexibility and expressiveness of publish/subscribe system are limited. Event filtering on subscriptions which include not only equality operator (=) but also non-equality operators (<=, =>) without fixed attributes, is similar to query in high dimensional data space. In this paper, considering dynamic maintenance and space efficiency of publish/subscribe system, we propose an index structure for event filtering based on UB-tree. There, by dimension transform, the event filtering is regarded as high dimensional range query. The feasibility of the proposed index is evaluated in simulated publish/subscription environment. Results show that in almost all the cases, the performance our proposed index is 4 order of magnitude faster than counting algorithm. Because our index can support both equality operator (=) and non-equality operators (<=, >=), we can conclude that our proposal is efficient and flexible for event filtering of publish/subscribe system under reasonable size of dimension.

Botao Wang, Wang Zhang, Masaru Kitsuregawa

Query Processing in XML

Approximate Counting of Frequent Query Patterns over XQuery Stream

One efficient approach to improve the performance of XML management systems is to cache the frequently retrieved results. This entails the discovery of frequent query patterns that are issued by users. In this paper, we model user queries as a stream of XML query pattern trees and mine for frequent query patterns in a batch-wise manner. We design a novel data structure called D-GQPT to merge the pattern trees of the batches seen so far, and to dynamically mark the active portion of the current batch. With the D-GQPT, we are able to limit the enumeration of candidate trees to only the currently active pattern trees. We also design a summary data structure called ECTree to incrementally compute the frequent tree patterns over the query stream. Based on the above two constructs, we present the frequent query pattern mining algorithm called AppXQSMiner over the XML query stream. Experiment results show that the proposed approach is both efficient and scalable.

Liang Huai Yang, Mong Li Lee, Wynne Hsu

Counting Relaxed Twig Matches in a Tree

We consider the problem of accurately estimating the number of approximate XML answers for a given query, and propose an efficient method that (1) accurately computes selectivity estimates for each relaxed XML query, using a natural generalization of the correlated subpath tree (CST) summary structure, and (2) carefully combines these estimates by analyzing the nature of overlap between the different relaxed twig queries.

Dongwon Lee, Divesh Srivastava

XTree for Declarative XML Querying

How to query XML documents to extract and restructure the information is an important issue in XML research. Currently, XQuery based on XPath is the most promising standard of W3C. In this paper, we introduce a new set of syntax rules called XTree, which is a generalization of XPath. XTree has a tree structure, and a user can bind multiple variables in one XTree expression. It explicitly identifies list-valued variables, and defines some natural built-in functions to manipulate them. XTree expression can also be used in the result construction part of a query, to make it easy to read and comprehend. With these differences, XTree expressions are much more compact, and more convenient to write and understand than XPath expressions. We also give algorithms to convert queries based on XTree expressions to standard XQuery queries.

Zhuo Chen, Tok Wang Ling, Mengchi Liu, Gillian Dobbie

Security and Integrity

On Addressing Efficiency Concerns in Privacy-Preserving Mining

Data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. To encourage users to provide correct inputs, we recently proposed a data distortion scheme for association rule mining that simultaneously provides both privacy to the user and accuracy in the mining results. However, mining the distorted database can be orders of magnitude more time-consuming as compared to mining the original database. In this paper, we address this issue and demonstrate that by (a) generalizing the distortion process to perform symbol-specific distortion, (b) appropriately chooosing the distortion parameters, and (c) applying a variety of optimizations in the reconstruction process, runtime efficiencies that are well within an order of magnitude of undistorted mining can be achieved.

Shipra Agrawal, Vijay Krishnan, Jayant R. Haritsa

Efficient Execution of Aggregation Queries over Encrypted Relational Databases

Encryption is a common method to assure privacy of stored data. In many practical situations, decrypting data before applying logic compromises privacy. The challenge is to come up with logic transformation techniques and result-mapping methods so that the exact result of applying logic to data-in-the-clear is obtained by applying the transformed logic to encrypted data and mapping the result produced. In the scope of relational aggregation queries and in the presence of logical predicates, we show how to support needed transformations and mappings.

Hakan Hacıgümüş, Bala Iyer, Sharad Mehrotra

A MAC-Based Service Bundle Authentication Mechanism in the OSGi Service Platform

In this paper, we propose a service bundle authentication mechanism considering characteristics for the home gateway environment. We designed the key exchange mechanism for exchanging a key, which transports a service bundle safely in bootstrapping step that recognizes and initializes various equipments. And we propose the service bundle authentication mechanism based on MAC that use a shared secret created in the bootstrapping step. A MAC-based service bundle authentication mechanism we propose is more efficient than PKI-based service bundle authentication mechanism or RSH protocol in the OSGi service platform, which has restricted resources such as storage space and operations

Young-Gab Kim, Chang-Joo Moon, Dae-Ha Park, Doo-Kwon Baik

S-COI : The Secure Conflicts of Interest Model for Multilevel Secure Database Systems

The Chinese Wall policy is one of the well-known commercial security policies. It is used to specify information control when conflicts of interest arise. In other words, it maintains database security by means of classifying accessible data objects according to users’ interest and limiting access to data objects that can cause conflicts of interest. In this paper, we propose a new model that decreases conflicts between user transactions by classifying data objects in a database system according to the users’ interest. In order to achieve our goal, the Chinese Wall policy is newly interpreted and then applied to transaction processing in multilevel secure database systems. And then, more flexible concurrency control protocol based on the proposed model is suggested. Our model might be utilized as flexible security policy that prevents performance degradation of large database systems.

Chanjung Park, Seog Park, Yoongu Kim

Query Processing in Temporal and Spatial Databases

Modeling Temporally Variable Transportation Networks

In this paper, a State-Based Dynamic Transportation Network (SBDTN) model is presented, which can be used to describe the spatio-temporal aspect of temporally variable transportation networks. The basic idea of this model is to associate a temporal attribute to every node or edge of the graph system so that state changes (such as traffic jams and blockages caused by temporary constructions) and topology changes (such as insertion and deletion of nodes or edges) can be expressed. Since the changes of the graph system are discrete, the temporal attribute can be expressed as a series of temporal units and each temporal unit describes one single state of the node or edge during a certain period of time. The data model is given as a collection of data types and operations which can be plugged as attribute types into a DBMS to obtain a complete data model and query language.

Zhiming Ding, Ralf Hartmut Güting

Statistic Driven Acceleration of Object-Relational Space-Partitioning Index Structures

Relational index structures, as for instance the Relational Interval Tree or the Linear Quadtree, support efficient processing of queries on top of existing object-relational database systems. Furthermore, there exist effective and efficient models to estimate the selectivity and the I/O cost in order to guide the cost-based optimizer whether and how to include these index structures into the execution plan. By design, the models immediately fit to common extensible indexing/optimization frameworks, and their implementations exploit the built-in statistics facilities of the database server. In this paper, we show how these statistics can also be used for accelerating the access methods themselves by reducing the number of generated join partners which results in fewer logical reads and consequently improves the overall runtime. We cut down on the number of join partners by grouping different join partners together according to a statistic driven grouping algorithm. Our experiments on an Oracle9i database yield an average speed-up between 20% and 10,000% for spatial collision queries on the Relational Interval Tree and on the Relational Quadtree.

Hans-Peter Kriegel, Peter Kunath, Martin Pfeifle, Matthias Renz

Path-Based Range Query Processing Using Sorted Path and Rectangle Intersection Approach

Path-based range query (PRQ) is a class of range query in which given a set of points P in two-dimensional plane that define a path, find the union of all points within a distance d from the points in P. The simple method of performing repeated range query (RRQ), i.e. the standard range query for each query point and combining the results is inefficient as it searches the spatial index multiple times. Current method, using two pruning rules PointOut and NodeIn, of solving PRQ in one pass while simultaneously processing all the points in P involves using minimum and maximum geometric distances of the query points in P was revisited. We further present two techniques for PRQ that improves the running time of the two pruning rules, called sorted path and rectangle intersection. Empirical results from real life datasets supported our theoretic results that PRQ using sorted path is better than the other approaches.

Hoong Kee Ng, Hon Wai Leong

Efficient Processing of Spatiotemporal Joins

Among other operations, a spatiotemporal DBMS should efficiently answer the spatiotemporal join. This paper presents an evaluation of spatiotemporal join algorithms using these new structures, particularly a partially persistent R-Tree called Temporal R-Tree and the 2+3D R-Tree. Starting from spatial join algorithms, we present algorithms for processing spatiotemporal joins over time instants and intervals on both spatiotemporal data structures. Finally, we implement and test these new algorithms with a couple of generated spatiotemporal data sets. Our experiments show that our algorithms’ performance is good even in extreme cases, showing its good scalability – especially for the TR-Tree.

Geraldo Zimbrão, Jano Moreira de Souza, Victor Teixeira de Almeida

Semi-structured Databases

Skipping Strategies for Efficient Structural Joins

The structural join is considered a core operation in processing and optimizing XML queries. Recently, various techniques have been proposed for efficiently finding structural relationships between sets of nodes. This paper presents an adaptive algorithm for efficiently processing structural joins. In contrast to previous work, which usually relies on external index structures such as B-trees, our proposal paper does not require any such data structures. Hence, our strategy has lower overheads than previous techniques, and can be easily implemented and incorporated into any existing system. Experiments show that our method significantly outperforms previous algorithms.

Franky Lam, William M. Shui, Damien K. Fisher, Raymond K. Wong

Scaling SDI Systems via Query Clustering and Aggregation

XML-based Selective Dissemination of Information (SDI) systems aims to quickly deliver useful information to the users based on their profiles or user subscriptions. These subscriptions are specified in the form of XML queries. This paper investigates how clustering and aggregation of user queries can help scale SDI systems by reducing the number of document-subscription matchings required. We design a new distance function to measure the similarity of query patterns, and develop a filtering technique called YFilter* that is based on YFilter. Experiment results show that the proposed approach is able to achieve high precision, high recall, while reducing runtime requirement.

Xi Zhang, Liang Huai Yang, Mong Li Lee, Wynne Hsu

A Lightweight XML Repository Supporting Dynamic Partial Update of XML Data

Generic methods of storing and querying XML data are often not optimal or adequate for an embedded system with domain-specific requirement. Though embedded systems may have more strict restriction on resource and time constraints, they also pose a new chance of customization and optimization in handling XML data. This is because the characteristics and behaviours of the XML data in a specific domain are likely to be known a priori. In this paper, embedded-type XML repositories are considered where large-scale XML data are split into smaller fragments for secure and efficient delivery from the senders to the receiving devices. The proposed XML repository for the receiving devices deals with the XML data in such a way that the XML fragments can be efficiently reconstructed, searched and retrieved, while dynamic updates and versioning of the XML fragments are supported with marginal costs.

Hyoseop Shin, Andrei V. Popov

On View Transformation Support for a Native XML DBMS

XML is becoming the standard data exchange format. View transformation of XML data is important and frequent operation in XML data integration and publishing. In schema-based view transformation, users define view schema over sources to obtain view results. This declarative approach alleviates user from writing complex scripts to perform view transformation. Current available schema formats are unable to express views with complex semantic constraints. In this paper, we introduce a semantically expressive XML data model: Object-Relationship-Attribute model for Semi-Structured data (ORA-SS), which allows users to define view schemas with rich semantic meanings. Combine with ORA-SS, we use a native XML DBMS: OrientStore to perform accurate and efficient view transformation.

Daofeng Luo, Ting Chen, Tok Wang Ling, Xiaofeng Meng

Knowledge Discovery in Temporal and Spatial Databases

Similarity Search for Interval Time Sequences

Time sequences, which are ordered sets of observations, have been studied in various database applications. In this paper, we introduce a new class of time sequences of which each observation is represented by an interval rather than a number. Such sequences may arise in many situations. For instance, we may not be able to determine the exact value at a time point due to uncertainty or aggregation. In such a case, the observation may be represented better by a range of possible values. Similarity search for interval time sequences has not been studied to the best of our knowledge and poses a new challenge for research. We first address the issue of (dis)similarity measures for interval time sequences. We choose a $\mathcal{L}_1$ norm-based measure because it is semantically better than other alternatives. We next propose an efficient indexing technique for fast retrieval of similar interval time sequences from large databases. More specifically, we propose: (1) to extract a segment-based feature vector for each sequence, and (2) to map each feature vector to either a point or a hyper-rectangle in a multi-dimensional feature space. We then show how we can use existing multi-dimensional index structures such as the R-tree for efficient query processing. Our proposed method guarantees that no false dismissals would occur.

Byoung-Kee Yi, Jong-Won Roh

Discovering Geographical Features for Location-Based Services

In applications such as location-based services, development planning and area marketing, the knowledge of frequent service requests that are always issued together is useful for decision and policy making. However, knowing merely the frequently co-located service requests may not suffice. We observe that often times, these co-located service requests are influenced by surrounding geographical features. By incorporating geographical features with the co-located service requests, we discover a new class of patterns called geographical-based NRS (Neighbouring service Request Sets), which is found to reveal more information compared to co-located service requests. We design two algorithms, namely TwoPhaseGSS and AprioriGSS, to discover this new class of patterns. Experiment results demonstrate the efficiency and the scalability of these algorithms.

Junmei Wang, Wynne Hsu, Mong Li Lee

Polygon and Polyline Join Using Raster Filters

Processing spatial joins efficiently is crucial to rendering the spatial data analysis process feasible. As pointed out in many works, the exact intersection test of two spatial objects is the most time-consuming and I/O-consuming step in processing spatial joins. The use of approximations can reduce the need for examining the exact geometry of spatial objects in order to find the intersecting ones. In previous works, approximations were proposed to perform spatial joins between similar objects: polygons × polygons or polylines × polylines. This work evaluates the benefits of using approximations in spatial joins performed on polygons and polylines sets. Also, a new algorithm is proposed to compare approximations of polygons and polylines. The experiments performed with real data sets resulted in performance gains validating approach effectiveness. The number of exact intersection tests was reduced by 59%. The overall execution time and number of disk accesses were both reduced by 48%.

Rodrigo Salvador Monteiro, Leonardo Guerreiro Azevedo, Geraldo Zimbrão, Jano Moreira de Souza

XML and Multimedia Data Storage

Searching Proper Replications in Mapping XML into Relations

In this paper, we introduce two efficient search methods that heuristically find proper replications close to the optimal value. These searched replications can notably reduce the overall query cost in mapping XML into relations. The necessity of such an automatic tool arises from the complexity of XML, i.e., the fact that it is difficult to search the optimal replication set to enhance query performance because of mass complex XML data and queries. Moreover, it was observed that the search problem requires exponential time of the number of given queries. Another important problem in implementing them is that exact query cost can not be measured within reasonable time. That is, actually executing an XML query over actual relational schema costs much time whenever a replication is evaluated in the algorithms, and effective estimation technique is needed. To overcome the obstacle, we considered the histogram-based estimation in commercial RDBMS. It is very simple and can support exact estimation. In addition, it can resolve the trait of highly skewed structure in XML very well. Finally, the effect of two search methods and the estimation are analyzed through some experiments.

Jaehoon Kim, Seog Park

A Semantics-Based Versioning Scheme for Multimedia Data

As advanced Web-based applications require dealing with diverse types of multimedia data, it is necessary to store, retrieve, and manage multimedia data from disparate sources in a single data warehouse. In this paper, we present a semantics-based method to manage multimedia data. First, we propose a data model that represents multimedia contents and captures semantic relationships among different types of multimedia data. Second, we develop a version management technique for multimedia objects without loss of semantic information. The prototype system has been implemented on top of native XML database system.

Hyon Hee Kim, Seung Soo Park

DiffXML: Change Detection in XML Data

In this paper, we introduce a method to map XML files to relational data model by parsing XML files as DOM trees and store value and path information for each node in relational tables. We present an algorithm called “DiffXML” which uses SQL operations to detect changes between two versions of XML file stored in a relational database. The value and path information for XML files are also used to detect differences. DiffXML finds new inserted, deleted and updated nodes, and also finds the move of a subtree from one place to the other in the XML DOM tree. We analyze the performance of DiffXML with some current commercial and research prototype XML change detection tools.

Yan Chen, Sanjay Madria, Sourav Bhowmick

Temporal and Spatial Databases and Query Processing

Adaptive Quantization of the High-Dimensional Data for Efficient KNN Processing

In this paper, we present a novel index structure, called the SA-tree, to speed up processing of high-dimensional K-nearest neighbor (KNN) queries. The SA-tree employs data clustering and compression, i.e. utilizes the characteristics of each cluster to adaptively compress feature vectors into bit-strings. Hence our proposed mechanism can reduce the disk I/O and computational cost significantly, and adapt to different data distributions. We also develop efficient KNN search algorithms using MinMax Pruning and Partial MinDist Pruning methods. We conducted extensive experiments to evaluate the SA-tree and the results show that our approaches provide superior performance.

Bin Cui, Jing Hu, Hengtao Shen, Cui Yu

Indexing Iconic Image Database for Interactive Spatial Similarity Retrieval

Similarity-based retrieval of images is an important task in many image database applications. Interactive similarity retrieval is one way to resolve the fuzzy area involving psychological and physiological factors of individuals during the retrieval process. A good interactive similarity system is not only dependent on a good measure system, but also closely related to the structure of the image database and the retrieval process based on the respective image database structure. In this paper, we propose to use a digraph of most similar image as an index structure of an iconic spatial similarity retrieval. Our approach makes use of the simple feedback from the user, and avoids the high cost of re-computation of interactive retrieval algorithm. The interactive similarity retrieval process is similar to a guided navigation by the system measure and the user in the image database. The proposed approach prevents looping and guarantees to find the target image. It is straightforward and adaptive to different similarity measure.

Xiao Ming Zhou, Chuan Heng Ang, Tok Wang Ling

Concurrent Updating of Large Spatial Objects

The update transactions to be executed in spatial databases have been usually known as interactive and long duration works. When a spatial object has a very large geometry of which size is larger than the screen window, it needs to concurrently update such a large spatial object for improving concurrency of updating spatial objects. Under the existing locking protocols, it is not allowed a large spatial object to be concurrently updated. We propose a partial locking scheme of allowing a transaction to set locks on parts of the large object. The partial locking scheme acquires a lock for part of the large object to allow several users to concurrently update the same object. The scheme gives benefits of improving the concurrency of updating a large object while maintaining the consistency of long duration transactions for interactively updating spatial data.

Youngduk Seo, Donghyun Kim, Bonghee Hong

A Cost Model for Spatial Intersection Queries on RI-Trees

The efficient management of interval sequences represents a core requirement for many temporal and spatial database applications. With the Relational Interval Tree (RI-tree), an efficient access method has been proposed to process intersection queries of spatial objects encoded by interval sequences on top of existing object-relational database systems. This paper complements that approach by effective and efficient models to estimate the selectivity and the I/O cost of interval sequence intersection queries in order to guide the cost-based optimizer whether and how to include the RI-tree into the execution plan. By design, the models immediately fit to common extensible indexing/ optimization frameworks, and their implementations exploit the built-in statistics facilities of the database server. According to our experimental evaluation on an Oracle database, the average relative error of the estimated query results and costs lies in the range of 0% to 32%, depending on the size and the structural complexity of the query objects.

Hans-Peter Kriegel, Martin Pfeifle, Marco Pötke, Thomas Seidl

Web Computing

Template-Based Proxy Caching for Table-Valued Functions

Certain types of database-backed web sites heavily utilize user-defined functions in SQL queries. Unfortunately, existing web proxy caching schemes can not handle these functions. In order to enable proxy caching for such web sites, we propose a template-based proxy caching framework, a function proxy, for table-valued functions. In our framework, function templates are registered with the proxy so that the proxy can answer new queries based on previously cached data in collaboration with the original web site. We identify a common class of function-embedded queries and present several active caching schemes for them. Our experiments with real web sites show the feasibility and usefulness of the function proxy. Additionally, we find that cache-intersecting queries may not be worth handling at the proxy and that containment-based active caching is efficient and practical.

Qiong Luo, Wenwei Xue

On Improving Website Connectivity by Using Web-Log Data Streams

When people visit Websites, they desire to efficiently and exactly access the contents they are interested in without delay. However, due to the constant changes of site contents and user patterns, the access efficiency of Websites cannot be optimized, especially in peak hours. In this paper, we first address the problems of access efficiency in Websites during peak hours and then propose new measures to evaluate access efficiency. An efficient algorithm is introduced to detect user access patterns using Website topology and Web-log stream data. Adopting this method, we can online modify a Website topology so that the new topology can improve the Website connectivity to adapt current visitors’ access patterns. A real sports Website is used to evaluate the effectiveness of our proposed method of accelerating user access to related contents. The results of the evaluation presented in this paper suggest that this method is feasible to online improve the connectivity of a Website intelligently.

Edmond HaoCun Wu, Michael KwokPo Ng, Joshua ZheXue Huang

Ontological and Pragmatic Knowledge Management for Web Service Composition

The vision of the Semantic Web is to reduce manual discovery and usage of Web resources (documents and services) and to allow intelligent agents to automatically identify these Web resources, integrate them and execute them for achieving the intended goals of the user. The composed Web service is represented as a workflow, called service flow. This paper presents different types of compositional knowledge required for automatic Web service flow generation: operational (syntactic), semantic and pragmatic knowledge. Operational knowledge deals with the right output and input types for possible service composition, while semantic knowledge is domain-specific expert knowledge that governs the Web service compositionality. In addition, pragmatic knowledge allows the contextual as well as common sense knowledge to constrain the Web service composition. We show how to extend the current XML-based standards to represent the compositional knowledge or rules that play a role in service discovery and composition.

Soon Ae Chun, Yugyung Lee, James Geller

Web Page Grouping Based on Parameterized Connectivity

We propose a novel method for Web page grouping based only on hyperlink information. Because of the explosive growth of the Web, page grouping is expected to provide a general grasp of the Web for effective Web search and netsurfing. The Web can be regarded as a gigantic digraph where pages are vertices and links are arcs. Our method is a generalization of the decomposition into strongly connected components. Each group is constructed as a subset of a strongly connected component. Moreover, group sizes can be controlled by a parameter, called the threshold parameter. We call the resulting groups parameterized connected components. The algorithm is simple and admits parallelization. Notably, we apply Dijkstra’s shortest path algorithm in our method.

Tomonari Masada, Atsuhiro Takasu, Jun Adachi

Data Mining and Knowledge Discovery in Web I

Reducing Communication Cost in a Privacy Preserving Distributed Association Rule Mining

Data mining is a process that analyzes voluminous digital data in order to discover hidden but useful patterns from digital data. However, discovery of such hidden patterns has statistical meaning and may often disclose some sensitive information. As a result privacy becomes one of the prime concerns in data mining research community. Since distributed association mining discovers global association rules by combining local models from various distributed sites, breaching data privacy happens more often than it does in centralized environments. In this work we present a methodology that generates global association rules without revealing confidential inputs such as statistical properties of individual sites and yet retains high level of accuracy in resultant rules. One of the important outcomes of the proposed technique is that it reduces the overall communication costs. Performance evaluation of our proposed method shows that it reduces the communication cost significantly when we compare with some well-known distributed association rule mining algorithms. Furthermore, the global rule model generated by the proposed method is based on the exact global support of each itemsets, and hence diminished inconsistency, which indeed occurs when global models are generated from partial support count of an itemset.

Mafruz Zaman Ashrafi, David Taniar, Kate Smith

A Novel Representation of Sequence Data Based on Structural Information for Effective Music Retrieval

In this paper, we propose a novel representation of sequences based on the structural information of the sequences. A sequence is represented by a set of rules, which are derived from its subsequences. There are two types of subsequences of interest. One is called frequent pattern, which is a subsequence appearing often enough in the sequence. The other is called correlative pattern, which is a subsequence composed of highly correlated elements. The rules derived from the frequent patterns are called frequent rules, while the ones derived from the correlative patterns are called correlative rules. By considering music objects as sequences, we represent each of them as a set of rules and design a similarity function for effective music retrieval. The experimental results show that our approaches outperform the approaches based on the Markov-model on the average precision.

Chia-Hsiung Lee, Chung-Wen Cho, Yi-Hung Wu, Arbee L. P. Chen

Managing and Mining Clinical Outcomes

In this paper, we describe clinical outcomes analysis for data in Memorial Sloan-Kettering Cancer Center Sarcoma Database using relational data mining and propose an infrastructure for managing cancer data for Drexel University Cancer Epidemiology Server (DUCES). It is a network-based multi-institutional database that entails a practical research tool that conducts On-Line Analytic Mining (OLAM). We conducted data analysis using relational learning (or relational data mining) with cancer patients’ clinical records that have been collected prospectively for 20 years. We analyzed clinical data not only based on the static event, such as disease specific death for survival analysis, but also based on the temporal event with censored data for each death. Rules extracted using relational learning were compared to results from statistical analysis. The usefulness of rules is also assessed in the context of clinical medicine. The contribution of this paper is to show that rigorous data analysis using relational data mining provides valuable insights for clinical data assessment and complements traditional statistical analysis and to propose an infrastructure to manage and mine clinical outcomes used in multi-institutional organizations.

Hyoil Han, Il-Yeol Song, Xiaohua Hu, Ann Prestrud, Murray F. Brennan, Ari D. Brooks

An Efficient Approach for Maintaining Association Rules Based on Adjusting FP-Tree Structures

In this study, a general incremental updating technique is proposed for maintaining the frequent itemsets discovered in a database in the cases including insertion, deletion, and modification of transactions in the database. An efficient algorithm, called AFPIM (Adjusting FP-tree for Incremental Mining), is designed based on adjusting FP-tree structures. Our approach uses a FP-tree structure to store the compact information of transactions involving frequent and pre-frequent items in the original database. In most cases, without needing to rescan the original database, the new FP-tree structure of the updated database can be obtained by adjusting FP-tree of the original database according to the changed transactions. Experimental results show that AFPIM outperforms the existing algorithms in terms of the execution time.

Jia-Ling Koh, Shui-Feng Shieh

A Collaborative Recommendation Based on Neural Networks

Collaborative filtering is one of the widely used methods for recommendation. It recommends an item to a user based on the reference users’ preferences for the target item or the target user’s preferences for the reference items. In this paper, we propose a neural network based collaborative filtering method. Our method builds a model by learning correlation between users or items using a multi-layer perceptron. We also investigate selection of the reference users or items based on similarity to improve performance. We finally demonstrate that our method outperforms the existing methods through experiments using the EachMovie data.

Myung Won Kim, Eun Ju Kim, Joung Woo Ryu

Query Processing and Optimization

On Incorporating Iceberg Queries in Query Processors

Iceberg queries are a special case of SQL queries involving GROUP BY and HAVING clauses, wherein the answer set is small relative to the database size. We present here a performance framework and a detailed evaluation within this framework of the efficiency of various iceberg query processing techniques. Based on these results, we provide a simple recipe algorithm that can be implemented in a query optimizer to make appropriate algorithmic choices for processing iceberg queries.

Krishna P. Leela, Pankaj M. Tolani, Jayant R. Haritsa

A Multiple Continuous Query Optimization Method Based on Query Execution Pattern Analysis

Many data streams are provided through the network today, and continuous queries are often used to extract useful information from data streams. When a system must process many queries continuously, query optimization is quite important for their efficient execution. In this paper, we propose a novel multiple query optimization method for continuous queries based on query execution pattern analysis. In the advanced stream processing environment assumed in the paper, we use window operators to specify time intervals to select information of interest and the execution time specification to designate when the query should be evaluated. Queries having the same operators may share many intermediate results when they are executed at close instants, but may involve only disjoint data when executed at completely different instants. Thus, query execution timing as well as common subexpressions is a key to deciding an efficient query execution plan. The basic idea of the proposed method is to identify query execution patterns from data arrival logs of data streams and to make the most of the information in deciding an efficient query execution plan. The proposed query optimization scheme first analyzes data arrival logs and extracts query execution patterns. It then forms clusters of continuous queries such that queries in the same cluster are likely to be executed at close instants. Finally, it extracts common subexpressions from among queries in each cluster and decides the query execution plan. We also show experiment results using the prototype implementation, and discuss effectiveness of the proposed approach.

Yousuke Watanabe, Hiroyuki Kitagawa

An Efficient Approach for Partial-Sum Queries in Data Cubes Using Hamming-Based Codes

A previous method supported partial-sum queries using a Covering Code (CCode) method with a seed table and an index look-up table. However, this approach suffers a major drawback in the excessive memory space requirement associated with the index look-up table. Accordingly, this work proposes an efficient method, called the Hamming-Based Code (HBC). The HBC method uses Hamming-based codes to establish a seed table and locate the nearest seed to a user’s partial-sum query. The total partial-sum can be determined by calculating the seed value and original cell value via plus or minus operations. For dynamic environments, the seed table only needs to change half of the seeds, and complete reconstruction is unnecessary. In the present analysis, the HBC method requires less storage than the CCode method.

Chien-I Lee, Yu-Chiang Li, Shin-Mu Tseng

Optimising Mediator Queries to Distributed Engineering Systems

Data and computations of a CAD system have been wrapped by a mediator system using CORBA’s IIOP Protocol. This allows ad hoc declarative mediator queries to be translated into optimised execution plans sending dynamic IIOP messages to a CAD server. Queries may call computational methods provided only by the CAD system. Dynamic query re-optimisation taking into account what data is currently materialised in the mediator substantially improves the performance compared to static optimisation, despite the added optimisation time. The approach provides both increased flexibility and efficiency compared to the conventional navigational use of CORBA-based CAD interfaces. Furthermore, CAD independences is provided since transparent queries and views can be defined over different CAD systems.

Mattias Nyström, Tore Risch

Automatic Generation of SQLX View Definitions from ORA-SS Views

Although XML is the dominant standard for publishing and exchanging data for Internet-based business applications, data is typically stored in relational or object-relational databases. Thus, it is necessary to define XML views over these traditional databases. Unfortunately, it is not easy for users to manually write SQLX queries to define the XML views. This paper describes a method to automatically generate SQLX view definitions from object-relational databases. We utilize the semantically rich ORA-SS data model to capture the schematic structure and semantics of the underlying data. Valid ORA-SS views are first designed on the ORA-SS schema, before they are mapped to XML views. The generated view definitions are SQL queries with XML extension (SQLX) that can be directly evaluated on object-relational databases to materialize the views. This approach removes the need to manually write executable view definitions for the XML views, and provides a user-friendly interface to retrieve XML data via views.

Ya Bing Chen, Tok Wang Ling, Mong Li Lee

Classification and Clustering I

Semi-supervised Text Classification Using Partitioned EM

Text classification using a small labeled set and a large unlabeled data is seen as a promising technique to reduce the labor-intensive and time consuming effort of labeling training data in order to build accurate classifiers since unlabeled data is easy to get from the Web. In [16] it has been demonstrated that an unlabeled set improves classification accuracy significantly with only a small labeled training set. However, the Bayesian method used in [16] assumes that text documents are generated from a mixture model and there is a one-to-one correspondence between the mixture components and the classes. This may not be the case in many applications. In many real-life applications, a class may cover documents from many different topics, which violates the one-to-one correspondence assumption. In such cases, the resulting classifiers can be quite poor. In this paper, we propose a clustering based partitioning technique to solve the problem. This method first partitions the training documents in a hierarchical fashion using hard clustering. After running the expectation maximization (EM) algorithm in each partition, it prunes the tree using the labeled data. The remaining tree nodes or partitions are likely to satisfy the one-to-one correspondence condition. Extensive experiments demonstrate that this method is able to achieve a dramatic gain in classification performance.

Gao Cong, Wee Sun Lee, Haoran Wu, Bing Liu

FMACA: A Fuzzy Cellular Automata Based Pattern Classifier

This paper presents a pattern classifier to handle real valued patterns. A special class of Fuzzy Cellular Automata (FCA), referred to as Fuzzy Multiple Attractor Cellular Automata (FMACA), is employed to design the pattern classifier. The analysis reported in this paper has established the FMACA as an efficient pattern classifier for real valued patterns. Excellent classification accuracy and low memory overhead of FMACA based pattern classifier have been demonstrated through extensive experimental results.

Pradipta Maji, P Pal Chaudhuri

Music Classification Using Significant Repeating Patterns

With the popularity of multimedia applications, a large amount of music data has been accumulated on the Internet. Automatic classification of music data becomes a critical technique for providing an efficient and effective retrieval of music data. In this paper, we propose a new approach for classifying music data based on their contents. In this approach, we focus on monophonic music features represented as rhythmic and melodic sequences. Moreover, we use repeating patterns of music data to do music classification. For each pattern discovered from a group of music data, we employ a series of measurements to estimate its usefulness for classifying this group of music data. According to the patterns contained in a music piece, we determine which class it should be assigned to. We perform a series of experiments and the results show that our approach performs on average better than the approach based on the probability distribution of contextual information in music.

Chang-Rong Lin, Ning-Han Liu, Yi-Hung Wu, Arbee L. P. Chen

Web Search I

Applying Co-training to Clickthrough Data for Search Engine Adaptation

The information on the World Wide Web is growing without bound. Users may have very diversified preferences in the pages they target through a search engine. It is therefore a challenging task to adapt a search engine to suit the needs of a particular community of users who share similar interests. In this paper, we propose a new algorithm, Ranking SVM in a Co-training Framework (RSCF). Essentially, the RSCF algorithm takes the clickthrough data containing the items in the search result that have been clicked on by a user as an input, and generates adaptive rankers as an output. By analyzing the clickthrough data, RSCF first categorizes the data as the labelled data set, which contains the items that have been scanned already, and the unlabelled data set, which contains the items that have not yet been scanned. The labelled data is then augmented with unlabelled data to obtain a larger data set for training the rankers. We demonstrate that the RSCF algorithm produces better ranking results than the standard Ranking SVM algorithm. Based on RSCF we develop a metasearch engine that comprises MSNSearch, Wisenut, and Overture, and carry out an online experiment to show that our metasearch engine outperforms Google.

Qingzhao Tan, Xiaoyong Chai, Wilfred Ng, Dik-Lun Lee

Visual Interface for Evaluating Internet Search Results

The summary-based approach is not an optimal solution in presenting a search result for a massive collection of information. This traditional approach does not consider user’s search behavior, further it lacks the capability of providing distributional and structural information of the contents, which is valuable in the user’s search activity. We have developed a visualization system that presents visual abstractions of Web pages as a search result. The illustration of a Web page shows the distributional pattern of search terms and structural layout of document compactly, but informatively. The developed system reflects the significance of related hyperlinks in the visualization and it also helps users locate search terms through distinct markers. In our pilot study, the participants showed a higher degree of preference using the visual interface than the traditional approach in exploring a huge information space, the Internet. The proposed encoding algorithm makes it possible to compare retrieved results intuitively and it will boost user’s understanding about the content of Web resources, eventually assisting search activities.

Beomjin Kim

A Meta-search Method with Clustering and Term Correlation

A meta-search engine propagates user queries to its participant search engines following a server selection strategy. To facilitate server selection, the meta-search engine must keep concise descriptors about the document collections indexed by the participant search engines. Most existing approaches record in the descriptors information about what terms appear in a document collection, but they skip information about which documents a keyword appears in. This results in ineffective server ranking for multi-term queries, because a document collection may contain all of the query terms but not all of the terms appear in the same document.In this paper, we propose a server ranking approach in which each search engine’s document collection is divided into clusters by indexed terms. Furthermore, we keep the term correlation information in a cluster descriptor as a concise method to estimate the degree of term co-occurrence in a document set. We empirically show that combining clustering and term correlation analysis significantly improves search precision and that our approach effectively identifies the most relevant servers even with a naïve clustering method and a small number of clusters.

Dyce Jing Zhao, Dik Lun Lee, Qiong Luo

Classification and Clustering II

SUDEPHIC: Self-Tuning Density-Based Partitioning and Hierarchical Clustering

Clustering is one of the primary techniques in data mining, for which to find the user expecting result is a major issue. However, to dynamically specify the parameters for clustering algorithms presents an obstacle for users. This paper firstly introduces a novel density-based partitioning and hierarchical algorithm, which makes it easy to employ synthetic feedback mechanism in clustering. Additionally, by investigating into the relation between parameters and the clustering result, we propose a self-tuning technique for the setting of parameters. Meanwhile, the density distribution within a cluster can be expressed in the result for the user to specify the cluster’s feature. The algorithm is both evaluated in theory and practice. It outperforms many existing algorithms both in efficiency and quality.

Ding Zhou, Zunping Cheng, Chen Wang, Haofeng Zhou, Wei Wang, Baile Shi

Classification of Bio-medical Images Using Neuro Fuzzy Approach

The prime requirement for medical imaging systems is to be able to display images relating to a particular disease, there is increasing interest in the use of Image Retrieval techniques to aid diagnosis by identifying similar past cases. One area where computers have scored great success in biomedicine has been medical imaging. In biomedicine, searching digital bio-medical images is a challenging problem. Bio-medical images, such as pathology slides, usually have higher resolution than general-purpose pictures. Retrieval of biomedical images will find wide usage in the next decade. In this paper a neuro fuzzy technique for classification of biomedical images on the basis of combined feature vector, which combines color and texture feature into a single feature vector, is presented. The system uses concept based on pixel descriptors, which combines the human perception of color and texture into a single vector, for region extraction. The region extracted using the feature vectors represented in the form of pixel descriptor are fed as input to a neural network, which is trained for classification of images The method takes care of “imprecision”. The user always provides partial information while posing queries. In the proposed method efforts have been made to model the imprecision using fuzzy interpretation. The technique has been implemented on the database of biomedical images. Some of the experimental results are reported in the paper. Tested on a database of more than 100 pathological / biomedical images the technique is found to be quite satisfactory. Some of the results have been reported in the paper. This technique can assist the medical community in diagnosing the disease.

Shashikala Tapaswi, R. C. Joshi

Optimized Fuzzy Classification for Data Mining

F uzzy rules are suitable for describing uncertain phenomena and natural for human understanding and they are, in general, efficient for classification. In addition, fuzzy rules allow us to effectively classify data having non-axis-parallel decision boundaries, which is difficult for the conventional attribute-based methods. In this paper, we propose an optimized fuzzy rule generation method for classification both in accuracy and comprehensibility (or rule complexity). We investigate the use of genetic algorithm to determine an optimal set of membership functions for quantitative data. In our method, for a given set of membership functions a fuzzy decision tree is constructed and its accuracy and rule complexity are evaluated, which are combined into the fitness function to be optimized. We have experimented our algorithm with several benchmark data sets. The experiment results show that our method is more efficient in performance and comprehensibility of rules compared with the existing methods including C4.5 and FID3.1.

Myung Won Kim, Joung Woo Ryu

Web Search II

Supporting Exploratory Queries in Databases

Users of database applications, especially in the e-commerce domain, often resort to exploratory “trial-and-error” queries since the underlying data space is huge and unfamiliar, and there are several alternatives for search attributes in this space. For example, scouting for cheap airfares typically involves posing multiple queries, varying flight times, dates, and airport locations. Exploratory queries are problematic from the perspective of both the user and the server. For the database server, it results in a drastic reduction in effective throughput since much of the processing is duplicated in each successive query. For the client, it results in a marked increase in response times, especially when accessing the service through wireless channels.In this paper, we investigate the design of automated techniques to minimize the need for repetitive exploratory queries. Specifically, we present SAUNA, a server-side query relaxation algorithm that, given the user’s initial range query and a desired cardinality for the answer set, produces a relaxed query that is expected to contain the required number of answers. The algorithm incorporates a range-query-specific distance metric that is weighted to produce relaxed queries of a desired shape (e.g., aspect ratio preserving), and utilizes multi-dimensional histograms for query size estimation. A detailed performance evaluation of SAUNA over a variety of multi-dimensional data sets indicates that its relaxed queries can significantly reduce the costs associated with exploratory query processing.

Abhijit Kadlag, Amol V. Wanjari, Juliana Freire, Jayant R. Haritsa

A Web Page Scoring Method for Local Web Search Engines

Web-page scoring is a method to improve Web search-engines by assigning a score to each page according to its importance. The PageRank algorithm implemented for Google is a well known efficient scoring method for WWW search-engines, whereas it is not efficient for searching a local Web. For the latter case, text matching is usually used for computing scores and the hyperlink structure of Web-pages is wasted. Although a method for scoring local Web-pages called the HotLink method has been proposed, it is not well established because the scores depend on how to extract a tree structure, which is unknown, from the Web-graph.In this paper, we solve the problem of the HotLink method by considering all shortest-path trees and taking the average score. As a result, the scores are independent of the selection of a tree, which makes the scores robust. We also propose an efficient algorithm to compute this average score in O(|V||E|) time where V and E is the set of pages and hyperlinks of a local Web-graph, respectively. Experimental results show that our new scoring method captures important pages in a local Web.

Yohei Ikawa, Kunihiko Sadakane

Discovering Aspects of Web Pages from Their Referential Contexts in the Web

There are an enormous number of Web pages of unknown authorship, and even though Web search engines precisely evaluate the relevancy of Web page contents, a user cannot be sure whether a search result shows credible information. Considering that a Web page is referred to by other pages in various contexts through links, these contexts indicate the reputation of the page. For example, some pages may refer to a company’s page as “an excellent local company” and still other pages may refer to it as “a member of a certain research project”, while the company’s page itself might contain only product and service information. Such references are called “aspects” of the Web page, as distinguished from the content of the page. In this paper, we propose an approach for discovering aspects for characterizing Web pages based on their contexts. We define criteria for selecting “aspectual” Web content based on (1) its strength of association with the page based on the logical structure of the Web (i.e. Web document structure and link structure), (2) its novelty of content compared to the page and (3) its typicality among multiple contexts. We evaluate how these criteria affect aspect discovery results. We also explain the grouping of Web pages based on aspect similarity. This helps us to find Web pages that are referred to in the same way even though their content is different.

Koji Zettsu, Yutaka Kidawara, Katsumi Tanaka

Mobile Databases I

A Log-Based Cache Consistency Control of Spatial Databases in Mobile Computing Environments

In mobile client/server computing environments, mobile clients make access to their server to get interested data and then are disconnected because of high cost of wireless communication. Mobile clients usually keep their own local copies in order to reduce the overhead of communicating with the server. The updates of the server database sometimes are subject to leading to invalidation of the cached map in mobile clients. However it is not efficient to resend the entirely copied map from the server to mobile clients for solving invalidation. This paper proposes a log-based update propagation method to propagate the server’s update into its corresponding mobile clients by sending only update logs. The log-based update propagation scheme raises new issues as follows. First, the continuously growing of update logs downgrades the speed of searching for the relevant log data for a specific client. Second, there is considerable overhead of transmitting the update logs into mobile clients by using wireless communication. To solve these problems, we define unnecessary logs and then suggest methods to remove the unnecessary logs.

Kyounghwan An, Bonggi Jun, Jietae Cha, Bonghee Hong

Improving Concurrency Control in Mobile Databases

In this paper, we enhance the conventional Optimistic Concurrency Control algorithm (OCC) with an early termination mechanism on conflicting transactions. With the use of invalidation reports, conflicting transactions can be identified timely and terminated before reaching the validation phase. This mechanism is highly desirable, especially in the mobile environment, because allowing conflicting transactions to continue not only wastes constrained computing power and scarce bandwidth, but also exacerbates conflicts. The simulation results confirm our claims by showing strong performance gains according to various performance metrics.

Anand Yendluri, Wen-Chi Hou, Chih-Fang Wang

Just-in-Time Recommendation Using Multi-agents for Context-Awareness in Ubiquitous Computing Environment

Ubiquitous computing is a vision of future in which a computing technology becomes pervasive. One of the most important attributes in ubiquitous computing is context- awareness. The application that the user’s contexts are changed continuously over time normally requires a just-in-time recommendation. This paper describes a new just-in-time recommendation method using multi-agents for context-awareness in ubiquitous computing environment. For the just-in-time recommendation, we locally store the recommendation information that the user is likely to need in the near future based on predictions. To predict the information that be expected to be needed by the user, we use data mining methods. We propose a JITR Server Context Agent and a JITR Client Context Agent. They make the just-in-time recommendation continuously with the change of user’s contexts and solve the limitation of storage. We explain our method with an example. Several experiments are performed and the experimental results show that our method has a good recommendation performance.

Joonhee Kwon, Sungrim Kim, Yongik Yoon

Parallel and Distributed Databases

LFU-K: An Effective Buffer Management Replacement Algorithm

This paper introduces a new approach to database disk buffering, called the LFU-K method. The LFU-K page replacement algorithm is an improvement to the Least Frequently Used (LFU) algorithm. The paper proposes a theoretical-probability model for formal description of LFU-K algorithm. Using this model we evaluate estimations for the LFU-K parameters. This paper also describes an implementation of LFU-2 policy. As we demonstrate by trace-driven simulation experiments, the LFU-2 algorithm provides significant improvement over conventional buffering algorithms for the shared-nothing database systems.

Leonid B. Sokolinsky

Data Declustering with Replications

Declustering is used to distribute blocks of data among multiple devices, thus enabling parallel I/O access and reducing query response times. Many data declustering schemes have been proposed in the literature. However, these schemes are designed for non-replication systems, and thus they will fail if any disk fails. Assume that a single disk would fail once every five years, a non-replication system with 100 disks would have failed every 18 days. Data replication is a technique commonly used in multidisk systems to enhance availability of data during disk failures and, often as a second goal, to improve I/O performance of read-intensive applications. In this paper, we propose a LOG data declustering scheme for systems with replication. Furthermore, we present a novel replication algorithm. Although the replication algorithm is designed for the LOG declustering scheme, it is also applicable to existing schemes such as DM, GFIB, and GRS. Finally, as demonstrated by our experimental results, the LOG scheme with the proposed replication algorithm provides a significant performance improvement compared to the state-of-the-art data declustering schemes.

Yao Liu, Sam Y. Sung, Hui Xiong, Peter Ng

Efficient Declustering of Non-uniform Multidimensional Data Using Shifted Hilbert Curves

Data declustering speeds up large data set retrieval by partitioning the data across multiple disks or sites and performing retrievals in parallel. Performance is determined by how the data is broken into ”buckets” and how the buckets are assigned to disks. While some work has been done for declustering uniformly distributed low dimensional data, little work has been done on declustering non-uniform high dimensional data. To decluster non-uniform data, a distribution sensitive bucketing algorithm is crucial for achieving good performance. In this paper we propose a simple and efficient data distribution sensitive bucketing algorithm. Our method employs a method based on shifted Hilbert curves to adapt to the underlying data distribution. Our proposed declustering algorithm gives good performance compared with previous work which have mostly focused on bucket-to-disk allocation scheme. Our experimental results show that the proposed declustering algorithm achieves a performance improvement up to 5 times relative to the two leading algorithms.

Hak-Cheol Kim, Mario A. Lopez, Scott T. Leutenegger, Ki-Joune Li

Multimedia Databases I

Efficient and Flexible Bitmap Indexing for Complex Similarity Queries

In this paper, we propose a novel indexing method for complex similarity queries in high-dimensional image and video databases. In order to provide the indexing method with the flexibility in dealing with multiple features and multiple query objects, we treat every dimension independently. The efficiency of our method is realized by a specialized bitmap indexing that represents all objects in a database as a set of bitmaps. The percentage of data accessed in our indexing method is inversely proportional to the overall dimensionality, and thus the performance deterioration with the increasing dimensionality does not occur. To demonstrate the efficacy of our method we conducted extensive experiments and compared the performance with the linear scan by using real image and video datasets, and obtained a remarkable speed-up over the linear scan.

Guang-Ho Cha

Multimedia Data Integration and Navigation through MediaView: Implementation, Evolution and Utilization

To tackle the problems of multimedia data integration and navigation, MediaView as an extended object-oriented view mechanism is devised to bridge the semantic gap between conventional database and semantics-intensive multimedia applications. Such a view mechanism can further facilitate exploring internal context-dependent features of the multimedia representations. Due to the complex ingredients and dynamic application requirements of multimedia databases, however, it is difficult for users to define by themselves individual MediaView(s) in a top-down fashion. In this paper, we advocate a statistics-based approach of constructing MediaViews logically. In addition, a set of user-level operators is devised to accommodate the specialization and generalization relationships among the MediaViews. Hence, users may create their own MediaViews by deriving from existing ones. We also show the effectiveness of using MediaView in the problem domain of multimedia data integration and navigation through experiments.

Dawei Ding, Qing Li, Jun Yang

Union and Intersection of Filtering Functions for Information Filtering

In our previous works, to establish mathematical foundation of information filtering, we defined the notion of filtering function that represents filtering as a function, and clarified the characteristics of filtering. The constructed mathematical foundation makes it possible to qualitatively evaluate various filtering methods, to optimize processing methods in filtering, or to design a declarative language for describing the filtering policy. Moreover, since current filtering methods consist of multiple methods, we have revealed the properties of composite filtering functions. However, we have not considered operations without composition. In this paper, we define filtering functions that carry out union and intersection of the filtering results, and clarify their properties. Results show that we can qualitatively represent the filtering combined by more diverse strategies and reveal their characteristics.

Rie Sawai, Masahiko Tsukamoto, Tsutomu Terada, Shojiro Nishio

Mobile Databases II

Efficient Transaction Processing in Mobile Data Broadcast Environments

Data broadcasting in wireless information services is a very effective technique to disseminate information to a massive number of clients when the number of data items is small. When the number of items is large, however, it may be beneficial to integrate a client-to-server backchannel with the push-based data broadcast approach, resulting in a hybrid data delivery. In this paper, the performance behavior of a predeclaration-based transaction processing in mobile data broadcast environments is examined. The extensive simulation studies have been performed to evaluate the effectiveness of our methods not only in a pure push data delivery but also in the hybrid data delivery. The analysis of simulation results has shown that the use of predeclaration-based transaction processing can provide significant performance improvement in mobile data broadcast environments.

SangKeun Lee, SungSuk Kim

GBL: Group-Based Location Updating in Mobile Environment

Conventionally, each mobile host frequently reports its current location to a location server in a mobile environment, termed individual-based approach. With a large mobile host population, the demand on the wireless uplink channel becomes overwhelm. In this paper, we propose a group-based scheme to alleviate the demand on the uplink channel. With our scheme, nearby mobile hosts, possessing similar mobility, are clustered. Only one host in a cluster reports collectively their locations to the server. This approach reduces the uplink traffic from mobile hosts to the server since individual update messages are aggregated. We propose a dynamic group formation scheme, and the strategies to report host locations within a cluster and to report a cluster location to the server. A simulation model is built to evaluate the effectiveness of our scheme. The results show our scheme outperforms individual-based approach, particularly under dense population scenarios.

Gary Hoi Kit Lam, Hong Va Leong, Stephen Chi Fai Chan

A Moving Point Indexing Using Projection Operation for Location Based Services*

The progress of wireless communication technology causes to increase the concerns about contents services related to locations of mobile users. Objects in the LBS move freely and change their position and/or shape continuously over time. And it would be difficult to rapidly provide the moving object location to the mobile user. In order to efficiently process this information, we propose a MP-tree(short for Moving Point Tree) using the projection operation, which is an indexing technique for moving point data. The projection operation extracts the common boundaries of child nodes and stores the boundaries into the projection storage of all axes, so each of the projection storage has the aligned boundaries of subordinate nodes. The ordered boundaries of each projection storage are effective in not only processing moving object query such as point and range queries but also having smaller storage space and less cost of split, because they reduce unnecessary search nodes for overlap and dead space as well as the number of non-leaf nodes. Moreover the MP-tree is able to process combined trajectory queries effectively using linked list. The proposed MP-tree would be useful in the car management using GPS, and the navigation system for LBS.

Eung Jae Lee, Young Jin Jung, Keun Ho Ryu

Data Mining and Knowledge Discovery in Web II

EGA:An Algorithm for Automatic Semi-structured Web Documents Extraction

With the fast expansion of World Wide Web, more and more semi-structured web documents appear on the web. In this paper, we study how to extract information from the semi-structured web documents by automatically generated wrappers. To automate the wrapper generation and the data extraction process, we develop a novel algorithm EGA (EPattern Generation Algorithm) to conduct the extraction pattern based on the local structural context features of the web documents. These optimal or near optimal extraction patterns are described in XPath language. Experimental results on RISE and our own data sets confirm the feasibility of our approach.

Liyu Li, Shiwei Tang, Dongqing Yang, Tengjiao Wang, Zhihua Su

An Automated Algorithm for Extracting Website Skeleton

The huge amount of information available on the Web has attracted many research efforts into developing wrappers that extract data from webpages. However, as most of the systems for generating wrappers focus on extracting data at page-level, data extraction at site-level remains a manual or semi-automatic process. In this paper, we study the problem of extracting website skeleton, i.e. extracting the underlying hyperlink structure that is used to organize the content pages in a given website. We propose an automated algorithm, called the Sew algorithm, to discover the skeleton of a website. Given a page, the algorithm examines hyperlinks in groups and identifies the navigation links that point to pages in the next level in the website structure. The entire skeleton is then constructed by recursively fetching pages pointed by the discovered links and analyzing these pages using the same process. Our experiments on real life websites show that the algorithm achieves a high recall with moderate precision.

Zehua Liu, Wee Keong Ng, Ee-Peng Lim

Ontologies on the MOVE

The semantic web depends heavily on ontologies, as they provide information about the structures used. Ideally, these ontologies would be static, but in reality they are very much dynamic. This paper addresses the issues that arise from having large ontologies and their automatically generated materialized ontology views, as well as how quality can be maintained in the materialized views. We use the notion of database views to extract ontologies from large base ontologies. A framework is required to consider variations between standards, and differences in user requirements. MOVE, the Materialized Ontology View Extractor will be discussed as an architecture that offers a solution to these problems.

Carlo Wouters, Tharam Dillon, Wenny Rahayu, Elizabeth Chang, Robert Meersman

Incremental Maintenance of Discovered Mobile User Maximal Moving Sequential Patterns

In the context of mobile computing, a special sequential pattern, moving sequential pattern that reflects the moving behavior of mobile users attracted researchers’ interests recently. While there have been a number of efficient moving sequential pattern mining algorithms reported, this paper concentrates on the maintenance of mined maximal moving sequential patterns. In particular, we developed an incremental approach, where maximal moving sequential patterns are stored in prefix trees, and new moving sequences can be easily combined with the existing patterns. A performance study indicated that the proposed approach performs significantly faster than straightforward approaches that mine from the whole updated database.

Shuai Ma, Shiwei Tang, Dongqing Yang, Tengjiao Wang, Chanjun Yang

Multimedia Databases II

Similarity Search and Dimensionality Reduction: Not All Dimensions Are Equally Useful

Indexing high-dimensional data is a well known problem. Techniques for dimensionality reduction which map D-dimensional objects onto a d-dimensional space (d ≪ D) are often used to speedup similarity queries. In this paper we show that one can further improve query performance by initially overestimating the reduction, i.e., reducing the dimensionality of the space to D′ dimensions, where d < D′ < D, and, at query time, automatically choosing only d′, where d′ < d, dimensions to be used – that is, using only a few good dimensions after the initial reduction of the dimensionality. By incorporating this idea within a recently proposed technique, we can process range queries up to three times faster at the expense of limited storage overhead.

Christian Digout, Mario A. Nascimento, Alexandru Coman

Relative Queries and the Relative Cluster-Mapping Method

Most conventional database systems and information retrieval systems force users to specify the qualification conditions in queries in an ”absolute” manner. That is, a user must specify a qualification condition to be satisfied by each retrieval result. To do this properly, users must have sufficient knowledge about the metadata and their structures. In the real world, however, people often specify a qualification in a relative manner such as ”I prefer this one among these.” In this paper, we propose the notion of ”relative queries,” and their query processing method called the ”relative cluster-mapping.” A relative query specifies user’s selection criteria as selecting his/her favorite data among a sample data cluster. This notion is useful when users do not have sufficient knowledge about metadata, or users cannot decide a complete qualification condition. The ”relative cluster-mapping” method maps the relative position of the user-selected data in a sample data cluster to a target data cluster and returns an answer from the target data cluster. The answer‘s relative position in the target data cluster is similar to that of the selected data in a sample data cluster. We believe it is more natural to express the desired data using relative qualifications. We also have developed prototype system, and evaluated its appropriateness.

Shinsuke Nakajima, Katsumi Tanaka

Improving Query Effectiveness for Large Image Databases with Multiple Visual Feature Combination

This paper describes CMVF, a new framework for indexing multimedia data using multiple data properties combined with a neural network. The goal of this system is to allow straightforward incorporation of multiple image feature vectors, based on properties such as colour, texture and shape, into a single low-dimensioned vector that is more effective for retrieval than the larger individual feature vectors. CMVF is not constrained to visual properties, and can also incorporate human classification criteria to further strengthen image retrieval process. The analysis in this paper concentrates on CMVF’s performance on images, examining how the incorporation of extra features into the indexing affects both efficiency and effectiveness, and demonstrating that CMVF’s effectiveness is robust against various kinds of common image distortions and initial(random) configuration of neural network.

Jialie Shen, John Shepherd, Anne H. H. Ngu, Du Q. Huynh

Mobile Databases III

Dynamic Data Replication Using Aperiodic Updates in Mobile Adhoc Networks

In this paper, we present three extended dynamic data replication schemes for mobile ad-hoc networks. We improve upon existing replication algorithms by considering aperiodic updates and integrating user profiles consisting of mobile users’ mobility schedules, access behavior and read/write patterns. Our schemes actively reconfigure the replicas to adjust to the changes in user behavior and network status. We present three replication algorithms and their performance in an environment where data items are updated aperiodically, and where frequency of access to each data objects from mobile hosts and the status of network connection are also considered.

Takahiro Hara, Sanjay Kumar Madria

Stream Selection Policies for Transcoding Multimedia Presentations Composed of Multiple Streams to Play on Mobile Terminals

This paper proposes various policies and experiment results to select streams in a presentation to be transcoded for mobile terminals after distinguishing the characteristics of streams in the presentation. This applies to a presentation composed of various data streams on a PC server. The rationale of the research is that stream selection policies for transcoding enhance the speed of presentation in mobile terminals considerably faster than in the case when every stream has to be transcoded. Thus, the thesis suggests a stream selection policy for transcoding based on EPOB (End Point of Over Bandwidth) and aims to lower the required bandwidth of the presentation to the network bandwidth as well as to minimize initial delay time for the presentation to be played on mobile terminals.

Maria Hong, Joon-Sung Yoon, Yong Joon Lee, Jae Gak Hwang, Young Hwan Lim

Efficient Group Pattern Mining Using Data Summarization

In group pattern mining, we discover group patterns from a given user movement database based on their spatio-temporal distances. When both the number of users and the logging duration are large, group pattern mining algorithms become very inefficient. In this paper, we therefore propose a spherical location summarization method to reduce the overhead of mining valid 2-groups. In our experiments, we show that our group mining algorithm using summarized data may require much less execution time than that using non-summarized data.

Yida Wang, Ee-Peng Lim, San-Yih Hwang

A Cost Effective Cache Consistency Method for Mobile Clients in Wireless Environment

When a mobile client disconnects for a prolonged time which exceeds the window size of the server’s invalidation reports , the mobile client’s cache is discarded even if most of data are still valid. In this paper, we propose an efficient cache consistency method called CCI(Cost-based Cache Invalidation) for mobile clients, which takes into account not only the disconnection time but also the update frequencies at a server.

Song-Yi Yi, Wonmin Song, Sungwon Jung, Sooyong Park

Supporting Benefit-Oriented Retrieval for Data on Air

Battery power is always a limitation for using mobile devices to access information in a wireless environment. One of the ways to alleviate the problem is to let the device go into a doze (no-reaction or disconnection) mode periodically so as to let the battery energy last longer. Broadcast disk technology is becoming a popular method for data dissemination in a wireless information system. In this paper, we study from a mobile host’s viewpoint how to achieve the highest “benefit” if it listens to the broadcast discontinuously. We propose three fast and near optimal algorithms for the MH to retrieve the broadcast data.

Chao-Chun Chen, Lien-Fa Lin, Chiang Lee

Springer Professional

Inhaltsverzeichnis

Frontmatter