Skip to main content

2004 | Buch

Advances in Web-Age Information Management

5th International Conference, WAIM 2004, Dalian, China, July 15-17, 2004

herausgegeben von: Qing Li, Guoren Wang, Ling Feng

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter

Invited Talks

A Framework for a Trusted Environment for Virtual Collaboration

Trusted computing has assumed major significance since the advent of internet based commerce and peer to peer communications. Trust is the belief that a peer has another peer in a given context. This paper addresses both identity trust which establishes the identity of another party and behaviour trust which answers the question ’Do you believe that the entity will behave as you expect it to?’ It develops complete definitions of context and time dependant trust and reputation and develops trust models, trust relationships of a peer.

Tharam S. Dillon, Elizabeth Chang, Farookh Hussain
Future Interconnection Environment – Dream, Principle, Challenge and Practice

Networks and flows are everywhere in society, nature and virtual world organizing versatile resources and behaviors. Breaking boundaries, this keynote establishes a scenario of the future interconnection environment – a large-scale, autonomous, live, sustainable and intelligent network environment where society, nature, and the environment harmoniously co-exist, cooperatively work and evolve. It automatically senses and collects useful information from the nature, transforms raw information into resources in the environment, and organizes resources in semantic rich normal forms that can be easily understood by both machine and human. Geographically dispersed people and resources can work in cooperation to accomplish tasks and solve problems by actively participating material flows, energy flows, technology flows, information flows, knowledge flows and service flows in the environment through roles and machines, improving the environment and the nature. Parameters, characteristics and principles of the environment are proposed from the system methodology point of view. A case study on a live semantic link network under a reference model of the environment is presented. As a practice, the advanced architecture and technology used in China Knowledge Grid e-science environment IMAGINE-1 is introduced.

Hai Zhuge
Recent Advances in Histogram Construction Algorithms

Obtaining fast and accurate approximations to data distributions is a problem of central interest to database management. A variety of database applications including, approximate querying, similarity searching and data mining in most application domains, rely on such accurate approximations. A very popular way in database theory and practice to approximate a data distribution is by means of a histogram.Histograms approximate a data distribution using a fixed amount of space, and under certain assumptions strive to minimize the overall error incurred by the approximation. The problem of histogram construction is of profound importance and has attracted a lot of research attention. A fundamental requirement for any good histogram construction algorithm is to approximate the underlying data distribution in a provably good way and be efficient in terms of running time. The typical assumption for constructing histograms is that the data set to be approximated is finite and of known size, or that the size can be easily derived by performing a single pass over the finite data set. The problem of histogram construction on datasets of known size has received a lot of research attention and the optimal as well as a lot of heuristic solutions exist. Wavelets synopses is in fact very simple histograms, and this approach allows us to minimize some objective fit criterion, than storing the k highest coefficients of a wavelet decomposition. In all these problems, given a sequence of data values x1,...,x n , the task is to construct a suitable summary of the data which can be stored in small space (e.g. a small fixed number, say B, of the n coefficients in the Wavelet transform, or a histogram involving B buckets).In this talk, we first present optimal and approximate histogram construction algorithms that have been recently proposed. We next discuss the algorithms to approximate a data stream in an online and incremental way using histograms. We also describe wavelet synopses algorithms that include deterministic and probabilistic thresholding schemes to select the wavelet coefficients. We finally point out the drawbacks of such existing schemes and introduce future research directions.

Kyuseok Shim

Data Stream Processing

Dynamic Adjustment of Sliding Windows over Data Streams

The data stream systems provide sliding windows to preserve the arrival of recent streaming data in order to support continuous queries in real-time. In this paper, we consider the problem of adjusting the buffer size of sliding windows dynamically when the rate of streaming data changes or when queries start or end. Based on the status of available memory resource and the requirement of queries for memory, we propose the corresponding algorithms of adjustment with greedy method and dynamic programming method, which minimize the total error of queries or achieve low memory overhead. The analytical and experimental results show that our algorithms can be applied to the data stream systems efficiently.

Dongdong Zhang, Jianzhong Li, Zhaogong Zhang, Weiping Wang, Longjiang Guo
Space Efficient Quantile Summary for Constrained Sliding Windows on a Data Stream

In many online applications, we need to maintain quantile statistics for a sliding window on a data stream. The sliding windows in natural form are defined as the most recent N data items. In this paper, we study the problem of estimating quantiles over other types of sliding windows. We present a uniform framework to process quantile queries for time constrained and filter based sliding windows. Our algorithm makes one pass on the data stream and maintains an ε-approximate summary. It uses $O(\frac{1}{\epsilon^2}\log^2{\epsilon N})$ space where N is the number of data items in the window. We extend this framework to further process generalized constrained sliding window queries and proved that our technique is applicable for flexible window settings. Our performance study indicates that the space required in practice is much less than the given theoretical bound and the algorithm supports high speed data streams.

Jian Xu, Xuemin Lin, Xiaofang Zhou

Time Series Data Processing

The Dimension-Reduced Multi-scale Histogram: A New Method for Fast and Efficient Retrieval of Similar Time Sequences and Subsequence

In this paper, we propose dimensionality reduction representation of multi-scale time series histograms, which is performed based on the multi-scale histograms. It is a faster and efficient way to pre-select time sequence in a database and leads to reduce the need of time sequence comparisons when answering similarity queries. A new metric distance function MD ( ) that consistently lower-bounds WED and also satisfies the triangular inequality is also presented and based on it, we construct the Slim-tree index structure as the metric access method to answer similarity queries. We also extend it to subsequence matching and presented a MSST index structure.

Jian-Wei Liu, Shou-Jian Yu, Jia-Jin Le
Time Series Prediction Based on Gene Expression Programming

Two novel methods for Time Series Prediction based on GEP (Gene Expression Programming). The main contributions include: (1) GEP-Sliding Window Prediction Method (GEP-SWPM) to mine the relationship between future and historical data directly. (2) GEP-Differential Equation Prediction Method (GEP-DEPM) to mine ordinary differential equations from training data, and predict future trends based on specified initial conditions. (3) A brand new equation mining method, called Differential by Microscope Interpolation (DMI) that boosts the efficiency of our methods. (4) A new, simple and effective GEP-constants generation method called Meta-Constants (MC) is proposed. (5) It is proved that a minimum expression discovered by GEP-MC method with error not exceeding δ/2 uses at most log3(2L/δ) operators and the problem to find δ-accurate expression with fewer operators is NP-hard. Extensive experiments on real data sets for sun spot prediction show that the performance of the new method is 20-900 times higher than existing algorithms.

Jie Zuo, Chang-jie Tang, Chuan Li, Chang-an Yuan, An-long Chen
A Grid-Based Index Method for Time Warping Distance

Recently DTW (dynamic time warping) has been recognized as the most robust distance function to measure the similarity between two time series, and this fact has spawned a flurry of research on this topic. Most indexing methods proposed for DTW are based on the R-tree structure. Because of high dimensionality and loose lower bounds for time warping distance, the pruning power of these tree structures are quite weak, resulting in inefficient search. In this paper, we propose a dimensionality reduction method motivated by observations about the inherent character of each time series. A very compact index file is constructed. By scanning the index file, we can get a very small candidate set, so that the number of page access is dramatically reduced. We demonstrate the effectiveness of our approach on real and synthetic datasets.

Jiyuan An, Yi-Ping Phoebe Chen, Eamonn Keogh

Security

Fail-Stop Authentication Protocol for Digital Rights Management in Adaptive Information Retrieval System

Digital Rights Management (DRM) is one of the most important issues in digital library, however it is difficult to balance security and efficiency of DRM system in open network environment, especially when the environment is unauthentic. In this paper, a new and Secure Authentication protocol for Digital Rights Management (SADIRM) is proposed to provide perfect usage control and fair transaction in adaptive information retrieval system, in which a License Finite States Machine (LFSM) is adopted to manage the digital rights under a third part Certificate Authority (CA), thus it is secure and fair for atomic authorization and forced revocation of license dynamically. During the interaction of all transaction stages, a fail-stop digital signature is used for identity authentication, integrity verification, and undeniability. Analysis manifests the proposed protocol is feasible, secure with high integrity and it provides a new and reliable approach for DRM.

Zhaofeng Ma, Boqin Feng
Effective Web-Related Resource Security Using Distributed Role Hierarchy

The recent trend of access control on the Web-related resource is that the service provider differentiates services according to the user’s attributes such as membership class or job position. For proper and simple representation of the relationships between user’s attributes and permission, role-base access control is used on the Web. As the size of the organization increases, does the number of membership classes or job positions. Naturally, the role hierarchy becomes too complicated to administrate properly. The lack of proper role administration makes the system vulnerable in access control. Moreover, the web services are distributed and categorized by the type of service. It is difficult that a centralized server controls the distribute environment properly. In this paper, we propose a distributed role hierarchy module that can manage the role hierarchy effectively and practically. The proposed system employs the distributed local role hierarchy, and the combination of them represents a logical global role structure. Distributed role hierarchy modules manage local role hierarchy that is related to each of them.

Gunhee Lee, Wonil Kim, Dong-kyoo Kim, Hongjin Yeh

Mobile Computing

A Low-Cost Checkpointing Scheme for Mobile Computing Systems

In distributed computing systems, processes in different hosts take checkpoints to survive failures. For mobile computing systems, due to certain new characteristics conventional distributed checkpointing schemes need to be reconsidered. In this paper, we propose a low-cost coordinated checkpointing algorithm. During normal computation message transmission, the checkpoint dependency information among mobile hosts is recorded in the corresponding mobile support stations. When a checkpointing procedure begins, the initiator concurrently informs relevant mobile hosts, which minimizes the identifying time. Moreover, compared with existing coordinated checkpointing schemes, our algorithm blocks the minimum number of mobile support stations during the identifying procedure. Experimental simulation shows that the proposed algorithm outperforms other coordinated checkpointing schemes and can provide a better system performance for mobile computing systems.

Guohui Li, Hongya Wang, Jixiong Chen
Semantic and Adaptive Middleware for Data Management in Smart Vehicle Space

With the increasing demands for adaptive middleware of real-time embedded systems in ubiquitous computing environments, the need for novel software architecture and programming infrastructure to manage data is widely recognized. In this paper, we present SAM (Semantic and Adaptive Middleware), an efficient middleware architecture that supports for real-time embedded system in smart vehicle space. The SAM’s architecture is based on embedded real-time OS and CAN-based communication environment, which includes tinyORB kernel, distributed components, distributed meta objects, services and tools. In SAM, we mainly utilize context-sensitive repository and knowledge management technology to realize adaptive characteristic. The two contributions of this paper are the proposal of SVA (Semantic Virtual Agent) and component-level adaptive mechanism. The SVAs use SIP (Semantic Interface Protocol) to communicate and auto-update themselves. To achieve self-adaptation and reflection of component-level, the component hook mechanism is introduced. In addition, we have brought into further discussion of the hardware infrastructure and one application of SAM to manage heterogeneous data.

Qing Wu, Zhaohui Wu, Bin Wu, Zhou Jiang
Dynamic Resource Reservation Using the Differentiated Handoff Estimation Model for Mobile Networks

In this paper, we propose a differentiated handoff estimation model and a dynamic resource reservation scheme. The differentiated handoff estimation model is modeled with the reducible Markov chain based on inner regions in a cell.The dynamic resource reservation scheme can dynamically control already reserved resource by change of the mobility class of a mobile host. And the scheme can reserve resource for mobile host according to their mobility classes and hand off probabilities. The proposed model and scheme improved a connection blocking probability and a connection dropping probability efficiently.

Si-Yong Park, Ki-Dong Chung
Adaptive Generic Communications for Integrated Mobile and Internet Web-Services

This paper presents a novel Generic Communication Model (GCM) for integrated mobile and Internet web services. GCM is an adaptive protocol selection mechanism to ease the balance of transmission performance and communication interoperability in wired networks and wireless mobile networks. With GCM, end users can select a suitable protocol among different protocols for adapting different situations intelligently without any interruption and disconnection. In this way, it can provide seamless mobile communication by integrating both 3G telecom networks and IEEE 802.11 wireless networks through GCM which allow various client devices to communicate at their best acceptable QoS level through consistent and simple APIs. We present a preliminary prototype of GCM to illustrate its functionalities of communicating with heterogeneous mobile devices by using various communication protocols.

Man-Ching Yuen, Leung Cheng, Pui-On Au, Weijia Jia
Spatio-temporal Database with Multi-granularities

Spatio-temporal information process grows up quickly in these years. Although uncertain and multi-granularities is the common features of spatio-temporal data, those problems were not well solved yet. A new spatio-temporal granularity representation method which supports multiple granularities and uncertain spatio-temporal objects is put forward. Then this method is applied to a spatio-temporal database ASTD. By supporting multiple granularities and approximate region, ASTD can process the multiple level data together, perform the query in variant precision and handle uncertainty. Compared with similar systems, ASTD is more powerful in multiple granularities, uncertainty, object type and granularity operations.

Sheng-sheng Wang, Da-you Liu

Cache Management

A Simple but Effective Dynamic Materialized View Caching

Dynamic materialized view management problem has recently received considerable attention. The problem is how to dynamically maintain some results of online analytical processing (OLAP) queries for the purpose of maximizing the possibility to answer subsequence ad-hoc OLAP queries as well as minimizing the total query processing cost. However, the existing dynamic view management approaches have their limit to support OLAP queries. In this paper, we propose a dynamic materialized view management approach with a new hit prediction, which can predict the users’ interest in an ongoing fashion and adapt to changes if necessary. Extensive performance studies show that our approach achieves a high cache hit ratio and speeds up the response time or throughout for different kinds of OLAP queries.

Chi-Hon Choi, Jeffrey Xu Yu, Hongjun Lu
A Cache Replacement Algorithm in Hierarchical Storage of Continuous Media Object

Although storage technologies are in great progress, hierarchical storage is still an efficient way to manage massive data, especially massive Continuous Media (CM) data. In some large web based CM applications such as DL (Digital Library) and VOD (Video-On-Demand), cache replacement algorithms are used to keep the hot data on the disk, and migrate out those rarely accessed. The existing algorithms always treat all files equally, and use cache miss rate to evaluate the performance. However, in actual CM applications, most video objects are stored as several separated files that will be played one by one, so the strong relationships among the access time of these files make pre-fetch practicable and efficient. If all the files are taken as the same, the performance of systems will degrade. Meanwhile, the cache miss rate cannot evaluate the performance comprehensively. In this paper, a new metric, User Waiting Rate, is defined to evaluate the performance, and a novel cache replacement algorithm, Least Waiting Probability (LWP) algorithm, is proposed. Experiment results show that it can improve the performance a lot, and is highly adaptive.

Yaoqiang Xu, Chunxiao Xing, Lizhu Zhou
CCAN: Cache-Based CAN Using the Small World Model

The small world model has been used to improve the routing performance in P2P networks recently. In this paper, we proposed CCAN, a cache-based CAN, which is based on the small world model. CCAN caches the long contact and takes a novel probabilistic cache replacing strategy. The probabilistic cache scheme proves to be a new approach to model the small world phenomenon. Experiments in both the static and the dynamic network show that CCAN can be converged to a stable status with the cache scheme, and the routing performance is improved with almost zero overheads in the network compared with CAN.

Futai Zou, Yin Li, Liang Zhang, Fanyuan Ma
Performance Evaluations of Replacement Algorithms in Hierarchical Web Caching

Web caching plays an important role in many network services. Utilization of the cache in each level (server, proxy, and client) of network forms a web caching hierarchy. A major problem of the hierarchical caching is the poor performance due to the interference between upper-level and lower-level caches. This paper investigates the replacement algorithms applied in the network caching hierarchy through trace-driven simulation experiments to identify the performance bottleneck. Our investigation focuses on three fundamental replacement algorithms: LRU, LFU and SIZE, because many other replacement algorithms are mainly the variations and/or combinations of the three fundamental algorithms. Through extensive experiments, we have acquired useful performance features of these algorithms and their combinations at different levels. Thus, our work may serve as a reference for the design and deployment of efficient replacement algorithms in the caching hierarchy for web engineering.

Haohuan Fu, Pui-On Au, Weijia Jia
Component-Based WebGIS and Its Spatial Cache Framework

Component model is a primary approach to deepen the functions of WebGIS. A component-based WebGIS system Geo-Union is introduced. Geo-Union has a multi-level Client/Server architecture, including four layers: application layer, component layer, service layer and storage layer, where service layer has different units to provide both client services and server services. Also, component partition and web mode of Geo-Union is discussed. After that, a spatial cache framework, including spatial database cache, network spatial cache and spatial data proxy server, is designed in Geo-Union to improve spatial data access performance in different situations in network environment.

Yingwei Luo, Xiaolin Wang, Zhuoqun Xu

Query Evaluation

Learning-Based Top-N Selection Query Evaluation over Relational Databases

A top-N selection query against a relation is to find the N tuples that satisfy the query condition the best but not necessarily completely. In this paper, we propose a new method for evaluating top-N selection queries against relational databases. This method employs a learning-based strategy. Initially, it finds and saves the optimal search spaces for a small number of random top-N queries. The learned knowledge is then used to evaluate new queries. Extensive experiments are carried out to measure the performance of this strategy and the results indicate that it is highly competitive with existing techniques for both low-dimensional and high-dimensional data. Furthermore, the knowledge base can be updated based on new user queries to reflect new query patterns so that frequently submitted queries can be processed most efficiently.

Liang Zhu, Weiyi Meng
DHT Based Searching Improved by Sliding Window

Efficient full-text searching is a big challenge in Peer-to-Peer (P2P) system. Recently, Distributed Hash Table (DHT) becomes one of the reliable communication schemes for P2P. Some research efforts perform keyword searching and result intersection on DHT substrate. Two or more search requests must be issued for multi-keyword query. This article proposes a Sliding Window improved Multi-keyword Searching method (SWMS) to index and search full-text for short queries on DHT. The main assumptions behind SWMS are: (1) query overhead to do standard inverted list intersection is prohibitive in a distributed P2P system; (2) most of the documents relevant to a multi-keyword query have those keywords appearing near each other. The experimental results demonstrate that our method guarantees the search quality while reduce the cost of communication.

Shen Huang, Gui-Rong Xue, Xing Zhu, Yan-Feng Ge, Yong Yu
A Scalable and I/O Optimal Skyline Processing Algorithm

In this paper, we will present a novel progressive algorithm for computing skyline queries. The algorithm is based on the depth-first search paradigm. It can be shown that the new algorithm is I/O optimal. Further, we can show that the algorithm takes a logarithmic memory space in 2D-space in the worst case if there are not many intersections in an adopted R-tree. Our experiment demonstrated that the new algorithm is more scalable and efficient than the existing techniques especially in a low dimensional space. The experiment also showed that the algorithm is progressive in a way sensitive to a user’s preference.

Yi Luo, Hai-Xin Lu, Xuemin Lin
Linearization Approach for Efficient KNN Search of High-Dimensional Data

K-Nearest Neighbors (KNN) search in high-dimensional feature spaces is an important paradigm for a variety of applications. Despite the continuous efforts in the past years, algorithms of data partitioning methods (DPMs), such as R*-tree, SR-tree and self-organizing maps (SOM), to find the exact KNN answer set at high dimensions are outperformed by a linear scan method. In this paper, we present a ”plug&search” method to greatly speed up the exact KNN search of existing DPMs. The idea is to linearize the data partitions produced by a DPM, rather than the points themselves, based on the distances between the partitions and a selected reference point, thus resulting in a one-dimensional array-index, that is simple, compact and yet fast index structure. Our KNN search algorithm sets conditions to skip irrelevant partitions and early stop the search as soon as the exact KNN answer points are retrieved.

Zaher Al Aghbari, Akifumi Makinouchi
SQL/MDR: Query Language for Consistent Access of Metadata Registries Based on SQL3

This paper proposes a query language to consistently access metadata registries. In current, many metadata registries have been built in various fields. Because there is no access method to access the metadata registries in a standard manner, many management systems for them have been developed in various and different access methods. In this paper, we propose a metadata registry query language that allows us to access consistently all metadata registries in a standardized manner. The query language, SQL/MDR is an extension of SQL3, which is familiar to existing database managers. Consequently, the query language reduces the development cost of a metadata registry management system. And it also enables all metadata registries to be accessed in a consistent manner.

Dongwon Jeong, Young-Gab Kim, Doo-Kwon Baik
Solving Queries over Semantically Integrated Biological Data Sources

Mediators used to be developed as monolithic systems that envelop the data source’s information, such as its semantics and location. It involves a high degree of coupling among the mediator’s components. This coupling does not allow sharing services with other organizations or the dynamic integration of new data sources. We propose an architecture for conceptual mediation in which the sources’ query capabilities are published as web data services. This architecture also provides a way to achieve interoperability between applications in the Semantic Web context.

José Francisco Aldana-Montes, Ismael Navas-Delgado, María del Mar Roldán-García
Fast Approximate Search in Text Databases

We present an index structure that, for a given keyword T and a tolerance value k, can find all documents in the database that contain the keyword T or other words that match T approximately (We say that two words match each other approximately if the edit distance between them does not exceed the tolerance value k). Let M denote the sum of the lengths of all words in the vocabulary and z the number of words in the vocabulary. The index takes O(M log z) time to construct and occupies O(M) storage. The locations of all the documents in the database that contain the keyword T or words that match T approximately, can be found in O(n(log z + on)) expected time where n is the length of the keyword T, z the number of words in the vocabulary, and o the number of words in the vocabulary that match T with an edit distance of at most k.

Fei Shi

Web Search Engine

Apply Feedback Control Theory to Design Soft Real-Time Search Engine System

This paper proposes a design method for soft real-time search engine system, and provides proofs to its correctness and robustness both in control theory and by practical experiments. An analyzable mathematical model is set up to approximately describe the nonlinear and time-varying search engine system. The feedback control theory is adopted to prove the system’s stableness, zero steady state error and zero overshoot. The soft real-time guarantee is satisfied while the feedback system is in stable state. The experiment results further prove the effectiveness of our scheme.

Huayong Wang, Yiqi Dai
User Interest Detection on Web Pages for Building Personalized Information Agent

In this paper we propose a vision-based approach to detecting user’s interests on web pages. This approach first splits a web page into several blocks of topics according to visual cues and then analyzes the layout structure of the page based on these blocks. The user’s interests are detected according to his browsing behaviors on the blocks and the layout structure of the web page. A personalized information agent is also developed to automatically extract new content from those web pages according to the user’s interests. Our experiments show that our proposed approach to detecting user’s interests on web pages achieves satisfactory accuracy and the personalized information agent significantly saves user’s browsing time and improves user’s browsing experience.

Yin Liu, Wenyin Liu, Changjun Jiang
Improved Link-Based Algorithms for Ranking Web Pages

Several link-based algorithms, such as PageRank [7], HITS [4] and SALSA [5], have been developed to evaluate the popularity of web pages. These algorithms can be interpreted as computing the steady-state distribution of various Markov processes over web pages. The PageRank and HITS algorithms tend to over-rank tightly interlinked collections of pages, such as well-organized message boards. We show that this effect can be alleviated using a number of modifications to the underlying Markov process. Specifically, rather than weight all outlinks from a given page equally, greater weight is given to links between pages that are, in other respects, further off in the web, and less weight is given to links between pages that are nearby. We have experimented with a number of variants of this idea, using a number of different measures of ”distance” in the Web, and a number of different weighting schemes. We show that these revised algorithms often do avoid the over-ranking problem and give better overall rankings.

Ziyang Wang
Effectiveness Evaluation and Comparison of Web Search Engines and Meta-search Engines

In this paper we present the experiments which evaluate the effectiveness of a group of Web search engines and meta-search engines. Experimental results show that on average the performances of selected meta-search engines and search engines are very close, therefore, the claim that meta-search engines are more effective than their counterpart-Web search engines can not be supported. Another observation is that, very often, search engines and meta-search engines are doing much better with short queries than with long queries, which is surprisingly opposite to conventional information retrieval systems. It suggests that Web search services focus on short queries having very few key words probably with special efforts; however, the general search techniques they use for long queries are unsophisticated and further improvement is in demand.

Shengli Wu, Jieyu Li
Semantic Search in the WWW Supported by a Cognitive Model

Most users of the WWW want their searches to be effective. Currently, there exists a wide variety of efficient syntactic tools that have can be used for search in the WWW. With the continuous increase in the amount of information, effective search will not be possible in the future only with syntactic tools. On the other hand, people have remarkable abilities at the moment of retrieving and acquiring information. For example, a librarian is capable of knowing, with great precision, what a client seeks by asking a small set of questions. Motivated by the efficiency of that process, we have created a web search system prototype based on ontologies that uses a cognitive model of the process of human information acquisition. We have built a prototype of a search system whose output better meets the expectations of the users compared to tools based only on syntax. Using this model, the prototype ”understands” better what the user is looking for.

Katia Wechsler, Jorge Baier, Miguel Nussbaum, Ricardo Baeza-Yates
Semantic Board: A Bulletin Board System Supporting Dynamic Semantic Information

Although a bulletin board system has been an important virtual place for exchanging information in the same interest group for a long time, its users have to browse each article for getting target information because current board system has not been able to provide conceptual data access. To solve this issue, we propose a semantic-based bulletin board system, called Semantic Board that supports tagging and searching articles using semantic information. Without modification, it can be adopted in various application fields using different semantics with a proposed ontology named SBoard. The SBoard ontology is designed to describe RDF predicates representing semantic information in the Semantic Board. The ontology indicates the form generation in the frontend and RDF data access in the backend of the target predicates. Using this, the Semantic Board is able to generate dynamically HTML forms to ease difficulties in writing semantic tags manually and store users’ input as a corresponding RDF document. The SBoard is also used to generate RDQL statement for searching articles based on semantic information.

Eui-Hyun Jung
Efficient Evaluation of Sparse Data Cubes

Computing data cubes requires the aggregation of measures over arbitrary combinations of dimensions in a data set. Efficient data cube evaluation remains challenging because of the potentially very large sizes of input datasets (e.g., in the data warehousing context), the well-known curse of dimensionality, and the complexity of queries that need to be supported. This paper proposes a new dynamic data structure called SST (Sparse Statistics Trees) and a novel, interactive, and fast cube evaluation algorithm called CUPS (Cubing by Pruning SST), which is especially well suitable for computing aggregates in cubes whose data sets are sparse. SST only stores the aggregations of non-empty cube cells instead of the detailed records. Furthermore, it retains in memory the dense cubes (a.k.a. iceberg cubes) whose aggregate values are above a threshold. Sparse cubes are stored on disks. This allows a fast, accurate approximation for queries. If users desire more refined answers, related sparse cubes are aggregated. SST is incrementally maintainable, which makes CUPS suitable for data warehousing and analysis of streaming data. Experiment results demonstrate the excellent performance and good scalability of our approach.

Lixin Fu

XML

Redundancy Free Mappings from Relations to XML

Given the fact that relational and object-relational databases are the most widely used technology for storing data and that XML is the standard format used in electronic data interchange, the process of converting relational data to XML documents is one that occurs frequently. The problem that we address in this paper is an important one related to this process. If we convert a relation to an XML document, under what circumstances is the XML document redundancy free? In some allied work we formally defined functional dependencies in XML (XFDs) and, based on this definition, formally defined redundancy in an XML document. We then introduced a normal form for an XML document (XNF) and showed that it is a necessary and sufficient condition for the elimination of redundancy. In this paper we address the problem of determining what class of mappings map a relation in BCNF to an XML document in XNF. The class of mappings we consider is very general and allows arbitrary nesting of the original flat relation. Our main result establishes a necessary and sufficient condition on the DTD induced by the mapping for it to be in XNF.

Millist W. Vincent, Jixue Liu, Chengfei Liu
Tree Multivalued Dependencies for XML Datasets

This paper introduces tree multivalued dependencies (TMVDs) for XML documents by extending multivalued dependencies in relational databases and recent research about tree functional dependencies (TFDs). We show the existence of the TMVDs in XML documents that produce redundancies and update anomalies. Secondly, the precise definition of TMVDs is proposed, which is based on the tree structure and deep equality. Moreover, we introduce the concept of recomposition that is a reconstruction of an XML database by moving or adding nodes. The right side of TMVDs is characterized by single attribute and multiple attributes, for which there are different recompositions. In addition, the relationship between TFDs and TMVDs is investigated, and we demonstrate that a TMVD is a generalized TFD. Furthermore, we present a theorem showing the recomposition by TMVDs saves space. Finally, we exhibit a recomposition that is not equivalent to any relational data.

Lawrence V. Saxton, Xiqun Tang
TROS: A Novel Model and Its Implementation for XML Data Management

In this paper we propose a novel model for XML data management, which is called TROS (Text Relational Object Semantic), to overcome the shortcoming of directly storing XML data in relational databases. TROS consists of three layers, the top of which is object semantic layer, the middle relational data management layer and the bottom is text file management layer. We also consider the implementation issues of this model. By analyzing the semantics of queries posed by users, we propose a novel object-relational index structure, which can efficiently improve query performance by avoiding join operations between relations. In the model, relational queries result support corresponding document content’s queries.

Wenwen Qi, Aoying Zhou
Optimized Query Translation Strategy for XML Stored in Relational Database

XML is an important standard of data exchange and representation. XML database also becomes important as web database. As a mature database system, using relational database to support XML data may bring some advantages. In this paper, based on a relational storage structure support XPath query effectively, an optimized query translation strategy is presented as well as optimization techniques. This method converts XQuery to variable forest. Optimization operations are applied on the variable forest. The optimization is based on the semantics of XML document such as schema of original XML document. The VF can be converted to SQL easily. This method translates XQuery to a set of SQL statements and generates final result through the result of them. Our method adapts to the XML database with relational storage supporting effective query process to XML documents with various schemas and size.

Hongzhi Wang, Jianzhong Li, Zhenying He
QReduction: Synopsizing XPath Query Set Efficiently under Resource Constraint

How to evaluate a massive XPath set over XML streams poses great challenges to database researchers. Current work chiefly focuses on evaluating efficiently massive XPath set to obtain precise results. The size of the input query set has a great impact on the resource requirement and the efficiency of evaluation. In this paper, we propose a novel method, QReduction, to obtain the synopsized XPath query set to represent the original query set, while at the same time to minimize the ’precision loss’ caused by query set synopsis. QReduction discovers frequent patterns among the massive input XPath tree patterns first, and select query set synopsis from them based on a dynamic benefit model under resource constraints. Since frequent patterns discovery takes high complexity in QReduction, we propose optimization methods by pushing the constraints of QReduction into the discovery process. We propose 3 criteria, namely recall, precision and intersection to determine a better synopsis. The experimental results demonstrate that our method can produce a query set synopsis with high precision, recall and intersection under given resource constraints.

Jun Gao, Xiuli Ma, Dongqing Yang, Tengjiao Wang, Shiwei Tang
Extracting Key Value and Checking Structural Constraints for Validating XML Key Constraints

We propose an approach that can effectively validate key constrains over XML document by extracting key value and checking structural constraints. First we propose an XPath-based algorithm that can extract key values from the XML document and generate the corresponding key value document. Then we present how a key value document and its DTD can be designed to check whether predefined key constraints are satisfied. Last we draw an interesting conclusion that as long as XML keys can be expressed in XPath, the validation problem can be done by the XPath and the structural constraints checking.

Yunfeng Liu, Dongqing Yang, Shiwei Tang, Tengjiao Wang, Jun Gao
Estimating the Selectivity of XML Path Expression with Predicates by Histograms

Selectivity estimation of path expressions in querying XML data plays an important role in query optimization. A path expression may contain multiple branches with predicates, each of which having its impact on the selectivity of the entire query. In this paper, we propose a novel method based on 2-dimensional value histograms to estimate the selectivity of path expressions embedded with predicates. The value histograms capture the correlation between the structures and the values in the XML data. We define a set of operations on the value histograms as well as on the traditional histograms that capture nodes positional distribution. We then construct a cost tree based on such operations. The selectivity of any node (or branch) in a path expression can be estimated by executing the cost tree. Compared with previous methods (which ignore value distribution) our method offers much better estimation accuracy.

Yu Wang, Haixun Wang, Xiaofeng Meng, Shan Wang
Wrapping Web Pages into XML Documents

XML is gaining popularity as an industrial standard for presenting and exchanging structured information on the web. Meanwhile, the majority of documents on-line are still marked up with HTML, which are designed specifically for display purposes rather than for applications to automatically access. In order to make web information accessible to applications so as to afford automation, inter-operation and intelligent services, some Information Extraction programs, called ”wrappers”, have been developed to extract the structured data from HTML pages. In this paper, we present a layout-based approach to separate the data layer from its aspect of presentation in HTML and extract the pure data as well as its hierarchical structure into XML. This approach aims to offer a general purpose methodology that can automatically convert HTML to XML without any tuning for a particular domain.

Tao Fu

Web Services

Rapid Prototyping for Web Applications

This paper proposes a web-based toolkit for web page developers to produce functional version of database-driven web systems. Many researchers realize the significant difference between the development process of web systems and conventional software systems. A key difference in web development is the clients understood their requirements poorly at the beginning of the project. The objective of this toolkit is to achieve rapid software prototyping in building web application. This result will shorten the timeframes of the actual development process, and developers can produce a functional working system with a minimum amount of effort. The functional system not only will help the clients to have better understanding of the project requirements, but also articulating other possible solutions to their problems. This paper outlines the development of today web applications, and how the toolkit is designed to improve the development of web-based information systems.

Chee Chern Lim, Man Hing Yu, Jesse J. Jin
Capacity Planning for Composite Web Services Using Queueing Network-Based Models

In this paper, queueing network-based models are proposed to predict the performance of composite Web services and make them run in an optimal way with limited resources. The models are necessary constituents of reliable Web service composition. They are flexible enough to evaluate the performance indices of the composite Web services simply by solving some equations rather than doing a lot of exhausting experiments. Research shows that our models provide an effective way for both running the services optimally with certain resources and estimating the growth needs accurately.

Dunlu Peng, Yang Yuan, Kun Yue, Xiaoling Wang, Aoying Zhou
A Web-Based Coordination Infrastructure for Grid Collective Services

Virtual Organizations (VO) consisting of heterogeneous institutions and individuals that share resources dynamically and in a coordinated way to support collaborative problem-solving are emerging in many fields. Consequently, new types of distributed infrastructures known as Grids have been proposed to cope with these new sharing requirements. Grids offer sets of collective services to support collaborative tasks which are distributed in nature and need asynchronous communication. Existing approaches to this problem lack of flexibility, adaptability and are tightly bound to the collective service that is provided. We present here a generic event model to build collective services that requires no programming, is extensible and can accommodate different event vocabularies. An implementation is also described using standard Web technologies like Java Servlets, XML, XSLT, and XSQL.

Javier Jaén, Jose H. Canós, Elena Navarro
Symbolic Agent Negotiation for Semantic Web Service Exploitation

This paper presents an architecture and a methodology for agent-based Web service discovery and composition. We assume that Web services are described with declarative specifications like DAML-S. Based on the declarative information about services, symbolic reasoning can be applied while searching for or composing automatically new services.We propose that symbolic agent negotiation could be used for dynamic Web service discovery and composition. Symbolic negotiation, as we consider it here, is a mixture of distributed planning and information exchange. Therefore, by using symbolic negotiation for automated service composition, we support information collection and integration during service composition. The latter aspect has been largely neglected in automated service composition until now.

Peep Küngas, Jinghai Rao, Mihhail Matskin
Object-Oriented Realization of Workflow Views for Web Services – An Object Deputy Model Based Approach

Adapted from the concept of views in databases, workflow views are derived form workflows as a fundamental support for workflow inter-operability and visibility by external parties in a web services environment. However, because of the implementation difficulties of views in OODB systems, it is tough to realize workflow views in objected-oriented workflow management systems (OO WFMS). In this paper, we adopt the object deputy model to support the realization of workflow views in OO WFMS. The detailed definitions are given based on a reference model of an OO WFMS, namely ADOME-WFMS. Furthermore, we introduce E-ADOME, which enables the implementation of workflow views in the web services environment.

Zhe Shan, Zhiyi Long, Yi Luo, Zhiyong Peng
Modeling QoS for Semantic Equivalent Web Services

Quality of Service (QoS) is an important factor for selecting a better Web service from numerous semantic equivalent Web services in Web services composition. In this paper, a QoS model (e_QoS) for evaluating semantic equivalent Web services is proposed, In which, the factors used to evaluate Web Services include Accessed Times, Invoked Success Rate, Average Responding Time, Stability Variance, and Customers Estimation. The performance of Web services in Latest Period of Time, the performance in history and the performance estimated by customers are considered together. By experiments, we concluded that e_QoS is effective for estimating Web services, and can improve the quality of web services composition greatly.

Derong Shen, Ge Yu, Tiezheng Nie, Rui Li, Xiaochun Yang

Classiffication

Improved Email Classification through Enriched Feature Space

This paper presents a novel feature space enriching (FSE) technique to address the problem of sparse and noisy feature space in email classification. The (FSE) technique employs two semantic knowledge bases to enrich the original sparse feature space, which results in more semantic-richer features. From the enriched feature space, the classification algorithms can learn improved classifiers. Naive Bayes and support vector machine are selected as the classification algorithms. Experiments on an enterprise email dataset have shown that the FSE technique is effective for improving the email classification performance.

Yunming Ye, Fanyuan Ma, Hongqiang Rong, Joshua Zhexue Huang
PLD: A Distillation Algorithm for Misclassified Documents

We observed that in interactive text classification, user tends to point out only the misclassified documents, not the correct ones. It is unlikely that a user would be diligent enough to identify all the misclassified documents. In this case, a classifier is expected to deal with misclassified documents. Among them it is possible that only a small proportion has been identified. We propose the Prediction-Learning-Distillation (PLD) framework for distilling the misclassified documents. Whenever a user points out an error, the PLD learns from the mistake and identifies the same mistake from all other classified documents. The PLD then enforces this learning for future classifications. Our experiment results have demonstrated that the proposed algorithm can learn from user identified misclassified documents, then distills the rest successfully.

Ding-Yi Chen, Xue Li
Sequential Classifiers Combination for Text Categorization: An Experimental Study

In this paper, we introduce Sequential Classifiers Combination (SCC) into text categorization to improve both the classification effectiveness and classification efficiency of the combined individual classifiers. We apply two classifiers sequentially for experimental study, where the first classifier (called filtering classifier) is used to generate candidate categories for the test document and the second classifier (called deciding classifier) is used to select a final category for the test document from the candidate categories. Experimental results indicate that when combining boosting and kNN methods, the combined classifier outperforms the best one of the two individual classifiers, and in the case of combining Rocchio and kNN methods, the combined classifier performs equally well as kNN while its efficiency is much better than kNN and is close to that of Rocchio.

Zheng Zhang, Shuigeng Zhou, Aoying Zhou
Combining Active Learning and Boosting for Naïve Bayes Text Classifiers

This paper presents a variant of the AdaBoost algorithm for boosting Naïve Bayes text classifier, called AdaBUS, which combines active learning with boosting algorithm. Boosting has been evaluated to effectively improve the accuracy of machine-learning based classifiers. However, Naïve Bayes classifier, which is remarkably successful in practice for text classification problems, is known not to work well with the boosting technique due to its instability of base classifiers. The proposed algorithm focuses on boosting Naïve Bayes classifiers by performing active learning at each iteration of boosting process. The basic idea is to induce perturbation of base classifiers by augmenting the training set with the most informative unlabeled documents.

Han-joon Kim, Je-uk Kim

Data Mining

Using Interval Association Rules to Identify Dubious Data Values

A hard-to-catch erroneous data is one whose value looks perfectly legitimate. Yet, if we examine this value in conjunction with other attribute values, the value appear questionable. Detecting such dubious values is a major problem in data cleaning. This paper presents a framework to automatically detect dubious data values in the datasets. Data is first pre-processed by data smoothing and mapping. Next, interval association rules are generated which involved data partitioning via clustering before the rules are generated using an Apriori algorithm. Finally, these rules are used to identify data values that fall outside the expected intervals. Experiment results show that the proposed framework is able to accurately and efficiently dubious values in large datasets.

Ren Lu, Mong Li Lee, Wynne Hsu
Mining Web Sequential Patterns Incrementally with Revised PLWAP Tree

Since point and click at web pages generate continuous data stream, which flow into web log data, old patterns may be stale and need to be updated. Algorithms for mining web sequential patterns from scratch include WAP, PLWAP and apriori-based GSP. An incremental technique for updating already mined patterns when database changes, which is based on an efficient sequential mining technique like the PLWAP is needed.This paper proposes an algorithm, Re-PL4UP, which uses the PLWAP tree structure to incrementally update web sequential patterns. Re-PL4UP scans only the new changes to the database, revises the old PLWAP tree to accommodate previous small items that have become large and previous large items that have become small in the updated database without the need to scan the old database. The approach leads to improved performance.

Christie I. Ezeife, Min Chen
Mining Frequent Items in Spatio-temporal Databases

It is important to retrieve aggregate information in spatio-temporal applications. Recently, some applications, such as decision support systems, also require to mine frequent items based on a dataset within a query region during a query interval. Because of unbounded space requirement and slow response time, executing query based on operational databases becomes inapplicable.In this paper, we define the problem formally and give out a novel solution to overcome the above two disadvantages. Recently, some algorithms are proposed to mine frequent items from a summarization(sketch) of a mass dataset. In our solution, one of latest sketches is integrated with a spatio-temporal index to provide good performance.

Cheqing Jin, Fang Xiong, Joshua Zhexue Huang, Jeffrey Xu Yu, Aoying Zhou
Discovery of Frequent XML Query Patterns with DTD Cardinality Constraints

Common query patterns of multiple XML queries can be stored and shared to accelerate the query execution efficiently. We present a new technique for efficiently mining frequent XML query patterns with DTD cardinality constraints. We first introduce a new definition of extended subtree for XML query patterns (EST), which provides a model for supporting special characters of ”common” XML query pattern. Then we ”push” the DTD cardinality constraints deep into the mining process to prune the search space and still ensure the completeness of the answers. Finally, we propose an algorithm FEST for effectively mining frequent ESTs to improve the query performance for XML data management system.

Yunfeng Liu, Dongqing Yang, Shiwei Tang, Tengjiao Wang, Jun Gao
Mining Inheritance Rules from Genealogical Data

Data mining extracts implicit, previously unknown and potentially useful information from databases. Many approaches have been proposed to extract information, and one of the most important ones is finding association rules. Although a large number of researches have been devoted to this subject, to the best of our knowledge, no previous researches find association rules from genealogical data. In this paper, we use a DAG (directed acyclic graph) to represent the genealogical data of families, where a node can be viewed as a member in the family tree, the features associated with a node as the characteristics of the corresponding person and the arcs as the parental relationships between members. In the DAG, two kinds of inheritance rules are defined, which indicate how the characteristics of ancestors are passed down to descendants, and an algorithm containing four stages is proposed to discover the inheritance rules.

Yen-Liang Chen, Jing-Tin Lu
Mining DAG Patterns from DAG Databases

Data mining extracts implicit, previously unknown and potentially useful information from databases. Many approaches have been proposed to extract information, and one of the most important ones is finding frequent patterns in databases. Although much work has been done to this problem, to the best of our knowledge, no previous research studies how to find frequent DAG (directed acyclic graph) patterns from DAG data. Without such a mining method, the knowledge cannot be discovered from the databases storing DAG data such as family genealogy profiles, product structures, XML documents and course structures. Therefore, a solution method containing four stages is proposed in this paper to discover frequent DAG patterns from DAG databases.

Yen-Liang Chen, Hung-Pin Kao, Ming-Tat Ko
Mining Class Outliers: Concepts, Algorithms and Applications

Detection of outliers is important in many applications and has attracted much attention in the data mining research community recently. However, most existing methods are designed for mining outliers from a single dataset without considering the class labels of data objects. In this paper, we consider the class outlier detection problem, i.e., ”given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels.” By generalizing two pioneering contributions in this field, we propose the notion of class outliers and practical solutions by extending existing outlier detection algorithms to detect class outliers. Furthermore, its potential applications in CRM (customer relationship management) are discussed. The experiments on real datasets have shown that our method can find interesting outliers and can be used in practice.

Zengyou He, Joshua Zhexue Huang, Xiaofei Xu, Shengchun Deng
CD-Trees: An Efficient Index Structure for Outlier Detection

Outlier detection is to find objects that do not comply with the general behavior of the data. Partition is a kind of method of dividing data space into a set of non-overlapping rectangular cells. There exists very large data skew in real-life datasets so that partition will produce many empty cells. The cell-based algorithms for outlier detection don’t get enough attention to the existence of many empty cells, which affects the efficiency of algorithms. In this paper, we propose the concept of Skew of Data (SOD) to measure the degree of data skew, and which approximates the percentage of empty cells under a partition of a dataset. An efficient index structure called CD-Tree and the related algorithms are designed. This paper applies the CD-Tree to detect outliers. Compared with cell-based algorithms on real-life datasets, the speed of CD-Tree-based algorithm increases 4 times at least and that the number of dimensions processed also increases obviously.

Huanliang Sun, Yubin Bao, Faxin Zhao, Ge Yu, Daling Wang

Industrial/Short Papers

Mobile Data Management with Sound

This study introduces a sound-based mobile payment system. Using a mobile phone for payment for mobile commerce is essential for its success. The proposed system will use sound that is generated from the existing mobile phone and the sound can be recognized by the existing credit card reader of the shop with an installation of an ordinary microphone and simple software to process sound input. This system is a better system since it doesn’t require customers to buy a new mobile phone and the cost for the microphone and the software is very low.

Ook Lee
A Mixture Model of Classical Fuzzy Classifiers

The classical fuzzy classifier consists of rules each one describing one of the classes. In this paper a new fuzzy classifier is proposed where each rule can represent more than one classes with different probabilities. A learning algorithm based on the gradient descent method is proposed to identify the parameters of the fuzzy sets in the antecedent of each fuzzy rule and the class conditional probabilities associated with each rule. This new approach is applied to the well-known Wisconsin breast cancer classification problem, and a compact, interpretable and accurate fuzzy classifier is achieved.

Yongchuan Tang, Shouqian Sun
An Empirical Study of Building Compact Ensembles

Ensemble methods can achieve excellent performance relying on member classifiers’ accuracy and diversity. We conduct an empirical study of the relationship of ensemble sizes with ensemble accuracy and diversity, respectively. Experiments with benchmark data sets show that it is feasible to keep a small ensemble while maintaining accuracy and diversity similar to those of a full ensemble. We propose a heuristic method that can effectively select member classifiers to form a compact ensemble. The idea of compact ensembles is motivated to use them for effective active learning in tasks of classification of unlabeled data.

Huan Liu, Amit Mandvikar, Jigar Mody
Merging Individually Learned Optimal Results to Accelerate Coordination

By merging agents’ individually learned optimal value functions, agents can learn their optimal policies in a multiagent system. Pre-knowledge of the task is used to decompose it into several subtasks and this decomposition greatly reduces the state and action spaces. The optimal value functions of each subtask are learned by MAXQ-Q[1] algorithm. By defining the lower and upper bound on the value functions of the whole task, we propose a novel online multiagent learning algorithm LU-Q, and LU-Q accelerates learning of coordination between multiple agents by task decomposition and action pruning.

Huaxiang Zhang, Shangteng Huang
The Compression of Massive Offline Relations

Compression database techniques play an important role in the management of massive data in database. Based on an important feature of offline relations and the features of operations on these data, we propose a compression method of massive offline relations to improve the processing performance of these relations. Experiments show that our method is efficient.

Jizhou Luo, Jianzhong Li, Hongzhi Wang, Yanqiu Zhang, Kai Zhao
Finding Plagiarism Based on Common Semantic Sequence Model

It is one of key problems in Text Mining to find document features. The string matching model and global word frequency model are two common models. But the former can hardly resist rewording noise, whereas the latter cannot find document details. We present Common Semantic Sequence Model (CSSM) and apply it to Document Copy Detection. CSSM combines the ideas of 2 models above, and it makes a trade-off between a document global features and local features. CSSM calculates the common words proportion between 2 documents semantic sequences to make a plagiarism score. A semantic sequence is indeed a continual word sequence after the low-density words are omitted. With the collection of 2 documents semantic sequences, we can detect plagiarism in a fine granularity. We test CSSM with several common copy types. The result shows that CSSM is excellent for detecting non-rewording plagiarism and valid even if documents are reworded to some extent.

Jun-Peng Bao, Jun-Yi Shen, Xiao-Dong Liu, Hai-Yan Liu, Xiao-Di Zhang
Web Information Extraction Based on Similar Patterns

Information Extraction is an important research topic in data mining. In this paper we introduce a web information extraction approach based on similar patterns, in which the construction of pattern library is a knowledge acquisition bottleneck. We use a method based on similarity computation to automatically acquire patterns from large-scale corpus. According to the given seed patterns, relevant patterns can be learned from unlabeled training web pages. The generated patterns can be put to use after little manual correction. Compared to other algorithms, our approach requires much less human intervention and avoids the necessity of hand-tagging training corpus. Experimental results show that the acquired patterns achieve IE precision of 79.45% and recall of 66.51% in open test.

Na Ye, Xuejun Wu, Jingbo Zhu, Wenliang Chen, Tianshun Yao
TS-Cache: A Novel Caching Strategy to Manage Data in MANET Database

With the development of computer networks and wireless communication technologies, more interests have been addressed on mobile wireless ad hoc networks (MANET) that are constructed by only wireless hosts. Previously proposed data management researches in traditional mobile database put emphasis only on the mobility and power limitation of mobile clients. In MANET database system, research must consider issues both on clients and servers. Wireless sensor networks can be seen as data source in many applications based on MANET. In these applications, mobile clients query the data generated by wireless sensor networks through mobile servers. Up to now, few papers have discussed the methods of data management in such complicated system which includes mobile clients, mobile servers and wireless sensor networks. In this paper, a novel and complicated system architecture is proposed. In order to deal with the data management problem, Ts-Cache strategy which is based on time-series analysis theory is also proposed, which can significantly reduce the workload of communication and improve the performance of the system.

Shengfei Shi, Jianzhong Li, Chaokun Wang, Jinbao Li
Extended Chain-Conflicting Serializability for the Correct Schedule of Transactions in Hierarchical Multidatabase

The issue of the correctness criterion for the schedule of transactions in hierarchical MDBS is studied in this paper. The definition and the architecture of hierarchical MDBS are firstly presented. The structure of transactions executed in this hierarchical MDBS is also presented. Then, a correctness criterion for the execution of transactions in hierarchical MDBS is proposed. An example of the application of the criterion is given as well. At last, the analysis and future development of the criterion are given, which shows the latent application of the criterion.

Guoning Chen, Taoshen Li
An Efficient Hierarchical Failure Recovery Algorithm Ensuring Semantic Atomicity for Workflow Applications

Failure recovery is essential for transactional workflow management systems. When a failure occurs, compensation is usually used to rollback corresponding processes and ensures semantic atomicity. Previously, many mechanisms have been proposed while performance problem is rarely considered. This is a limitation since compensating activities are rather expensive. In this paper, we propose an efficient hierarchical failure recovery algorithm, HFR. In stead of treating the compensation sphere as flat flow graphs, it generates compensation graph in a hierarchical manner from lower layers to higher ones according to nested structure of workflow execution history. The end compensation point is dynamically generated such that the compensation sphere is confined to a layer as low as possible and the number of compensations can be reduced. HFR can guarantee semantic atomicity of a workflow instance in case of a failure. The correctness of it is proved. Theoretical performance analysis shows that HFR has preferable efficiency.

Yi Ren, Quanyuan Wu, Yan Jia, Jianbo Guan
A Macrocommittees Method of Combining Multistrategy Classifiers for Heterogeneous Ontology Matching

To resolve the problem of ontology heterogeneity, we apply multiple classification methods to learn the matching between ontologies. We use the general statistic classification method to discover category features in data instances and use the first-order learning algorithm FOIL to exploit the semantic relations among data instances. When using multistrategy learning approach, a central problem is the combination of multiple match results. We find that the goal and the conditions of using multistrategy classifiers within ontology matching are different from the ones for general text classification. We propose a macrocommittees combination method that uses multistrategy in matching phase but not classification phase. In this paper we describe the combination rule called Best Outstanding Champion, which is suitable for heterogeneous ontology mapping. On the prediction results of individual methods, our method can well accumulate the correct matching of alone classifier.

Leyun Pan, Hui Song, Fanyuan Ma
Hypertext Classification Algorithm Based on Co-weighting Multi-information

Compared with the text information text, hypertext information such as hyperlinks and meta data all provide rich information for classifying hypertext documents. After analyzing different rules of using hypertext, we present a new hypertext classification algorithm based on co-weighting multi-information. We co-operate different hypertext information generally, by co-weighting them after extraction. Experimental results on two different data sets show that the new algorithm performs better than using single hypertext information individually.

Ya Peng, Ya-ping Lin, Zhi-ping Chen
Verifying a MAC-Based Service Bundle Authentication Mechanism for the OSGi Service Platform

In this paper, we propose a service bundle authentication mechanism considering characteristics for the home gateway environment. We also designed the key exchange mechanism for exchanging a key in bootstrapping step. Furthermore we verify the safety of the key exchange mechanism and service bundle authentication mechanism with BAN Logic. Through the verification, we are able to show that the result does not compromise the security of the proposed service bundle authentication mechanism.

Young-Gab Kim, Dongwon Jeong, Doo-Kwon Baik
Optimal Data Dispatching Methods in Near-Line Tertiary Storage System

Many applications like digital libraries use Near-line Tertiary Storage Systems (TSS) to store their massive data. TSS consists of main memory, disks and tertiary devices. Typically highly referenced or recent data are stored on disks and historical data are stored on tertiary storage devices. We call it Data Dispatching: the determination of what kind of data should be stored on disks and what kind of data should be stored on tertiary storage devices. Traditionally it was up to the Database Management System (DBMS) administrator to dispatch data by hand. But DBMS has to take the responsibility if we want to bring tertiary storage devices under the control of DBMS. We proved in this paper that the data dispatching is an optimal problem and can be reduced to the famous binary knapsack problem. Experimental results showed that the average response time of TSS could be decreased by using optimal data dispatch method.

Baoliang Liu, Jianzhong Li, Yanqiu Zhang
A Real-Time Information Gathering Agent Based on Ontology

This paper provides an agent based on an ontology that helps user to gather the real-time information. We present an approach of ontology representation based on concept graph and introduce the building and revising algorithms of knowledge base, then discuss the application of the ontology in information source representation. Experimental results show that the agent helps user accomplish the task effectively.

Jianqing Li, Shengping Liu, ZuoQuan Lin, Cen Wu
Dynamical Schedule Algorithms Based on Markov Model in Tertiary Storage

Tertiary storages, such as tape libraries and optical disc libraries, are becoming the important storage devices in massive data applications. With the massive storage space they have, the tertiary storages have very low access efficiency. Designing good schedule algorithms is an important method to improve the access efficiency in tertiary storage. Stochastic Markov model is used for predicting the expected number of accesses to the data on tertiary storage. Two new schedule algorithms named MarkovMR and MarkovSRF are given also. Generally, in a tape library, there is a robot arm, a few of tape drives and a magazine where a lot of tapes located in. When the tapes are kept in tape drives, we call them online tapes. Otherwise, we call the tapes in the magazine offline tapes. Weight factors are used to above schedule algorithms to favor online tapes so that the requests for online tapes are served first before online tapes are ejected. The weighted algorithms are named wMarkovMR and wMarkovSRF. By compared to the Round-robin policy, the experimental results show that the four schedule algorithms based on Markov model have higher efficiency of data access in tertiary storage. The efficiency of wMarkovMR is highest among all the algorithms. Furthermore, the optimal factors can be derived from experiments.

Yanqiu Zhang, Jianzhong Li, Zhaogong Zhang, Baoliang Liu
Image Retrieval Using Structured Logical Shape Feature

Most visual information retrieval systems on the web treat image as an object that is annotated with some keywords. However many semantic contents that cannot be expressed with some keywords are included in one image. It is the reason why most commercial and research systems utilize shape as the most cognitive information to represent contents of an image and to differentiate an image from others. In this paper, we propose an extraction method of logical shape feature to reflect structure of shape and coarse-fine matching method using it. For similar retrieval, we generate pattern segment matrix that is composed of curves’s type in order to find the most similar curve sequence. We use it for coarse-fine matching because our shape features have global characteristic as a structural feature and local characteristic as an adaptive feature of shape. A pattern segment matrix has the advantage to search with only a small quantity of structural features. Our experiments show that structural-adaptive features through logical analysis result in effectively classifying shapes according to their cognitive characteristics. Various experiments show that our approach reduces computational complexity and retrieval cost.

Nanhyo Bang, Kyhyun Um
Automatic HTML to XML Conversion

We present a new approach to automatically convert HTML documents into XML documents. It first captures the inter-blocks nested structure, then the intra-blocks nested structure, which consists of blocks including headings, lists, paragraphs and tables in HTML documents, by exploiting both formatting information and structural information implied by HTML tags.

Shijun Li, Mengchi Liu, Tok Wang Ling, Zhiyong Peng
Quality Improvement Modeling and Practice in Baosteel Based on KIV-KOV Analysis

The conventional manufacture process improvement is a long working process which was based on personal experience and remains a questionable cost effective issue. It really demands a fast and low cost manufacture process design and improvement model. The article indicates that the core process of quality improvement is to define the relationship of KIV-KOV and to optimize. With the data warehouse and data mining techniques, this article offers the KIV-KOV based new quality improvement model, functional structure and process as well as the practical experience.

Xinyu Liu, Haidong Tang, Zhiping Fan, Bing Deng
A Frequent Pattern Discovery Method for Outlier Detection

An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of outliers is important for many applications and has recently attracted much attention in the data mining research community. In this paper, we present a new method to detect outliers by discovering frequent patterns (or frequent itemsets) from the data set. The outliers are defined as the data transactions which contain less frequent patterns in their itemsets. We define a measure called FPOF (Frequent Pattern Outlier Factor) to detect the outlier transactions and propose the FindFPOF algorithm to discover outliers. The experimental results show that our approach outperformed the existing methods on identifying interesting outliers.

Zengyou He, Xiaofei Xu, Joshua Zhexue Huang, Shengchun Deng
Adaptive Load Balancing and Fault Tolerance in Push-Based Information Delivery Middleware Service

Push-based personalized information delivery is a special kind of event-driven application, which must monitor or react to changes of information sources or users interests to provide user-centered real-time services. In this paper, an enhanced event-driven middleware framework of Quality of Service (QoS) is proposed to support high reliable and available timely information push-delivery, in which adaptive loading balance strategy that combines static and dynamic load distribution algorithms together to ensure the system works in a reasonable workload level. As for the fault tolerance, a super-demon with adaptable time-piece is adopted for failure detection, node replication and recovery. Large amount simulations showed the adaptable QoS strategies and algorithms that we proposed are effective, efficient for event-driven distributed systems.

Zhaofeng Ma, Boqin Feng
Performance Optimization of Fractal Dimension Based Feature Selection Algorithm

Feature selection is a key issue in the advanced application fields like data mining, multi-dimensional statistical analysis, multimedia index and document classification. It is a novel method to exploit fractal dimension to reduce dimension of feature spaces. The most famous one is the fractal dimension based feature selection algorithm FDR proposed by Traina Jr et al. This paper proposes an optimized algorithm, OptFDR, which scans the dataset only once and avoids the efficiency problems of multiple scanning large dataset in the algorithm FDR. The performance experiments are made for evaluating OptFDRalgorithm using real-world image feature dataset and synthetic dataset with fractal characteristics. The experimental results show that OptFDR algorithm outperforms FDR algorithm.

Yubin Bao, Ge Yu, Huanliang Sun, Daling Wang
Graph-Based Cyclic Multi-join Query Optimization Algorithm

Although many Multi-Join query optimizations algorithms have been proposed, there is a lack of cyclic multi-join query optimization algorithms. In this paper, a graph-based cyclic multi-join query optimization algorithm-Improved Minimum Result Relation-Minimum Weight First (Improved-MR-MWF) was proposed. The time complexity is O(mn). This algorithm, considering the influences among the join operations concerning same relation, chooses the join that the result relation is minimum as the top-priority operation. The minimum weight join among the top-priority operations was selected. The experimental results indicated that the cost and the query response time of the new algorithm were less than that of others.

Hong Yu, Xiu-Kun Wang, Jian-Ying Zhang
Backmatter
Metadaten
Titel
Advances in Web-Age Information Management
herausgegeben von
Qing Li
Guoren Wang
Ling Feng
Copyright-Jahr
2004
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-27772-9
Print ISBN
978-3-540-22418-1
DOI
https://doi.org/10.1007/b98703