Keynote Papers

Internet-Scale Data Distribution: Some Research Problems

The increasing growths of the Internet, the Web and mobile environments have had a push-pull effect. On the one hand, these developments have increased the variety and scale of distributed applications. The data management requirements of these applications are considerably different from those applications for which most of the existing distributed data management technology was developed. On the other hand, they represent significant changes in the underlying technologies on which these systems are built. There has been considerable attention to this issue in the last decade and there are important progress on a number of fronts. However, there are remaining issues that require attention. In this talk, I will review some of the developments, some areas in which there has been progress, and highlight some of the issues that still require attention.

M. Tamer Özsu

Towards Next-Generation Search Engines and Browsers – Search Beyond Media Types and Places

In this keynote talk, the author will describe concepts and technologies for next-generation search engines and browsers for searching and browsing contents beyond media types and places. Currently, digital content are represented by different media such as text, images, video etc. Also, digital content are created, stored and used on a variety of places (devices) such as independent digital archives, World Wide Web, TV HDD/DVD recorders, personal PCs, digital appliances and mobile devices. The viewing styles of these content are different. That is, WWW pages are accessed and viewed in an active manner such as a conventional Web browser (reading, scrolling and clicking interface). On the other hand, TV content are accessed and viewed in a passive manner. As for searching these "ambient multimedia contents", currently, many commercial search engines cover only WWW content and personal PC contents, called "desktop search".

First, the author describes research issues necessary for searching “ambient multimedia contents”. The main research issues are (1) cross-media search, (2) ranking methods for contents without hyperlinks, and (3) integration of search results. As for cross-media search, the author describes query-free search, complementary-information retrieval, and cross-media meta-search.

Second, the author describes ways of browsing “ambient multimedia content”. The main topics of the second part are new browsers by media conversion of digital content, concurrent and comparative browsers for multiple contents. For example, the proposed browsers have an ability to automatically convert Web content into TV content, and vice versa.

The last part of the talk is concerned with mining metadata owned by search engines and its usage for computing the "trustness" of the searched results.

Katsumi Tanaka

Building a Domain Independent Platform for Collecting Domain Specific Data from the Web

World Wide Web has become a major information resource for both individuals and institutions. The freedom of presenting data on the Web by HTML makes the information of same domain, such as sales of book, scientific publications etc., be present on many Web sites with diverse format. Thus to collect the data for a particular domain from the Web is not a trivial task, and how to solve the problem is becoming a trendy research area. This talk first gives an overview of this new area by categorizing the information on the Web, and indicating the difficulties in collecting domain specific data from the Web. As a solution, the talk then continues to present a stepwise methodology for collecting domain specific data from the Web, and introduce its supporting system SESQ which is a domain independent tool for building topic specific search engines for applications. The talk shows full features of SESQ by two application examples. In conclusion, the talk briefs further research directions in this new Web data processing area.

Lizhu Zhou

Session 1: Web Search

A Web Search Method Based on the Temporal Relation of Query Keywords

As use of the Web has become more popular, searching for particular content has been refined by allowing users to enter multiple keywords as queries. However, simple combinations of multiple query keywords may not generate satisfactory search results. We therefore propose a search method which automatically combines query keywords to generate queries by extracting the relations among query keywords. This method consists of two Web search processes: one to determine the temporal relations between query keywords, and one to generate queries based on the obtained temporal relations. We discuss these two processes along with experimental results and implementation issues regarding a prototype system.

Tomoyo Kage, Kazutoshi Sumiya

Meta-search Based Web Resource Discovery for Object-Level Vertical Search

Object-level vertical search engine has been the research focus recently where the resource collecting problem is still an open area. It is difficult to adapt the traditional link-based web crawler for this task because of the sparse linkage and data-centered webpage of the relevant resources. In this paper, we propose a meta-search based method enhanced with auxiliary crawling to address the problem caused by sparse linkage of the relevant resources. And to retrieve the data-centered webpages efficiently, domain schema is defined to describe the target resource, and representative data instances are selected for meta-search query composing. Moreover, evaluation criteria for the domain resource survey are also proposed as the guideline for query composing and auxiliary crawling, which enable the resource discovery to be automatically performed by computers. Experiment results on real-world data show that our method is effective and efficient.

Ling Lin, Gang Li, Lizhu Zhou

PreCN: Preprocessing Candidate Networks for Efficient Keyword Search over Databases

Keyword Search Over Relational Databases(KSORD) has attracted much research interest since casual users or Web users can use the techniques to easily access databases through free-form keyword queries, just like searching the Web. However, it is a critical issue that how to improve the performance of KSORD systems. In this paper, we focus on the performance improvement of schema-graph-based online KSORD systems and propose a novel Preprocessing Candidate Network(PreCN) approach to support efficient keyword search over relational databases. Based on a given database schema, PreCN reduces CN generation time by preprocessing the maximum Tuple Sets Graph(

G

ts

) to generate CNs in advance and to store them in the database. When a user query comes, its CNs will be quickly retrieved from the database instead of being temporarily generated through a breadth-first traversal of its

G

ts

. Extensive experiments show that the approach PreCN is efficient and effective.

Jun Zhang, Zhaohui Peng, Shan Wang, Huijing Nie

Searching Coordinate Terms with Their Context from the Web

We propose a method for searching coordinate terms using a traditional Web search engine. “Coordinate terms” are terms which have the same hypernym. There are several research methods that acquire coordinate terms, but they need parsed corpora or a lot of computation time. Our system does not need any preprocessing and can rapidly acquire coordinate terms for any query term. It uses a conventional Web search engine to do two searches where queries are generated by connecting the user’s query term with a conjunction “

OR

”. It also obtains background context shared by the query term and each returned coordinate term.

Hiroaki Ohshima, Satoshi Oyama, Katsumi Tanaka

Session 2: Web Retrieval

A Semantic Matching of Information Segments for Tolerating Error Chinese Words

There exist new words and error words in Chinese information of web pages. In this paper, we introduce our definition of semantic similarity between sememes and their theorems. On the base of proving the theorems, the influence of the parameter is analyzed. Moreover, this paper presents a novel definition of the word similarity based on the sememe similarity, which can be used to match the new Chinese words with the existing Chinese words and match the error Chinese words with correct Chinese words. And also, based on the novel word similarity, a matching method of information segments is presented to recognize the category of Chinese web information segments, in which new words and error words occur. In addition, the experiment of the matching methods is presented. Therefore, the novel matching method is an efficient method both in theory and from experimental results.

Maoyuan Zhang, Chunyan Zou, Zhengding Lu, Zhigang Wang

Block-Based Similarity Search on the Web Using Manifold-Ranking

Similarity search on the web aims to find web pages similar to a query page and return a ranked list of similar web pages. The popular approach to web page similarity search is to calculate the pairwise similarity between web pages using the Cosine measure and then rank the web pages by their similarity values with the query page. In this paper, we proposed a novel similarity search approach based on manifold-ranking of page blocks to re-rank the initially retrieved web pages. First, web pages are segmented into semantic blocks with the VIPS algorithm. Second, the blocks get their ranking scores based on the manifold-ranking algorithm. Finally, web pages are re-ranked according to the overall retrieval scores obtained by fusing the ranking scores of the corresponding blocks. The proposed approach evaluates web page similarity at a finer granularity of page block instead of at the traditionally coarse granularity of the whole web page. Moreover, it can make full use of the intrinsic global manifold structure of the blocks to rank the blocks more appropriately. Experimental results on the ODP data demonstrate that the proposed approach can significantly outperform the popular Cosine measure. Semantic block is validated to be a better unit than the whole web page in the manifold-ranking process.

Xiaojun Wan, Jianwu Yang, Jianguo Xiao

Design and Implementation of Preference-Based Search

Preference-based search is the problem of finding an item that matches best with a user’s preferences. User studies show that example-based tools for preference-based search can achieve significantly higher accuracy when they are complemented with suggestions chosen to inform users about the available choices. We present

FlatFinder

, an implementation of an example-based tool and discuss how such a tool as well as suggestions can be efficiently implemented even for large product databases.

Paolo Viappiani, Boi Faltings

Topic-Based Website Feature Analysis for Enterprise Search from the Web

Efficient and accurate enterprise search is a challenging and important problem for specified resources available on the web. Domain-specific enterprise websites are similar in the topic structures and textual contents. Considering the semantic information of website content terms, a novel website feature vector modelling method representing website topic were proposed on the basis of vector space model. The feature vector elements integrated textual semantic information about topic content and structure information through different semantic terms and weighting schema respectively. The contrast recognition performances demonstrate that this feature analysis approach to website topic gives full potentials for specific enterprise web search.

Baoli Dong, Huimei Liu, Zhaoyong Hou, Xizhe Liu

Session 3: Web Workflows

Fault-Tolerant Orchestration of Transactional Web Services

As composite services are often long-running, loosely coupled, and cross application and administrative boundaries, they are susceptible to a wide variety of failures. This paper presents a solution for fault-tolerant web services orchestration by using relaxed atomic execution and exception handling. To achieve atomic execution, a scalable commit protocol is proposed, which allows heterogeneous transactional web services to participate in a composition. A recovery algorithm is given to ensure a reliable service orchestration in the presence of failures.

An Liu, Liusheng Huang, Qing Li, Mingjun Xiao

Supporting Effective Operation of E-Governmental Services Through Workflow and Knowledge Management

The improvement in efficiency of governmental administrative processes is a key to successful application of e-government. Workflow technology offers such a mean to realize automation of business processes. However, it is inappropriate to employ workflow technology alone for automating governmental processes that are typically of knowledge-intensive characteristics. In this paper, we study an approach of automating or semi-automating e-governmental processes by combining workflow technology and knowledge management, especially semantic web technology such as OWL, SWRL. The proposed approach can be classified into several parts: (1) The government processes are modeled using UML activity diagram with extended UML profiles. (2) The Problem-Solving-Method (PSM) tasks are represented with OWL-S. (3) The application ontology for applying for social security cards is developed in OWL. 4) Based on the proposed approach, the architecture of knowledge-driven e-governmental processes management system is presented and a prototype system is implemented using workflow engines, Jena and JESS.

Dong Yang, Lixin Tong, Yan Ye, Hongwei Wu

DOPA: A Data-Driven and Ontology-Based Method for Ad Hoc Process Awareness in Web Information Systems

The knowledge in Web Information Systems (WISs) makes that business process can not be described by a fixed model. Due to not integrating with domain knowledge seamlessly, traditional workflow management technology can not manage the ad hoc process very well. In order to solve that problem, we propose a data-driven and ontology-based method to support process awareness. Using an ontology-based model we integrate domain knowledge and WIS structure into process. A set of reasoning rules are designed to regulate the task transfer. Learning is done from the cases to help users make decision. A smart tool used for ontology building and WIS generation is developed and we prove the feasibility of this method in a real Web-based application.

Meimei Li, Hongyan Li, Lv-an Tang, Baojun Qiu

A Transaction-Aware Coordination Protocol for Web Services Composition

With the rapid development of WWW, Web Service is becoming a new application model for decentralized computing based on Internet. However, the tradeoff problem between consistency and resource utilization is the primary obstructer for building a transactional environment for Web Services Compositions. Since a resource may not be acceptable to lock exclusively by an unknown Internet user, we propose a Transaction-Aware Tentative Hold Protocol (

taTHP

) to perceive the transaction context information and play a more active role in transaction coordination. With the capability of forecasting the will-succeed transactions in a fairly small scope of candidates, taTHP is able to achieve more resource utilization with smaller complaints about the transaction coordination. Finally, a comprehensive comparison is carried out to demonstrate the improvement of the proposed protocol. And it can be concluded from the result that taTHP is provided with better efficiency and satisfactory quality of service.

Wei Xu, Wenqing Cheng, Wei Liu

Session 4: Web Services

Unstoppable Stateful PHP Web Services

This paper presents the architecture and implementation of the EOS

2

failure-masking framework for composite Web Services. EOS

2

is based on the recently proposed notion of interaction contracts (IC), and provides exactly-once execution semantics for general, arbitrarily distributed Web Services in the presence of message losses and component crashes without requiring explicit coding effort by the application programmer. The EOS

2

implementation masks failures by adding a recovery layer to popular Web technology products: (i) the server-side script language PHP run on Apache Web server, and (ii) Internet browsers like IE to deliver recovery guarantees to the end-user.

German Shegalov, Gerhard Weikum, Klaus Berberich

Quantified Matchmaking of Heterogeneous Services

As the service-oriented computing paradigm and its related technologies mature, it is expected that electronic services will continue to grow in numbers. In such a setting, the course of service discovery could yield many alternative yet heterogeneous services which, by all means, may be of different type and moreover distinguished by their quality characteristics. To come through such situations and ease the task of service selection, service search engines need to be powered by an efficient matchmaking mechanism, which will abstract requesters from service heterogeneity and provide them with the means for choosing the service that best fits their requirements, among a wide set of services with similar functionally. In this paper, we present an efficient service matchmaking algorithm, which facilitates the task of heterogeneous service selection, whilst combining and exploiting the syntactic, semantic, and Quality-of-Service (QoS) properties contained in service advertisements.

Michael Pantazoglou, Aphrodite Tsalgatidou, George Athanasopoulos

Pattern Based Property Specification and Verification for Service Composition

Service composition is becoming the dominant paradigm for developing Web service applications. It is important to ensure that a service composition complies with the requirements for the application. A rigorous compliance checking approach usually needs the requirements being specified in property specification formalisms such as temporal logics, which are difficult for ordinary software practitioners to comprehend. In this paper, we propose a property pattern based specification language, named PROPOLS, and use it to verify BPEL service composition schemas. PROPOLS is easy to understand and use, yet is formally based. It builds on Dwyer et al.’s property pattern system and extends it with the logical composition of patterns to accommodate the specification of complex requirements. PROPOLS is encoded in an ontology language, OWL, to facilitate the sharing and reuse of domain knowledge. A Finite State Automata based framework for verifying BPEL schemas against PROPOLS properties is also discussed.

Jian Yu, Tan Phan Manh, Jun Han, Yan Jin, Yanbo Han, Jianwu Wang

Detecting the Web Services Feature Interactions

Feature interaction has been first identified as a problem in the telecom. But it is not limited in this domain, it can also occur in any software system that is subject to changes. With the introduction of service composition technique in the Web services area, a variety of message interactions are resulted among the composed services. As a result, Web Services Feature Interactions problem becomes inevitable. This paper investigates the detection of this problem and proposes a novel multilayer detection system (WSFIDS) by taking inspiration from the immune system. With the application of some immune principles, the system provides an effective method to detect both known and previously unknown feature interactions.

Jianyin Zhang, Fangchun Yang, Sen Su

Session 5: Web Mining

Exploiting Rating Behaviors for Effective Collaborative Filtering

Collaborative Filtering (CF) is important in the e-business era as it can help business companies to predict customer preferences. However, Sparsity is still a major problem preventing it from achieving better effectiveness. Lots of ratings in the training matrix are unknown. Few current CF methods try to fill in those blanks before predicting the ratings of an active user. In this work, we have validated the effectiveness of matrix filling methods for the collaborative filtering. Moreover, we have tried three different matrix filling methods based on the whole training dataset and their clustered subsets with different weights to show the different effects. By comparison, we have analyzed the characteristics of those methods and have found that the mainstream method, Personality diagnosis (PD), can work better with most matrix filling method. Its MAE can reach 0.935 on a 2%-density EachMovie training dataset by item based matrix filling method, which is a 10.1% improvement. Similar improvements can be found both on EachMovie and MovieLens datasets. Our experiments also show that there is no need to do cluster-based matrix filling but the filled values should be assigned with a lower weight during the prediction process.

Dingyi Han, Yong Yu, Gui-Rong Xue

Exploiting Link Analysis with a Three-Layer Web Structure Model

Link analysis is one of the most effective methods of web structure mining. Traditional link analysis methods only consider the flat structure of the web with hyperlinks. It may affect the precision of the analysis result inevitably. It is observed that the web could be treated as an entity with a three-layer structure: host layer, page layer and block layer. Considering this three-layer structure is expected to improve the precision significantly when performing link analysis. A novel algorithm, three-layer based ranking is proposed to complete this task. In this algorithm, the important hosts and blocks are found within adaptations of the traditional link analysis methods. Based on these hosts and blocks, the web pages both belonged to important hosts and containing important blocks could be retrieved. These web pages are just the authoritative web pages that link analysis is looking for. We experimentally evaluate the precision of our three-layer based ranking algorithm. It is concluded from extensive experiments that this method outperforms other traditional algorithms significantly.

Qiang Wang, Yan Liu, JunYong Luo

Concept Hierarchy Construction by Combining Spectral Clustering and Subsumption Estimation

With the rapid development of the Web, how to add structural guidance (in the form of concept hierarchies) for Web document navigation becomes a hot research topic. In this paper, we present a method for the automatic acquisition of concept hierarchies. Given a set of concepts, each concept is regarded as a vertex in an undirected, weighted graph. The problem of concept hierarchy construction is then transformed into a modified graph partitioning problem and solved by spectral methods. As the undirected graph cannot accurately depict the hyponymy information regarding the concepts, subsumption estimation is introduced to guide the spectral clustering algorithm. Experiments on real data show very encouraging results.

Jing Chen, Qing Li

Automatic Hierarchical Classification of Structured Deep Web Databases

We present a method that automatically classifies structured deep Web databases according to a pre-defined topic hierarchy. We assume that there are some manually classified databases, i.e., training databases, in every node of the topic hierarchy. Each training database is probed using queries constructed from the node titles of the topic hierarchy and the query result counts reported by the database are used to represent the content of the database. Hence, when adding a new database it can be probed by the same set of queries and classified to a node whose training databases are most similar to the new one. Specifically, a support vector machine classifier is trained on each internal node of the topic hierarchy with these training databases and the new database can be classified into the hierarchy top-down level by level. A feature extension method is proposed to create discriminant features. Experiments run on real structured Web databases collected from the Internet show that this classification method is quite accurate.

Weifeng Su, Jiying Wang, Frederick Lochovsky

Session 6: Performant Web Systems

A Robust Web-Based Approach for Broadcasting Downward Messages in a Large-Scaled Company

Downward communication is a popular push-based scheme to forward messages from headquarters to front-line staff in a large-scaled company. With the maturing intranet and web technology, broadcasting algorithms, including pull-based and push-based broadcasting algorithms, it is feasible to send downward messages through web-based design by sending packets on a network. To avoid losing messages due to the traditional push-based method, companies adopt a pull-based algorithm to build up the broadcasting system. However, although the pull-based method can ensure that a message is received, it has a critical problem, the network is always congested. The push-based method can avoid congesting the network, but it needs a specific robust design to ensure that the message reaches its destination. Hence, adopting only a pull-based or a push-based broadcasting algorithm is no longer feasible especially not for a large-scaled company with complex network architecture. To ensure that every receiver will read downward messages thereby reducing the consumption of network bandwidth, this work proposes a robust web-based push- and pull-based broadcasting system for sending downward messages. This proposed system was successfully applied to a large-scaled company for a one-year period.

Chih-Chin Liang, Chia-Hung Wang, Hsing Luh, Ping-Yu Hsu

Buffer-Preposed QoS Adaptation Framework and Load Shedding Techniques over Streams

Maintaining the quality of queries over streaming data is often thought to be of tremendous challenge since data arrival rate and average per-tuple CPU processing cost are highly unpredictable. In this paper, we address a novel buffer-preposed QoS adaptation framework on the basis of control theory and present several load shedding techniques and scheduling strategies in order to guarantee the QoS of processing streaming data. As the most significant part of our framework, buffer manager consisting of scheduler, adaptor and cleaner, is deliberately introduced and analyzed. The experiments on both synthetic data and real life data show that our system, which is built by adding several concrete strategies on the framework, outperforms existing works on both resource utilization and QoS assurance.

Rui Zhou, Guoren Wang, Donghong Han, Pizhen Gong, Chuan Xiao, Hongru Li

Cardinality Computing: A New Step Towards Fully Representing Multi-sets by Bloom Filters

Bloom Filters are space and time efficient randomized data structures for representing (multi-)sets with certain allowable errors, and are widely used in many applications. Previous works on Bloom Filters considered how to support insertions, deletions, membership queries, and multiplicity queries over (multi-)sets. In this paper, we introduce two novel algorithms for computing cardinalities of multi-sets represented by Bloom Filters, which extend the functionality of the Bloom Filter and thus make it usable in a variety of new applications. The Bloom structure presented in the previous work is used without any modification, and our algorithms have no influence to previous functionality. For Bloom Filters support cardinality computing in addition to insertions, deletions, membership queries, and multiplicity queries simultaneously, our work is a new step towards fully representing multi-sets by Bloom Filters. Performance analysis and experimental results show the difference of the two algorithms and show that our algorithms perform well in most cases.

Jiakui Zhao, Dongqing Yang, Lijun Chen, Jun Gao, Tengjiao Wang

An Efficient Scheme to Completely Avoid Re-labeling in XML Updates

A number of labeling schemes have been designed to label element nodes such that the relationship between nodes can easily be determined by comparing their labels. However, the label update cost is high in present labeling schemes. These schemes must re-label existing nodes or re-calculate certain values when inserting an order-sensitive element. In this paper, a labeling scheme is proposed to support order-sensitive update, without re-labeling or re-calculation. In experimental results, the proposed scheme efficiently processes order-sensitive queries and updates.

Hye-Kyeong Ko, SangKeun Lee

Session 7: Web Information Systems

Modeling Portlet Aggregation Through Statecharts

A portal is a key component of an enterprise integration strategy. It provides integration at the user interface level, whereas other integration technologies support business process, functional or data integration. To this end, portlet syndication is the next wave following the successful use of content syndication in current portals. A portlet is a front-end application which is rendered within the portal framework. From this perspective, portlets can be regarded as Web components, and the portal as the component container where portlets are aggregated to provide higher order applications. Unlike back-end integration approaches (e.g. workflow systems), portlet aggregation demands front-end solutions that permit users navigate freely among portlets in a hypertext way. To this end, the Hypermedia Model Based on Statecharts is used. This model uses the structure and execution semantics of statecharts to specify both the structural organization and the browsing semantics of portlet aggregation. Besides familiarity, statecharts bring formal validation to portal design, helping portal designers in the development of structured portals. As a prove of concept, this model has been realized in the

eXo

portal platform.

Oscar Díaz, Arantza Irastorza, Maider Azanza, Felipe M. Villoria

Calculation of Target Locations for Web Resources

A location-based search engine must be able to find and assign proper locations to Web resources. Host, content and metadata location information are not sufficient to describe the location of resources as they are ambiguous or unavailable for many documents. We introduce target location as the location of users of Web resources. Target location is content-independent and can be applied to all types of Web resources. A novel method is introduced which uses log files and IPs to track the visitors of websites. The experiments show that target location can be calculated for almost all documents on the Web at country level and to the majority of them in state and city levels. It can be assigned to Web resources as a new definition and dimension of location. It can be used separately or with other relevant locations to define the geography of Web resources. This compensates insufficient geographical information on Web resources and would facilitate the design and development of location-based search engines.

Saeid Asadi, Jiajie Xu, Yuan Shi, Joachim Diederich, Xiaofang Zhou

Efficient Bid Pricing Based on Costing Methods for Internet Bid Systems

Internet bid systems are being widely used of late. In these systems, the seller sets the bid price. When the bid price is set too high compared with the normal price, chances of a successful bid may decrease. When it is set too low, however, based on inaccurate information, it can result in a successful bid yet one with no profit at all. To resolve this problem, an agent is proposed that automatically generates bid prices for sellers based on the similarity of the bidding parameters using past bidding information as well as on various costing methods such as the high-low point method, the scatter diagram method, and the learning curve method. Performance experiments have shown that the number of successful bids with appropriate profits can be increased using the bid pricing agent. Among the costing methods, the learning curve method has shown the best performance. The manner of designing and implementing the bid pricing agent is also discussed.

Sung Eun Park, Yong Kyu Lee

An Enhanced Super-Peer Model for Digital Library Construction

Peer-to-Peer (P2P) overlay network has emerged as a major infrastructure for constructing future digital libraries. Among various P2P infrastructures, super-peer based P2P network receives extensive attention because the super-peer paradigm allows a node to act as not just a client, but also serve for a set of clients. As different from conventional file-sharing paradigm, digital library applications have more advanced requirements on system independence/autonomy, robustness and flexible communication. This paper is devoted for constructing digital library systems built upon such super-peer based network, i.e. JXTA framework. Evaluation results are to be presented concerning network initialization, loading balancing and self-organizing.

Hao Ding, Ingeborg Sølvberg, Yun Lin

Offline Web Client: Approach, Design and Implementation Based on Web System

Offline web client is a new type of Rich Internet Applications which supports a user’s web operations no matter with network connection or not. It is very useful in many situations since many offline scenarios involve the users who work with explicitly disconnecting from the network. This paper proposes the approaches of designing offline web client as a guidance of developing offline web client applications, and describes the details of how to design and implement them. Finally, a prototype system is presented to verify the proposed approach. The experiments show that offline web client can increase user productivity and satisfaction, and enhance the usage of web-based systems.

Jie Song, Ge Yu, Daling Wang, Tiezheng Nie

Session 8: Web Document Analysis

A Latent Image Semantic Indexing Scheme for Image Retrieval on the Web

In this paper, we present a novel latent image semantic indexing scheme for efficient retrieval of WWW images. We present a hierarchical image semantic structure called HIST, which captures image semantics in an ontology tree and visual features in a set of specific semantic domains. The query algorithm works in two phases. First, the ontology is used for quickly locating the relevant semantic domains. Second, within each semantic domain, the visual features are extracted, and similarity techniques are exploited to break the “dimensionality curse”. The target images can then be efficiently retrieved with high precision. The experimental results show that HIST achieves good query performance. Therefore, our method is promising in diverse Web image retrieval.

Xiaoyan Li, Lidan Shou, Gang Chen, Lujiang Ou

Hybrid Method for Automated News Content Extraction from the Web

Web news content extraction is vital to improve news indexing and searching in nowadays search engines, especially for the news searching service. In this paper we study the Web news content extraction problem and propose an automated extraction algorithm for it. Our method is a hybrid one taking the advantage of both sequence matching and tree matching techniques. We propose

TSReC

, a variant of tag sequence representation suitable for both sequence matching and tree matching, along with an associated algorithm for automated Web news content extraction. By implementing a prototype system for Web news content extraction, the empirical evaluation is conducted and the result shows that our method is highly effective and efficient.

Yu Li, Xiaofeng Meng, Qing Li, Liping Wang

A Hybrid Sentence Ordering Strategy in Multi-document Summarization

In extractive summarization, a proper arrangement of extracted sentences must be found if we want to generate a logical, coherent and readable summary. This issue is special in multi-document summarization. In this paper, several existing methods each of which generate a reference relation are combined through linear combination of the resulting relations. We use 4 types of relationships between sentences (chronological relation, positional relation, topical relation and dependent relation) to build a graph model where the vertices are sentences and edges are weighed relationships of the 4 types. And then apply a variation of page rank to get the ordering of sentences for multi-document summaries. We tested our hybrid model with two automatic methods: distance to manual ordering and ROUGE score. Evaluation results show a significant improvement of the ordering over strategies losing some relations. The results also indicate that this hybrid model is robust for articles with different genre which were used on DUC2004 and DUC2005.

Yanxiang He, Dexi Liu, Hua Yang, Donghong Ji, Chong Teng, Wenqing Qi

Document Fragmentation for XML Streams Based on Query Statistics

Recently XML fragments processing prevails over the Web due to its flexibility and manageability. In this paper, we propose two techniques for document fragmentation considering the query statistics over XML data: path frequency tree (PFT) and Markov tables. Both techniques work by merging the nodes of low inquiring frequency to enhance fragment utilization, or by merging the nodes of high inquiring frequency to enhance fragment cohesion. Performance study shows that our algorithms perform well on query cost and other metrics.

Huan Huo, Guoren Wang, Xiaoyun Hui, Chuan Xiao, Rui Zhou

A Heuristic Approach for Topical Information Extraction from News Pages

Topical information extraction from news pages could facilitate news searching and retrieval etc. A web page could be partitioned into multiple blocks. The importance of different blocks varies from each other. The estimation of the block importance could be defined as a classification problem. First, an adaptive vision-based page segmentation algorithm is used to partition a web page into semantic blocks. Then spatial features and content features are used to represent each block. Shannon’s information entropy is adopted to represent each feature’s ability for differentiating. A weighted Naïve Bayes classifier is used to estimate whether the block is important or not. Finally, a variation of TF-IDF is utilized to represent weight of each keyword. As a result, the similar blocks are united as topical region. The approach is tested with several important English and Chinese news sites. Both recall and precision rates are greater than 96%.

Yan Liu, Qiang Wang, QingXian Wang

Session 9: Quality, Security and Trust

Defining a Data Quality Model for Web Portals

Advances in technology and the use of the Internet have favoured the appearance of a great variety of Web applications, among them Web Portals. These applications are important information sources and/or means of accessing information. Many people need to obtain information by means of these applications and they need to ensure that this information is suitable for the use they want to give it. In other words, they need to assess the quality of the data.

In recent years, several research projects were conducted on topic of Web Data Quality. However, there is still a lack of specific proposals for the data quality of portals. In this paper we introduce a model for the data quality in Web portals (PDQM). PDQM has been built upon the foundation of three key aspects: a set of Web data quality attributes identified in the literature in this area, data quality expectations of data consumers on the Internet, and the functionalities that a Web portal may offer to its users.

Angélica Caro, Coral Calero, Ismael Caballero, Mario Piattini

Finding Reliable Recommendations for Trust Model

This paper presents a novel context-based approach to find reliable recommendations for trust model in ubiquitous environments. Context is used in our approach to analyze the user’s activity, state and intention. Incremental learning based neural network is used to dispose the context in order to detect doubtful recommendations. This approach has distinct advantages when dealing with randomly given irresponsible recommendations, individual unfair recommendations as well as unfair recommendations flooding regardless of from recommenders who always give malicious recommendations or “inside job” (recommenders who acted honest previous suddenly give unfair recommendations), which is lack of consideration in the previous works. The incremental learning based neural network used in our approach also enables to filter out the unfair recommendations with limited information about the recommenders. Our simulation results show that our approach can effectively find reliable recommendations in different scenarios and a comparison is also given between previous works and our method.

Weiwei Yuan, Donghai Guan, Sungyoung Lee, Youngkoo Lee, Andrey Gavrilov

Self-Updating Hash Chains and Their Implementations

Hash Chains are widely used in various cryptography applications such as one-time passwords, server-supported signatures and micropayments etc. However, the finite length (‘limited-link’) of hash chains limits their applications. Some methods of re-initializing hash chains or infinite hash chains introduced in literatures are inefficient and un-smooth. In this paper, a novel scheme (a new kind of hash chain) is proposed, which re-initializes or updates by itself, named

Self-Updating Hash Chain

– SUHC. Highlights of SUHC are

self-updating, fine-authentication and proactive updating

. The updating process of SUHC is smooth, secure and efficient and does not need additional protocols or an independent re-initialization process, and can be continued indefinitely to give rise to an infinite length hash chain. An improved

Server-Supported Signature

with SUHC is also presented to show the application of SUHC.

Haojun Zhang, Yuefei Zhu

Modeling Web-Based Applications Quality: A Probabilistic Approach

Quality assurance of Web-based applications is considered as a main concern. Many factors can affect their quality. Modeling and measuring these factors are by nature uncertain and subjective tasks. In addition, representing relationships between these factors is a complex task. In this paper, we propose an approach for modeling and supporting the assessment of Web-based applications quality. Our proposal is based on Bayesian Networks.

Ghazwa Malak, Houari Sahraoui, Linda Badri, Mourad Badri

Monitoring Interactivity in a Web-Based Community

Over the years, Web-based communities (WBC) have evolved from social phenomena that have no business dimension to a business enabler in the virtual marketplace. WBC becomes economically interesting with the increasing size of the community where members are active participants in sharing ideas and knowledge. There are, however, inherent problems associated with the increasing size of a WBC, especially determining the contributions of members in sustaining the community’s interactivity. Interactivity relates to the level of participation of a member in a given community, and the usefulness of such contributions to the needs of the community. In this paper, we present an interactivity model that captures the contributions of members of the community and also determines the community’s interactivity level. We use simulation results to validate the correctness of our model.

Chima Adiele, Wesley Penner

Session 10: Semantic Web and Integration

A Metamodel-Based Approach for Extracting Ontological Semantics from UML Models

UML has been a standard language for domain modeling and application system design for about a decade. Since UML models include domain knowledge in themselves, which was verified by domain experts, it is helpful to use these models as when we start to construct domain ontology. In this paper, we propose a method for extracting ontological concepts, properties, restrictions and instances from UML models. We compare the UML metamodel with the OWL metamodel, and define transformation rules for constructing OWL-encoded ontology. We expect that the generated ontology plays a role of an early stage model for ontology development.

Hong-Seok Na, O-Hoon Choi, Jung-Eun Lim

Deeper Semantics Goes a Long Way: Fuzzified Representation and Matching of Color Descriptions for Online Clothing Search

Indexing and retrieval by color descriptions are very important to finding certain web resources, which is typical in the example of online clothing search. However, both keyword matching and semantic mediation by the current ontologies may suffer from the semantic gap between the similarity evaluation and the human perception, which requests the exploitation of “deeper” semantics of color descriptions to reduce this gap. Considering the inherent variability and imprecision characterizing color naming, this paper proposes a novel approach to define (1) the fuzzy semantics of color names on the HSL color space together with their knowledge representation in fuzzy conceptual graphs, and (2) the associated measures to evaluate the similarity between fuzzified color descriptions. The experimental results rendered by the prototype clothing search system have preliminarily shown the strength of the deeper semantics surpassing the ability of both keywords and a concept hierarchy, in handling the matching problem of color descriptions in the targeted web resource search.

Haiping Zhu, Huajie Zhang, Yong Yu

Semantically Integrating Portlets in Portals Through Annotation

Portlets are currently supported by most portal frameworks. However, there is not yet a definitive answer to portlet interoperation whereby data flows smoothly from one portlet to a neighboring one. One of the approaches is to use deep annotation. By providing additional markup about the background services, deep annotation strives to interact with these underlying services rather than with the HTML surface that conveys the markup. In this way, the

portlet

can extend portlet markup with meta-data about the processes this markup conveys. Then, the

portlet consumer

(e.g. a portal) can use this meta-data to guide mapping from available data found in markup of portlet A to required data in markup of portlet B. This mapping is visualised as portlet B having its input form (or other “input” widget) filled up. However, annotating is a cumbersome process that forces to keep in synchrony the meta-data and the resources being annotated (i.e. the markup). This paper presents an automatic process whereby annotations are generated from portlet markups without user intervention. We detail our prototype using Lixto Visual Wrappers to extract semantic data from the markup.

Iñaki Paz, Oscar Díaz, Robert Baumgartner, Sergio F. Anzuola

A Self-organized Semantic Clustering Approach for Super-Peer Networks

Partitioning a P2P network into distinct semantic clusters can efficiently increase the efficiency of searching and enhance scalability of the network. In this paper, two semantic-based self-organized algorithms aimed at taxonomy hierarchy semantic space are proposed, which can dynamically partition the network into distinct semantic clusters according to network load, with semantic relationship among data within a cluster and load balance among clusters all well maintained. The experiment indicates good performance and scalability of these two clustering algorithms.

Baiyou Qiao, Guoren Wang, Kexin Xie

Using Categorial Context-SHOIQ(D+) DL to Migrate Between the Context-Aware Scenes

An important issue in semantic web ontology application is how to improve ontological evolvement to fit the semantics of the unceasingly changing context. This paper presents a context-based formalism- Context-SHOIQ(D+) DL which is under the frame of SHOIQ(D+) DL, a kind of description logic, from the category theory point of view. The core part of the proposed formalism is a categorial context based on the SHOIQ(D+) DL, that captures and explicitly represents the information about contexts. Additionally, this paper presents some meta languages about reasoning and knowledge representation, finally discusses context-aware migration between different scenes with the categorial Context-SHOIQ(D+)DL.

Ruliang Xiao, Shengqun Tang, Ling Li, Lina Fang, Youwei Xu, Yang Xu

Session 11: XML Query Processing

SCEND: An Efficient Semantic Cache to Adequately Explore Answerability of Views

Maintaining a semantic cache of materialized XPath views inside or outside the database, is a novel, feasible and efficient approach to accelerate XML query processing. However, the main problems of existing approaches are that, they either can not exploit sufficient potential cached views to answer an issued query or need too much time for cache lookup. In this paper, we propose, SCEND, an efficient Semantic Cache based on dEcompositioN and Divisibility, which adequately explores the answerability of views, and speeds up cache lookup dramatically. We decompose complex XPath queries into some much simpler and tractable ones to improve cache hit rate, moreover, we introduce a notion of the divisibility between two positive integers to accelerate cache lookup. In addition, we present a new replacement technique for SCEND to improve performance for caching. We experimentally demonstrate the efficiency of our caching techniques and performance gains obtained by employing such a cache.

Guoliang Li, Jianhua Feng, Na Ta, Yong Zhang, Lizhu Zhou

Clustered Chain Path Index for XML Document: Efficiently Processing Branch Queries

Branch query processing is a core operation of XML query processing. In recent years, a number of stack based twig join algorithms have been proposed to process twig queries based on tag stream index. However, each element is labeled separately in tag stream index, similarity of same structured elements is ignored; besides, algorithms based on tag stream index perform worse on large document. In this paper, we propose a novel index Clustered Chain Path Index (CCPI for brief) based on a novel labeling scheme: Clustered Chain Path labeling. The index provides good properties for efficiently processing branch queries. It also has the same cardinality as 1-index against tree structured XML document. Based on CCPI, we design efficient algorithms KMP-Match-Path to process queries without branches and Related-Path-Segment-Join to process queries with branches. Experimental results show that proposed query processing algorithms based on CCPI outperform other algorithms and have good scalability.

Hongqiang Wang, Jianzhong Li, Hongzhi Wang

Region-Based Coding for Queries over Streamed XML Fragments

Recently proposed Hole-Filler model is promising for transmitting and evaluating streamed XML fragments. However, by simply matching filler IDs with hole IDs, associating all the correlated fragments to complete the query path would result in blocking. Taking advantage of region-based coding scheme, this paper models the query expression into query tree and proposes a set of techniques to optimize the query plan. It then proposes XFPR (XML Fragment Processor with Region code) to speed up query processing by skipping correlating adjacent fragments. We illustrate the effectiveness of the techniques developed with a detailed set of experiments.

Xiaoyun Hui, Guoren Wang, Huan Huo, Chuan Xiao, Rui Zhou

PrefixTreeESpan: A Pattern Growth Algorithm for Mining Embedded Subtrees

Frequent embedded subtree pattern mining is an important data mining problem with broad applications. In this paper, we propose a novel embedded subtree mining algorithm, called

PrefixTreeESpan

(i.e.

Prefix-Tree

-projected

E

mbedded-

S

ubtree

pa

tter

n

), which finds a subtree pattern by growing a frequent

prefix-tree

. Thus, using divide and conquer, mining local length-1 frequent subtree patterns in

Prefix-Tree-Projected database

recursively will lead to the complete set of frequent patterns. Different from

Chopper

and

XSpanner

[4],

PrefixTreeESpan

does not need a checking process. Our performance study shows that

PrefixTreeESpan

outperforms

Apriori-like

algorithm:

TreeMiner

[6], and

pattern-growth

algorithms :

Chopper

,

XSpanner

.

Lei Zou, Yansheng Lu, Huaming Zhang, Rong Hu

Evaluating Interconnection Relationship for Path-Based XML Retrieval

As one of popular solutions for meaningful result determination, interconnection relationship has been accepted widely by search engines. However, it suffers from heavy storage overhead and serious time complexity of index construction and query evaluation. We extend the retrieval syntax to support path query for partial structural information, propose a variant of interconnection relationship, and present an efficient evaluation algorithm with the extended structural summaries. The experiments show that with the promise of meaningful answers, our approach offers great performance benefits of evaluating interconnection relationship at run time.

Xiaoguang Li, Ge Yu, Daling Wang, Baoyan Song

Session 12: Multimedia and User Interface

User Models: A Contribution to Pragmatics of Web Information Systems Design

On a high level of abstraction a Web Information System (WIS) can be described by a storyboard, which in an abstract way specifies who will be using the system, in which way and for which goals. While syntax and semantics of storyboarding has been well explored, its pragmatics has not. A first step towards pragmatics of storyboards is the observation and documentation of life cases in reality. These, however, have to be complemented by user models. This paper presents an approach to capture user models and to specify the various facets of actor profiles that are needed for them. We analyse actor profiles and present a semi-formal way for their documentation. We outline how these profiles can be used to specify actors, which are an important component of a storyboard.

Klaus-Dieter Schewe, Bernhard Thalheim

XUPClient – A Thin Client for Rich Internet Applications

With the help of rich web client technologies, developers are creating rich internet applications in response to end users’ growing demand in richer web experiences. However, most of these technologies are fat client based. That is, to enable rich user interfaces, application code, whether binary or script, must be downloaded and executed on the client side. In this paper, we propose a thin client based approach – XUPClient, a rich web client aimed at closing the gap between web and desktop user interfaces. It allows end users to interact with rich UI components only found in desktop environment, while remaining thin in terms of application logic; i.e. all application code resides on the server side, and the client only renders declarative UI definitions. XUPClient is built on top of the Extensive User Interface Protocol (XUP), a SOAP-based protocol for communicating events and incremental user interface updates on the web.

Jin Yu, Boualem Benatallah, Fabio Casati, Regis Saint-Paul

2D/3D Web Visualization on Mobile Devices

Visualization is able to present the rich Web information intuitively and make the Web search/mining more productive. Mobile computing is able to provide the flexibility of working anytime and anywhere. Therefore, it is natural to combine the two techniques for intriguing applications. However, the technical limitations of mobile devices make it difficult to port well-designed visualization methods from desktop computers to mobile devices. In this paper, we present what we learned on engineering 3D Web visualization on both high-end and low-end mobile devices as the MWeb3D framework, which forms a distributed pipeline that move intensive computation from the mobile devices to server systems. Some important issues of this strategy include: (1) separating visualization from graphics rendering, (2) encoding visual presentation for transmitting via bandwidth-limited wireless connections, (3) user interaction on mobile devices, and (4) highly efficient graphics rendering on the mobile devices. We will show fruitful experiments on both PDA and mobile phone with photos taken from both simulator and real mobile devices.

Yi Wang, Li-Zhu Zhou, Jian-Hua Feng, Lei Xie, Chun Yuan

Web Driving: An Image-Based Opportunistic Web Browser That Visualizes a Peripheral Information Space

An image-based opportunistic Web browser called "WebDriving" is described that automatically and continuously visualizes the "peripheral information space". It extracts the images from the current Web page, its link-destination pages, and other related pages and uses them to dynamically construct a 3D space. This enables the user to concurrently browse not only the images on the current Web page but also the images in the peripheral information space. The user browses these images by "driving a car through the constructed 3D world". The user thus becomes aware of other relevant Web pages while browsing the current page, which is a form of "opportunistic browsing". An experiment in which elementary school pupils used the WebDriving browser demonstrated that they could use it to effectively obtain information because it enabled them to intuitively understand a Web space, the functions of a Web browser, and Web search.

Mika Nakaoka, Taro Tezuka, Katsumi Tanaka

Blogouse: Turning the Mouse into a Copy&Blog Device

Blogs are tools that put web publication into the layman’s hands. Despite its simplicity, the publication process is a cumbersome task when the content to be published is already in desktop documents. In order to ease this process, we have created

Blogouse

, a user-friendly, editor-independent, and blog-independent publication tool, which applies annotation techniques to the publication system. To attain this aim, we have extended the mouse device functionality to be ontology-aware.

Felipe M. Villoria, Sergio F. Anzuola, Oscar Díaz

Springer Professional

Table of Contents

Frontmatter

Keynote Papers

Internet-Scale Data Distribution: Some Research Problems

Towards Next-Generation Search Engines and Browsers – Search Beyond Media Types and Places

Building a Domain Independent Platform for Collecting Domain Specific Data from the Web

Session 1: Web Search

A Web Search Method Based on the Temporal Relation of Query Keywords

Meta-search Based Web Resource Discovery for Object-Level Vertical Search

PreCN: Preprocessing Candidate Networks for Efficient Keyword Search over Databases

Searching Coordinate Terms with Their Context from the Web

Session 2: Web Retrieval

A Semantic Matching of Information Segments for Tolerating Error Chinese Words

Block-Based Similarity Search on the Web Using Manifold-Ranking

Design and Implementation of Preference-Based Search

Topic-Based Website Feature Analysis for Enterprise Search from the Web

Session 3: Web Workflows

Fault-Tolerant Orchestration of Transactional Web Services

Supporting Effective Operation of E-Governmental Services Through Workflow and Knowledge Management

DOPA: A Data-Driven and Ontology-Based Method for Ad Hoc Process Awareness in Web Information Systems

A Transaction-Aware Coordination Protocol for Web Services Composition

Session 4: Web Services

Unstoppable Stateful PHP Web Services

Quantified Matchmaking of Heterogeneous Services

Pattern Based Property Specification and Verification for Service Composition

Detecting the Web Services Feature Interactions

Session 5: Web Mining

Exploiting Rating Behaviors for Effective Collaborative Filtering

Exploiting Link Analysis with a Three-Layer Web Structure Model

Concept Hierarchy Construction by Combining Spectral Clustering and Subsumption Estimation

Automatic Hierarchical Classification of Structured Deep Web Databases

Session 6: Performant Web Systems

A Robust Web-Based Approach for Broadcasting Downward Messages in a Large-Scaled Company

Buffer-Preposed QoS Adaptation Framework and Load Shedding Techniques over Streams

Cardinality Computing: A New Step Towards Fully Representing Multi-sets by Bloom Filters

An Efficient Scheme to Completely Avoid Re-labeling in XML Updates

Session 7: Web Information Systems

Modeling Portlet Aggregation Through Statecharts

Calculation of Target Locations for Web Resources

Efficient Bid Pricing Based on Costing Methods for Internet Bid Systems

An Enhanced Super-Peer Model for Digital Library Construction

Offline Web Client: Approach, Design and Implementation Based on Web System

Session 8: Web Document Analysis

A Latent Image Semantic Indexing Scheme for Image Retrieval on the Web

Hybrid Method for Automated News Content Extraction from the Web

A Hybrid Sentence Ordering Strategy in Multi-document Summarization

Document Fragmentation for XML Streams Based on Query Statistics

A Heuristic Approach for Topical Information Extraction from News Pages

Session 9: Quality, Security and Trust

Defining a Data Quality Model for Web Portals

Finding Reliable Recommendations for Trust Model

Self-Updating Hash Chains and Their Implementations

Modeling Web-Based Applications Quality: A Probabilistic Approach

Monitoring Interactivity in a Web-Based Community

Session 10: Semantic Web and Integration

A Metamodel-Based Approach for Extracting Ontological Semantics from UML Models

Deeper Semantics Goes a Long Way: Fuzzified Representation and Matching of Color Descriptions for Online Clothing Search

Semantically Integrating Portlets in Portals Through Annotation

A Self-organized Semantic Clustering Approach for Super-Peer Networks

Using Categorial Context-SHOIQ(D+) DL to Migrate Between the Context-Aware Scenes

Session 11: XML Query Processing

SCEND: An Efficient Semantic Cache to Adequately Explore Answerability of Views

Clustered Chain Path Index for XML Document: Efficiently Processing Branch Queries

Region-Based Coding for Queries over Streamed XML Fragments

PrefixTreeESpan: A Pattern Growth Algorithm for Mining Embedded Subtrees

Evaluating Interconnection Relationship for Path-Based XML Retrieval

Session 12: Multimedia and User Interface

User Models: A Contribution to Pragmatics of Web Information Systems Design

XUPClient – A Thin Client for Rich Internet Applications

2D/3D Web Visualization on Mobile Devices

Web Driving: An Image-Based Opportunistic Web Browser That Visualizes a Peripheral Information Space

Blogouse: Turning the Mouse into a Copy&Blog Device

Backmatter

Premium Partner