Keynote Speech

Present and Future Supercomputer Architectures

In last 25 years, the field of scientific computing has undergone rapid change – we have experienced a remarkable turnover of technologies, architectures, vendors, and the usage of systems. Despite all these changes, the long-term evolution of performance seems to be steady and continuous. The acceptance of parallel systems not only for engineering applications but also for new commercial applications especially for database applications emphasized different criteria for market success such as stability of system, continuity of the manufacturer and price/performance. Due to these factors and the consolidation in the number of vendors in the market hierarchical systems build with components designed for the broader commercial market are currently replacing homogeneous systems at the very high end of performance. Clusters build with components of the shelf also gain more and more attention and today have a dominant position in the Top500. In this talk we will look at the some of the existing and planned high performance computer architectures and look at the interconnections schemes they are using. This talk will look at a number of different high performance computing architectures.

Jack Dongarra

Challenges in P2P Computing

Peer-to-peer (P2P) is an emerging model aiming to further utilize Internet information and resources, complementing the available client-server services. P2P has emerged as a promising paradigm for developing large-scale distributed systems due to its many unique features and its potential in future applications. P2P systems are popular because of their adaptation, self-organization, load-balancing, and highly availability. However, P2P systems also present many challenges that are currently obstacles to their widespread acceptance and usage, such as efficiency, security, and performance guarantees. For example, studies have shown that P2P traffic contributes the largest portion of the Internet traffic based on the measurements on some popular P2P systems, such as FastTrack (including KaZaA and Grokster), Gnutella, and Direct Connect. Even given that 95% of any two nodes are less than 7 hops away and the message time-to-live (TTL=7) is preponderantly used, the flooding-based routing algorithm generates 330 TB/month in a Gnutella network with only 50,000 nodes. In reality, there are millions of active P2P users at any given time. Our study has shown that the mechanism of a peer randomly choosing logical neighbors without any knowledge about the underlying physical topology causes topology mismatch between the P2P logical overlay network and the physical underlying network. A large portion of the heavy P2P traffic is caused by inefficient overlay topology and the blind flooding. Security and anonymity are other concerns in P2P systems. This talk will address the above issues as well as other potential applications of P2P computing and mobile P2P systems.

Linoel M. Ni

Multihop Wireless Ad Hoc Networking: Current Challenges and Future Opportunities

An ad hoc network is a collection of wireless mobile nodes that form a network without existing infrastructure or centralized administration. Nodes in the network cooperate to forward packets for each other, to allow mobile nodes not within direct wireless transmission range of each other to communicate. Ad hoc networks were initially studied for military applications more than 20 years ago, and they are currently a very active area of research within academia, government, and industry. However, few real applications of ad hoc networking have yet been deployed in common usage, and the commercial potentials of this technology have yet to be realized. In this talk, I will describe some of the current research challenges in ad hoc networking, and I will present what I believe are some the future real-world applications for this promising technology.

David B. Johnson

Session 1A: Parallel Algorithms and Systems I

An Inspector-Executor Algorithm for Irregular Assignment Parallelization

A loop with irregular assignment computations contains loop-carried output data dependences that can only be detected at run-time. In this paper, a load-balanced method based on the inspector-executor model is proposed to parallelize this loop pattern. The basic idea lies in splitting the iteration space of the sequential loop into sets of conflict-free iterations that can be executed concurrently on different processors. As will be demonstrated, this method outperforms existing techniques. Irregular access patterns with different load-balancing and reusability properties are considered in the experiments.

Manuel Arenaz, Juan Touriño, Ramón Doallo

Multi-grain Parallel Processing of Data-Clustering on Programmable Graphics Hardware

This paper presents an effective scheme for clustering a huge data set using a commodity programmable graphics processing unit(GPU). Due to GPU’s application-specific architecture, one of the current research issues is how to bind the rendering pipeline with the data-clustering process. By taking advantage of GPU’s parallel processing capability, our implementation scheme is devised to exploit the multi-grain single-instruction multiple-data (SIMD) parallelism of the nearest neighbor search, which is the most computationally-intensive part of the data-clustering process. The performance of our scheme is discussed in comparison with that of the implementation entirely running on CPU. Experimental results clearly show that the parallelism of the nearest neighbor search allows our scheme to efficiently execute the data-clustering process. Although data-transfer from GPU to CPU is generally costly, acceleration by GPU is significant to save the total execution time of data-clustering.

Hiroyki Takizawa, Hiroaki Kobayashi

A Parallel Reed-Solomon Decoder on the Imagine Stream Processor

The increasing gap between processor and memory speeds is a well-known problem in modern computer architecture. Imagine stream architecture can solve bandwidth bottleneck by its particular memory hierarchy and stream processing for computationally intensive applications. Good performance has been demonstrated on media processing and partial scientific computing domains. Reed-Solomon (RS) codes are powerful block codes widely used as an error correction method. RS decoding demands a high memory bandwidth and intensive ALUs because of complex and special processing (galois field arithmetic), and real time requirement. People usually use specialized processor or DSP to solve it that gains high performance but lacks flexibility. This paper presents a software implementation of a parallel Reed-Solomon decoder on the Imagine platform. The implementation requires complex stream programming since the memory hierarchy and cluster organization of the underlying architecture are exposed to the Imagine programmer. Results demonstrate that Imagine has comparable performance to TI C64x. This work is an ongoing effort to validate the stream architecture is efficient and makes contribution to extend the application domain.

Mei Wen, Chunyuan Zhang, Nan Wu, Haiyan Li, Li Li

Effective Nonblocking MPI-I/O in Remote I/O Operations Using a Multithreaded Mechanism

A flexible intermediate library named Stampi realizes seamless MPI operations on interconnected parallel computers. Dynamic process creation and MPI-I/O operations both inside a computer and among computers are available with it. MPI-I/O operations to a remote computer are realized by MPI-I/O processes of the Stampi library which are invoked on a remote computer using a vendor-supplied MPI-I/O library. If the vendor-supplied one is not available, a single MPI-I/O process is invoked on a remote computer, and it uses UNIX I/O functions instead of the vendor-supplied one. In nonblocking MPI-I/O functions with multiple user processes, the single MPI-I/O process carries out I/O operations required by the processes sequentially. This results in small overlap of computation by the user processes with I/O operations by the MPI-I/O process. Therefore performance of the nonblocking functions is poor with multiple user processes. To realize effective I/O operations, a Pthreads library has been implemented in the MPI-I/O mechanism, and multithreaded I/O operations have been realized. The newly implemented MPI-I/O mechanism has been evaluated on inter-connected PC clusters, and higher overlap of the computation with the I/O operations has been achieved.

Yuichi Tsujita

Session 1B: Data Mining and Management

Asynchronous Document Dissemination in Dynamic Ad Hoc Networks

This paper presents a document-oriented model for information dissemination in dynamic ad hoc networks, such as those composed of highly mobile and volatile communicating devices (e.g. laptops and PDAs). This model relies on an asynchronous, peer-to-peer propagation scheme where documents can be cached on intermediate devices, and be later sent again –either spontaneously or on demand– in the network.

Frédéric Guidec, Hervé Roussain

Location-Dependent Query Results Retrieval in a Multi-cell Wireless Environment

The demand of information services is popular in recent years. However, the requested of correct answer in a mobile environment needs to have more attentions. This is due to the scope of query depends to the user location. In this paper, we propose an extension approach to handle the situation where a mobile user misses a query result at current time and expects to receive a next query result in the next interval time. The aim of this extension approach is to avoid redundant process in order to get a new query result. We show the efficiency of our proposed algorithm by giving some different examples and evaluations.

James Jayaputera, David Taniar

An Efficient Mobile Data Mining Model

Mobile Data Mining involves the generation of interesting patterns out from datasets collected from mobile devices. Previous work are frequency pattern [3], group pattern [9] and parallel pattern [5]. As mobile applications usage increases, the volume of dataset increases dramatically leading to lag time for processing. This paper presents an efficient model that uses the principle to attack the problem early in the process. The proposed model performs minor data analysis and summary early before the source data arrives to the data mining machine. By the time the source data arrives to the data mining machine, it will be in the form of summary transactions, which reduces the amount of further processing required in order to perform data mining. Performance and evaluation shows that this proposed model is significantly more efficient than traditional model to perform mobile data mining.

Jen Ye Goh, David Taniar

An Integration Approach of Data Mining with Web Cache Pre-fetching

Web caching plays a very important role for improving the performance of many Web-Based systems. As web cache capacity is limited, most web cache systems are using replacement algorithm to wash out the outdated data. Our web cache prediction method is based on the fact that, many clients usually have some kinds of regular procedures to access web files, such that the regular-procedure knowledge can be mined or learned by web cache system and files can be pre-fetched accordingly.

Yingjie Fu, Haohuan Fu, Puion Au

Session 1C: Distributed Algorithms and Systems

Towards Correct Distributed Simulation of High-Level Petri Nets with Fine-Grained Partitioning

Powerful grid and cluster computers allow efficient distributed simulation. Optimistic simulation techniques have been developed which allow for more parallelism in the local simulations than conservative methods. However, they may require costly rollbacks in simulation time due to dependencies between model parts that cause violations of global causality. Different notions of time have been proposed to detect and remedy these situations. Logical time (or Lamport time) is used in many present-day distributed simulation algorithms. However, high-level colored Petri nets may contain global activity priorities, vanishing states, and global state dependencies. Thus virtual time is not sufficient to maintain the global chronological order of events for the optimistic simulation of this model class. The paper presents a new approach that guarantees a correct ordering of global states in a distributed Petri net simulation. A priority-enhanced vector time algorithm is used to detect causal dependencies.

Michael Knoke, Felix Kühling, Armin Zimmermann, Günter Hommel

M-Guard: A New Distributed Deadlock Detection Algorithm Based on Mobile Agent Technology

Deadlock detection and resolution are of the fundamental issues in distributed systems. Although many algorithms have been proposed, these message passing based traditional solutions can hardly meet the challenges of the prevailing Internet computing and mobile computing. In this paper, we present a novel algorithm, namely the M-Guard, for deadlock detection and resolution in distributed systems based on mobile agent technology. The proposed algorithm lies in the intersection of the centralized type algorithm and the distributed type algorithm. An agent is employed in our algorithm as a guard with dual-role: when roaming in the system according to a specified itinerary algorithm, the agent collects resource request/allocation information for detecting deadlock cycles as well as propagating the collected network and resource information among the nodes. Consequently, accurate and timely detections of deadlocks can be made without any network node being the performance bottleneck. Preliminary simulation results show that, compared with several other algorithms, the M-Guard algorithm achieves both shorter deadlock persisting time and smaller phantom deadlock ratio. Moreover, the overall network communication overhead can be decreased, too.

Jingyang Zhou, Xiaolin Chen, Han Dai, Jiannong Cao, Daoxu Chen

Meta-based Distributed Computing Framework

The explosive growth of distributed technologies requires frameworks to be adaptable. This paper uses design patterns as building blocks to develop an adaptive pattern-oriented framework for distributed computing applications. We describe our novel approach of combining a meta-architecture with a pattern-oriented framework, resulting in an adaptable framework which provides a mechanism to facilitate system evolution. We show how the meta-based framework can be used effectively to enable component integration and to separate system functionality from application functionality. The framework is based on modelling layers of the architecture to meet the challenges of customization, reusability, and extendibility in distributed computing technology. We also highlight how the meta-based framework will impose significant adaptability in system evolution through a simple example using a HTTP Server in conjunction with a thread pool.

Andy S. Y. Lai, A. J. Beaumont

Locality Optimizations for Jacobi Iteration on Distributed Parallel Systems

In this paper, we propose an inter-nest cache reuse optimization method for Jacobi codes. This method is easy to apply, but effective in that it enhances cache locality of the Jacobi codes while preserving their coarse grain parallelism. We compare our method to two previous locality enhancement techniques that can be used for Jacobi codes: time skewing and new tiling. We quantitatively calculate the main contributing factors to the runtime of different Jacobi codes. We also perform experiments on a PC cluster to verify our analysis. The results show that our method performs poorer than time skewing and new tiling for uniprocessor, but performs better for distributed parallel system.

Yonggang Che, Zhenghua Wang, Xiaomei Li, Laurence T. Yang

Session 2A: Fault Tolerance Protocols and Systems

Fault-Tolerant Cycle Embedding in the WK-Recursive Network

Recently, the WK-recursive network has received much attention due to its many favorable properties such as a high degree of scalability. By

K

(

d

,

t

), we denote the WK-recursive network of level

t

, each of whose basic modules is a

d

-node complete graph, where

d

> 1 and

t

≥ 1. In this paper, we construct fault-free Hamiltonian cycles in

K

(

d

,

t

) with at most

d

– 3 faulty nodes, where

d

≥ 4. Since the connectivity of

K

(

d

,

t

) is

d

– 1, the result is optimal.

Jung-Sheng Fu

RAIDb: Redundant Array of Inexpensive Databases

In this paper, we introduce the concept of Redundant Array of Inexpensive Databases (RAIDb). RAIDb is to databases what RAID is to disks. RAIDb aims at providing better performance and fault tolerance than a single database, at low cost, by combining multiple database instances into an array of databases. Like RAID, we define and compare different RAIDb levels that provide various cost/performance/fault tolerance tradeoffs.

We present a Java implementation of RAIDb called Clustered JDBC or CJDBC. C-JDBC achieves both database performance scalability and high availability at the middleware level without changing existing applications nor database engines.

Emmanuel Cecchet

A Fault-Tolerant Multi-agent Development Framework

FATMAD is a fault-tolerant multi-agent development framework that is built on top of a mobile agent platform (Jade). FATMAD aims to satisfy the needs of two communities of users: Jade application developers and fault-tolerant protocol developers. Application-level fault tolerance incurs significant development-time cost. FATMAD is based on a generic fault-tolerant protocol whose refinements lead to a broad range of checkpoint and recovery protocols to be used in supporting user applications, thus significantly reducing the development time of fault-tolerant agent applications. This paper introduces the design of FATMAD and explains how fault-tolerant protocol developers can extend FATMAD with additional checkpoint and recovery protocols. The key concepts and features are illustrated through the staggered checkpoint protocol.

Lin Wang, Hon F. Li, Dhrubajyoti Goswami, Zunce Wei

A Fault Tolerance Protocol for Uploads: Design and Evaluation

This paper investigates fault tolerance issues in Bistro, a wide area upload architecture. In Bistro, clients first upload their data to intermediaries, known as bistros. A destination server then pulls data from bistros as needed. However, during the server pull process, bistros can be unavailable due to failures, or they can be malicious, i.e., they might intentionally corrupt data. This degrades system performance since the destination server may need to ask for retransmissions. As a result, a fault tolerance protocol is needed within the Bistro architecture. Thus, in this paper, we develop such a protocol which employs erasure codes in order to improve the reliability of the data uploading process. We develop analytical models to study reliability and performance characteristics of this protocol, and we derive a cost function to study the tradeoff between reliability and performance in this context. We also present numerical results to illustrate this tradeoff.

L. Cheung, C. -F. Chou, L. Golubchik, Y. Yang

Topological Adaptability for the Distributed Token Circulation Paradigm in Faulty Environment

In this paper, we combine random walks and self-stabilization to design a single token circulation algorithm. Random walks have proved their efficiency in dynamic networks and are perfectly adapted to frequent network topological changes. Self-stabilization is the most general technique to design an algorithm that tolerates transient failures. Taking account that the token circulates continually according to a random walk scheme, designing a self-stabilizing algorithm implies to solve two situations (1) no token in the system and (2) several tokens in the system. The former is generally solved by a time-out mechanism, upon timeout a new token is created. In this paper, we focus on this problem. Just state that one may choose a sufficiently long time-out period is not possible in our case: the system could never stabilize. Indeed, a random walk based token eventually cover the network but only the

expected

time to cover the network can be captured. Therefore, we introduce a mechanism “the reloaded wave propagation” to prevent unnecessary token creation and preserve self-stabilization properties.

Thibault Bernard, Alain Bui, Olivier Flauzac

Session 2B: Sensor Networks and Protocols

Adaptive Data Dissemination in Wireless Sensor Networks

Recent years have witnessed growing research interest in wireless sensor networks. In the literature, three data storage strategies, i.e.,

local

,

data-centric

, and

index-centric

, have been proposed to answer one-shot queries originated from inside the network. Through careful analysis, we show in this paper that each of the three strategies respectively achieves the best performance in a certain range of query-rate to update-rate (Q2U) ratios. We therefore propose several adaptive schemes which periodically adjust the data storage and dissemination strategy in accordance with the Q2U ratio observed in the network. Experimental results demonstrate that in dynamic environments the proposed adaptive schemes substantially outperform the three basic strategies.

Jian Xu, Jianliang Xu, Shanping Li, Qing Gao, Gang Peng

Continuous Residual Energy Monitoring in Wireless Sensor Networks

A crucial issue in the management of sensor networks is the continuous monitoring of residual energy level of the sensors in the network. In this paper, we propose a hierarchical approach to construct a continuous energy map of a sensor network. Our method consists of a topology discovery and clustering phase, followed by an aggregation phase when energy information collected are abstracted and merged into energy contours in which nodes with similar energy level are grouped into the same region. The monitoring tree is restructured periodically to distribute energy cost among all nodes fairly, which increases the lifetime of the sensor network. Simulation results indicate that our method is able to generate accurate energy maps with low energy cost.

Song Han, Edward Chan

Design and Analysis of a k-Connected Topology Control Algorithm for Ad Hoc Networks

Topology control is an effective approach to reduce the energy consumption in wireless ad hoc networks. In this paper, we propose a

k

-connected (KC) energy saving topology control algorithm, which is fully distributed, asynchronous and scalable with little overhead. We prove KC algorithm has several valuable properties. First, it is optimal for the local energy-saving topology control; second it preserves the network connectivity (even

k

-connectivity if the neighbor topology is

k

-connected) and dramatically reduces the energy consumption; third, each node degree obtained by the KC algorithm cannot exceed 6*

k

. Performance simulation shows the effectiveness of our proposed algorithm.

Lei Zhang, Xuehui Wang, Wenhua Dou

On Using Temporal Consistency for Parallel Execution of Real-Time Queries in Wireless Sensor Systems

In this paper, we study the important issues for execution of real-time queries in a wireless sensor system in which sensor nodes are distributed to monitor the events that have occurred in the environment. Three important objectives in processing the real-time queries are: (1) to minimize the number of missed deadlines, (2) to minimize the processing costs, especially in data communication; and (3) to provide temporally consistent sensor data values for query execution. To reduce the data transmission cost and delay time in gathering the right versions of data items for a query, we propose the Parallel Data Shipping with Priority Transmission (PAST) scheme to determine where and how to execute a real-time query. To meet the deadlines of the queries, a deadline-driven priority policy is adopted to schedule the transmission of sensor data versions to the coordinator node.

Kam-Yiu Lam, Henry C. W. Pang, Sang H. Son, BiYu Liang

Session 2C: Cluster Systems and Applications

Cluster-Based Parallel Simulation for Large Scale Molecular Dynamics in Microscale Thermophysics

A cluster-based spatial decomposition algorithm for solving large-scale Molecular Dynamics simulation of thermophysics is proposed. Firstly, three kinds of domain division strategies are provided and their efficiency and scalability are analyzed. Secondly, a method called FLNP (Fast Location of Neighboring Particles) to accelerate the location of neighboring particles is proposed, which greatly reduces the cost of calculation and communication of interaction. Additionally, a new memory management technique called AMM (Adaptive Memory Management) is applied to meet the large memory requirement. The parallel algorithm based on these above technologies was implemented on a cluster of SMPs and tested on a system of 6,912,000 particles and achieved an efficiency of 77.0%.

Jiwu Shu, Bing Wang, Weimin Zheng

Parallel Checkpoint/Recovery on Cluster of IA-64 Computers

We design and implement a high availability parallel run-time system—ChaRM64, a Checkpoint- based Rollback Recovery and Migration system for parallel running programs on a cluster of IA-64 computers. At first, we discuss our solution of a user-level, single process checkpoint/recovery library running on IA-64 systems. Based on this library, ChaRM64 is realized, which implements a user-transparent, coordinated checkpointing and rollback recovery (CRR) mechanism, quasi-asynchronous migration and the dynamic reconfiguration function. Owing to the above techniques and efficient error detection, ChaRM64 can handle cluster node crashes and hardware transient faults in a IA-64 cluster. Now ChaRM64 for PVM has been implemented in Linux and the MPI version is under construction. As we know, there are few similar projects accomplished for IA-64 architecture.

Youhui Zhang, Dongsheng Wang, Weimin Zheng

Highly Reliable Linux HPC Clusters: Self-Awareness Approach

Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by transient failures and require a complete restart of the entire machine. The recently released HA-OSCAR software stack is one such effort making inroads here. This paper discusses detailed solutions for the high-availability and serviceability enhancement of clusters by HA-OSCAR via multi-head-node failover and a service level fault tolerance mechanism. Our solution employs self-configuration and introduces Adaptive Self Healing (ASH) techniques. HA-OSCAR availability improvement analysis was also conducted with various sensitivity factors. Finally, the paper also entails the details of the system layering strategy, dependability modeling, and analysis of an actual experimental system by a Petri net-based model, Stochastic Reword Net (SRN).

Chokchai Leangsuksun, Tong Liu, Yudan Liu, Stephen L. Scott, Richard Libby, Ibrahim Haddad

An Enhanced Message Exchange Mechanism in Cluster-Based Mobile Ad Hoc Networks,

In mobile ad hoc networks (MANETs), networks can be partitioned into clusters. Clustering algorithms are localized algorithms that have the property of creating an upper bounded clusterheads in any networks even in the worst case. Generally, clusterheads and selected gateways form a connected dominating set (CDS) of the network. This CDS forms the backbone of the network and can be used for routings (broadcasting/multicasting/unicating). As clusterheads need to determine the selected gateways to connect their adjacent clusterheads within the coverage set, they periodically collect 2-hop and 3-hop clusterhead information. These information can be gathered when they hear their non-clusterhead neighbors sending out messages that contain neighboring clusterhead information. These extra maintenance cost can be reduced when some enhanced mechanism is applied. In this paper, an enhanced mechanism is proposed to reduce the total length of the messages when non-clusterhead nodes exchange their 1-hop and 2-hop neighboring clusterhead information. Simulation shows that over 50% of message overhead can be saved for dense networks.

Wei Lou, Jie Wu

Session 3A: Parallel Algorithms and Systems II

Algorithmic-Parameter Optimization of a Parallelized Split-Step Fourier Transform Using a Modified BSP Cost Model

Adaptive algorithms are increasingly acknowledged in leading parallel and distributed research. In the past, algorithms were manually tuned to be executed efficiently on a particular architecture. However, interest has shifted towards algorithms that can adapt themselves to the computational resources. A cost model representing the behavior of the system (i.e. system parameters) and the algorithm (i.e algorithm parameters) plays an important role in adaptive parallel algorithms. In this paper, we contribute a computational model based on Bulk Synchronous Parallel processing that predicts performance of a parallelized split-step Fourier transform. We extracted the system parameters of a cluster (upon which our algorithm was executed) and showed the use of an algorithmic parameter in the model that exhibits optimal behavior. Our model can thus be used for the purpose of self-adaption.

Elankovan Sundararajan, Malin Premaratne, Shanika Karunasekera, Aaron Harwood

Parallel Volume Rendering with Early Ray Termination for Visualizing Large-Scale Datasets

This paper presents an efficient parallel algorithm for volume rendering of large-scale datasets. Our algorithm focuses on an optimization technique, namely early ray termination (ERT), which aims to reduce the amount of computation by avoiding enumeration of invisible voxels in the visualizing volume. The novelty of the algorithm is that it incorporates this technique into a distributed volume rendering system with global reduction of the computational amount. The algorithm also is capable of statically balancing the processor workloads. The experimental results show that our algorithm with global ERT further achieves the maximum reduction of 33% compared to an earlier algorithm with local ERT. As a result, our load-balanced algorithm reduces the execution time to at least 66%, not only for dense objects but also for transparent objects.

Manabu Matsui, Fumihiko Ino, Kenichi Hagihara

A Scalable Low Discrepancy Point Generator for Parallel Computing

The Monte Carlo (MC) method is a simple but effective way to perform simulations involving complicated or multivariate functions. The Quasi-Monte Carlo (QMC) method is similar but replaces independent and identically distributed (i.i.d.) random points by low discrepancy points. Low discrepancy points are regularly distributed points that may be deterministic or randomized. The digital net is a kind of low discrepancy point set that is generated by number theoretical methods. A software library for low discrepancy point generation has been developed. It is thread-safe and supports MPI for parallel computation. A numerical example from physics is shown.

Kwong-Ip Liu, Fred J. Hickernell

Generalized Trellis Stereo Matching with Systolic Array

We present here a real time stereo matching chip which is based on a general trellis form with vergent optical axis. The architecture can deal with general axis angle of cameras with better resolution in given space. For a pair of images with

M

×

N

pixels, only

$\mathcal{O}(MN)$

time is required. The design is highly scalable and fully exploits the concurrent and configurable nature of the algorithm. We implement stereo chip on Xilix FPGA with 208 PEs(Processing Elements) that can obtain disparity range of 208 levels. It can provide the real-time stereo matching for the mega-pixel images.

Hong Jeong, Sungchan Park

Optimal Processor Mapping Scheme for Efficient Communication of Data Realignment

In this paper, we present an

OptimalProcessor Mapping

(OPM) scheme to minimize data transmission cost for general BLOCK-CYCLIC data realignment. We examine a size oriented greedy matching method and the maximum bipartite matching theory to explore logical processor sequences. Based on these matching polices, the realigned sequences are used to perform data realignment in the destination phase. A significant improvement of our approach is that the

OPM

achieves high ratio of data remain in local space and leading minimum inter-processor communications. The

OPM

scheme could handle array realignment with arbitrary BLOCK-CYCLIC type and multidimensional arrays. Theoretical analysis and experimental results show that our technique provides considerable improvement for dynamic data realignment.

Ching-Hsien Hsu, Kun-Ming Yu, Chi-Hsiu Chen, Chang Wu Yu, Chiu Kuo Lian

Session 3B: Grid Applications and Systems

MCCF: A Distributed Grid Job Workflow Execution Framework

With the explosion of scientific data, distributed scientific applications present great challenges to the existing job workflow execution models over the Grid. Based on the idea of having executable codes as part of Grid resources, a Mobile Code Collaboration Framework (MCCF) utilizing light-weight mobile agent and dynamic services for distributed job workflow execution is proposed in this paper. Instead of the existing light-weight mobile agents, agent core (AC), which is an XML file specifying the functional descriptions of sub-jobs, is used to adapt job workflow execution to the dynamic characteristics of the Grid and to reduce the security risks. In addition, dynamic service mechanism is also introduced to facilitate the multidisplinary scientific cooperation and application integration over the Grid. As a proof-of-concept, a prototype of MCCF is implemented.

Yuhong Feng, Wentong Cai

Gamelet: A Mobile Service Component for Building Multi-server Distributed Virtual Environment on Grid

A DVE system provides a computer-generated virtual world where individuals located at different places could interact with each other. In this paper, we present the design of a grid-enabled service oriented framework for facilitating the building of DVE systems on Grid. A service component named “gamelet” is proposed. Each gamelet is characterized by its load awareness, high mobility, and embedded synchronization. Based on gamelet, we show how to re-design the existing monopolistic model of a DVE system into an open and service-oriented system that can fit into current Grid/OGSA framework. We also demonstrate an adaptive gamelet load-balancing (AGL) algorithm that helps the DVE system achieve better performance. We evaluate the performance through a multiplayer online game prototype implemented on Globus Toolkit. Results show that our approach can achieve faster response time and higher throughput.

Tianqi Wang, Cho-Li Wang, Francis Lau

The Application of Grid Computing to Real-Time Functional MRI Analysis

The analysis of brain imaging data such as functional MRI (fMRI) data often requires considerable computing resources, which in most cases are not readily available in many medical imaging facilities. This lack of computing power makes it difficult for researchers and medical practitioners alike to perform on-site analysis of the generated data. This paper proposes and demonstrates the use of Grid computing technology to provide medical imaging facilities with the capability of analyzing functional MRI data in real time with results available within seconds after data acquisition. Using PC clusters as analysis servers, and a software package that includes fMRI analysis tools, data transfer routines, and an easy-to-use graphical user interface, we are able to achieve fully real-time performance with a total processing time of 1.089 s per image volume (64 x 64 x 30 in size), much less than the per volume acquisition time set to 3.0 s. We also study the feasibility of using XML-based computational web services, and show how such web services can improve accessibility and interoperability while still making real-time analysis possible.

E. Bagarinao, L. Sarmenta, Y. Tanaka, K. Matsuo, T. Nakai

Building and Accessing Grid Services

A computation grid, which hosts services that are shared by users, is formed. The grid system is responsible for deciding which services need to be replicated for efficiency reason. A dynamic XML document is an XML document with embedded Web service calls. This paper proposes an approach in which dynamic XML documents are used to specify Grid services that users want to access.

Xinfeng Ye

DRPS: A Simple Model for Locating the Tightest Link

The tightest link of a network path is the link where the end-to-end available bandwidth is limited. We propose a new and simple probe model, called

Dual Rate Periodic Streams

(DRPS), for finding the location of the tightest link. A DRPS probe is a periodic stream with two rates. Initially, it goes through the path at a comparatively high rate. When arrived at a particular link, the probe shifts its rate to a lower level and keeps the rate. If proper rates are set to the probe, we can control whether the probe is congested or not by adjusting the shift time. When the point of rate shift is in front of the tightest link, the probe can go through the path without congestion, otherwise congestion occurs. Thus, we can find the location of the tightest link by congestion detection at the receiver.

Dalu Zhang, Weili Huang, Chen Lin

Session 3C: Peer-to-Peer and Ad-Hoc Networking

A Congestion-Aware Search Protocol for Unstructured Peer-to-Peer Networks

Peer-to-Peer (P2P) file sharing is the hottest, fastest growing application on the Internet. When designing Gnutella-like applications, the most important consideration is the scalability problem. Recently, different search protocols have been proposed to remedy the problems in Gnutella’s flooding. However, congestion due to large query loads from users and peer heterogeneity definitely impact on the performance of search protocols, and this consideration has received little attention from the research community. In this paper, we propose a congestion-aware search protocol for unstructured P2P networks. The aim of our protocol is to integrate congestion control and object discovery functionality so that it can achieve good performance under congested networks and flash crowds. The simulation results show that our protocol can largely reduce a hit delay while maintaining a high hit rate, and the congestion problems such as query loss and system overloading can be effectively alleviated.

Kin Wah Kwong, Danny H. K. Tsang

Honeycomb: A Peer-to-Peer Substrate for On-Demand Media Streaming Service

Peer-to-Peer media streaming service has gained tremendous momentum in recent years. However, a number of challenges in Peer-to-Peer media streaming have not been addressed. In this paper, we propose a peer-to-peer substrate for supplying on-demand media streaming service, called

Honeycomb

, which mainly addresses on the key technical issue of: constructing a locality-aware P2P overlay network with high scalability and manageability.

Honeycomb

can fully utilize the locality of underlying physical network so that the bandwidth consumption used for the overlay maintenance can be effectively saved and the QoS requirements for delivering media content can be easily satisfied.

Dafu Deng, Hai Jin, Chao Zhang, Hao Chen, Xiaofei Liao

An Improved Distributed Algorithm for Connected Dominating Sets in Wireless Ad Hoc Networks

The idea of virtual backbone routing has been proposed for efficient routing among a set of mobile nodes in wireless ad hoc networks. Virtual backbone routing can reduce communication overhead and speedup the routing process compared with many existing on-demand routing protocols for routing detection. In many studies, Minimum Connected Dominating Set (MCDS) is used to approximate virtual backbones in a unit-disk graph. However finding a MCDS is a NP-hard problem. We propose a distributed, 3-phase protocol for calculating the CDS in this paper. Our new protocol largely reduces the number of nodes in CDS compared with Wu and Li’s method, while message and time complexities of our approach remain almost the same as those of Wu and Li’s method. We conduct extensive simulations and show our protocol can consistently outperform Wu and Li’s method. The correctness of our protocol is proved through theoretical analysis.

Hui Liu, Yi Pan, Jiannong Cao

A New Distributed Approximation Algorithm for Constructing Minimum Connected Dominating Set in Wireless Ad Hoc Networks

In this paper, we present a new distributed approximation algorithm that constructs a minimum connected dominating set (MCDS) for wireless ad hoc networks based on a maximal independent set (MIS). Our algorithm, which is fully localized, has a constant approximation ratio, and

O

(

n

) time and

O

(

n

) message complexity. In this algorithm each node only requires the knowledge of its one-hop neighbors and there is only one shortest path connecting two dominators that are at most three hops away. Compared with other MCDS approximation algorithms, our algorithm shows better efficiency and performance than them.

Bo Gao, Huiye Ma, Yuhang Yang

An Adaptive Routing Strategy Based on Dynamic Cache in Mobile Ad Hoc Networks

The dynamic changes of the topology caused by the movement of nodes makes routing become one of the key problems in the mobile Ad Hoc Networks (MANET). So how to optimize routing becomes a hot and difficult topic, among which optimizing routing cache is one of the key techniques. In this paper, we propose an adaptive dynamic cache routing (DCR) strategy based on DSR (Dynamic Source Routing), which can evaluate the link expiration time rapidly. The experimental results show that the DCR has considerable improvement in control packets, packet delivery ratio, packet drops and end-to-end average delay in the MANET.

YueQuan Chen, XiaoFeng Guo, QingKai Zeng, Guihai Chen

Session 4A: Grid Scheduling and Algorithms I

On the Job Distribution in Random Brokering for Computational Grids

This paper analyses the way jobs are distributed in a computational grid environment where the brokering is done in such a way that each Computing Element has a probability to be chosen proportional to its number of CPUs. We give the asymptotic behaviour for several metrics (queue sizes, slowdown...), or, in some case, an approximation of this behaviour. We study the unsaturated case as well as the saturated case, in several stochastic distributions.

Vandy Berten, Joël Goossens

Dividing Grid Service Discovery into 2-Stage Matchmaking

In the wide-area Grid environment which consists of a huge amount of stateful Grid services, the service discovery is a key and challenging issue. The user’s requirements on Grid services not only include the function-related but also include the QoS or service state related requirements. Based on the analysis for the requirements of service discovery, this paper divides the service matching process into 2 stages: service type matching and instance matching, and proposes a Grid Service Discovery Model Based on 2-Stage Matching which can enable more effective service discovery. In the model VO is utilized as the managerial unit for grid services and a two-level publication architecture is adopted. The initial simulation results show that the model can effectively aggregate the service information and avoid the workload caused by frequent dynamic updating.

Ye Zhu, Junzhou Luo, Teng Ma

Performance Evaluation of a Grid Computing Architecture Using Realtime Network Monitoring

This paper integrates the concepts of realtime network monitoring and visualizations into a grid computing architecture on the Internet. We develop a Realtime Network Monitor(RNM) that performs realtime network monitoring in order to improve the performance of the grid computing framework. Through network monitoring, it is also found out that the network traffic has effects on the performance of processing for large scale applications.

Young-Sik Jeong, Cheng-Zhong Xu

Quartet-Based Phylogenetic Inference: A Grid Approach

The accuracy of quartet-puzzling method, which is widely used in molecular phylogenetic analysis depends heavily on the number of intermediate trees searched and the order of molecular sequences selected to generate these trees. Because the quality of intermediate trees cannot be guaranteed, the consensus results can easily be trapped into local optima. In this paper we present a new approach to guide the intermediate tree selection. Our experimental results show that the accuracy of reconstructed trees can be improved significantly. Using our method, the task can easily be partitioned into independent subtasks of different sizes. Therefore, it can effectively be implemented and run in the heterogeneous and dynamic environment of the computational grid.

Chen Wang, Bing Bing Zhou, Albert Y. Zomaya

Scheduling BoT Applications in Grids Using a Slave Oriented Adaptive Algorithm

Efficient scheduling of Bag-of-Tasks (BoT) applications in a computational grid environment reveals several challenges due to its high heterogeneity, dynamic behavior, and space shared utilization. Currently, most of the scheduling algorithms proposed in the literature use a master-oriented algorithm, in which the master is the only responsible for choosing the best task size to send to each slave. We present in this paper a different approach whose main originality is to be slave-oriented,

ie

each slave locally determines, from a set of initial runs, which workload size is more adapted to its capacities and notifies the master of it. Finally, we show some measurements comparing our algorithm with other three well-known scheduling algorithms using the SimGrid toolkit.

Tiago Ferreto, César De Rose, Caio Northfleet

Session 4B: Data Replication and Caching

A Clustering-Based Data Replication Algorithm in Mobile Ad Hoc Networks for Improving Data Availability

In Mobile Ad Hoc Networks (MANET), network partitioning occurs when nodes move freely and cause disconnections frequently. Network partitioning is a wide-scale topology change that can cause sudden and severe disruptions to ongoing data access, and consequently data availability is decreased. A new distributed clustering algorithm is presented in this paper for dynamically organizing mobile nodes into clusters in which the probability of path availability can be bounded. Based on this clustering algorithm, a data replication policy is proposed to improve data availability during network partitioning. Through theoretic analysis our algorithm has proper complexity. Simulation results show that the clusters created by our clustering algorithm have desirable properties and our algorithms improve the data availability effectively in MANET environment.

Jing Zheng, Jinshu Su, Xicheng Lu

CACHERP: A Novel Dynamic Cache Size Tuning Model Working with Relative Object Popularity for Fast Web Information Retrieval

The novel CACHE

RP

model for dynamic cache size tuning leverages the relative object popularity as the sole parameter. It is capable of maintaining the given hit ratio on the fly by deriving the popularity ratio from the currently collected statistics. Being statistical the accuracy of the CACHE

RP

operation should be independent of the changes in the Internet traffic pattern, which can switch suddenly. By adaptively maintaining the given hit ratio through cache size auto-tuning in a dynamic manner the model effectively reduces the end-to-end information retrieval roundtrip time (RTT).

Richard S. L. Wu, Allan K. Y. Wong, Tharam S. Dillon

Implementation of a New Cache and Schedule Scheme for Distributed VOD Servers

A VOD server’s caching and scheduling performance determine its service performance efficiency. This paper describes a new cache model and content replacement strategy, based on the Zipf-like Law and the characteristics of media stream service, which can reduce the disk I/O ratio by 6.22%. A performance analytical model for a disk load schedule was constructed, based on the Stochastic Process and Queuing Theory, and a new disk load strategy suitable for VOD systems was also formulated. This strategy reduces the disk block time by 3.71% on average. This paper also describes a content schedule scheme which was designed by constructing, analyzing and simplifying the SPN model deduced from the MSMQ theory. This scheme can guarantee the quality of service(QoS) and distribute program content automatically. An experiment was conducted, and the results showed that VOD servers embedded with the new cache and using the new schedule strategy could reduce the average response time to user requests by 7% to 19%.

Han Luo, Ji-wu Shu

Session 4C: Software Engineering and Testing

UML Based Statistical Testing Acceleration of Distributed Safety-Critical Software

It is necessary to assess the reliability of distributed safety-critical systems to a high degree of confidence before they are deployed in the field. However, distributed safety-critical software systems often include some rarely executed critical functions that are often inadequately tested in statistical testing based reliability estimation. This paper presents a method that can accelerate statistical testing of distributed safety-critical software. The method starts with the derivation of scenario usage diagram model (SUD) from UML diagrams annotated with usage related attributes and reliability attributes. Then the statistical testing accelerating method based on importance sampling is presented. When both the critical scenarios and the entire software are adequately tested, the method can still compute the unbiased software reliability from the test results with much less test cases. Thus, the statistical testing cost of distributed safety-critical software can be reduced effectively.

Jiong Yan, Ji Wang, Huo-wang Chen

A Metamodel for the CMM Software Process

With the increasing complexity of software system, geographically distributed development has become mainstream. Managing a software process in which team members are physically distributed is challenging. How to use the Capability Maturity Model (CMM) in geographically distributed development is an area with a number of open research issues. We define a CMM Software Process (CSP) by a set of generic process elements in accordance with the requirements of the CMM. Using the Model Driven Architecture (MDA), the CSP model can be transformed into distributed CMM implementation process models. This paper presents a metamodel for the CSP model, named MM-CSP, and provides the abstract syntax and the semantic of the MM-CSP as well as a UML profile for the MM-CSP. Based on the MM-CSP, a prototype tool for CSP modeling is developed.

Juan Li, Mingshu Li, Zhanchun Wu, Qing Wang

Performance Tuning for Application Server OnceAS

The J2EE application server provides a primary solution to develop enterprise-wide applications, which uses containers to hold application components. The container framework relieve developers’ burden greatly because it encapsulates all the system level services and the developers are able to use these services directly without knowing underlying details. The processing capacity of application servers is becoming more and more important with the requirements of achieving higher performance and higher scalability. This paper uses ECperf, a performance benchmark tool for application servers, to studies the performance issues of the application server OnceAS, which is developed by the Institute of software, Chinese Academy of Sciences, and presents optimization approaches including bean instance pools and high speed naming service. These optimizations are implemented in OnceAS and proved to be effective through ECperf benchmark evaluation.

Wenbo Zhang, Bo Yang, Beihong Jin, Ningjing Chen, Tao Huang

Systematic Robustness-Testing RI-Pro of BGP

Robustness testing is a very active research area in protocol testing. This paper starts with the analysis of RI-Pro of BGP-4, and then builds Scenario Model to describe the process of route update. The new model studies the RI-Pro from the relationship of available sources instead of the function of RI-Pro. Based on this model, a novel generation method of robustness-testing suite is presented. All above compose the systematic robustness testing approach. This approach eliminates the test scaffolding of ISO9646 essentially. Robustness testing experiments of Cisco 7200 indicates that, compared with positive test suite, the error-detecting capability of negative test suite generated by this approach is enhanced 1.3 times.

Lechun Wang, Peidong Zhu, Zhenghu Gong

Session 5A: Grid Protocols

MPICH-GP: A Private-IP-Enabled MPI Over Grid Environments

This paper presents an overview of MPICH-GP, which extends the MPICH-G2 functionality to support private IP clusters. To support the communication among private IP clusters, the MPICH-GP uses a communication relay scheme combining the NAT and a user-level proxy. The benchmarking results show that MPICH-GP outperforms the user-level two-proxy scheme used in PACX-MPI and Firewall-enabled MPICH-G, and also show comparable application performance to that of original MPICH-G2, especially in large message (or problem) size.

Kumrye Park, Sungyong Park, Ohyoung Kwon, Hyoungwoo Park

Paradigm of Multiparty Joint Authentication: Evolving Towards Trust Aware Grid Computing

This paper introduces the semantic of Multiparty Joint Authentication (MJA) into the authentication service, which is to find simplified or optimal authentication solutions for all involved principals through their transitive trust instead of to authenticate each pair of principals in turn. MJA is designed to support multiparty security contexts in grids with a specified, understood level of confidence and reduce the time cost of mutual authentications. Graph theory model is employed to define MJA, and analyze its mathematical properties. Two algorithms to find an

n

-principal,

n

-order MJA solution are also presented. MJA is indeed a trust aware mechanism and will promote trust aware grid computing eventually.

Hui Liu, Minglu Li

Design and Implementation of a 3A Accessing Paradigm Supported Grid Application and Programming Environment

The mobile grid users accessing to grid services has become a normally paradigm for getting grid resources. To improve their working productivity in dynamic and open grid environment it should provide the mobile grid users an access-point decoupled and access-time decoupled way to access grid services. In development of the VEGA Grid, corresponding to the user accessing module in the Service-Oriented Architecture, we developed a kind of we called “3A(anytime, anywhere, and on any device) accessing paradigm” supported Grid Application and Programming Environment(GAPE). To the 3A accessing paradigm, we mean a mobile grid user not only could access grid services at anytime anywhere and on any device, but also could continuously manage the executing state of his grid applications even if he changed his access point or access time. This article analyzes the 3A accessing paradigm, and introduces the implementation of the VEGA GAPE.

He Ge, Liu Donghua, Sun Yuzhong, Xu Zhiwei

VAST: A Service Based Resource Integration System for Grid Society

Grid and P2P are two different ways of organizing heterogeneous resources. However, the similarity in their target of resource sharing determines that they could be and should be treated together. Reference proposed the concept as well as system model of ‘Grid Society’, which describes the Grid-P2P mixed environment and demonstrates the similarity between Grid Society and Human Society. Following this idea, we continued by an in-depth analysis of Grid Society environment from systematical aspect and designed a service-based architecture for VAST, a resource integration system. In VAST, Grid and P2P Resource Access Entry assists the Portal to co-schedule resources by providing a set of well-defined service interfaces; DCI, a P2P software system, was also developed to meet the special requirements of Portal; and Portal, coordinator of the whole system, links those two Access Entries and provides a friendly interface to the End Users for controlling and monitoring the submitted jobs. VAST has already been implemented using Java and Web Service technologies, which helps to achieve good portability. Further more, an experiment of parameter sweep application has also been conducted to test the prototype of VAST and through it VAST shows a promising potential for the collaboration of different kinds of resources.

Jiulong Shan, Huaping Chen, Guangzhong Sun, Xin Chen

Petri-Net-Based Coordination Algorithms for Grid Transactions

Transaction proceesing in Grid is to ensure reliable execution of inherently distributed Grid applications. This paper proposes coordination algorithms for handling short-lived and long-lived Grid transactions, models and analyzes these algorithms with the Petri net. The cohesion transaction can coordinate long-lived business Grid applications by automatically generating and executing compensation transactions to semantically undo committed sub-transactions. From analysis of the reachability tree, we show that the Petri net models of above algorithms are bounded and L1-live. This demonstrates that transactional Grid applications can be realized by the proposed algorithms effectively.

Feilong Tang, Minglu Li, Joshua Zhexue Huang, Cho-Li Wang, Zongwei Luo

Session 5B: Context- ware and Mobile Computing

Building Infrastructure Support for Ubiquitous Context-Aware Systems

Many context-aware systems have been demonstrated in lab environments; however, due to some difficulties such as the scalability and privacy issues, they are not yet practical for deployment on a large scale. This paper addresses these two issues with particular interest in user’s privacy protection and spontaneous system association. A person-centric service infrastructure is proposed together with a context-aware call forwarding system constructed as a proof-of-concept prototype based on the Session Initiation Protocol (SIP).

Wei Li, Martin Jonsson, Fredrik Kilander, Carl Gustaf Jansson

Context-Awareness in Mobile Web Services

Context-aware computing is a computing paradigm in which applications can take advantage of contextual information. Quality of network connection is a very important factor for mobile web services. However, the conditions of mobile networks may change frequently and dynamically. Thus, providing support for context-aware applications is especially important in mobile web services. Recently, a number of architectures supporting context-aware applications have been developed, but little attention is paid to the special requirements of mobile devices which particularly have many constraints. This paper discusses a client-proxy-server architecture that supports context-awareness by considering types of device, network and application characteristics. The contribution of this paper mainly lies in the division of labor between proxy and server. Application specific proxy is used to tailor the original resource based on the mobile user’s context information. To prove the feasibility, a context-aware image management system is designed and realized.

Bo Han, Weijia Jia, Ji Shen, Man-Ching Yuen

CRL: A Context-Aware Request Language for Mobile Computing

This paper introduces an XML-based generic Context Request Language (CRL), whose construction is part of a web services framework in the domain of mobile context sensing. The paper describes an implementation of the technique that is in accordance with the formal mathematical representational model, using first-order temporal language [6]. The language is an attempt to introduce intelligence into context-aware computing by defining context-sensing elements into logical entities. The use of first-order calculus in this language definition serving on web service technology allows users to utilize context aggregation and to embed user control in contextual information. By carrying out on-the-fly context inferences at the middleware level, we can achieve a complete separation of concerns between user application and context sensing. Moreover, the declaration of contextual knowledge based on situations and events within the predicate domain allows users to express changes in contextual information and to quantify these elements among times and durations.

Alvin T. S. Chan, Peter Y. H. Wong, Siu-Nam Chuang

A Resource Reservation Protocol for Mobile Cellular Networks

This paper proposed a protocol named RSVP-C, which aims at reserving resources for mobile cellular networks. In RSVP-C, both active and passive resource reservation routes could be established. We described the whole architecture and all the management principles. The messages format and reservation mechanism are illuminated in details. Simulation results based on a discrete-event simulation model are given as well. Compared with MRSVP, RSVP-C shows its better performance in most mobile cellular networks.

Ming Xu, Zhijiao Zhang, Yingwen Chen

Session 5C: Distributed Routing and Switching Protocols I

Using the Linking Model to Understand the Performance of DHT Routing Algorithms

The various proposed DHT routing algorithms embody different underlying routing geometries such as ring, hypercube etc. The routing performance depends on the geometry. In this paper, we anatomize the construction of the geometry and propose the linking model for the geometry to understand the performance of DHT routing algorithms. The effect of the different types of links on the performance is analyzed. Especially, randomized long link is considered as a new approach to optimize the performance. Our experiments show that the routing performance is greatly affected by the links, and the performance of CAN is improved with additional links.

Futai Zou, Shudong Cheng, Fanyuan Ma, Liang Zhang, Junjun Tang

Packet-Mode Priority Scheduling for Terabit Core Routers

Existing packet-mode schedulers will result in long waiting time for short control packets in IP networks. To overcome this problem, this paper proposes a packet-mode practical scheduling algorithm called short-packets-first (SPF). With uniform Poisson arrival process and low to medium offered load, we prove that SPF can reduce the average packet waiting time for overall packets by greatly lowering the average packet waiting time for short packets. Moreover, simulations under a real traffic model demonstrate that SPF performs very well compared with other packet-mode and cell-mode scheduling algorithms, especially for short packets under heavy offered load..

Wenjie Li, Bin Liu

Node-to-Set Disjoint Paths Problem in Bi-rotator Graphs

A rotator graph was proposed as a topology for interconnection networks of parallel computers, and it is promising because of its small diameter and small degree. However, a rotator graph is a directed graph that sometimes behaves harmful when it is applied to actual problems. A bi-rotator graph is obtained by making each edge of a rotator graph bi-directional. A bi-rotator has a Hamilton cycle and it is also pan-cyclic. In this paper, we give an algorithm for the node-to-set disjoint paths problem in bi-rotator graphs with its evaluation results. The solution achieves some fault tolerance such as file distribution based information dispersal technique. The algorithm is of polynomial order of

n

for an

n

-bi-rotator graph. It is based on recursion and divided into three cases according to the distribution of destination nodes in classes into which all the nodes in a bi-rotator graph are categorized. The sum of lengths of paths obtained and the time complexity of the algorithm are estimated. Average performance of the algorithm is also evaluated by computer experiments.

Keiichi Kaneko

QoSRHMM: A QoS-Aware Ring-Based Hierarchical Multi-path Multicast Routing Protocol

We propose a QoS-aware multicast routing protocol called QoSRHMM based on a ring-based hierarchical network structure, which establishes a

Ring-Tree

by the principle of “

ring

in the same tier of the hierarchy and

tree

in different tiers”. It forms one shortest-delay path and multiple paths with delay-constrained minimal-cost between the sender and each receiver. When establishing a connection on demand, a node joins a logical ring while satisfying some QoS constraints. Mobile hosts are served in time with multicast services even in case of handoff.

Guojun Wang, Jun Luo, Jiannong Cao, Keith C. C. Chan

Session 6A : Grid Scheduling and Algorithms II

A Dynamic Task Scheduling Algorithm forGrid Computing System

In this paper, we propose a dynamic task scheduling algorithm which assigns tasks with precedence constraints to processors in a Grid computing system. The proposed scheduling algorithm bases on a modified static scheduling algorithm and takes into account the heterogeneous and dynamic natures of resources in Grid.

Yuanyuan Zhang, Yasushi Inoguchi, Hong Shen

Replica Selection on Co-allocation Data Grids

Data Grid supports data-intensive applications in a large scale grid environment. It makes use of storage systems as distributed data stores by replicating contents. On the co-allocation architecture, the client can divide a file into k blocks of equal size and download the blocks dynamically from multiple servers by GridFTP in parallel. But the drawback is that faster servers must wait for the slowest server to deliver the final block. Therefore, designing efficient strategies for accessing a file from multiple copies is very import. In this paper, we propose two replica retrieval approaches, abort-and-retransfer and one by one co-allocation, to improve the performance of the data grids. Our schemes decrease the completion time of data transfer and reduce the workload of slower serves. Experiment results are also done to demonstrate its performances.

Ruay-Shiung Chang, Chih-Min Wang, Po-Hung Chen

A Novel Checkpoint Mechanism Based on Job Progress Description for Computational Grid

In this paper, we argue that application-level uncoordinated checkpointing with user-defined checkpoint data is the favorable in grid environment where heterogeneity is essentially popular. We present a novel application-level uncoordinated checkpoint protocol based on Job Progress Description (

JPD

) which is composed by a Job Progress Record Object and a group of Job Progress State Objects, these two kinds of objects act as checkpoint data for the job and the methods of them can be used as checkpoint APIs. By extending this protocol with sender-based message logging, it can be used by the message passing applications in computational grid. Emulation with a kind of master-worker message-passing applications shows that using this checkpointing protocol can dramatically reduce the wall-time of the application when failure occurs.

Chunjiang Lia, Xuejun Yang, Nong Xiao

A Peer-to-Peer Mechanism for Resource Location and Allocation Over the Grid

Recent advances in P2P lookup overlays provide an appealing solution for distributed search without relying on a single database server. In addition to performing resource discovery, these P2P substrates also offer membership management for dynamic peers. In this paper, we propose a publicly shared architecture called VC

2

A that takes advantage of a P2P lookup substrate for computational applications. VC

2

A targets computational master-slave applications. An application running in VC

2

A dynamically allocates resources from the system on the fly. These allocated resources then self-manage and -heal. We have implemented an architecture based on our previous efforts that include an enhanced P2P lookup overlay and a mobile agent system on top of this overlay. We show that VC

2

A is not only scalable but robust, and takes advantage of heterogeneity of the resources.

Hung-Chang Hsiao, Mark Baker, Chung-Ta King

The Model, Architecture and Mechanism Behind Realcourse

Realcourse

is a video stream service supported by a collection of physical servers distributed all over China. This article provides a comprehensive description on the model, architecture, and operation mechanism behind

realcourse

. In particular, the highly-availability feature and dynamic re-configurability is emphasized.

Jinyu Zhang, Xiaoming Li

Session 6B: Cluster Resource Scheduling and Algorithms

Managing Irregular Workloads of Cooperatively Shared Computing Clusters

Cooperative resource sharing enables distinct organizations to form a federation of computing resources. A functional broker is deployed to facilitate remote resource access within the community grid. A major issue is the problem of correlations in job arrivals caused by seasonal usage and/or coincident resource usage demand patterns where high levels of burstiness in job arrivals can cause the job queue of the broker to grow to an extent such that its performance becomes severely impaired. Since job arrivals cannot be controlled, management strategies must be employed to admit jobs to sustain the resource allocation performance of the broker. In this paper, we present a theoretical analysis of the problem of job traffic burstiness on resource allocation performance in order to elicit the general job management strategies to be employed. Based on the analysis, we define and justify a job management framework for the resource broker to cope with overload conditions caused by job arrival correlations.

Percival Xavier, Wentong Cai, Bu-Sung Lee

Performance-Aware Load Balancing for Multiclusters

In a multicluster architecture, where jobs can be submitted through each constituent cluster, the job arrival rates in individual clusters may be uneven and the load therefore needs to be balanced among clusters. In this paper we investigate load balancing for two types of jobs, namely

non-QoS

and

QoS-demanding

jobs and as a result, two performance-specific load balancing strategies (called ORT and OMR) are developed. The ORT strategy is used to obtain the optimised mean response time for non-QoS jobs and the OMR strategy is used to achieve the optimised mean miss rate for QoS-demanding jobs. The ORT and OMR strategies are mathematically modelled combining queuing network theory to establish sets of optimisation equations. Numerical solutions are developed to solve these optimisation equations, and a so called

fair workload level

is determined for each cluster. When the current workload in a cluster reaches this pre-calculated fair workload level, the jobs subsequently submitted to the cluster are transferred to other clusters for execution. The effectiveness of both strategies is demonstrated through theoretical analysis and experimental verification. The results show that the proposed load balancing mechanisms bring about considerable performance gains for both job types, while the job transfer frequency among clusters is considerably reduced. This has a number of advantages, in particular in the case where scheduling jobs to remote resources involves the transfer of large executable and data files.

Ligang He, Stephen A. Jarvis, David Bacigalupo, Daniel P. Spooner, Graham R. Nudd

Scheduling of a Parallel Computation-Bound Application and Sequential Applications Executing Concurrently on a Cluster – A Case Study

Studies have shown that most of the computers in a non-dedicated cluster are often idle or lightly loaded. The underutilized computers in a non-dedicated cluster can be employed to execute parallel applications. The aim of this study is to learn how concurrent execution of a computation-bound and sequential applications influence their execution performance and cluster utilization. The result of the study has demonstrated that a computation-bound parallel application benefits from load balancing, and at the same time sequential applications suffer only an insignificant slowdown of execution. Overall, the utilization of a non-dedicated cluster is improved.

Adam K. L. Wong, Andrzej M. Goscinski

Sequential and Parallel Ant Colony Strategies for Cluster Scheduling in Spatial Databases

In spatial join processing, a common method to minimize the I/O cost is to partition the spatial objects into clusters and then to schedule the processing of the clusters such that the number of times the same objects to be fetched into memory can be minimized. The key issue of cluster scheduling is how to produce a better sequence of clusters to guide the scheduling. This paper describes strategies that apply the ant colony optimization (ACO) algorithm to produce cluster scheduling sequence. Since the structure of the ACO is highly suitable for parallelization, parallel algorithms are also developed to improve the performance of the algorithms. We evaluated and illustrated that that the scheduling sequence produced by the new method is much better than existing approaches.

Jitian Xiao, Huaizhong Li

Session 6C: Distributed Routing and Switching Protocols I

Cost-Effective Buffered Wormhole Routing

The cost-effectiveness of wormhole torus-networks is systematically evaluated with emphasis on new buffered-wormhole routing algorithms. These algorithms use hardwired datelines and escape channels to alleviate bottleneck and bubble problems caused by previous algorithms and so to improve the buffer efficiency with little hardware overhead. A two-part evaluation environment is developed consisting of a cycle-driven simulator for high-level measurements such as network capacity and an ASIC design package for low-level measurements such as operating frequency and chip area. This environment is used to demonstrate that the new routers are more cost-effective than their counterparts under various workload parameters.

Jinming Ge

Efficient Routing and Broadcasting Algorithms in de Bruijn Networks

Recently, routing on dBG has been investigated as shortest path and fault tolerant routing but investigation into shortest path in failure mode on dBG has been non-existent. Furthermore, dBG based broadcasting has been studied as local broadcasting and arc-disjoint spanning trees based broadcasting. However, their broadcasting algorithms can only work in dBG(2,k). In this paper, we investigate shortest path routing algorithms in the condition of existing failure, based on the Bidirectional de Bruijn graph (BdBG). And we also investigate broadcasting in BdBG for a degree greater than or equal to two.

Ngoc Chi Nguyen, Nhat Minh Dinh Vo, Sungyoung Lee

Fault-Tolerant Wormhole Routing Algorithm in 2D Meshes Without Virtual Channels

In wormhole meshes, many routing algorithms prevent deadlocks by enclosing faulty nodes within faulty blocks. None of them however can tolerate the convex fault model without virtual channels. We propose a deterministic fault-tolerant wormhole routing algorithm for mesh networks that can handle disjoint convex faulty regions. These regions would not contain any nonfaulty nodes. The proposed algorithm does not use any virtual channels, which routes the messages using an extended X-Y routing algorithm in the fault-free regions. The algorithm is deadlock- and livelock-free.

Jipeng Zhou, Francis C. M. Lau

Fault Tolerant Routing Algorithm in Hypercube Networks with Load Balancing Support

We introduce load balancing support to our original fault tolerant routing algorithm in hypercube networks. We design an improved routing algorithm based on load balancing in hypercube networks. Our routing algorithm is simple and powerful, which preserves the advantages of our original algorithm. Firstly, our routing algorithm is applicable no matter whether the given hypercube network satisfies the required conditions or not. Secondly, our algorithm is distributed and local-information-based. Finally, our algorithm is effective and efficient. The contribution is that we realize load balancing in fault tolerant routing algorithm in hypercube networks in a creative way to make our algorithm perform better.

Xiaolin Xiao, Guojun Wang, Jianer Chen

Session 7A: Security I

Proxy Structured Multisignature Scheme from Bilinear Pairings

In the past few years, proxy signatures have become an important research area and many excellent schemes have been proposed. Proxy signatures can combine other special signatures to obtain some new types of proxy signatures. A multisignature scheme is said to be structured if the group of signers is structured. Due to the various applications of the bilinear pairings in cryptography, many pairing-based signature schemes have been proposed. In this paper, we propose a proxy structured multisignature scheme from bilinear pairings. This scheme provides a possible solution to the problem that ordinary structured multisignature relies on the presence and cooperating of all the entities of the signing group. We support the proposed scheme with security and efficiency analysis.

Xiangxue Li, Kefei Chen, Longjun Zhang, Shiqun Li

A Threshold Proxy Signature Scheme Using Self-Certified Public Keys

So far, all of proposed threshold proxy signature schemes are based on public key systems with certificates and most of them use Shamir’s threshold secret share scheme. Self-certified public key system has attracted more and more attention because of its advantages. Based on Hsu et al’s self-certified public key system and Li et al’s proxy signature scheme, one threshold proxy signature scheme is proposed. The new scheme can provide the properties of proxy protection, verifiability, strong identifiability, strong unforgeability, strong repudiability, distinguishability, known signers and prevention of misuse. In the proxy signature verification phase, the authentication of original and proxy signers’ public keys and the verification of the threshold proxy signature are executed together. In addition, the computation overhead and communication cost of the scheme are less than previous works with Shamir’s secret share protocol and the public key system based on certificates.

Qingshui Xue, Zhenfu Cao

The Authentication and Processing Performance of Session Initiation Protocol (SIP) Based Multi-party Secure Closed Conference System

This paper presents, a new SIP based multi-party secure closed conference system. In the traditional participants, except of captain, conference system doesn’t have a privilege for UA (User Agent) that accepts or declines new participants. On the other hand, closed conference system supports this function. We also propose for closed conference system authentication procedure. By means of a real implementation, we provide an experimental performance analysis of SIP security mechanisms.

Jongkyung Kim, Hyuncheol Kim, Seongjin Ahn, Jinwook Chung

Session 7B: High Performance Processing and Applications

A Method for Authenticating Based on ZKp in Distributed Environment

A new ZK

p

identity protocol is proposed in this paper. It is more appropriate than the traditional identity protocol in distributed environment without an identical trusted third party. The security of this protocol relies on the discrete logarithm problem on conic over finite fields. It can be designed and implemented easier than those on elliptic curve. A simple solution is proposed to prevent a potential leak of our protocol.

Dalu Zhang, Min Liu, Zhe Yang

A Load-Balanced Parallel Algorithm for 2D Image Warping

2D image warping is a computation-intensive task and plays an important role in many fields. All existing parallel image warping algorithms are only suitable for limited geometric transformations. This paper proposes a distributed parallel image warping algorithm PIWA-LIC, which partitions the valid area of the output image in a balanced way, and each computing node does resampling for one sub output image. To guarantee data locality, a line segment approximation (LSA) method is exploited to get the corresponding input area for each sub output image. For any 2D geometric transformations with one-to-one mapping, PIWA-LIC achieves high performance because of good load balance and data locality.

Yan-huang Jiang, Zhi-ming Chang, Xue-jun Yang

A Parallel Algorithm for Helix Mapping Between 3D and 1D Protein Structure Using the Length Constraints

Determining 3-dimensional (3D) structures of proteins is still a challenging problem. Certain experimental techniques can produce partial information about protein structures, yet not enough to solve the structure. In this paper, we investigate the problem of relating such partial information to its protein sequence. We developed an algorithm of building a library to map helices in a 3D structure to its 1-dimensional (1D) structure using the length constraints of helices, obtained from such partial information. We present a parallel algorithm for building a mapping tree using dynamic distributed scheduling for load balancing. The algorithm shows near linear speedup for up to 20 processors tested. If the protein secondary structure prediction is good, the library contains a mapping that correctly assigns the majority of the helices in the protein.

Jing He, Yonggang Lu, Enrico Pontelli

A New Scalable Parallel Method for Molecular Dynamics Based on Cell-Block Data Structure

A scalable parallel algorithm especially for large-scale three dimensional simulations with seriously non-uniform particles distributions is presented. In particular, based on cell-block data structures, this algorithm uses Hilbert space filling curve to convert three-dimensional domain decomposition for load distribution across processors into one-dimensional load balancing problems for which measurement-based multilevel averaging weights(MAW) method can be applied successfully. Against inverse space-filling partitioning(ISP), MAW redistributes blocks by monitoring change of total load in each processor. Numerical experimental results have shown that MAW is superior to ISP in rendering balanced load for large-scale multi-medium MD simulation in high temperature and high pressure physics. Excellent scalability was demonstrated, with a speedup larger than 200 with 240 processors of one MPP. The largest run with 1.1 × 10

9

particles on 500 processors took 80 seconds per time step.

Xiaolin Cao, Zeyao Mo

Parallel Transient Stability Simulation for National Power Grid of China

With the development of modern power system, real-time simulation and on-line control are becoming more and more critical. Transient stability analysis, where the most intensive computation locates, is the bottleneck of real-time simulation for large-scale power system. Thus, the key to achieve the real-time simulation of large scale power systems is to find the new transient stability algorithms and parallel software with high-performance computers. This paper presents a new spatial parallel transient stability algorithm including an improved parallel network algorithm and an optimal convergence checking method. The simulation software with the spatial parallel algorithm is designed and implemented on a SMP-cluster system. The test cases of national power grid of China show that the optimal computation time of the parallel software is only 38% of the actual dynamic process. It is suggested that the algorithms described in this paper can achieve the super real-time simulation of very large scale power system and make the complex on-line power applications, especially on-line supervision and control for large scale power system, feasible.

Wei Xue, Jiwu Shu, Weimin Zheng

HPL Performance Prevision to Intending System Improvement

HPL is a parallel Linpack benchmark package widely adopted in massive cluster system performance test. On HPL data layout among processors, a law to determine block size NB theoretically, which breaks through dependence on trial-and-error experiments, is found based on in-depth analysis of blocked parallel solution algorithm of linear algebra equations and implementation mechanics in HPL. According to that law, an emulation model to toughly estimate HPL execution time is constructed. Verified by real system, the model is used to do some scientific prevision on the benefits to Linpack test brought by intending system improvement, such as respectively memory size increase, communication bandwidth increase and so on. It is expected to conduce to direct system improvement on optimizing HPL test in the future.

Wenli Zhang, Mingyu Chen, Jianping Fan

Session 7C: Networking and Protocols I

A Novel Fuzzy-PID Dynamic Buffer Tuning Model to Eliminate Overflow and Shorten the End-to-End Roundtrip Time for TCP Channels

The novel Fuzzy-PID dynamic buffer controller/tuner eliminates overflow at the user/server/application level adaptively. As a result it shortens the end-to-end roundtrip time (RTT) of a client/server TCP interaction due to improved fault tolerance. The Fuzzy-PID, which is independent of what occurs at the system/router level, is formed by combining fuzzy logic with the extant algorithmic model, the

pure

PID (P

2

ID) tuner. It eliminates the shortcomings from the P

2

ID component but preserves its power. Its operation is independent of the traffic pattern, and this makes it suitable for the Internet, where the traffic pattern switches suddenly, for example, from LRD (long-range dependence) to SRD (short-range dependence) or multi-fractal.

Wilfred W. K. Lin, Allan K. Y. Wong, Tharam S. Dillon

Communication Using a Reconfigurable and Reliable Transport Layer Protocol

Although TCP is known to be inefficient over networks such as wireless, satellite, and log-fat-pipes, it is still the most widely used transport layer protocol even on these networks. In this paper, we explore an alternative strategy for designing a reliable transport layer protocol that is much more suitable for today’s mobile and other types of non-conventional networks. The objective here is to have a single protocol that is compatible with today’s communication software and can be easily made to perform better over all types of network. The outcome of the research is a reconfigurable, user-level, reliable transport layer protocol, called RRTP (Reliable and Reconfigurable Transport Protocol) that is TCP-friendly, i.e. it asymptotically converges to fairness as in the case of LIMD (Linear Increase Multiplicative Decrease) algorithms. The protocol is implemented on top of UDP, but it can also easily be incorporated into OS kernels. The paper presents the RRTP algorithm and the key parameters that are necessary for its reconfiguration. We evaluate our protocol using the standard network simulation tool (ns2). Several representative network configurations are used to benchmark the performance of our protocol against TCP in terms of network throughput and congestion loss rate. It is observed that under normal operating conditions, our protocol has a performance advantage of 30% to 700% over TCP in lossy, wireless environments as well as high bandwidth, high latency networks.

Tan Wang, Ajit Singh

Minicast: A Multicast-Anycast Protocol for Message Delivery

Anycast and multicast are two important Internet services. Combining the two protocols can provide new and practical services. In this paper we propose a new Internet service, Minicast: in the scenario of

n

replicated or similar servers, deliver a message to at least

m

members, 1 ≤

m

≤

n

. Such a service has potential applications in information retrieval, parallel computing, cache queries, etc. The service can provide the same Internet service with an optimal cost, reducing bandwidth consumption, network delay, and so on. We design a multi-core tree based architecture for the Minicast service and present the criteria for calculating the subcores among a subset of Minicast members. Simulation shows that the proposed architecture can even the Minicast traffic, and the Minicast application can save the consumptions of network resource.

Shui Yu, Wanlei Zhou, Justin Rough

Dependable WDM Networks with Edge-Disjoint P-Cycles

In this paper, we propose a fault tolerant mechanism on the optical network design with edge-disjoint P-cycles (EDPC). EDPC is a special type of traditional P-cycles, that is, no common edges are allowed to exist between any two P-cycles. Previously published schemes for computing P-cycles are time consuming and do not survive multiple link failures when P-cycles have common edges. Instead of using the complex ILP, a heuristic method based on link state routing which is compatible to the traditional open shortest path first (OSPF) network protocol is proposed to speed up the construction of EDPC. The results show that the EDPC method can tolerate more link failures and improve the restoration efficiency for traditional P-cycles with the decrease of two working units for every two P-cycles.

Chuan-Ching Sue, Yung-Chiao Chen, Min-Shao Shieh, Sy-Yen Kuo

An Efficient Fault-Tolerant Approach for MPLS Network Systems

Multiprotocol label switching (MPLS) has become an attractive technology for the next generation backbone networks. To provide high quality services, fault tolerance should be taken into account in the design of a backbone network. In an MPLS based backbone network, the fault-tolerant issue concerns how to protect traffic in a carried path (label switched path (LSP)) against node and link failures. This paper presents a new efficient fault-tolerant approach for MPLS. When a node or link failure occurs in a working LSP, the traffic of the faulty LSP (the affected traffic) is distributed to be carried by other failure-free working LSPs. To minimize the affections on the failure-free LSPs, the affected traffic distribution is transferred to the minimum cost flow to be solved. Finally, extensive simulations are performed to quantify the effectiveness of the proposed approach over previous approaches.

Jenn-Wei Lin, Hung-Yu Liu

Session 8A: Security II

A Novel Technique for Detecting DDoS Attacks at Its Early Stage

Spoofing source IP addresses is always utilized to perform Distributed Denial-of-Service (DDoS) attacks. Most of current detection and prevention methods against DDoS ignore the innocent side, whose IP is utilized as the spoofed IP by the attacker. In this paper, a novel method has been proposed to against the direct DDoS attacks, which consists of two components: the client detector and the server detector. The cooperation of those two components and their interactive behavior lead to an early stage detection of a DDoS attack. From the result of experiments, the approach presented in this paper yields accurate DDoS alarms at early stage. Furthermore, such approach is insensitive to the false suspect alarms with adopted evaluation functions.

Bin Xiao, Wei Chen, Yanxiang He

Probabilistic Inference Strategy in Distributed Intrusion Detection Systems

The level of seriousness and sophistication of recent cyber-attacks has risen dramatically over the past decade. This brings great challenges for network protection and the automatic security management. Quick and exact localization of intruder by an efficient intrusion detection system (IDS) will be great helpful to network manager. In this paper, Bayesian networks (BNs) are proposed to model the distributed intrusion detection based on the characteristic of intruders’ behaviors. An inference strategy based on BNs are developed, which can be used to track the strongest causes (attack source) and trace the strongest dependency routes among the behavior sequences of intruders. This proposed algorithm can be the foundation for further intelligent decision in distributed intrusion detection.

Jianguo Ding, Shihao Xu, Bernd Krämer, Yingcai Bai, Hansheng Chen, Jun Zhang

An Authorization Framework Based on Constrained Delegation

In this paper, we distinguish between authorization problems at management level and request level in open decentralized systems, using delegation for flexible and scalable authorization management. The delegation models in existing approaches are limited within one level or only provide basic delegation schemes, and have no effective control over the propagation scope of delegated privileges. We propose REAL, a Role-based Extensible Authorization Language framework for open decentralized systems. REAL covers delegation models at both two levels and provides more flexible and scalable authorization and delegation policies while capable of restricting the propagation scope of delegations. We formally define the semantics of credentials in REAL by presenting a translation algorithm from credentials to Datalog rules (with negation-as-failure). This translation also shows that the semantics can be computed in polynomial time.

Gang Yin, Meng Teng, Huai-min Wang, Yan Jia, Dian-xi Shi

A Novel Hierarchical Key Management Scheme Based on Quadratic Residues

In 1997, Lin [1] proposed a dynamic key management scheme using user hierarchical structure. After that, Lee [2] brought to two comments on Lin’s method. In 2002, Lin [3] proposed a more efficient hierarchical key management scheme based on Elliptic Curve. Lin’s efficient scheme solves the weaknesses appearing in Lee’s scheme in [1]. In this paper, we further use Quadratic Residues (Q.R.) theorem to reduce the computing complexity of Lin’s method.

Jue-Sam Chou, Chu-Hsing Lin, Ting-Ying Lee

Session 8B: Artificial Intelligence Systems and Applications

Soft-Computing-Based Intelligent Multi-constrained Wavelength Assignment Algorithms in IP/DWDM Optical Internet

Wavelength assignment is one of the most important research areas in IP/DWDM optical Internet. Taking multiple constraints into account, including cost, power, network performance etc., the wavelength assignment is made much fit to the actual network configurations, however, the problem complexity increases correspondingly, leading to the adoption of a layered solution framework. As each layer sub-problem is NP complete, soft-computing algorithms, including Simulated Annealing Algorithm (SAA) and Simulated-annealing-Genetic Algorithm (SGA), and heuristic algorithms are used jointly to design the intelligent multi-constrained wavelength assignment algorithms respectively. Simulation results have shown that these proposed algorithms are feasible and effective.

Xingwei Wang, Cong Liu, Min Huang

Data Transmission Rate Control in Computer Networks Using Neural Predictive Networks

The main difficulty arising in designing an e.cient congestion control scheme lies in the large propagation delay in data transfer which usually leads to a mismatch between the network resources and the amount of admitted traffic. To attack this problem, this paper describes a novel congestion control scheme that is based on a Back Propagation (BP) neural network technique. We consider a general computer communication model with multiple sources and one destination node. The dynamic bu.er occupancy of the bottleneck node is predicted and controlled by using a BP neural network. The controlled best-effort traffic of the sources uses the bandwidth, which is left over by the guaranteed traffic. This control mechanism is shown to be able to avoid network congestion efficiently and to optimize the transfer performance both by the theoretic analyzing procedures and by the simulation studies.

Yanxiang He, Naixue Xiong, Yan Yang

Optimal Genetic Query Algorithm for Information Retrieval

An efficient immune query optimization algorithm for information retrieval is proposed in this paper. The main characteristics of this algorithm are as follows: The genetic individual is a query, each gene corresponds to a weighted term, immune operator is used to avoid degeneracy, local search procedure based on the concept of neighborhood is used to speed up the abilities of finding better query vector. Experimental results show that the proposed algorithm can efficiently improve the performance of the query search.

Ziqiang Wang, Boqin Feng

A Genetic Algorithm for Dynamic Routing and Wavelength Assignment in WDM Networks

In this paper, we study the challenging problem of Dynamic Routing and Wavelength Assignment (DRWA) in WDM (Wavelength-Division-Multiplexing) networks with wavelength continuity constraint, and propose an improved genetic algorithm (GA) for it. By adopting a new fitness function to consider simultaneously the path length and number of free wavelengths in cost estimation of a route, the new genetic RWA algorithm can achieve a good load balance among route candidates and result in a lower blocking probability than both the fixed alternative routing algorithm and the previous GA-based algorithm for DRWA, as verified by an extensive simulation study upon the ns-2 network simulator and some typical network topologies.

Vinh Trong Le, Son Hong Ngo, Xiaohong Jiang, Susumu Horiguchi, Minyi Guo

Session 8C: Networking and Protocols II

Ensuring E-Transaction Through a Lightweight Protocol for Centralized Back-End Database

A reasonable end-to-end reliability guarantee for three-tier systems, called e-Transaction (exactly-once Transaction), has been recently proposed. This work presents a lightweight e-Transaction protocol for centralized back-end database. Our protocol does not require coordination among the replicas of the application server and does not rely on any assumption for what concerns the processing order of messages exchanged among processes, as instead required by some existing solution.

Paolo Romano, Francesco Quaglia, Bruno Ciciani

Cayley DHTs — A Group-Theoretic Framework for Analyzing DHTs Based on Cayley Graphs

Static DHT topologies influence important features of such DHTs such as scalability, communication load balancing, routing efficiency and fault tolerance. While obviously dynamic DHT algorithms which have to approximate these topologies for dynamically changing sets of peers play a very important role for DHT networks, important insights can be gained by clearly focussing on the static DHT topology as well. In this paper we analyze and classify current DHTs in terms of their static topologies based on the Cayley graph group-theoretic model and show that most DHT proposals use Cayley graphs as static DHT topologies, thus taking advantage of several important Cayley graph properties such as vertex/edge symmetry, decomposability and optimal fault tolerance. Using these insights, Cayley DHT design can directly leverage algebraic design methods to generate high-performance DHTs adopting Cayley graph based static DHT topologies, extended with suitable dynamic DHT algorithms.

Changtao Qu, Wolfgang Nejdl, Matthias Kriesell

BR-WRR Scheduling Algorithm in PFTS

The novel concept of Physical Frame Time-slot Switching (PFTS) over DWDM, proposed by Sichuan Network and Communication technology key laboratory (SC-Netcom Lab), has been around for some time. It differs from existing switching techniques over DWDM by its superior QoS mechanisms embedded in and its capability to simplify Internet into a Single physical-layer User-data transfer Platform Architecture (SUPA).

This paper proposed a Borrow & Return Weighted Round Robin (BR-WRR) algorithm of output scheduling and dispatching in a multiple-priority queue environment in PFTS nodes. In such nodes, there are multi-ports in a DWDMbased PFTS node and each port contains multi-lambdas. Furthermore, multiple queues with different priorities plus burst queue with the highest priority are devised for each output lambda. A Borrow-and-Return mechanism is introduced to improve the orthodox WRR in PFTS, which cannot satisfy continuous transmitting privileged data of a burst to maintain its integrity. Comparison of the results between simulation of BR-WRR and that of WRR is provided in this paper and shows that BR-WRR has better performance with regard to fairness and important QoS parameters such as transit delay, jitters, and nondisordering.

Dengyuan Xu, Huaxin Zeng, Chao Xu

VIOLIN: Virtual Internetworking on Overlay Infrastructure

We propose a novel application-level virtual network architecture called VIOLIN (Virtual Internetworking on OverLay INfrastructure). VIOLINs are isolated virtual networks created on top of an overlay infrastructure (e.g., PlanetLab). Entities in a VIOLIN include virtual end-hosts, routers, and switches implemented by software and hosted by physical overlay hosts. Novel features of VIOLIN include: (1) a VIOLIN is a “virtual world” with its own IP address space. All its computation and communications are strictly con.ned within the VIOLIN. (2) VIOLIN entities can be created, deleted, or migrated on-demand. (3) Value-added network services not widely deployed in the real Internet can be provided in a VIOLIN. We have designed and implemented a prototype of VIOLIN in PlanetLab.

Xuxian Jiang, Dongyan Xu

Session 9A: Hardware Architectures and Implementations

Increasing Software-Pipelined Loops in the Itanium-Like Architecture

The Itanium architecture (IPF) contains features such as register rotation to support efficient software pipelining. One of the drawbacks of software pipelining is its high register requirement, which may lead to failure when registers provided by architecture are insufficient. This paper evaluates the register requirements of software-pipelined loops in Itanium architecture and presents two new methods, which try to reduce static general register requirements in software pipelined loops by either reducing instructions in the loop body or allocating unused rotating registers for variants using static registers. We have implemented our methods in the Open Research Compiler (ORC) targeted for Itanium processors, and experiments show that number of software-pipelined loops in NAS Benchmarks increased. For some benchmarks, the performance is improved by more than 18%.

Wenlong Li, Haibo Lin, Yu Chen, Zhizhong Tang

A Space-Efficient On-Chip Compressed Cache Organization for High Performance Computing

In order to alleviate the ever-increasing processor-memory performance gap of high-end parallel computers, on-chip compressed caches have been developed that can reduce the cache miss count and off-chip memory traffic by storing and transferring cache lines in a compressed form. However, we observed that their performance gain is often limited due to their use of the coarse-grained compressed cache line management which incurs internally fragmented space. In this paper, we present the fine-grained compressed cache line management which addresses the fragmentation problem, while avoiding an increase in the metadata size such as tag field and VM page table. Based on the SimpleScalar simulator with the SPEC benchmark suite, we show that over an existing compressed cache system the proposed cache organization can reduce the memory traffic by 15%, as it delivers compressed cache lines in a fine-grained way, and the cache miss count by 23%, as it stores up to three compressed cache lines in a physical cache line.

Keun Soo Yim, Jang-Soo Lee, Jihong Kim, Shin-Dug Kim, Kern Koh

A Real Time MPEG-4 Parallel Encoder on Software Distributed Shared Memory Systems

This paper is dedicated to developing real-time MEPG-4 parallel encoder on software distributed shared memory systems. Basically, the performance of a MPEG-4 parallel encoder implemented on distributed systems is mainly determined by the latency of data synchronization and disk I/O, and the cost of data computation. For reducing the impact of data synchronization latency, we invent a pipeline algorithm to minimize the number of data synchronization points necessary for video encoding. In addition, we employ a master-slave node structure to overlay computation and I/O in order for alleviating the impact of I/O latency. On the other hand, we propose a two-level partitioning method to minimize the cost of data computation, and overlap the encoding times of two different GOVs. We have implemented the proposed MPEG-4 encoder on a test bed called Teamster. The experimental results show the proposed MPEG-4 encoder has successfully met the requirement of real time through the support of previous techniques via 32 SMP machines, which are equipped with dual 1.5 GHz Itanium II processors per node and connected by Gigabit Ethernet.

Yung-Chang Chiu, Ce-Kuen Shieh, Jing-Xin Wang, Alvin Wen-Yu Su, Tyng-Yeu Liang

A Case of SCMP with TLS

As an alternative way of chip design, Single Chip Multi-Processors (SCMP) has been a hot topic in microprocessor architecture research all the while. It achieves higher performance by extracting thread-level parallelism (TLP). Thread-level speculation (TLS) is an important way to simplify TLP extraction. This paper presents a new SCMP architecture called Griffon, which aims at general-purpose applications. It implements thread partition in assembly language. It supports thread-level speculation with simple logics and maintains data dependence using a dual-ring structure. Simulation and synthesis results show that Griffon can achieve ideal speedup, less design complexity and accessorial hardware cost.

Jianzhuang Lu, Chunyuan Zhang, Zhiying Wang, Yun Cheng, Dan Wu

Session 9B: High Performance Computing and Architecture

SuperPAS: A Parallel Architectural Skeleton Model Supporting Extensibility and Skeleton Composition

Application of pattern-based approaches to parallel programming is an active area of research today. The main objective of pattern-based approaches to parallel programming is to facilitate the reuse of frequently occurring structures for parallelism whereby a user supplies mostly the application specific code-components and the programming environment generates most of the code for parallelization. Parallel Architectural Skeleton (PAS) is such a pattern-based parallel programming model and environment. The PAS model provides a generic way of describing the architectural/structural aspects of patterns in message-passing parallel computing. Application development using PAS ishierarchical, similar to conventional parallel programming using MPI, however with the added benefit of reusability and high level patterns. Like most other pattern-based parallel programming models, the benefits of PAS were offset by some of its drawbacks such as difficulty in: (1) extending PAS and (2) skeleton composition.

SuperPAS

is an extension of PAS that addresses these issues.

SuperPAS

provides a skeleton description language for the generic PAS. Using SuperPAS, a skeleton developer can extend PAS by adding new skeletons to the repository (i.e., extensibility). SuperPAS also makes the PAS system more flexible by defining composition of skeletons. In this paper, we describe SuperPAS and elaborate its use through examples.

Mohammad Mursalin Akon, Dhrubajyoti Goswami, Hon Fung Li

Optimizing I/O Server Placement for Parallel I/O on Switch-Based Irregular Networks

In this paper, we study I/O server placement for optimizing parallel I/O performance on switch-based clusters, which typically adopt irregular network topologies to allow construction of scalable systems with incremental expansion capability. Finding optimal solution to this problem is computationally intractable. We quantified the number of messages travelling through each network link by a

workload function

, and developed three heuristic algorithms to find good solutions based on the values of the workload function. Our simulation results demonstrate performance advantage of our algorithms over a number of algorithms commonly used in existing parallel systems. In particular, the load-balance-based algorithm is superior to the other algorithms in most cases, with improvement ratio of 10% to 95% in terms of parallel I/O throughput.

Yih-Fang Lin, Chien-Min Wang, Jan-Jan Wu

Designing a High Performance and Fault Tolerant Multistage Interconnection Network with Easy Dynamic Rerouting

Designing a reliable and high performance multistage interconnection network (MIN) should consider the following issues carefully: (1) fault tolerance guarantee; (2) easy schemes and hardware design of rerouting switches; (3) low rerouting resulting in a low collision ratio. In this paper, we present the High Performance Chained Multistage Interconnection Network (HPCMIN) which has one-fault tolerance, destination tag routing for easy rerouting, one rerouting hop, resulting in a low collision ratio. From our simulation results, the HPCMIN results in a lower collision ratio than other dynamic rerouting networks. The HPCMIN is embedded with the indirect binary n-cube network(the ICube network) which is equivalent to many important MINs. Thus, the design methods used in the HPCMIN can be applied to these MINs so that they have the characteristics of the HPCMIN.

Ching-Wen Chen, Phui-Si Gan, Chih-Hung Chang

Evaluating Performance of BLAST on Intel Xeon and Itanium2 Processors

High-performance computing (HPC) has increasingly adopted the use of clustered Intel architecture–based servers. This paper compares the performance characteristics of three Dell PowerEdge (PE) servers that are based on three different Intel processor technologies. They are the PE1750 which is an IA-32 based Xeon system, PE1850 which uses the new 90nm technology Xeon processor at faster frequencies and the PE3250 which is an Itanium2 based system. BLAST (Basic Local Alignment Search Tool), a high performance computing application used in the field of biological research, is used as the workload for this study. The aim is to understand the performance benefits of the different features associated with each processor/platform technology to BLAST and explain the observations using other standard micro-benchmarks like STREAM and LMBench.

Ramesh Radhakrishnan, Rizwan Ali, Garima Kochhar, Kalyana Chadalavada, Ramesh Rajagopalan, Jenwei Hsieh, Onur Celebioglu

Session 9C: Distributed Processing and Architecture

PEZW-ID: An Algorithm for Distributed Parallel Embedded Zerotree Wavelet Encoder

The image compression algorithm based on EZW can get any compression ratio as specified, which is widely used in the field of image processing. However, with massive computation during wavelet transformation and multiple scans of the transformed wavelet coefficient matrix during encoding processing, much time is consumed. Therefore, both the parallelism of wavelet transformation and zerotree encoding for EZW are necessary, and then we present PEZW-ID: an algorithm for distributed parallel embedded zerotree wavelet encoder. This paper describes the flow of the algorithm, proves the validity, and shows the performance analysis. Finally, with the experimental results under MPP, the parallelism and scalability of the algorithm have been verified.

Zhi-ming Chang, Yan-huang Jiang, Xue-jun Yang, Xiang-li Qu

Enhanced-Star: A New Topology Based on the Star Graph

The star graph, though an attractive alternative to the hypercube, has a poor network bandwidth due to a lower number of channels compared to that in an equivalent hypercube. In order to alleviate this drawback, this paper presents a generalization of the star graph topology with a richer connectivity, called the

enhanced-star

. We also examine some topological properties of the proposed network. Some useful functions such as multi-node broadcasting, scattering, total exchange, and group communication, in the enhanced-star are also addressed. We show these operations can be completed faster in the enhanced-star (compared to the star).

Hamid Reza Tajozzakerin, Hamid Sarbazi-Azad

An RFID-Based Distributed Control System for Mass Customization Manufacturing

Mass customization production means to produce customized products to meet individual customer’s need with the efficiency of mass production. It introduces challenges such as drastic increase of varieties, very small batch size and random arrival of orders, thus brings to the manufacturing control system requirements of flexibility and responsiveness which make mass customization production a fertile ground for intelligent agents. With RFID (Radio Frequency Identification) integration, an applicable agent-based heterogeneous coordination mechanism is developed to fulfill the requirements. In this paper, we propose a distributed system framework including a number of intelligent agents to collaborate in a virtual market-like environment. The proposed price mechanism in our system has the advantage of improving the production efficiency and total profit that the manufacturer will receive from a certain amount of jobs by utilizing a mechanism of both resource competition and job competition. Based on the simulation results, we compare the total profit, average delay and average waiting time of our agent-based price mechanism with those based on the widely exploit FIFO (First In First Out) and EDD (Earliest Due Date) scheduling mechanisms to show that when resources are with large queue length, the agent-based price mechanism significantly outperforms the other two.

Michael R. Liu, Q. L. Zhang, Lionel M. Ni, Mitchell M. Tseng

Event Chain Clocks for Performance Debugging in Parallel and Distributed Systems

In this paper, the Event Chain Clock synchronization algorithm is presented. This algorithm can maintain a global physical clock that reflects both the partial order and the elapsed time of all events occurred. This algorithm, which repeats some basic operations, has good astringency, and is suitable for parallel program performance debugging.

Hongliang Yu, Jian Liu, Weimin Zheng, Meiming Shen

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Keynote Speech

Present and Future Supercomputer Architectures

Challenges in P2P Computing

Multihop Wireless Ad Hoc Networking: Current Challenges and Future Opportunities

Session 1A: Parallel Algorithms and Systems I

An Inspector-Executor Algorithm for Irregular Assignment Parallelization

Multi-grain Parallel Processing of Data-Clustering on Programmable Graphics Hardware

A Parallel Reed-Solomon Decoder on the Imagine Stream Processor

Effective Nonblocking MPI-I/O in Remote I/O Operations Using a Multithreaded Mechanism

Session 1B: Data Mining and Management

Asynchronous Document Dissemination in Dynamic Ad Hoc Networks

Location-Dependent Query Results Retrieval in a Multi-cell Wireless Environment

An Efficient Mobile Data Mining Model

An Integration Approach of Data Mining with Web Cache Pre-fetching

Session 1C: Distributed Algorithms and Systems

Towards Correct Distributed Simulation of High-Level Petri Nets with Fine-Grained Partitioning

M-Guard: A New Distributed Deadlock Detection Algorithm Based on Mobile Agent Technology

Meta-based Distributed Computing Framework

Locality Optimizations for Jacobi Iteration on Distributed Parallel Systems

Session 2A: Fault Tolerance Protocols and Systems

Fault-Tolerant Cycle Embedding in the WK-Recursive Network

RAIDb: Redundant Array of Inexpensive Databases

A Fault-Tolerant Multi-agent Development Framework

A Fault Tolerance Protocol for Uploads: Design and Evaluation

Topological Adaptability for the Distributed Token Circulation Paradigm in Faulty Environment

Session 2B: Sensor Networks and Protocols

Adaptive Data Dissemination in Wireless Sensor Networks

Continuous Residual Energy Monitoring in Wireless Sensor Networks

Design and Analysis of a k-Connected Topology Control Algorithm for Ad Hoc Networks

On Using Temporal Consistency for Parallel Execution of Real-Time Queries in Wireless Sensor Systems

Session 2C: Cluster Systems and Applications

Cluster-Based Parallel Simulation for Large Scale Molecular Dynamics in Microscale Thermophysics

Parallel Checkpoint/Recovery on Cluster of IA-64 Computers

Highly Reliable Linux HPC Clusters: Self-Awareness Approach

An Enhanced Message Exchange Mechanism in Cluster-Based Mobile Ad Hoc Networks,

Session 3A: Parallel Algorithms and Systems II

Algorithmic-Parameter Optimization of a Parallelized Split-Step Fourier Transform Using a Modified BSP Cost Model

Parallel Volume Rendering with Early Ray Termination for Visualizing Large-Scale Datasets

A Scalable Low Discrepancy Point Generator for Parallel Computing

Generalized Trellis Stereo Matching with Systolic Array

Optimal Processor Mapping Scheme for Efficient Communication of Data Realignment

Session 3B: Grid Applications and Systems

MCCF: A Distributed Grid Job Workflow Execution Framework

Gamelet: A Mobile Service Component for Building Multi-server Distributed Virtual Environment on Grid

The Application of Grid Computing to Real-Time Functional MRI Analysis

Building and Accessing Grid Services

DRPS: A Simple Model for Locating the Tightest Link

Session 3C: Peer-to-Peer and Ad-Hoc Networking

A Congestion-Aware Search Protocol for Unstructured Peer-to-Peer Networks

Honeycomb: A Peer-to-Peer Substrate for On-Demand Media Streaming Service

An Improved Distributed Algorithm for Connected Dominating Sets in Wireless Ad Hoc Networks

A New Distributed Approximation Algorithm for Constructing Minimum Connected Dominating Set in Wireless Ad Hoc Networks

An Adaptive Routing Strategy Based on Dynamic Cache in Mobile Ad Hoc Networks

Session 4A: Grid Scheduling and Algorithms I

On the Job Distribution in Random Brokering for Computational Grids

Dividing Grid Service Discovery into 2-Stage Matchmaking

Performance Evaluation of a Grid Computing Architecture Using Realtime Network Monitoring

Quartet-Based Phylogenetic Inference: A Grid Approach

Scheduling BoT Applications in Grids Using a Slave Oriented Adaptive Algorithm

Session 4B: Data Replication and Caching

A Clustering-Based Data Replication Algorithm in Mobile Ad Hoc Networks for Improving Data Availability

CACHERP: A Novel Dynamic Cache Size Tuning Model Working with Relative Object Popularity for Fast Web Information Retrieval

Implementation of a New Cache and Schedule Scheme for Distributed VOD Servers

Session 4C: Software Engineering and Testing

UML Based Statistical Testing Acceleration of Distributed Safety-Critical Software

A Metamodel for the CMM Software Process

Performance Tuning for Application Server OnceAS

Systematic Robustness-Testing RI-Pro of BGP

Session 5A: Grid Protocols

MPICH-GP: A Private-IP-Enabled MPI Over Grid Environments

Paradigm of Multiparty Joint Authentication: Evolving Towards Trust Aware Grid Computing

Design and Implementation of a 3A Accessing Paradigm Supported Grid Application and Programming Environment

VAST: A Service Based Resource Integration System for Grid Society

Petri-Net-Based Coordination Algorithms for Grid Transactions

Session 5B: Context- ware and Mobile Computing

Building Infrastructure Support for Ubiquitous Context-Aware Systems