Skip to main content

2006 | Buch

Parallel and Distributed Processing and Applications

4th International Symposium, ISPA 2006, Sorrento, Italy, December 4-6, 2006. Proceedings

herausgegeben von: Minyi Guo, Laurence T. Yang, Beniamino Di Martino, Hans P. Zima, Jack Dongarra, Feilong Tang

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

Welcome to the proceedings of the 4th International Symposium on Parallel and Distributed Processing and Applications (ISPA 2006), which was held in Sorrento, Italy, December, 4–6 2006. Parallel computing has become a mainstream research area in computer s- ence and the ISPA conference has become one of the premier forums for the p- sentation of new and exciting research on all aspects of parallel and distributed computing. We are pleased to present the proceedings for ISPA 2006, which comprises a collection of excellent technical papers and keynote speeches. The accepted papers cover a wide range of exciting topics including architectures, languages, algorithms, software, networking and applications. The conference continues to grow and this year a record total of 277 ma- scriptsweresubmitted forconsiderationbytheProgramCommittee.Fromthese submissions the Program Committee selected only 79 regular papers in the p- gram, which re?ects the acceptance rate as 28%. An additional 10 workshops complemented the outstanding paper sessions. The submission and review process worked as follows. Each submission was assigned to at least three Program Committee members for review. Each P- gram Committee member prepared a single review for each assigned paper or assigned a paper to an outside reviewer for review. In addition, the Program Chairs and Program Vice-Chairs read the papers when a con?icting review - sult occurred. Finally, after much discussion among the Program Chairs and ProgramVice-Chairs, based on the review scores, the ProgramChairs made the ?nal decision. Given the large number of submissions, each ProgramCommittee member was assigned roughly 7–12 papers.

Inhaltsverzeichnis

Frontmatter

Keynote Speech

The Walls of Computer Design

Most of the work of the computer architect today has to do with bridging the well-known gigantic gap between processor speed and memory access time. However, other significant challenges are looming on the horizon. For one thing, higher device density and smaller design rules are increasing the sensitivity of circuits to outside radiation events. Also, power issues are creating significant challenges, not only in terms of power consumption for embedded devices, but also in terms of cooling and power dissipation. Hence, while Moore’s law will continue to hold for the foreseeable future, processor speeds are not expected to follow suit, thereby requiring new architectural concepts.

Jean-Luc Gaudiot
The Impact of Multicore on Math Software and Exploiting Single Precision Computing to Obtain Double Precision Results

Recent versions of microprocessors exhibit performance characteristics for 32 bit floating point arithmetic (single precision) that is substantially higher than 64 bit floating point arithmetic (double precision). Examples include the Intel Pentium IV and M processors, AMD Opteron architectures, the IBM Cell processor and various GPUs. When working in single precision, floating point operations can be performed up to two times faster on the Pentium and up to ten times faster on the Cell over double precision. The motivation for this work is to exploit single precision operations whenever possible and resort to double precision at critical stages while attempting to provide the full double precision results. The results described here are fairly general and can be applied to various problems in linear algebra such as solving large sparse systems, using direct or iterative methods and some eigenvalue problems. There are limitations to the success of this process, such as when the conditioning of the problem exceeds the reciprocal of the accuracy of the single precision computations. In that case the double precision algorithm should be used.

Jack Dongarra
Implementing Tomorrow’s Programming Languages

Compilers are the critical translators that convert a human-readable program into the code understood by the machine. While this transformation is already sophisticated today, tomorrow’s compilers face a tremendous challenge. There is a demand to provide languages that are much higher level than today’s C, Fortran, or Java. On the other hand, tomorrow’s machines are more complex than today; they involve multiple cores and may span the planet via compute Grids. How can we expect compilers to provide efficient implementations? I will describe a number of related research efforts that try to tackle this problem. Composition builds a way towards higher-level programming languages. Automatic translation of shared-address-space models to distributed-memory architectures may lead to higher productivity than current message passing paradigms. Advanced symbolic analysis techniques equips compilers with capabilities to reason about programs in abstract terms. Last but not least, through auto-tuning, compilers make effective decisions, even through there may be insufficient information at compile time.

Rudolf Eigenmann
Trends in High Performance Computing for Industry and Research

Today trends in High Performance Computing can be separated into two basic categories: one covers the directions towards smaller, faster and hopefully cooler components, the other includes more innovative directions, which might lead to real paradigm shifts.

Frank Baetke
Emerging Technologies and Large Scale Facilities in HPC

Recently developments in microprocessor technology, high performance interconnection, storage subsystem, and grid infrastructure, address pflops and pbytes capabilities for large scale architectures which will categorize large HPC facilities in the near future.

Marco Briscolini

Track 1. Architectures and Networks

Session 1. Architectures

Architecture and Performance of Dynamic Offloader for Cluster Network

This paper presents an architecture of the dynamic offloading mechanism, called Maestro Dynamic Offloading Mechanism(MDO), for the intelligent cluster network Maestro2. By using MDO, programmers can offload software modules to the network interface and the switch dynamically. MDO provides programmers functional APIs with which programmers can develop offload modules efficiently. MDO also includes wrapper library that enables the offload modules to be executed on the host processors as well as on the network devices. The results of performance evaluation showed that the performance of the collective communications can be improved by offloading communication procedures to the network devices using MDO. The overhead of the MDO and the traffic on PCI bus are also discussed.

Keiichi Aoki, Hiroki Maruoka, Koichi Wada, Masaaki Ono
Session Key Forwarding Scheme Based on AAA Architecture in Wireless Networks

Due to an increasing number of portable devices, supports for quality of service (QoS) and security problems become significant issues in wireless networks. For this reason, Authentication, Authorization, and Accounting (AAA) architecture has been proposed. However AAA architecture has inefficient authentication procedures when a mobile node (MN) roams to a foreign network because AAA architecture assume that the only reliable source of authentication is an AAA server in the home network. In this paper, we propose session key forwarding scheme that can reduce an authentication time and an authentication failure rate while maintaining a security level. The performance results show that the proposed scheme reduces the authentication latency up to 6.16% and the authentication failure rate up to 23.9% compared to the standard AAA architecture.

Jong-Hyouk Lee, Tai-Myoung Chung
Dynamic Anchor Based Mobility Management Scheme for Mobile IP Networks

In this paper, we introduce a dynamic anchor based mobility management scheme for mobile networks, in which different hierarchies are dynamically set up for different users and the multiple tunnels are converged into a single tunnel when lifetime refreshes without excessive packet loss. To justify the effectiveness of our proposed scheme, we made an analytic model to evaluate the signaling cost and handoff delay compared with legacy schemes. Our performance result shows that the proposed dynamic anchor based mobility management scheme can reduce the system signaling cost and has shorter handoff delay under various scenarios.

Jae-Pil Yoo, Kee-Cheon Kim
Forest Search: A Paradigm for Faster Exploration of Scale-Free Networks

Scale-free networks naturally model wide area networks like the WWW. We consider the problem of fast exploration of huge scale-free networks using small memory space. Although there are many search algorithms for exploring an unknown graph, they require much space or time. For example, the depth first search requires some memory for all the nodes in the worst case, and the average number of steps in the random walk is

O

(

n

3

), where

n

is the size of the graph. Under assumptions reflecting WWW applications, we propose a new exploration paradigm called

forest search

particularly designed for scale-free networks, and theoretically evaluate its space complexity. We also demonstrate its superiority over random walk based search algorithms by conducting simulations.

Yuichi Kurumida, Hirotaka Ono, Kunihiko Sadakane, Masafumi Yamashita
Choosing a Load Balancing Scheme for Agent-Based Digital Libraries

Digital Libraries (DLs) provide an important application area for deploying mobile agents, and with the increase in content being made available within such libraries, performance concerns are becoming significant. Current DLs often involve content servers which are geographically distributed, often necessitating information from these servers to be aggregated to answer a single query. Encoding a query as a mobile agent provides a useful means to implement such queries. However, a significant concern when this approach is adopted is the need to load balance multiple sets of such agents across the available servers. This paper focuses on an attempt to answer which load balancing scheme should be applied to an agent-based DL. A demonstration of a load balancing scheme based on the guidelines proposed in this paper with reference to Synthetic Aperture Radar Atlas (SARA) DL, consisting of multi-spectral images of the Earth, is presented. This particular DL involves processing image content, but also involves aggregation of data acquired from the images with text-based content available elsewhere.

Georgousopoulos Christos, Omer F. Rana
A Novel Software Solution for Localized Thermal Problems

In this paper, we propose a temperature-aware DFS (Dynamic Frequency Scaling) technique using the performance counters that is already embedded in the commercial microprocessors. By using performance counters and simple regression analysis, we can predict the localized temperature and efficiently schedule the tasks considering the temperature. The proposed technique is especially beneficial to potential localized thermal problems that are inevitable due to limited number of costly CMOS thermal sensors. When localized thermal problems that were not detected by thermal sensors are found after fabrication, the thermal problems can be avoided by the proposed software solution without re-fabrication costs. The evaluation results show that the proposed technique is comparable to the DFS technique using CMOS thermal sensors.

Sung Woo Chung, Kevin Skadron
The Optimum Network on Chip Architectures for Video Object Plane Decoder Design

On Chip Network (OCN) has been proposed as a new methodology for addressing the design challenges of future massly integrated system in nanoscale. In this paper, three differently heterogenous Tree-based network topologies are proposed as the OCN architectures for Video Object Plane Decoder (VOPD). The topologies are designed in order to maximize the system throughput. This paper also evaluates the proposed topologies by comparing them to other conventional topologies such as 2-D Mesh and Fat-Tree with respects to throughput, power consumption and size. We use the power modelling tool, known as Orion model to calculate the static powers, areas, and dynamic energies of three topologies. The experiment results show that our Tree-based topologies offer similar throughputs as Fat-Tree does and much higher throughputs compared to 2-D Mesh while use less chip areas and power consumptions.

Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi
A Hardware NIC Scheduler to Guarantee QoS on High Performance Servers

In this paper we present the architecture and implementation of a hardware NIC scheduler to guarantee QoS on servers for high speed LAN/SAN. Our proposal employs a programmable logic device based on an FPGA in order to store and update connection states, and to decide what data stream is to be sent next. The network architecture is connection-oriented and reliable, based on credit flow control. The architecture scales from 4 to 32 streams using a Xilinx Virtex 2000E. It supports links with speeds in the order of Gbps while, maintaining the delay and jitter constrains for the QoS streams.

J. M. Claver, M. Canseco, P. Agustí, G. León
Studying Several Proposals for the Adaptation of the DTable Scheduler to Advanced Switching

Advanced Switching (AS) is a new fabric-interconnect technology that further enhances the capabilities of PCI Express. On the other hand, the provision of Quality of Service (QoS) in computing and communication environments is currently the focus of much discussion and research in industry and academia.

A key component for networks with QoS support is the egress link scheduling algorithm. AS defines a table-based scheduler that is simple to implement and can offer good latency bounds with a fixed packet size. However, it does not work properly with variable packet sizes and faces the problem of bounding the bandwidth and latency assignments.

In this paper we propose several possible modifications to the original AS table scheduler in order to implement the Deficit Table (DTable) scheduler. This scheduler works properly with variable packet sizes and allows to partially decouple the bandwidth and latency assignments.

Raúl Martínez, Francisco J. Alfaro, José L. Sánchez
Asymmetrical SSL Tunnel Based VPN

Asymmetric SSL Tunnel (AST) based Virtual Private Network is presented as a cheap solution for large scale SSL VPNs. In this solution, portion of SSL/TLS computational load is transferred to disengaged internal application servers, so that VPN server is no more the bottleneck of VPN system. This paper analyzes the performance advantage of asymmetric SSL tunnel over traditional SSL tunnel, and discusses the secret management scheme for AST, which can meet enhanced security requirement and synchronize cipher specs of multipoint. Finally, a kernel optimization algorithm was introduced. AST is implemented in OpenVPN, which is originally a stable traditional SSL VPN solution. Experiment shows that the overall throughput of OpenVPN can be greatly improved after AST adopted.

Jingli Zhou, Hongtao Xia, Jifeng Yu, Xiaofeng Wang

Session 2. Networks

Multihomed Routing in Multicast Service Overlay Network

Most existing overlay multicast approaches assume a singlehomed model, where each overlay node has a unique access link in the physical network. This assumption is usually not applicable to the routing problem in multicast service overlay network, which may comprise multihomed multicast service nodes(MSN). We address this issue by proposing not only a multihomed proxy model but also a two-phase solution to the routing problem under this model. First we choose a proper physical path for each overlay link based on the current network status and the routing policy. Then the routing process calculates the multicast tree according to the previous selection. Through simulations, we prove that our approach makes the overlay multicast routing more efficient in a multihomed environment.

Xiao Chen, Weinong Wang
Region-Based Multicast Routing Protocol with Dynamic Address Allocation Scheme in Mobile Ad-Hoc Networks

Mobile Ad hoc Network (MANET) is a network that mobile nodes can freely and dynamically communicate with each others without the support of any infrastructure. Due to the node’s movement, the network topology can be changed randomly and unpredictably. Previously, address as identifier is used for establishing a delivery route. But it is not easy that each mobile node has a unique address due to the lack of centralized address protocol like DHCP. In this paper, we propose region-based dynamic address allocation scheme that can simplify the route establishment and maintenance. Routing is done by a short distance vector algorithm based on the proactive and the flat topology routing protocol. Basic idea of our address allocation scheme is to separate node identity and node address. Node address indicates node’s current location and is allotted by its parent node with a cascading address scheme. In evaluation, we examined our proposed scheme in a point of the view of the number of control packets to establish delivery trees and the connecting probability. From simulation results, we confirm that the proposed scheme offers substantially better performance.

Backhyun Kim, Iksoo Kim
A Novel Approach for Topology-Aware Overlay Multicasting

Most existing overlay multicast approaches avoid to consider any network layer support no matter whether it is available or not. This design principle greatly increases the complexity of the routing algorithms and makes the overlay topologies incompatible with the underlying network. To address these issues, we propose TOMIMN as a novel overlay multicast approach, which exploits the cooperation between end-hosts and IP multicast routers to construct a topology-aware overlay tree. Through a little modification to PIM-SM, a multicast router is able to receive registration from nearby group members and redirect passing-by join requests to them. Due to the multicast router’s support, TOMIMN organizes its group members into an overlay multicast tree efficiently, which matches the physical network topology well.

Xiao Chen, Huagang Shao, Weinong Wang
Limitations of Fixed Timers for Wireless Links

Traditional reliable link layer protocols set their fixed retransmission timers under the assumption that they operate in isolation over the link. Emerging wireless networks however allow multiple link layer sessions to dynamically share the link. To assess the impact of this development, we examine the performance of Web Browsing over a Selective Repeat protocol with fixed retransmission timers, showing that the optimal retransmission timer values depend on the level of contention. We therefore propose an adaptive Selective Repeat protocol that modifies its retransmission timers based on prevailing conditions. Our measurements show that this adaptive scheme provides excellent Web Browsing performance regardless of the level of contention, under two very different wireless error models.

George Xylomenos
Performance Analysis of IEEE 802.11e WLANs with Hidden Stations and Heterogeneous Traffic

IEEE 802.11e Medium Access Control (MAC) mechanism has been recently proposed for supporting differentiated Quality-of-Services (QoS) in Wireless Local Area Networks (WLANs). Heterogeneous traffic generated by wireless multimedia applications and hidden stations arisen from the wireless transmission power constraints have significant impact on the performance of MAC protocols. This study performs extensive simulation experiments and conducts comprehensive performance evaluation of the IEEE 802.11e Enhanced Distributed Channel Access (EDCA) protocol in WLANs with hidden stations and heterogeneous traffic. For this purpose, non-bursty Poisson, bursty ON/OFF, and fractal-like self-similar processes with high variability are used to model and generate heterogeneous network traffic. The performance results have shown that the protocol is able to achieve differentiated throughput, access delay and medium utilization. However, the hidden stations can degrade the throughput and medium utilization and also increase the medium access delay greatly in the presence of heterogeneous traffic.

Mamun I. Abu-Tair, Geyong Min
Study on High Performance IPv4/IPv6 Transition and Access Service

Transition to IPv6 has been recognized as the trend of future Internet. Since a huge amount of resources have been invested on current IPv4-based Internet, how to smoothly transit the Internet from IPv4 to IPv6 is a very important research topic. This paper surveys the state of the art in IPv6 transition, summarizes and compares different IPv6 transition mechanisms and scenarios, discusses the security issues in IPv6 transition, and presents the possible directions for future research.

Xiaoxiang Leng, Jun Bi, Miao Zhang
Limiting the Effects of Deafness and Hidden Terminal Problems in Directional Communications

It is known that wireless ad hoc networks employing omnidirectional communications suffer from poor network throughput due to inefficient spatial reuse. Although the use of directional communications is expected to provide significant improvements, the lack of efficient mechanisms to deal with deafness and hidden terminal problems makes it difficult to fully explore its benefits. The main contribution of this work is to propose a Medium Access Control (MAC) scheme which aims to lessen the effects of deafness and hidden terminal problems in directional communications without precluding spatial reuse. Unlike other proposals that focus on exploring the characteristics of the physical layer, the proposed MAC protocol relies on simple mechanisms that can be easily coupled with a directional antenna without requiring major modifications to the current MAC standard.

J. L. Bordim, T. Hunziker, K. Nakano
Enforcing Dimension-Order Routing in On-Chip Torus Networks Without Virtual Channels

In the case of simple tile-based architecture, such as small reconfigurable processor arrays, a virtual-channel mechanism, which requires additional logic and pipeline stages, will be one of the crucial factors for a low cost implementation of their on-chip routers. To guarantee deadlock-free packet transfer with no virtual channels on tori, we propose a non-minimal strategy consistent with the rule of dimension-order routing (DOR) algorithm. Since embedded streaming applications usually generate predictable data traffic, the path set can be customized to the traffic from alternative DOR paths. Although the proposed strategy does not use any virtual channels, it achieves almost the same performance as virtual-channel routers on tori in eleven of 18 application traces.

Hiroki Matsutani, Michihiro Koibuchi, Hideharu Amano
A Distributed Backbone Formation Algorithm for Mobile Ad Hoc Networks

Construction of a backbone architecture is an important issue in mobile ad hoc networks(MANET)s to ease routing and resource management. We propose a new fully distributed algorithm for backbone formation in MANETs that constructs a directed ring architecture. We show the operation of the algorithm, analyze its message complexity and provide results in the simulation environment of ns2. Our results conform that the algorithm is scalable in terms of its running time and round-trip delay against mobility, surface area, number of nodes and number of clusterheads.

Orhan Dagdeviren, Kayhan Erciyes
A Service Differentiation Mechanism for Improving the Performance of IEEE 802.15.4 Sensor Networks

In the sensor networks, each device generates data of different sizes in the home networking and the industrial application according to their roles in the networks. In this paper, we propose a mechanism that provides differentiated services for the IEEE 802.15.4 sensor networks to improve the total throughput and the fairness of the channel. To provide differentiated services for each and every device, our mechanism adds different sizes of backoff period according to the size of packet that is generated by the device. The mathematical model based on the discrete-time Markov chain is presented and is analyzed to measure the performances of the proposed mechanism. Simulation results are also given to verify the accuracy of the analytical model. Finally, the analytical results show the improvement in the throughput and the fairness of the network which applies our mechanism.

Tae-Yoon Kim, Sungkwan Youm, Eui-Jik Kim, Chul-Hee Kang
Randomized Leader Election Protocols in Noisy Radio Networks with a Single Transceiver

In this work, we present leader election protocols for single-hop, single-channel noisy radio networks that do not have collision detection (CD) capabilities. In most leader election protocols presented so far, it is assumed that every station has the ability to transmit and monitor the channel at the same time, it requires every station to be equipped with two transceivers. This assumption, however, is unrealistic for most mobile stations due to constraints in cost, size, and energy dissipation. Our main contribution is to show that it is possible to elect a leader in an anonymous radio network where each station is equipped with a single transceiver. We first present a leader election protocol for the case the number

n

of stations is known beforehand. The protocol runs in

O

(log

f

) time slots with probability at least

$1-{1\over f}$

for any

f

> 1. We then present a leader election protocol for the case where

n

is not known beforehand but an upper bound

u

of

n

is known. This protocol runs in

O

(log

f

log

u

) time slots with probability at least

$1-{1\over f}$

for any

f

> 1. We also prove that these protocols are optimal. More precisely, we show that any leader election protocol elect a leader with probability at least

$1-{1\over f}$

must run in Ω(log

f

) time slots if

n

is known. Also, we proved that any leader election protocol elect a leader with probability at least

$1-{1\over f}$

must run in Ω(log

f

log

u

) time slots if an upper bound

u

of

n

is known.

Jacir Luiz Bordim, Yasuaki Ito, Koji Nakano
A Hybrid Intelligent Preventive Fault-Tolerant QoS Unicast Routing Scheme in IP over DWDM Optical Internet

IP over DWDM (Dense Wavelength Division Multiplexing) optical Internet is one of the mainstream networking techniques for NGI (Next Generation Internet) backbone, and how to improve its fault-tolerance capability and QoS provision becomes critical. Fault-tolerant QoS routing is one of the effective solutions. In this paper, a preventive fault-tolerant QoS unicast routing scheme is proposed based on a hybrid intelligent algorithm, taking advantage of both ant-colony algorithm and genetic algorithm. Simulation results have shown that the proposed scheme is both feasible and effective with better performance.

Xingwei Wang, Shuxiang Cai, Nan Gao, Min Huang

Track 2. Languages and Algorithms

Session 3. Languages

From XML Specifications to Parallel Programs

Skeleton-based libraries are considered as one of the alternatives for reducing the distance between end users and parallel architectures. We propose a general development methodology that allows for the automatic derivation of parallel programs assuming the existence of general structures as the skeletons. We propose the introduction of a new, high level abstraction layer that allows the user to extract problem specifications from particular skeleton languages or libraries. The result is a tool that allows for the generation of parallel codes from successive transformations to this high level specification without any loss of efficiency. We apply the technique to the automatic generation of parallel programs for Dynamic Programming Problems.

Ignacio Peláez, Francisco Almeida, Daniel González
Managing Grid Computations: An ORC-Based Approach

In this paper a model of grid computation that supports both heterogeneity and dynamicity is presented. The model presupposes that user sites contain software components awaiting execution on the grid. User sites and grid sites interact by means of managers which control dynamic behaviour. The orchestration language ORC [9,10] offers an abstract means of specifying operations for resource acquisition and execution monitoring while allowing for the possibility of non-responsive hardware. It is demonstrated that ORC is sufficiently expressive to model typical kinds of grid interactions.

A. Stewart, J. Gabarró, M. Clint, T. Harmer, P. Kilpatrick, R. Perrott
Adaptive Technique for Automatic Communication Access Pattern Discovery Applied to Data Prefetching in Distributed Applications Using Neural Networks and Stochastic Models

The distributed computing performance is usually limited by the data transfer rate and access latency. Techniques such as data caching and prefetching were developed to overcome this limitation. However, such techniques require the knowledge of application behavior in order to be effective. In this sense, we propose new application communication behavior discovery techniques that, by classifying and analyzing application access patterns, is able to predict future application data accesses. The proposed techniques use stochastic methods for application state change prediction and neural networks for access pattern discovery based on execution history, and is evaluated using the NAS Parallel Benchmark suite.

Evgueni Dodonov, Rodrigo Fernandes de Mello, Laurence Tianruo Yang
Process Scheduling Using Ant Colony Optimization Techniques

The growing availability of low cost microprocessors and the evolution of computing networks have enabled the construction of sophisticated distributed systems. The computing capacity of these systems motivated the adoption of clusters to build high performance solutions. The improvement of the process scheduling over clusters originated several proposals of scheduling and load balancing algorithms. These proposals have motivated this work, which defines, evaluates and implements a new load balancing algorithm for heterogeneous capacity clusters. This algorithm, named Ant Scheduler, uses concepts of ant colonies for the development of optimization solutions. Experimental results obtained in the comparison of Ant Scheduler with other approaches investigated in the literature show its ability to minimize process mean response times, improving the performance.

Bruno Rodrigues Nery, Rodrigo Fernandes de Mello, André Carlos Ponce de Leon Ferreira de Carvalho, Laurence Tianruo Yang
Interlocking Control by Distributed Signal Boxes: Design and Verification with the SPIN Model Checker

Control systems are required to comply with certain safety and liveness correctness properties. In most cases, such systems have an intrinsic degree of complexity and it is not easy to formally analyze them, due to the resulting large state space. Also, exhaustive simulation and testing can easily miss system errors, whether they are life-critical or not. In this work, we introduce an interlocking control approach that is based on the use of the so-called

Distributed Signal Boxes (DSBs)

. The proposed control design is applied to a railway-interlocking problem and more precisely, to the Athens underground metro system. Signal boxes correspond to the network’s interlocking points and communicate only with their neighbor signal boxes. Communication takes place by the use of rendezvous communication channels. This design results in a simple interlocking control approach that compared to other centralized solutions produces a smaller and easier to analyze state space. Formal analysis and verification is performed with the SPIN model checker.

Stylianos Basagiannis, Panagiotis Katsaros, Andrew Pombortsis
A Delay Time-Based Peak Load Control for Stable Performance

Currently, integration application, such as EAI (Enterprise Application Integration), B2BI (Business-To-Business Integration), requires reliable B2B integration framework that can stably process massive I/O transactions during overload state. This paper proposes the delay time-based peak load control for stable performance and we describe that how the suggested pattern is applied to B2BI as an example. The pattern uses the delay time algorithm for controlling the heavy peak load caused by many requests for a short period of time. According to our experimental result, the proposed delay time algorithm can stably process the heavy load after the saturation point and has an effect on the controlling the peak loads.

Yonghwan Lee, Eunmi Choi, Dugki Min
Interference Aware Dynamic Subchannel Allocation in a Multi-cellular OFDMA System Based on Traffic Situation

This paper presents the development and evaluation of a dynamic subchannel allocation scheme for downlink multi-cellular Orthogonal Frequency Division Multiple Access (OFDMA) systems. In the considered system each Access Point (AP) and the associated Mobile Terminals (MTs) are not operating on a frequency channel with a fixed bandwidth, rather the channel bandwidth for each AP is dynamically adapted according to the traffic load. The subchannels assignment procedure is based on quality estimations due to the interference measurements and the current traffic load. The developed dynamic subchannel allocation ensures Quality of Service (QoS), better traffic adaptability and higher spectrum efficiency with less computational complexity.

Banani Roy, Chanchal Kumer Roy, Michael Einhaus
Sized-Based Replacement-k Replacement Policy in Data Grid Environments

The data grid computing provides geographically distributed storage resources to solve computational problems with large-scale data. Unlike cache replacement policies in virtual memory or web-caching replacement, an optimal file replacement policy for data grids is the one of the important problems by the fact that file size is very large. The traditional file replacement policies such as LRU(Least Recently Used), LCB-K(Least Cost Beneficial based on K), EBR(Economic-based cache replacement), LVCT(Least Value-based on Caching Time) must predict near future or need additional resources for file replacement. In this paper, the SBR-k(Sized-based replacement-k) policy for solving previous problems propose. The SBR-k replacement is a file size based replacement policy for new file. The results of the simulation show that the proposed policy performs better than traditional policies.

HongJin Park, ChangHoon Lee
Barrier Elimination Based on Access Dependency Analysis for OpenMP

In this paper, we propose a new compiler technique for eliminating barrier synchronizations. In our approach, the compiler collects access information about array accesses and analyzes data dependency. If there was no dependency, barrier synchronizations can be eliminated. Additionally, even if the dependency was detected, there are cases when the barrier synchronization can be replaced with send-receive pairs of communications. For evaluation, we executed two application programs: Jacobi Method and Gaussian Elimination, on a PC cluster with barrier elimination applied. For comparison, we also executed the programs before elimination of barrier synchronizations. With barrier elimination, 1) the execution time is always reduced, and 2) as the number of processors increases, the reduction ratio of the execution time also increases. For 16 processors, we obtained 19.00% and 50.36% of the reduction ratio for Jacobi Method and Gaussian Elimination respectively.

Naoki Yonezawa, Koichi Wada, Takahiro Aida

Session 4. Algorithms

Parallelization Algorithms for Three-Body Interactions in Molecular Dynamics Simulation

Two force decomposition algorithms are proposed for parallelizing three-body interactions in Molecular Dynamics (MD) simulations. The first algorithm divides the entire 3D force matrix into equal sized force cubes that are assigned to parallel processors. In the second strategy, the force matrix is decomposed into slices of two-dimensional force matrixes, and those slices are distributed among processors cyclically. The proposed decomposition algorithms are implemented using MPI and tested in computational experiments. The performances of proposed decomposition methods are studied and compared with computational load theoretical analysis. Both theoretical prediction and computation experiments demonstrate that the load balance is a key factor that impacts the parallel performance of the examined system, and the cyclic force decomposition algorithm produced reasonably good overall performances.

Jianhui Li, Zhongwu Zhou, Richard J. Sadus
An Efficient Distributed Algorithm for Mining Association Rules

Association Rule Mining (ARM) is an active data mining research area. However, most ARM algorithms cater to a centralized environment where no external communication is required. Distributed Association Rule Mining (DARM) algorithms aim to generate rules from different datasets spread over various geographical sites; hence, they require external communications throughout the entire processor. A direct application of sequential algorithms to distributed databases is not effective, because it requires a large amount of communication overhead. DARM algorithms must reduce communication costs. In this paper, a new solution is proposed to reduce the size of message exchanges. Our solution also reduces the size of average transactions and datasets that leads to reduction of scan time, which is very effective in increasing the performance of the proposed algorithm. Our performance study shows that this solution has a better performance over the direct application of a typical sequential algorithm.

Zahra Farzanyar, Mohammadreza Kangavari, Sattar Hashemi
Adaptive Algorithms Using Bounded Memory Are Inherently Non-uniform

Distributed protocols that run in dynamic environments such as the Internet are often not able to use an upper bound on the number of potentially participating processes. In these settings

adaptive

and

uniform

algorithms are desirable where the step complexity of all operations is a function of the number of concurrently participating processes (adaptive) and the algorithm does not need to know an upper bound on the number of participating processes (uniform). Adaptive algorithms, however, are generally not adaptive with respect to their memory consumption – if no upper bound on the number of participating processes is known in advance – they require unbounded MWMR registers and an unbounded number of such registers (even if only finitely many distinct processes appear), making them impractical for real systems. In this paper we ask whether this must be the case: Can adaptive algorithms where no upper bound on the number of participating processes is known in advance be uniformly implemented with finite memory (if only finitely many distinct processes keep reappearing)? We will show that in the dynamic setting it is impossible to implement long-lived adaptive splitters, collect and renaming with infinitely many

bounded

MWMR registers, making such adaptive algorithms impractical in dynamic settings. On the positive side we provide algorithms that implement a long-lived uniform adaptive splitter if unbounded registers are available and that implement a non-uniform adaptive splitter with finitely many bounded registers if an upper bound on the number of participating processes is known in advance.

Burkhard Englert
On the Implementation of Parallel Shortest Path Algorithms on a Supercomputer

We investigate the practical merits of a parallel priority queue through its use in the development of a fast and work-efficient parallel shortest path algorithm, originally designed for an EREW PRAM. Our study reveals that an efficient implementation on a real supercomputer requires considerable effort to reduce the communication performance (which in theory is assumed to take constant time). It turns out that the most crucial part of the implementation is the mapping of the logical processors to the physical processing nodes of the supercomputer. We achieve the requested efficient mapping through a new graph-theoretic result of independent interest: computing a Hamiltonian cycle on a directed hyper-torus. No such algorithm was known before for the case of directed hypertori. Our Hamiltonian cycle algorithm allows us to considerably improve the communication cost and thus the overall performance of our implementation.

Gabriele Di Stefano, Alberto Petricola, Christos Zaroliagis
An Efficient K-Coverage Eligibility Algorithm on Sensor Networks

Wireless sensor networks are employed in many critical applications. The K-coverage configuration is usually adopted to guarantee the quality of surveillance. A sensor node can be determined to be

ineligible

to stay active when its sensing range is K-covered. Although many algorithms have been proposed to reduce the complexity of the K-coverage configuration, the accuracy cannot be preserved when the number of deployed sensor nodes increases. In this paper, we propose an efficient K-coverage eligibility (EKE) algorithm to accurately and cheaply determine the eligibility of each sensor node. The algorithm focuses on the regions having a lower degree of coverage for each sensor node. Therefore, the complexity of the EKE algorithm is reduced substantially while retaining accuracy. Experimental studies indicated that the computational cost of the EKE algorithm could be reduced by up to 89% and that the correct percentage was larger than 90%.

Meng-Chun Wueng, Shyh-In Hwang
A Distributed Algorithm for a b-Coloring of a Graph

A b-coloring of a graph is a proper coloring where each color admits at least one node (called

dominating node

) adjacent to every other used color. Such a coloring gives a partitioning of the graph in clusters for which every cluster has a clusterhead (the dominating node) adjacent to each other cluster. Such a decomposition is very interesting for large distributed systems, networks,... In this paper we present a distributed algorithm to compute a b-coloring of a graph, and we propose an application for the routing in networks to illustrate our algorithm.

Brice Effantin, Hamamache Kheddouci
Topology-Sensitive Epidemic Algorithm for Information Spreading in Large-Scale Systems

Epidemic algorithms are an emerging technique that has recently gained popularity as a potentially effective solution for disseminating information in large-scale network systems. For some application scenarios, efficient and reliable data dissemination to all or a group of nodes in the network is necessary to provide with the communication services within the system. These studies may have a large impact in communication networks where epidemic-like protocols become a practice for message delivery, collaborative peer-to-peer applications, distributed database systems, routing in Mobile Ad Hoc networks, etc. In this paper we present, through various simulations, that an epidemic spreading process can be highly influenced by the network topology. We also provide a comparative performance analysis of some global parameters performance such as network diameter and degree of connectivity. Based on this analysis, we propose a new epidemic strategy that takes into account the topological structure in the network. The results show that the proposed epidemic algorithm outperform a classical timestamped anti-entropy epidemic algorithm in terms of the number of sessions required to reach a consistent state in the network system.

J. Acosta-Elías, J. M. Luna-Rivera, M. Recio-Lara, O. Gutiérrez-Navarro, B. Pineda-Reyes
Performance Modeling and Optimal Block Size Selection for the Small-Bulge Multishift QR Algorithm

The small-bulge multishift QR algorithm proposed by Braman, Byers and Mathias is one of the most efficient algorithms for computing the eigenvalues of nonsymmetric matrices on processors with hierarchical memory. However, to fully extract its potential performance, it is crucial to choose the block size

m

properly according to the target architecture and the matrix size

n

. In this paper, we construct a performance model for this algorithm. The model has a hierarchical structure that reflects the structure of the original algorithm and given

n

,

m

and the performance data of the basic components of the algorithm, such as the level-3 BLAS routines and the double implicit shift QR routine, predicts the total execution time. Experiments on SMP machines with PowerPC G5 and Opteron processors show that the variation of the execution time as a function of

m

predicted by the model agrees well with the measurements. Thus our model can be used to automatically select the optimal value of

m

for a given matrix size on a given architecture.

Yusaku Yamamoto
A Parallel Mutual Information Based Image Registration Algorithm for Applications in Remote Sensing

Image registration is a classical problem that addresses the problem of finding a geometric transformation that best aligns two images. Since the amount of multisensor remote sensing imagery are growing tremendously, the search for matching transformation with mutual information is very time-consuming and tedious, and fast and automatic registration of images from different sensors has become critical in the remote sensing framework. So the implementation of automatic mutual information based image registration methods on high performance machines needs to be investigated. First, this paper presents a parallel implementation of a mutual information based image registration algorithm. It takes advantage of cluster machines by partitioning of data depending on the algorithm’s peculiarity. Then, the evaluation of the parallel registration method has been presented in theory and in experiments and shows that the parallel algorithm has good parallel performance and scalability.

Yunfei Du, Haifang Zhou, Panfeng Wang, Xuejun Yang, Hengzhu Liu
A Novel Broadcasting Algorithm for Wireless Sensor Networks

Broadcasting is one of the typical operations in wireless sensor networks where the nodes are energy constrained. Considering the characteristics that nodes are almost static and position aware in wireless sensor networks, we propose a novel broadcasting algorithm (NBA) for utilizing energy efficiently. The energy status of nodes is regarded as an important factor, as well as the local network topology, in clustering and deciding retransmission nodes. We prove the accurateness of NBA. Simulations show that NBA reduces the number of redundant retransmissions and balances traffic efficiently, so energy is conserved and energy dissipation is distributed evenly throughout the network. Thereby the lifetime of WSN is prolonged greatly.

Yong Tang, Mingtian Zhou

Track 3. Middleware and Cooperative Computing

Session 5. Middleware

Middleware Services for Pervasive Grids

Grid computing and pervasive computing have rapidly emerged and affirmed respectively as the paradigm for high performance computing and the paradigm for user-friendly computing. These two worlds, however, can no longer be separated islands. As a matter of fact, pervasive and grid computing communities can both benefit from joining the two paradigms. This conjunction is already taking place. This paper presents a middleware for pervasive grid applications. It consists of a set of basic services that aim to enhance classic grid environments with mechanisms for: i) integrating mobile devices in a pervasive way; ii) providing context-awareness; and, iii) handling mobile users’ sessions. Such services are OGSA compliant and have been deployed as grid services in order to be easily integrated in classic grid environments.

Mario Ciampi, Antonio Coronato, Giuseppe De Pietro
Allocating QoS-Constrained Workflow-Based Jobs in a Multi-cluster Grid Through Queueing Theory Approach

Clusters are increasingly interconnected to form multi-cluster systems, which are becoming popular for scientific computation. End-users often submit their applications in the form of workflows with certain Quality of Service (QoS) requirements imposed on the workflows. These workflows describe the execution of a complex application built from individual application components, which form the workflow tasks. This paper addresses workload allocation techniques for Grid workflows. We model individual clusters as

M

/

M

/

k

queues and obtain a numerical solution for missed deadlines (failures) of tasks of Grid workflows. The approach is evaluated through an experimental simulation and the results confirm that the proposed workload allocation strategy combined with traditional scheduling algorithms performs considerably better in terms of satisfying QoS requirements of Grid workflows than scheduling algorithms that don’t employ such workload allocation techniques.

Yash Patel, John Darlington
Managing Multiple Isolation Levels in Middleware Database Replication Protocols

Many database replication protocols have been designed for guaranteeing a serialisable isolation level, since it is appropriate for almost all applications. However, it also requires a tight coordination among replicas and might generate high abortion rates with some workloads. So, other isolation levels have also been considered, such as snapshot isolation and cursor stability, but none of the previous works has proposed an overall support for more than one isolation level at the same time. This paper explores such a research line.

Josep M. Bernabé-Gisbert, Raúl Salinas-Monteagudo, Luis Irún-Briz, Francesc D. Muñoz-Escoí
Proof and Evaluation of a 1CS Middleware Data Replication Protocol Based on O2PL

Middleware data replication techniques are a way to increase performance and fault tolerance without modifying the internals of a DBMS. However, they introduce overheads that may lead to poor response times. In this paper a modification of the O2PL protocol is introduced. It orders conflicting transactions by using their priority, instead of the total order obtained by an atomic multicast. Priorities are also used to avoid deadlocks. For improving its performance, it does not use the strict 2PC rule as O2PL does. We provide a formal correctness proof of its 1-Copy Serializability (1CS). This protocol has been implemented, and a comparison with other already implemented protocols is also given.

J. E. Armendáriz-Iñigo, F. D. Muñoz-Escoí, J. R. Garitagoitia, J. R. Juárez, J. R. González de Mendívil
Reconfiguration of Information Management Framework Based on Adaptive Grid Computing

In this paper, GridIMF provides the users with consistency while adapting to variability grid information. GridIMF designs hierarchical 3-tier information management model in accordance with participating intention, roles and operating strategies of grid information. In order to manage grid information in effective ways, GridIMF provides grid service while dividing to GVMS and GRMS. GVMS suggests optimal virtual organization selection mechanism for improving performance of specific applications and LRM auto-recovery strategy that treat faults of virtual organization. GRMS supports adaptive performance-based task allocation method for load balancing and fault tolerance. State monitoring and visualization view provides adaptability of managing grid information. Application proxy removes inter-dependency between service objects of GridIMF and application objects. For analyzing GridIMF executability, it adapts two fractal image processing those characteristics are different.

Eun-Ha Song, Sung-Kook Han, Laurence T. Yang, Young-Sik Jeong
Framework for Enabling Highly Available Distributed Applications for Utility Computing

The move towards IT outsourcing is the first step towards an environment where compute infrastructure is treated as a service. In utility computing this IT service has to honor Service Level Agreements (SLA) in order to meet the desired Quality of Service (QoS) guarantees. Such an environment requires reliable services in order to maximize the utilization of the resources and to decrease the Total Cost of Ownership (TCO). Such reliability cannot come at the cost of resource duplication, since it increases the TCO of the data center and hence the cost per compute unit. We, in this paper, look into aspects of projecting impact of hardware failures on the SLAs and techniques required to take proactive recovery steps in case of a predicted failure. By maintaining health vectors of all hardware and system resources, we predict the failure probability of resources based on observed hardware errors/failure events, at runtime. This inturn influences an availability aware middleware to take proactive action (even before the application is affected in case the system and the application have low recoverability).

The proposed framework has been prototyped on a system running HP-UX. Our offline analysis of the prediction system on hardware error logs indicate no more than 10% false positives. This work to the best of our knowledge is the first of its kind to perform an end-to-end analysis of the impact of a hardware fault on application SLAs, in a live system.

J. Lakshmi, S. K. Nandy, Ranjani Narayan, Keshavan Varadarajan
A Bluetooth MPI Framework for Collaborative Computer Graphics

This paper introduces a collaborative framework that resides on top of the Bluetooth API. It is designed for Bluetooth enabled mobile devices to allow for the collaborative generation of computer graphics and message passing. This new library is called the Mobile Message Passing Interface (MMPI). Mobile devices generally have limited processing abilities. By combining the processing power of several devices, complex computer graphics can be rendered in a fraction of the time a single device would take.

Daniel C. Doolan, Sabin Tabirca, Laurence T. Yang
Study on Application Server Aging Prediction Based on Wavelet Network with Hybrid Genetic Algorithm

Software aging is an important factor that affects the software reliability. According to the characteristic of performance parameters of application sever middleware, a new model for software aging prediction based on wavelet networks is proposed. The structure and parameters of wavelet network are optimized by hybridization of genetic algorithm and simulated annealing algorithm. The objective is to observe and model the existing resource usage time series of application server middleware to predict accurately future unknown resource usage value. Judging by the model, we can get the aging threshold before application server fails and rejuvenate the application server before systematic parameter value reaches the threshold. The experiments are carried out to validate the efficiency of the proposed model, and show that the aging prediction model based on wavelet network with hybrid genetic algorithm is superior to the neural network model and wavelet network model in the aspects of convergence rate and prediction precision.

Meng Hai Ning, Qi Yong, Hou Di, Liu Liang, He Hui
Autonomic Service Reconfiguration in a Ubiquitous Computing Environment

A middleware in ubiquitous computing environment (UbiComp) is required to support seamless on-demand services over diverse resource situations in order to meet various user requirements [1]. Since UbiComp applications need situation-aware middleware services in this environment. In this paper, we propose a semantic middleware architecture to support dynamic software component reconfiguration based on ontology to provide fault-tolerance in a ubiquitous computing environment. Our middleware includes autonomic[9] management to detect faults, to analyze causes of them, and to plan semantically meaningful strategies to recover from the failure using pre-defined fault and service ontology trees. We implemented a referenced prototype, Web-service based Application Execution Environment (Wapee), as a proof-of-concept, and showed the efficiency in runtime recovery.

Yoonhee Kim, Eun-kyung Kim
Remote Class Prefetching: Improving Performance of Java Applications on Grid Platforms

In this paper we introduce and evaluate two prefetching techniques to improve the performance of Java applications executed on the grid. These techniques are experimentally evaluated on two grid environments, by running test applications on two different grid deployment configurations. Our testbed is

suma/g

, a grid platform specifically targeted at executing Java bytecode on Globus grids. The experimental results show that these techniques can be effective on improving the performance of applications run on the grid, especially for compute intensive scientific applications.

Yudith Cardinale, Jesús De Oliveira, Carlos Figueira

Session 6. Cooperative Computing

Using Data Consistency Protocol in Distributed Computing Environments

Large-scale, data-intensive computing requires a sophisticated technology to be integrated with distributed file systems to provide clients with reliable and scalable high-performance accesses to the stored data. The clients physically share storage devices connected via a network like GigaEthernet or Fibre Channel and, on those clients, distributed file systems take responsibility for providing coordinated accesses and consistent views of shared data. In such a distributed computing environment, one of the major issues in achieving substantial I/O performance and scalability is to build an efficient locking protocol. In this paper, we present a distributed locking protocol that enables multiple nodes to simultaneously write their data to distinct data portions of a file, while providing the consistent view of client cached data. We conclude with an evaluation of the performance of our locking protocol.

Jaechun No, Chang Won Park, Sung Soon Park
Design and Evaluation of a Parallel Data Redistribution Component for TGrid

Data redistribution of parallel data representations has become an important factor of grid frameworks for scientific computing. Providing the developers with generalized interfaces for flexible parallel data redistribution is a major goal of this research. In this article we present the architecture and the implementation of the redistribution module of TGrid. TGrid is a grid-enabled runtime system for applications consisting of cooperating multiprocessor tasks (M-tasks). The data redistribution module enables TGrid components to transfer data structures to other components which may be located on the same local subnet or may be executed remotely. We show how the parallel data redistribution is designed to be flexible, extendible, scalable, and particularly easy-to-use. The article includes a detailed experimental analysis of the redistribution module by providing a comparison of throughputs which were measured for a large range of processors and for different interconnection networks.

Sascha Hunold, Thomas Rauber, Gudula Rünger
MetaLoRaS: A Predictable MetaScheduler for Non-dedicated Multiclusters

The main aim of this work is to take advantage of the computer resources of a single organization or Multicluster system to execute distributed applications efficiently without excessively damaging the local users. In doing so we propose a Metascheduler environment named MetaLoRaS with a 2-level hierarchical architecture for a non-dedicated Multicluster with job prediction capabilities.

Among other Metaschedulers, the non-dedicated feature and an efficient prediction system are the most distinctive characteristics of MetaLoRaS. Another important contribution is the effective cluster selection mechanism, based on the prediction system.

In this article, we show how the hierarchical architecture and simulation mechanism are the best choices if we want to obtain an accurate prediction system and, at the same time, the best turnaround times for distributed jobs without damaging local user performance.

J. Ll Lérida, F. Solsona, F. Giné, M. Hanzich, P. Hernández, E. Luque
The Cost of Transparency: Grid-Based File Access on the Avaki Data Grid

Grid computing has been a hot topic for a decade. Several systems have been developed. Despite almost a decade of research and tens of millions of dollars spent, uptake of grid technology has been slow. Most deployed grids are based on a toolkit approach that requires significant software modification or development. An operating system technique used for over 30 years has been to reduce application complexity by providing transparency, e.g. file systems mask details of devices, virtual machines mask finite memory, etc. It has been argued that providing transparency in a grid environment is too costly in terms of performance. This paper examines that question in the context of data grids by measuring the performance of a commercially available data grid product – the Avaki Data Grid (ADG). We present the architecture of the ADG, describe our experimental setup, and provide performance results, comparing the ADG to a native NFS V3 implementation for both local and wide area access cases. The results were mixed, though encouraging. For single client local file operations, native NFS outperformed the ADG by 15% to 45% for smaller files, though for files larger than 32 MB ADG outperformed native NFS.

H. Howie Huang, Andrew S. Grimshaw
A Topology Self-adaptation Mechanism for Efficient Resource Location

This paper introduces a novel unstructured P2P system able to adapt its overlay network topology to the load conditions. The adaptation is performed by means of a mechanism which is run by the nodes in the network in an autonomous manner using only local information, so no global coordinator is needed. The aim of this adaptation is to build an efficient topology for the resource discovery mechanism performed via

random walks

. We present the basis of the adaptation mechanism, along with some simulation results obtained under different conditions. These results show that this system is efficient and robust, even in front of directed attacks.

Luis Rodero-Merino, Luis López, Antonio Fernández, Vicent Cholvi
A Replication-Based Fault Tolerance Protocol Using Group Communication for the Grid

We describe a replication-based protocol that uses group communication for fault tolerance in the Computational Grid. The Grid is partitioned into a number of clusters and each cluster has a designated coordinator that manages the states of the replicas within its cluster. The coordinators belong to a process group and the proposed protocol ensures the correct sequence of message deliveries to the replicas by the coordinators. Any failing node of the Grid is replaced by an active replica to provide correct continuation of the operation of the application. We show the theoretical framework along with illustrations of the replication protocol and its implementation results and analyze its performance and scalability.

Kayhan Erciyes
Increasing Availability in a Replicated Partitionable Distributed Object System

Replicating objects in distributed object systems provides fault-tolerance and increases availability. We have designed a replication protocol for distributed object systems that provides increased availability by relaxing consistency temporarily. The protocol allows all partitions in a partitioned system to continue operating. The states of certain replicas are allowed to diverge. The application programmer can specify the required consistency using integrity constraints.

We present an analytical model of the new protocol and evaluate it against the primary partition model, where only a majority partition is allowed to continue. Furthermore, we identify the type of application for which our protocol provides increased availability.

Stefan Beyer, Mari-Carmen Bañuls, Pablo Galdámez, Johannes Osrael, Francesc D. Muñoz-Escoí
Exploiting Multidomain Non Routable Networks

Exploiting computing resources available within “departmental” organizations to run large-scale applications can be considered a difficult task since such resources are usually represented by computing nodes that belong to non-routable, private networks and are connected to the Internet through publicly addressable IP front-end nodes. This paper presents a Java middleware that can support the execution of large-scale, object-based applications over heterogeneous multidomain, non-routable networks. Furthermore, the middleware can be exploited to relieve programmers of the classic burden tied to the deployment of PVM runtime libraries and program executables among computational resources belonging to distinct administrative domains.

Franco Frattolillo, Salvatore D’Onofrio
Recovery Strategies for Linear Replication

Replicated systems are commonly used to provide highly available applications. In last years, these systems have been mostly based on the use of atomic broadcast protocols, and a wide range of solutions have been published. The use of these atomic broadcast-based protocols also has aided to develop recovery protocols providing fault tolerance to replicated systems. However, this research has been traditionally oriented to replication systems based on constant interaction for ensuring 1-copy-serializability. This paper presents a general strategy for recovery protocols based on linear interaction as well as providing other isolation levels as snapshot isolation. Moreover, some conclusions of this work can be used to review recovery protocols based on constant interaction.

Rubén de Juan-Marín, Luis Irún-Briz, Francesc D. Muñoz-Escoí
Mobile Agents Based Collective Communication: An Application to a Parallel Plasma Simulation

Collective communication libraries are widely developed and used in scientific community to support parallel and Grid programming. On the other side they often lack in Mobile Agents systems even if message passing is always supported to grant communication ability to the agents. Collective communication primitives can help to develop agents based parallel application. They can also benefit social ability and interactions of collaborative agents. Here we present a collective communication service implemented in the Jade agent platform. Furthermore we propose its exploitation to interface transparently heterogeneous executions instances of a scientific parallel application that runs in a distributed environment.

Salvatore Venticinque, Beniamino Di Martino, Rocco Aversa, Gregorio Vlad, Sergio Briguglio

Track 4. Software and Applications

Session 7. Software

A Static Parallel Multifrontal Solver for Finite Element Meshes

We present a static parallel implementation of the multifrontal method to solve unsymmetric sparse linear systems on distributed-memory architectures. We target Finite Element (FE) applications where numerical pivoting can be avoided, since an implicit minimum-degree ordering based on the FE mesh topology suffices to achieve numerical stability. Our strategy is static in the sense that work distribution and communication patterns are determined in a preprocessing phase preceding the actual numerical computation. To balance the load among the processors, we devise a simple model-driven partitioning strategy to precompute a high-quality balancing for a large family of structured meshes. The resulting approach is proved to be considerably more efficient than the strategies implemented by MUMPS and SuperLU_DIST, two state-of-the-art parallel multifrontal solvers.

Alberto Bertoldo, Mauro Bianco, Geppino Pucci
A Software Architecture Framework for On-Line Option Pricing

The problem of growing computational complexity in finance industry demands manageable, high-speed and real-time solutions in solving complex mathematical problems such as option pricing. In option trading scenario, determining a fair price for options “any time” and “any-where” has become vital yet difficult computational problem. In this study, we have designed, implemented, and deployed architecture for pricing options on-line using a hand-held device that is J2ME-based Mobile computing-enabled and is assisted by web mining tools. In our architecture, the client is a MIDP user interface, and the back end servlet runs on a standalone server bound to a known port address. In addition, the server uses table-mining techniques to mine real-time data from reliable web sources upon the mobile trader’s directive. The server performs all computations required for pricing options since mobile devices have limited battery power, low bandwidth, and low memory. To the best of our knowledge, this is one of the first studies that facilitate the mobile-enabled-trader to compute the price of an option in ubiquitous fashion. This architecture aims at providing the trader with various computational techniques to avail (to provide results from approximate to accurate results) while on-the-go and to make important and effective trading decisions using the results that will ensure higher returns on investments in option trading.

Kiran Kola, Amit Chhabra, Ruppa K. Thulasiram, Parimala Thulasiraman
An Open Source Web Service Based Platform for Heterogeneous Clusters

We present our joint effort to develop a web based interface for the GNU Scientific library and its parallelization. The interface has been developed using standard web services technology to enable the use of non local resources to execute parallel programs. The final result is a computing service where sequential and parallel routines demanding high performace computing are supplied. The design allows to incorporate new servers and platforms with a small number of software requirements. We also introduce an open source development environment to allow developers to cooperate in the parallelization of the GNU Scientific library codes. These codes also will be available trough the web based interface to end users. Performance results are shown for some GSL codes in two cluster heterogeneous systems using the interface enabled with web services technology.

F. Almeida, S. Barrachina, V. Blanco, E. Quintana, A. Santos
Cost Models and Performance Evaluation of Similarity Search in the HON P2P System

Similarity searching is particularly important in fully distributed networks such as P2P systems in which various routing schemes are used to submit queries to a group of relevant nodes. This paper focuses on the maintenance cost models and performances of similarity search in the HON P2P system, where data and peers are organized in a high dimensional feature space. We show through extensive simulations that HON has a low maintenance cost and is resilient to peers’ failures.

Mouna Kacimi, Kokou Yétongnon
Matrix-Based Programming Optimization for Improving Memory Hierarchy Performance on Imagine

Despite Imagine presents an efficient memory hierarchy, the straightforward programming of scientific applications does not match the available memory hierarchy and thereby constrains the performance of stream applications. In this paper, we explore a novel matrix-based programming optimization for improving the memory hierarchy performance to sustain the operands needed for highly parallel computation. Our specific contributions include that we formulate the problem on the Data&Computation Matrix (D&C Matrix) that is proposed to abstract the relationship between streams and kernels, and present the key techniques for improving the multilevel bandwidth utilization based on this matrix. The experimental evaluation on five representative scientific applications shows that the new stream programs yielded by our optimization can effectively enhance the locality in LRF and SRF, improve the capacity utilization of LRF and SRF, make the best use of SPs and SBs, and avoid index stream overhead.

Xuejun Yang, Jing Du, Xiaobo Yan, Yu Deng
On Design and Implementation of Adaptive Data Classification Scheme for DSM Systems

Distributed Shared Memory (DSM) environment is built by using specific softwares, to combine a number of computer hardware resources into one computing environment. Such environment not only provides an easy way to execute parallel applications, but also combines resources to speedup execution of these applications. DSM systems need to maintain data consistency in memory, what usually leads to communication overhead. Therefore, there exist a number of strategies that can be used to overcome this overhead and improve overall performance. Prefetching strategies have been proven to show great performance in DSM systems, since they can reduce data access communication latencies from remote nodes. However, these strategies also transfer unnecessary prefetching pages to remote nodes. In this research paper, we focus on the analysis of data access pattern during execution of parallel applications. We propose an Adaptive Data Classification scheme to improve prefetching strategy, with the goal to improve overall performance. Adaptive Data Classification scheme classifies data according to the access behavior of pages, so that home node uses past history access patterns of remote nodes to decide whether it needs to transfer related pages to remote nodes. From experimental results, our method can improve the performance of prefetching strategies in DSM systems.

Chun-Chieh Yang, Ssu-Hsuan Lu, Hsiao-Hsi Wang, Kuan-Ching Li
TraceDo: An On-Chip Trace System for Real-Time Debug and Optimization in Multiprocessor SoC

Traditional debug techniques using breakpoints and single stepping are hard to meet the requirements of debug and optimization problems related with temporal behavioral of the real-time programs in multiprocessors. In this paper an on-chip trace system TraceDo (Trace for Debug and Optimization) of a multiprocessor SoC (YHFT-QDSP) is introduced to overcome the debug challenge. Several novel methods including LS encoder, branch configuration bits and configuration instructions, have been presented in TraceDo to trace the program paths, data access and events with timestamps from four Digital Signal Processor (DSP) cores of YHFT-QDSP efficiently. The results of benchmarks show that TraceDo with LS encoder can improve the compression ratio of trace information by 27% than the best reference result on average. When using branch configuration bits, this value goes to 64%.

Xiao Hu, Pengyong Ma, Shuming Chen, Yang Guo, Xing Fang
Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers

This paper introduces a formal framework for automatically generating performance optimized implementations of the discrete Fourier transform (DFT) for distributed memory computers. The framework is implemented as part of the program generation and optimization system

Spiral

. DFT algorithms are represented as mathematical formulas in

Spiral

’s internal language SPL. Using a tagging mechanism and formula rewriting, we extend

Spiral

to automatically generate parallelized formulas. Using the same mechanism, we enable the generation of rescaling DFT algorithms, which redistribute the data in intermediate steps to fewer processors to reduce communication overhead. It is a novel feature of these methods that the redistribution steps are merged with the communication steps of the algorithm to avoid additional communication overhead. Among the possible alternative algorithms,

Spiral

’s search mechanism now determines the fastest for a given platform, effectively generating adapted code without human intervention. Experiments with DFT MPI programs generated by

Spiral

show performance gains of up to 30% due to rescaling. Further, our generated programs compare favorably with

Fftw

-MPI 2.1.5.

Andreas Bonelli, Franz Franchetti, Juergen Lorenz, Markus Püschel, Christoph W. Ueberhuber

Session 8. Applications

Debugging Distributed Shared Memory Applications

A debugger is a crucial part of any programming system, and is especially crucial for those supporting a parallel programming paradigm, like OpenMP. A parallel, relaxed-consistency, distributed shared memory (DSM) system presents unique challenges to a debugger for several reasons: 1) the local copies of a given variable are not always consistent between distributed machines, so directly accessing the variable in the local memory by the debugger won’t always work as expected; 2) if the DSM and debugger both modify page protections, they will likely interfere with each other; and 3) since a large number of SIGSEGVs occur as part of the normal operation of a DSM program, a program error producing a SIGSEGV may be missed. In this paper, we discuss these problems and propose solutions.

Jeffrey Olivier, Chih-Ping Chen, Jay Hoeflinger
Implications of Memory Performance for Highly Efficient Supercomputing of Scientific Applications

This paper examines the memory performance of the vector-parallel and scalar-parallel computing platforms across five applications of three scientific areas; electromagnetic analysis, CFD/heat analysis, and seismology. Our evaluation results show that the vector platforms can achieve the high computational efficiency and hence significantly outperform the scalar platforms in the areas of these applications. We did exhaustive experiments and quantitatively evaluated representative scalar and vector platforms using real applications from the viewpoint of the system designers and developers. These results demonstrate that the ratio of memory bandwidth to floating-point operation rate needs to reach 4-bytes/flop to preserve the computational performance with hiding the memory access latencies by pipelined vector operations in the vector platforms. We also confirm that the enough number of memory banks to handle stride memory accesses leads to an increase in the execution efficiency. On the scalar platforms, the cache hit rate needs to be almost 100% to achieve the high computational efficiency.

Akihiro Musa, Hiroyuki Takizawa, Koki Okabe, Takashi Soga, Hiroaki Kobayashi
Optimisation of the Parallel Performance of a 3D Device Simulator for HEMTs

The resolution of the linear systems generated by the discretisation of partial differential equations in semiconductor device simulation is the most time consuming part of the simulation process. In this paper we have presented an optimisation proposal of the linear systems resolution procedure used in the PSPARSLIB library. The linear systems employed in this work arise from a three–dimensional parallel simulator of semiconductor devices, specifically HEMTs, based on the drift–diffusion model. This optimisation increases the parallel efficiency of the simulation process and improves its scalability.

N. Seoane, A. J. García-Loureiro
Video Shot Extraction on Parallel Architectures

One of the main objectives of Content-based Multimedia Retrieval systems is the automation of the information extraction process from raw data. When dealing with video data, the first step is to perform a temporal video segmentation in order to make a shot decomposition of the video content. From a computational point of view, this is a very high demanding task and algorithm optimization must be seeked. This paper presents a comparison between two different parallel programming paradigms: shared-memory communication and distributed memory processing using the message passing paradigm. Taking into account the software solutions, experimental results are collected over two alternative parallel architectures: a shared-memory symmetric multiprocessor and a cluster. This paper analyzes the performance achieved from the viewpoints of speed and scalability.

Pablo Toharia, Oscar D. Robles, José L. Bosque, Angel Rodríguez
Performance of Real-Time Data Scheduling Heuristics Under Data Replacement Policies and Access Patterns in Data Grids

A variety of real-time data scheduling heuristics were proposed for distributed, data intensive real-time applications running on a distributed computing system. The proposed heuristics are used to produce real-time data dissemination schedules for the applications’ requests for the data stored on the machines in the system. However, how these real-time data scheduling heuristics will perform for different data replacement policies and data access patterns is a question left unanswered. Based on this motivation, in this study, the performance of the two real-time data scheduling heuristics, namely the Full Path Heuristic and the Extended Partial Path Heuristic, are evaluated under different data replacement policies and data access patterns. A detailed set of simulation studies are presented to reveal how these algorithms are affected by the changes in the data replacement policy and data access pattern as well as the other system parameters of interest.

Atakan Doğan
Multi-site Scheduling with Multiple Job Reservations and Forecasting Methods

Most previous research on job scheduling for multi-site distributed systems does not take into consideration behavioral trends when applying a scheduling method. In this paper, we address the scheduling of parallel jobs in a multi-site environment, where each site has a homogeneous cluster of non-dedicated processors where users submit jobs to be executed locally, while at the same time, external parallel jobs are submitted to a meta-scheduler. We use collected load data to model the performance trends that each site exhibits in order to predict load values via time-series analysis and then perform scheduling based on the predicted values.

Maria A. Ioannidou, Helen D. Karatza
Evaluating the Dynamic Behaviour of PROSA P2P Network

In this paper we present and simulate a new self–organising algorithm for P2P unstructured networks inspired by human relationships. This algorithm, called

PROSA

, tries to simulate the evolution of human relationships from simple acquaintance to friendship and partnership. Our target is to obtain a self–reconfiguring P2P system which possesses some of the desirable features of social communities. The most useful property of many natural social communities is that of beeing “small–worlds”, since in a small–world network queries usually require a small amount of “hops” to walk from a source to a destination peer. We show that

PROSA

naturally evolves into a small–world network, with an high clustering coefficient and a relly short average path length.

Vincenza Carchiolo, Michele Malgeri, Giuseppe Mangioni, Vincenzo Nicosia
Towards Fully Adaptive Pipeline Parallelism for Heterogeneous Distributed Environments

This work describes an adaptive parallel pipeline skeleton which maps pipeline stages to the best processors available in the system and clears dynamically emerging performance bottlenecks at run-time by re-mapping affected stages to other processors. It is implemented in C and MPI and evaluated on a non-dedicated heterogeneous Linux cluster. We report upon the skeleton’s ability to respond to an artificially generated variation in the background load across the cluster.

Horacio González-Vélez, Murray Cole
Compiler-Directed Energy-Time Tradeoff in MPI Programs on DVS-Enabled Parallel Systems

Although parallel systems with high peak performance have been exciting, high peak performance often means high power consumption. In this paper, power-aware parallel systems are investigated, where each node can make dynamic voltage scaling (DVS). Based on the characteristics of communication and memory access in MPI programs, a compiler is used to automatically form communication and computation regions, and to optimally assign frequency and voltage to the regions. Frequency and voltage of each node are dynamically adjusted, and energy consumption is minimized within the limit of performance loss. The results from simulations and experiments show that compiler-directed energy-time tradeoff can save 20~40% energy consumption with less than 5% performance loss.

Huizhan Yi, Juan Chen, Xunjun Yang
A GPGPU Approach for Accelerating 2-D/3-D Rigid Registration of Medical Images

This paper presents a fast 2-D/3-D rigid registration method using a GPGPU approach, which stands for general-purpose computation on the graphics processing unit (GPU). Our method is based on an intensity-based registration algorithm using biplane images. To accelerate this algorithm, we execute three key procedures of 2-D/3-D registration on the GPU: digitally reconstructed radiograph (DRR) generation, gradient image generation, and normalized cross correlation (NCC) computation. We investigate the usability of our method in terms of registration time and robustness. The experimental results show that our GPU-based method successfully completes a registration task in about 10 seconds, demonstrating shorter registration time than a previous method based on a cluster computing approach.

Fumihiko Ino, Jun Gomita, Yasuhiro Kawasaki, Kenichi Hagihara
Backmatter
Metadaten
Titel
Parallel and Distributed Processing and Applications
herausgegeben von
Minyi Guo
Laurence T. Yang
Beniamino Di Martino
Hans P. Zima
Jack Dongarra
Feilong Tang
Copyright-Jahr
2006
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-68070-3
Print ISBN
978-3-540-68067-3
DOI
https://doi.org/10.1007/11946441

Premium Partner