Keynote Addresses

Data Confidentiality in Collaborative Computing

Even though collaborative computing can yield substantial economic, social, and scientific benefits, a serious impediment to fully achieving that potential is a reluctance to share data, for fear of losing control over its subsequent dissemination and usage. An organization’s most valuable and useful data is often proprietary/confidential, or the law may forbid its disclosure or regulate the form of that disclosure. We survey security technologies that mitigate this problem, and discuss research directions towards enforcing the data owner’s approved purposes on the data used in grid computing. These include techniques for cooperatively computing answers without revealing any private data, even though the computed answers depend on all the participants’ private data. They also include computational outsourcing, where computationally weak entities use computationally powerful entities to carry out intensive computing tasks without revealing to them either their inputs or the computed outputs.

Mikhail Atallah

Productivity in High Performance Computing

Productivity gains in development of software systems has been modest over the 45-50 years that I have been involved in writing programs. Productivity gains in high performance computing in particular have been even more modest than across the industry in general. This talk will rationalize the relative failure in HPC and sketch an approach which could enable orders of magnitude improvement in productivity for development of HPC software systems for all types of execution environments. The technical challenges and opportunities implicit in the approach will be discussed. Barriers to productivity gains in HPC include: (i) The dramatic increase in complexity of HPC algorithms and applications coupled with a dramatic increase in complexity, diversity and scale of HPC execution environments, (ii) Little CS/CE research on productivity directly addresses HPC-specific problems such as parallelism, macro-locality and distributed state. The conceptual basis for dramatically increasing productivity in HPC already exists and includes: (i) Attention to HPC-specific design, evaluation and execution issues, (ii) Component-based development coupled with automation and tools, (iii) Coherent unification of design time, compile time and runtime languages mechanisms. Implementation of these concepts in HPC raises major technical challenges: (i) development environments must be extended to address HPC-specific issues including diverse and scalable parallelism, macro-locality and distributed state, (ii) Compilation processes must become more semantically complex and (iii) Compilation and runtime systems must become more knowledge-based and be more effectively unified.

James C. Browne

A New Approach to Programming and Prototyping Parallel Systems

Writing parallel programs to run on them. Transactional Coherence and Consistency (TCC) is a new way to implement cache coherency in shared-memory parallel systems by using programmer defined transactions as the fundamental unit of parallel work, communication, coherence, consistency and failure atomicity. TCC has the potential to simplify parallel programming by providing a smooth transition from sequential programs to parallel programs. In this talk, I will describe TCC and explain how to develop parallel applications using the TCC programming model. I will also briefly describe the architecture of the Stanford Flexible Architecture Research Machine (FARM). FARM is a scalable system based on commercial high-density blade server technology. FARM is designed to improve the capability of architects and applications developers to experiment with large-scale parallel systems.

Kunle Olukotun

The Changing Challenges of Collaborative Algorithmics

It seems to be a truism that one can gain computational efficiency by enlisting more computers in the solution of a single computational problem. (We refer to such use of multiple computers as “collaborative computing.”) In order to realize the promise of collaborative computing, however, one must know how to exploit the strengths of the technology used to build the computing platform – the multiple computers and the networks that interconnect them – and how to avoid the weaknesses of the technology. Changes in technology – even apparently modest ones – often call for dramatic changes in algorithmic strategy. In this talk, I describe some of the challenges that the algorithm designer has faced as the dominant collaborative computing platforms have changed.

Arnold L. Rosenberg

Quantum Physics and the Nature of Computation

Quantum physics is a fascinating area from a computational viewpoint. The features that make quantum systems prohibitively hard to simulate classically are precisely the aspects exploited by quantum computation to obtain exponential speedups over classical computers. In this talk I will survey our current understanding of the power of quantum computers and prospects for experimentally realizing them in the near future. I will also touch upon insights from quantum comuptation that have resulted in new classical algorithms for efficient simulation of certain important quantum systems.

Umesh Vazirani

Plenary Session - Best Papers

Preemption Adaptivity in Time-Published Queue-Based Spin Locks

The proliferation of multiprocessor servers and multithreaded applications has increased the demand for high-performance synchronization. Traditional scheduler-based locks incur the overhead of a full context switch between threads and are thus unacceptably slow for many applications. Spin locks offer low overhead, but they either scale poorly (test-and-set style locks) or handle preemption badly (queue-based locks). Previous work has shown how to build preemption-tolerant locks using an extended kernel interface, but such locks are neither portable to nor even compatible with most operating systems.

In this work, we propose a

time-publishing

heuristic in which each thread periodically records its current timestamp to a shared memory location. Given the high resolution, roughly synchronized clocks of modern processors, this convention allows threads to guess accurately which peers are active based on the currency of their timestamps. We implement two queue-based locks, MCS-TP and CLH-TP, and evaluate their performance relative to traditional spin locks on a 32-processor IBM p690 multiprocessor. These results show that time-published locks make it feasible, for the first time, to use queue-based spin locks on multiprogrammed systems with a standard kernel interface.

Bijun He, William N. Scherer III, Michael L. Scott

Criticality Driven Energy Aware Speculation for Speculative Multithreaded Processors

Speculative multithreaded architecture (SpMT) philosophy relies on aggressive speculative execution for improved performance. Aggressive speculative execution results in a significant wastage of dynamic energy due to useless computation in the event of mis-speculation. As energy consumption is becoming an important constraint in microprocessor design, it is extremely important to reduce such wastage of dynamic energy in SpMT processors in order to achieve a better performance to power ratio. Dynamic instruction criticality information can be effectively applied to control aggressive speculation in SpMT processors. In this paper, we present a model of micro-execution for SpMT processors to determine dynamic instruction criticality. We also present two novel techniques utilizing criticality information, namely delaying non-critical loads and criticality based thread-prediction for reducing useless computation and energy consumption. Our experiments show 17.71% and 11.63% reduction in dynamic energy for criticality based thread prediction and criticality based delayed load scheme respectively while the corresponding improvements in dynamic energy delay products are 13.93% and 5.54%.

Rahul Nagpal, Anasua Bhowmik

Session I - Algorithms

Search-Optimized Suffix-Tree Storage for Biological Applications

Suffix-trees are popular indexing structures for various sequence processing problems in biological data management. We investigate here the possibility of enhancing the search efficiency of

disk-resident

suffix-trees through customized layouts of tree-nodes to disk-pages. Specifically, we propose a new layout strategy, called

Stellar

, that provides significantly improved search performance on a representative set of real genomic sequences. Further, Stellar supports both the standard root-to-leaf lookup queries as well as sophisticated sequencesearch algorithms that exploit the suffix-links of suffix-trees. Our results are encouraging with regard to the ultimate objective of seamlessly integrating sequence processing in database engines.

Srikanta J. Bedathur, Jayant R. Haritsa

Cost-Optimal Job Allocation Schemes for Bandwidth-Constrained Distributed Computing Systems

This paper formulates the job allocation problem in distributed systems with bandwidth-constrained nodes. The bandwidth limitations of the nodes play an important role in the design of cost-optimal job allocation schemes. In this paper, we present a pricing strategy for generalized distributed systems by formulating an incomplete information bargaining game on two variables (price and percentage of bandwidth allocated for distributed computing jobs at each node). Next, we present a cost-optimal job allocation scheme for single class jobs that involve the communication delay and hence link bandwidth. We show that our algorithms are comparable to existing job allocation algorithms in minimizing the expected system response time.

Preetam Ghosh, Kalyan Basu, Sajal K. Das

A Fault Recovery Scheme for P2P Metacomputers

Despite the leaps and bounds made by the P2P research field in the last few years, the benefit of this innovation has been constrained to a few areas; search and file-sharing and storage to name a few. In particular, this innovation has had little significant impact in the field of distributed computing.

There are several obstacles to be overcome in the development of any distributed computer, most notably: scalability, fault tolerance, security and load balancing. The difficulty of these is compounded in the dynamic, decentralized environment which characterizes the P2P arena. This paper presents a method of recovering from faults which exploits the distributed hash table functionality provided by modern overlay networks. Its effectiveness is evaluated experimentally using a proof of concept P2P distributed computer.

It is hoped that by providing a solution to one of the obstacles, global, decentralized, dependable distributed computers will be one step closer to reality.

Keith Power, John P. Morrison

A Distributed Location Identification Algorithm for Ad hoc Networks Using Computational Geometric Methods

We present here a novel approach where we identify a

region

within which a node is

guaranteed

to be found, in contrast to the existing approaches where no such confining region for a node can be guaranteed, but only the location could be estimated either with no definitive error bound or only with some probabilistic error. The location identification algorithm presented here minimizes the size of this region, using computational geometric methods. The proposed technique iteratively improves the

region of residence

of all the nodes in the network through the exchange of region information among neighbors in

O

(

nD

) time, where

n

and

D

are the number of nodes and diameter of the network respectively. Simulation results also show encouraging results with this approach.

Koushik Sinha, Atish DattaChowdhury

A Symmetric Localization Algorithm for MANETs Based on Collapsing Coordinate Systems

Localization in mobile ad hoc networks (MANETs) is the process of fixing the position of a node according to some real or virtual coordinate system. In many cases, solutions like Global Positioning System (GPS) are not feasible. As a result, several algorithms have been developed for localization based purely on local communication. However, many of these suffer from one of the following: flooding of the network, requirement for global knowledge, or the requirement of “beacon” nodes, which know their absolute position according to GPS. At the very least, localization algorithms require parts of the system to be either static or relatively stable. In this paper, we propose a symmetric localization algorithm that performs fairly accurate localization. No special elements like beacons and other static elements are required; however, they are not excluded.

Srinath Srinivasa, Sanket Patil

Session II - Applications

Performance Study of LU Decomposition on the Programmable GPU

With the increasing programmability of graphics processing units (GPUs), these units are emerging as an attractive computing platform not only for traditional graphics computation but also for general-purpose computation. In this paper, to study the performance of programmable GPUs, we describe the design and implementation of LU decomposition as an example of numerical computation. To achieve this, we have developed and evaluated some methods with different implementation approaches in terms of (a) loop processing, (b) branch processing, and (c) vector processing. The experimental results give four important points: (1) dependent loops must be implemented through the use of a render texture in order to avoid copies in the video random access memory (VRAM); (2) in most cases, branch processing can be efficiently handled by the CPU rather than the GPU; (3) as Fatahalian et al. state for matrix multiplication, we find that GPUs require higher VRAM cache bandwidth in order to provide full performance for LU decomposition; and (4) decomposition results obtained by GPUs usually differ from those by CPUs, mainly due to the floating-point division error that increases the numerical error with the progress of decomposition.

Fumihiko Ino, Manabu Matsui, Keigo Goda, Kenichi Hagihara

PENCAPS: A Parallel Application for Electrode Encased Grounding Systems Project

The design of concrete encased electrode grounding systems by conventional computation procedures is a time-consuming task. It happens once the electromagnetic representation of the physical system requires the calculation of large full matrices. Recently, the possibility of paralleling the procedures involved in such calculations led the authors to implement a C language parallel application, based on MPI (Message Passing Interface). This article presents the engineering problem associated to this development and the fundamental aspects regarding this application, including the evaluation of its efficiency for solution of large grounding systems.

Marco Aurélio S. Birchal, Maria Helena M. Vale, Silvério Visacro

Application of Reduce Order Modeling to Time Parallelization

We recently proposed a new approach to parallelization, by decomposing the time domain, instead of the conventional space domain. This improves latency tolerance, and we demonstrated its effectiveness in a practical application, where it scaled to much larger numbers of processors than conventional parallelization. This approach is fundamentally based on dynamically predicting the state of a system from data of related simulations. In earlier work, we used knowledge of the science of the problem to perform the prediction. In complicated simulations, this is not feasible. In this work, we show how reduced order modeling can be used for prediction, without requiring much knowledge of the science. We demonstrate its effectiveness in an important nano-materials application. The significance of this work lies in proposing a novel approach, based on established mathematical theory, that permits effective parallelization of time. This has important applications in multi-scale simulations, especially in dealing with long time-scales.

Ashok Srinivasan, Yanan Yu, Namas Chandra

Orthogonal Decision Trees for Resource-Constrained Physiological Data Stream Monitoring Using Mobile Devices

This paper considers the problem of monitoring physiological data streams obtained from resource-constrained wearable sensing devices for pervasive health-care management. It considers Orthogonal decision trees (ODTs) that offer an effective way to construct a redundancy-free, accurate, and meaningful representation of large decision-tree-ensembles often created by popular techniques such as Bagging, Boosting, Random Forests and many distributed and data stream mining algorithms. ODTs are functionally orthogonal to each other and they correspond to the principal components of the underlying function space. This paper offers experimental results to document the performance of ODTs on grounds of accuracy, model complexity, and resource consumption.

Haimonti Dutta, Hillol Kargupta, Anupam Joshi

Throughput Computing with Chip MultiThreading and Clusters

Chip MultiThreading (CMT) based systems are being introduced in the market by several computer platform vendors. At the same time, cluster computing platforms are becoming prevalent in market segments which tend to be highly price/performance driven. This paper analyzes the architectural space of these two prominent computing paradigms in technical computing markets. This analysis is carried out in terms of the application turnaround time, throughput, and scalability across multiple threads of execution. Additionally, we introduce various subscription models to optimize application throughput and turnaround time.

Mukund Buddhikot, Sanjay Goil

Session III - Architecture

Supporting MPI-2 One Sided Communication on Multi-rail InfiniBand Clusters: Design Challenges and Performance Benefits

In cluster computing, InfiniBand has emerged as a popular high performance interconnect with MPI as the

de facto

programming model. However, even with InfiniBand, bandwidth can become a bottleneck for clusters executing communication intensive applications. Multi-rail cluster configurations with MPI-1 are being proposed to alleviate this problem. Recently, MPI-2 with support for one-sided communication is gaining significance. In this paper, we take the challenge of designing high performance MPI-2

one-sided communication

on multi-rail InfiniBand clusters. We propose a unified MPI-2 design for different configurations of multi-rail networks (

multiple ports, multiple HCAs

and

combinations

). We present various issues associated with one-sided communication such as

multiple synchronization messages, scheduling of RDMA (Read, Write) operations, ordering relaxation

and discuss their implications on our design. Our performance results show that multi-rail networks can significantly improve MPI-2 one-sided communication performance. Using PCI-Express with two-ports, we can achieve a peak

MPI_Put

bidirectional bandwidth of 2620 Million Bytes/s, compared to 1910 MB/s for single-rail implementation. For PCI-X with two HCAs, we can almost double the throughput and reduce the latency to half for large messages.

Abhinav Vishnu, Gopal Santhanaraman, Wei Huang, Hyun-Wook Jin, Dhabaleswar K. Panda

High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters

The All-to-all broadcast collective operation is essential for many parallel scientific applications. This collective operation is called

MPI_Allgather

in the context of MPI. Contemporary MPI software stacks implement this collective on top of MPI point-to-point calls leading to several performance overheads. In this paper, we propose a design of All-to-All broadcast using the Remote Direct Memory Access (RDMA) feature offered by InfiniBand, an emerging high performance interconnect. Our RDMA based design eliminates the overheads associated with existing designs. Our results indicate that latency of the All-to-all Broadcast operation can be reduced by 30% for 32 processes and a message size of 32 KB. In addition, our design can improve the latency by a factor of 4.75 under no buffer reuse conditions for the same process count and message size. Further, our design can improve performance of a parallel matrix multiplication algorithm by 37% on eight processes, while multiplying a 256x256 matrix.

S. Sur, U. K. R. Bondhugula, A. Mamidala, H. -W. Jin, D. K. Panda

Providing Full QoS Support in Clusters Using Only Two VCs at the Switches

Current interconnect standards providing hardware support for quality of service (QoS) consider up to 16 virtual channels (VCs) for this purpose. However, most implementations do not offer so many VCs because they increase the complexity of the switch and the scheduling delays. In this paper, we show that this number of VCs can be significantly reduced. Some of the scheduling decisions made at network interfaces can be easily reused at switches without significantly altering the global behavior. Specifically, we show that it is enough to use two VCs for QoS purposes at each switch port, thereby simplifying the design and reducing its cost.

A. Martínez, F. J. Alfaro, J. L. Sánchez, J. Duato

Offloading Bloom Filter Operations to Network Processor for Parallel Query Processing in Cluster of Workstations

Workstation clusters have high performance interconnects with programmable network processors, which facilitate interesting opportunities to offload certain application specific computation on them and hence enhance the performance of the parallel application. Our earlier work in this direction achieves enhanced performance and balanced utilization of resources by exploiting the programmable features of the network interface in parallel database query execution. In this paper, we extend our earlier work for studying parallel query execution with Bloom filters. We propose and evaluate a scheme to offload the Bloom filter operations to the network processor. Further we explore offloading certain tuple processing activities on to the network processor by adopting a network interface attached disk scheme. The above schemes yield a speedup of up to 1.13 over the base scheme with Bloom filter where all processing is done by the host processor and achieve balanced utilization of resources. In the presence of a disk buffer cache, which reduces both the disk and I/O traffic, offloading schemes improve the speedup to 1.24.

V. Santhosh Kumar, M. J. Thazhuthaveetil, R. Govindarajan

A High-Speed VLSI Array Architecture for Euclidean Metric-Based Hausdorff Distance Measures Between Images

A new parallel algorithm to compute Euclidean metric-based Hausdorff distance measures between binary images (typically edge maps) is proposed in this paper. The algorithm has a running time of

O

(

n

) for images of size

n

×

n

. Further, the algorithm has the following features: (i) simple arithmetic (ii) identical computations at each pixel and (iii) computations using a small neighborhood for each pixel. An efficient cellular architecture for implementing the proposed algorithm is presented. Results of implementation using field-programmable gate arrays show that the measures can be computed for approximately 88000 image pairs of size 128×128 in a second. This result is valuable for real-time applications like object tracking and video surveillance.

N. Sudha, E. P. Vivek

Session IV - Applications

Sensor Selection Heuristic in Sensor Networks

We consider the problem of estimating the location of a moving target in a 2-D plane. In this paper, we focus attention on selecting an appropriate 3

rd

sensor, given two sensors, with a view to minimize the estimation error. Only the selected sensors need to measure distance to the target and communicate the same to the central “tracker”. This minimizes bandwidth and energy consumed in measurement and communication while achieving near minimum estimation error. In this paper, we have proposed that the 3

rd

sensor be selected based on three measures viz. (a) collinearity, (b) deviation from the ideal direction in which the sensor should be selected, and (c) proximity of the sensor from the target. We assume that the measurements are subject to multiplicative error. Further, we use least square error estimation technique to estimate the target location. Simulation results show that using the proposed algorithm it is possible to achieve near minimum error in target location.

Vaishali P. Sadaphal, Bijendra N. Jain

Mobile Pipelines: Parallelizing Left-Looking Algorithms Using Navigational Programming

We consider the class of “left-looking” sequential matrix algorithms: consumer-driven algorithms that are characterized by “lazy” propagation of data. Left-looking algorithms are difficult to parallelize using the message-passing or distributed shared memory models because they only possess pipeline parallelism. We show that these algorithms can be directly parallelized using mobile pipelines provided by the Navigational Programming methodology. We present performance data demonstrating the effectiveness of our approach.

Lei Pan, Ming Kin Lai, Michael B. Dillencourt, Lubomir F. Bic

Distributed Point Rendering

Traditionally graphics clusters have been employed in real-time visualization of large geometric models (many millions of 3D points). Data parallel approaches have been the obvious choices when it comes to breaking up the computations over multiple processors. In recent years, programmable graphics hardware has gained widespread acceptance. Today, every processing node in a graphics cluster has two powerful and fully programmable processors – a CPU (Central Processing Unit) and a GPU (Graphics processing unit). It enables distribution of graphics computations targeting an applications’s needs in more flexible ways. In this paper we discuss and analyze our implementation of functionality distributed point-based rendering pipeline with impressive performance improvements. To the best of our knowledge, it is the first attempt to devise a functionality distribution scheme for a large data and compute-intensive application. We discuss the merits and limitations of such a distribution scheme by comparing it against traditional data parallel and single node schemes.

Ramgopal Rajagopalan, Sushil Bhakar, Dhrubajyoti Goswami, Sudhir P. Mudur

An Intra-task DVS Algorithm Exploiting Program Path Locality for Real-Time Embedded Systems

In this paper, we present a novel intra-task Dynamic Voltage Scheduling (DVS) algorithm based on the knowledge of frequently executed paths in the control flow graph for real-time embedded systems. The basic idea is to construct a common path composing all the frequently executed paths (hot-paths) and perform DVS scheduling based on this common path, rather than the most probable path. We compare the performance (energy consumption) of our algorithm with a recently proposed algorithm. Our simulation results show that the proposed algorithm performs better than the existing algorithm for most of the simulated conditions. We also identify interesting research problems in this context.

G. Sudha Anil Kumar, G. Manimaran

Advanced Resource Management and Scheduling of Workflow Applications in JavaSymphony

JavaSymphony is a high-level programming model for performance oriented distributed and parallel Java applications, which allows the programmer to control parallelism, load balancing, and locality at a high level of abstraction. In this paper, we describe an extension of JavaSymphony that deals with distributed workflow applications as graphs of software components, which can be executed on a distributed set computers. Workflows are not limited to DAGs, but also cover complex control flow including loops. Furthermore, we introduce a novel approach for workflow scheduling based on the HEFT algorithm and resource brokerage for a heterogeneous set of computers. We demonstrate the effectiveness of our approach with two real-world applications and compare our techniques against the widely known DAGMan Condor scheduler.

Alexandru Jugravu, Thomas Fahringer

Session V - Systems Software

Using Clustering to Address Heterogeneity and Dynamism in Parallel Scientific Applications

The dynamism and space-time heterogeneity exhibited by structured adaptive mesh refinement (SAMR) applications makes their scalable parallel implementation a significant challenge. This paper investigates an adaptive hierarchical multi-partitioner (AHMP) framework that dynamically applies multiple partitioners to different regions of the domain, in a hierarchical manner, to match the local requirements of these regions. Key components of the AHMP framework include a segmentation-based clustering algorithm (SBC) for identifying regions in the domain with relatively homogeneous partitioning requirements, mechanisms for characterizing the partitioning requirements, and a runtime system for selecting, configuring and applying the most appropriate partitioner to each region. The AHMP framework has been implemented and experimentally evaluated on up to 1280 processors of the IBM SP4 cluster at San Diego Supercomputer Center.

Xiaolin Li, Manish Parashar

Data and Computation Abstractions for Dynamic and Irregular Computations

Effective data distribution and parallelization of computations involving irregular data structures is a challenging task. We address the twin-problems in the context of computations involving block-sparse matrices. The programming model provides a global view of a distributed block-sparse matrix. Abstractions are provided for the user to express the parallel tasks in the computation. The tasks are mapped onto processors to ensure load balance and locality. The abstractions are based on the Aggregate Remote Memory Copy Interface, and are interoperable with the Global Arrays programming suite and MPI. Results are presented that demonstrate the utility of the approach.

Sriram Krishnamoorthy, Jarek Nieplocha, P. Sadayappan

XCAT-C++: Design and Performance of a Distributed CCA Framework

In this paper we describe the design and implementation of a C++ based Common Component Architecture (CCA) framework, XCAT-C++. It can efficiently marshal and unmarshal large data sets, and provides the necessary modules and hooks in the framework to meet the requirements of distributed scientific applications. XCAT-C++ uses a high-performance multi-protocol library so that the appropriate communication protocol is employed for each pair of interacting components. Scientific applications can dynamically switch to a suitable communication protocol to maximize effective throughput. XCAT-C++ component layering imposes minimal overhead and application components can achieve highly efficient throughput for large data sets commonly used in scientific computing. It has a suite of tools to aid application developers including a flexible code generation toolkit and a python scripting interface. XCAT-C++ provides the means for application developers to leverage the efficacy of the CCA component model to manage the complexity of their distributed scientific simulations.

Madhusudhan Govindaraju, Michael R. Head, Kenneth Chiu

The Impact of Noise on the Scaling of Collectives: A Theoretical Approach

The performance of parallel applications running on large clusters is known to degrade due to the interference of kernel and daemon activities on individual nodes, often referred to as

noise

. In this paper, we focus on an important class of parallel applications, which repeatedly perform computation, followed by a collective operation such as a barrier. We model this theoretically and demonstrate, in a rigorous way, the effect of noise on the scalability of such applications. We study three natural and important classes of noise distributions: The exponential distribution, the heavy-tailed distribution, and the Bernoulli distribution. We show that the systems scale well in the presence of exponential noise, but the performance goes down drastically in the presence of heavy-tailed or Bernoulli noise.

Saurabh Agarwal, Rahul Garg, Nisheeth K. Vishnoi

Extensible Parallel Architectural Skeletons

Complexity of parallel application development has been one of the major obstacles towards the mainstream adoption of parallel programming. In order to hide some of these complexities, researchers have been actively investigating the pattern-based approaches to parallel programming. As reusable components, patterns are intended to ease the design and development phases of parallel applications. Parallel Architectural Skeleton (PAS) is one such pattern-based parallel programming model which describes the architectural aspects of parallel patterns. Like many other pattern-based parallel programming models and tools, the benefits of PAS were offset by the difficulties in extending PAS.

EPAS

is an extension of PAS that addresses this issue. Using EPAS, a skeleton designer can design new skeletons and add them to the skeleton repository (i.e., extensibility). EPAS also makes the PAS model more flexible by defining composition of skeletons. In this paper, we describe the model of EPAS and also discuss some of the recent usability and performance studies. The studies demonstrate that EPAS is a practical and usable parallel programming model and tool.

Mohammad Mursalin Akon, Ajit Singh, Dhrubajyoti Goswami, Hon Fung Li

Session VI - Communication Networks

An Efficient Distributed Algorithm for Finding Virtual Backbones in Wireless Ad-Hoc Networks

A minimum connected dominating set is an efficient approach to form a virtual backbone for ad-hoc networks. We propose a tree based distributed time/message efficient approximation algorithm to compute a small connected dominating set without using geographic positions. The algorithm has

O

(

n

) time,

O

(

n

log

n

) message complexity, and has an approximation factor of eight. The algorithm is implemented using dominating set simulation program, which shows that our method gives smaller connected dominating set than the existing methods.

B. Paul, S. V. Rao, S. Nandi

A Novel Battery Aware MAC Protocol for Minimizing Energy × Latency in Wireless Sensor Networks

Wireless Sensor Networks (WSNs) possess highly con- strained energy resources. The existing Medium Access Control (MAC) protocols for WSNs try to either minimize the energy consumption or the latency, which are conflicting objectives, or find a trade-off between them. They fail to achieve the minimum

energy

×

latency

, which ensures that transmission should occur such that both the energy consumption and latency are minimized. We propose a novel Battery-aware Energy-efficient MAC protocol to minimize the Latency (BEL-MAC) that exploits the chemical properties of the batteries of the sensor nodes, in order to increase their lifetime. Our protocol reduces the latency of the packets in an efficient manner without compromising on the lifetime of the network. We compare our protocol with the SMAC, DSMAC, TMAC, and IEEE 802.11 MAC, in terms of throughput and latency and show that our protocol outperforms these existing protocols, in terms of

energy

×

latency

.

M. Dhanaraj, S. Jayashree, C. Siva Ram Murthy

On the Power Optimization and Throughput Performance of Multihop Wireless Network Architectures

With the emergence of powerful processors and complex applications, wireless communication devices are increasingly power hungry. While there exist several solutions to provide transmission power management in cellular wireless networks and ad hoc wireless networks, it remains an open problem in recently proposed hybrid wireless networks. The Multihop Cellular Network (MCN) and Multi Power Architecture for Cellular network (MuPAC) are instances of hybrid wireless networks, which are proposed to increase the system throughput and spectrum reuse by infusing multihop radio relaying mechanism into the infrastructure-based wireless networks. This paper proposes a novel variable power optimization scheme for the hybrid wireless network architectures such as MCN and MuPAC in order to optimize the power consumption at a mobile node without losing the throughput advantage gained by the multihop scheme. Extensive simulation results show 10% to 15% improvement in power consumption and system throughput which is significant in case of power constrained mobile nodes.

G. Bhaya, B. S. Manoj, C. Siva Ram Murthy

A Novel Solution for Time Synchronization in Wireless Ad Hoc and Sensor Networks

Time synchronization is an important aspect of distributed computer systems and networks. Nodes must be synchronized to a common clock to determine slot durations for a TDMA based transmission scheme. Most efficient slot-assignment algorithms apportion the TDMA slots with the underlying assumption of a reasonably accurate global synchronization of the network. In this paper, we propose a novel synchronization protocol for ad hoc, sensor, and other dense multi-hop infrastructure-less wireless networks. The protocol performs a random leader election to achieve global network synchronization. We have analyzed the variation of synchronization time and error with different node densities and mobility speeds, by simulating the protocol. Expressions have been derived reflecting the worst case synchronization error, and the maximum synchronization time, for a network with uniform distribution of nodes. Simulation results show that out-of-band and piggybacked signaling have good accuracy of synchronization, and that a considerable bandwidth saving occurs with piggybacking on data or acknowledgment packets.

Archana Sekhar, B. S. Manoj, C. Siva Ram Murthy

An Algorithm for Boundary Discovery in Wireless Sensor Networks

Wireless Sensor Networks (WSNs) consist of a large number of nodes networked via wireless links. In many WSN settings, sensor nodes are deployed in an ad hoc manner. One important issue in this context is to detect the boundary of the deployed network to ensure that the sensor nodes cover the target area. In this paper, we propose a new algorithm that can be used to discover the boundary of a randomly deployed WSN. The algorithm does not require the sensor nodes to be equipped with positioning devices and is scalable for large number of nodes. Simulation experiments are developed to evaluate the performance of the proposed algorithms for different network topologies. The simulation results show that the algorithm detects the boundary nodes of the network with high accuracy.

Jitender S. Deogun, Saket Das, Haitham S. Hamza, Steve Goddard

Session VII - Architecture

A Low-Complexity Issue Queue Design with Speculative Pre-execution

Current superscalar architectures inherently depend on an instruction issue queue to achieve multiple instruction issue and out-of-order execution. However, the issue queue is implemented as a centralized structure and mainly causes globally broadcasting operations to wake up and select the instructions. Therefore, a large issue queue ultimately results in a low clock rate along with a high circuit complexity. This paper proposes Speculative Pre-Execution Assisted by compileR (SPEAR), a low-complexity issue queue design. SPEAR is designed to manage the small issue queue more efficiently without increasing the queue size. To this end, we have first recognized that the long memory latency is one of the factors which demand a large queue, and we aim at achieving early execution of the miss-causing load instructions using another hierarchy of an issue queue. We speculatively pre-execute those miss-causing instructions as an additional prefetching thread.

Won W. Ro, Jean-Luc Gaudiot

Performance and Power Evaluation of an Intelligently Adaptive Data Cache

We describe the analysis of an on-line pattern-recognition algorithm to dynamically control the configuration of the L1 data cache of a high-performance processor. The microarchitecture achieves higher performance and energy saving due to the accommodation of operating frequency, capacity, set-associativity, line size, hit latency, energy per access, and chip area to program workload and ILP. We show that for the operating frequency 4.5 GHz, the execution time is always reduced with an average measure of 12.1% when compared to a non-adaptive high-performance processor. Additionally, the energy saving is 2.7% on average, and t1he product time-energy is reduced on average by 14.9%. We also consider a profile-based reconfiguration of data cache, which allows picking different cache configurations but only one can be chosen for each program. Experimental results indicate that this approach yields a high percentage of the performance improvement and energy saving achieved by the on-line algorithm.

Domingo Benítez, Juan Carlos Moure, Dolores Isabel Rexachs, Emilio Luque

Neural Confidence Estimation for More Accurate Value Prediction

Data dependencies between instructions have traditionally limited the ability of processors to execute instructions in parallel. Data value predictors are used to overcome these dependencies by guessing the outcomes of instructions. Because mispredictions can result in a significant performance decrease, most data value predictors include a confidence estimator that indicates whether a prediction should be used.

This paper presents a global approach to confidence estimation in which the prediction accuracy of previous instructions is used to estimate the confidence of the current prediction. Perceptrons are used to identify which past instructions affect the accuracy of a prediction and to decide whether the prediction is likely to be correct.

Simulation studies compare this global confidence estimator to the more conventional local confidence estimator. Results show that predictors using this global confidence estimator tend to predict significantly more instructions and incur fewer mispredictions than predictors using existing local confidence estimation approaches.

Michael Black, Manoj Franklin

The Potential of On-Chip Multiprocessing for QCD Machines

We explore the opportunities offered by current and forthcoming VLSI technologies to on-chip multiprocessing for Quantum Chromo Dynamics (QCD), a computational grand challenge for which over half a dozen specialized machines have been developed over the last two decades. Based on a careful study of the information exchange requirements of QCD both across the network and within the memory system, we derive the optimal partition of die area between storage and functional units. We show that a scalable chip organization holds the promise to deliver from hundreds to thousands flop per cycle as VLSI feature size scales down from 90 nm to 20 nm, over the next dozen years.

Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, Fabio Schifano, Raffaele Tripiccione

Low-Power 32bit×32bit Multiplier Design with Pipelined Block-Wise Shutdown

This paper proposes a novel low-power 32bit×32bit multiplier with pipelined block-wise shutdown scheme. When it idles, it turns off supply voltage to reduce both dynamic and static power. It shutdowns and wakes up sequentially along with pipeline stage to avoid power line noise. In idle mode, the proposed multiplier consumes 0.013mW and 0.006mW in 0.13

μ

m and 0.09

μ

m technologies, respectively, and it reduces power consumption to 0.07%~0.08% of active mode. As fabrication technology becomes small, power efficiency degrades in the conventional clock gating scheme, but the proposed multiplier does not. The low-power design methodology in this paper can be easily adopted in most functional blocks with pipeline architecture.

Yong-Ju Jang, Yoan Shin, Min-Cheol Hong, Jae-Kyung Wee, Seongsoo Lee

Session VIII - Communication Networks

Performance Analysis of User-Level PIM Communication in the Data IntensiVe Architecture (DIVA) System

The performance of user-level messaging in PIM (Processing-In-Memory) to PIM communication is modeled and analyzed for the DIVA (Data IntensiVe Architecture) system. Six benchmarks have been used for this purpose, two from each category, namely single message transfer, parallel transfer and collective communication, as described for the PMB (Pallas MPI Benchmarks). The benchmarks used are PingPong, PingPing, SendReceive, Exchange, Barrier synchronization and AllToAll personalized exchange. The main significance of this work lies in the evaluation of an implementation of system-wide support for memory-to-memory and memory-to-host communi-cation via a

parcel

buffer (used as a network interface). Another remarkable feature of this evaluation lies in presenting an optimal algorithm for Barrier synchronization and an optimal algorithm, with full channel utilization, for AllToAll personalized exchange for the bi-directional ring configuration of up to 8 DIVA PIMs in the memory system of a Hewlett-Packard’s zx6000 server. The algorithms presented can be scaled for higher number of PIM chips with a little degradation in performance over the optimal solution. Our analysis shows that the currently employed communication mechanism can be used very efficiently for collective communication operations, and it also exposes the bottlenecks in the current design for future improvements.

Sumit Dharampal Mediratta, Jeffrey Draper

Improved Point-to-Point and Collective Communication Performance with Output-Queued High-Radix Routers

We present an output-queued switch architecture with cross-point buffering that has improved performance for both point-to-point communication and hardware accelerated collective communication. In the past, output queuing architectures have been less popular as they require more internal speedup and buffering. However, with current technology it is possible to build output-queued switches with a relatively large number of ports. We demonstrate that our output-queued architecture performs well for point-to-point messages, specially in a fat-tree topology. We also show that output-queued architectures facilitate efficient implementations of multicasts and reductions. We present performance of multicasts and reductions on individual switches and a network of switches interconnected in a fat-tree topology. We also present simulation results based on synthetic workloads that emulate a molecular dynamics application.

Sameer Kumar, Craig Stunkel, Laxmikant V. Kalé

A Clustering and Traffic-Redistribution Scheme for High-Performance IPsec VPNs

CPE-based IPsec VPNs have been widely used to provide secure private communication across the Internet. As the bandwidth of WAN links keeps growing, the bottleneck in a typical deployment of CPE-based IPsec VPNs has moved from the last-mile connections to the customer-edge security gateways. In this paper, we propose a clustering scheme to scale the throughput as required by CPE-based IPsec VPNs. The proposed scheme groups multiple security gateways into a cluster using a transparent self-dispatching technique and allows as many gateways to be added as necessary until the resulting throughput is again limited by the bandwidth of the last-mile connections. It also includes a flow-migration mechanism to keep the load of the gateways balanced. The results of the performance evaluation confirm that the clustering technique and the traffic-redistribution mechanism together create a transparent, adaptive, and highly scalable solution for building high-performance IPsec VPNs.

Pan-Lung Tsai, Chun-Ying Huang, Yun-Yin Huang, Chia-Chang Hsu, Chin-Laung Lei

WDM Multistage Interconnection Networks Architectures for Enhancing Supernetworks Switching Infrastructure

Multistage Interconnection Networks (MINs) provide the required switching infrastructure for many shared-memory multiprocessor systems and telecommunication networks. The concept of

Supernetworks

is evolving in response to emerging computation and communication intensive applications. Supernetworks exploit parallelism in both computing resources and communication infrastructures by interconnecting several computing clusters via high-bandwidth communication links. Wavelength Division Multiplexing (WDM) technology provides the communication infrastructure for Supernetowrks by dividing the bandwidth of a single fiber into numerous channels that can be used independently. In this paper, we investigate several architectures for WDM MINs that enhance the Supernetworks switching infrastructure. Our objective is to propose a new architecture and to evaluate its hardware complexity by comparing it to other WDM MINs architectures.

Haitham S. Hamza, Jitender S. Deogun

Learning-TCP: A Novel Learning Automata Based Congestion Window Updating Mechanism for Ad hoc Wireless Networks

The use of traditional TCP, in its present form, for reliable transport over Ad hoc Wireless Networks (AWNs) leads to a significant degradation in the network performance. This is primarily due to the congestion window (

cwnd

) updation and congestion control mechanisms employed by TCP and its inability to distinguish congestion losses from wireless losses. In order to provide an efficient reliable transport over AWNs, we propose Learning-TCP, a novel learning automata based reliable transport protocol, which efficiently adjusts the

cwnd

size and thus reduces the packet losses. The key idea behind Learning-TCP is that, it dynamically adapts to the changing network conditions and appropriately updates the

cwnd

size by observing the arrival of acknowledgment (ACK) and duplicate ACK (DUPACK) packets. Learning-TCP, unlike other existing proposals for reliable transport over AWNs, does not require any explicit feedback, such as congestion and link failure notifications, from the network. We provide extensive simulation studies of Learning-TCP under varying network conditions, that show increased throughput (9-18%) and reduced packet loss (42-55%) compared to that of TCP.

B. Venkata Ramana, C. Siva Ram Murthy

Session IX - Algorithms

Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors

Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that supports the kernels from the Scalable Synthetic Compact Applications (SSCA) benchmark suite, developed under the DARPA High Productivity Computing Systems (HPCS) program. This synthetic benchmark consists of four kernels that require irregular access to a large, directed, weighted multi-graph. We have developed a parallel implementation of this benchmark in C using the POSIX thread library for commodity symmetric multiprocessors (SMPs). In this paper, we primarily discuss the data layout choices and algorithmic design issues for each kernel, and also present execution time and benchmark validation results.

David A. Bader, Kamesh Madduri

Scheduling Multiple Flows on Parallel Disks

We examine the problem of scheduling concurrent independent flows on multiple-disk I/O storage systems. Two models are considered: in the shared buffer model the memory buffer is shared among all the flows, while in the partitioned buffer model each flow has a private buffer. For the parallel disk model with

d

> 1 disks it is shown that the problem of minimizing the schedule length of

n

> 2 concurrent flows is NP-

complete

for both buffer models. A randomized scheduling algorithm for the partitioned buffer model is analyzed and probabilistic bounds on the schedule length are presented. Finally a heuristic based on static buffer allocation for the shared buffer model is discussed.

Ajay Gulati, Peter Varman

Snap-Stabilizing Detection of Cutsets

A

snap-stabilizing protocol

, starting from any configuration, always behaves according to its specification. Here, we present the first snap-stabilizing protocol for arbitrary rooted networks which detects if a set of nodes is a cutset. This protocol is based on the depth-first search (

DFS

) traversal and its properties. One of the most interesting properties of our protocol is that, despite the initial configuration, as soon as the protocol is initiated by the root, the result obtained from the computations will be right. So, after the first execution of the protocol, the root is able to take a decision: “the input set is a cutset or not”, and this decision is right.

Alain Cournier, Stéphane Devismes, Vincent Villain

Scheduling Divisible Loads with Return Messages on Heterogeneous Master-Worker Platforms

In this paper, we consider the problem of scheduling divisible loads onto an heterogeneous star platform, with both heterogeneous computing and communication resources. We consider the case where the workers, after processing the tasks, send back some results to the master processor. This corresponds to a more general framework than the one used in many divisible load papers, where only forward communications are taken into account. To the best of our knowledge, this paper constitutes the first attempt to derive optimality results under this general framework (forward and backward communications, heterogeneous processing and communication resources). We prove that it is possible to derive the optimal solution both for LIFO and FIFO distribution schemes. Nevertheless, the complexity of the general problem remains open: we also show in the paper that the optimal distribution scheme may be neither LIFO nor FIFO.

Olivier Beaumont, Loris Marchal, Yves Robert

Session X - Systems and Networks

A Grid Authentication System with Revocation Guarantees

Credential revocation is a critical problem in grid environments and remains unaddressed in existing grid security solutions. We present a novel grid authentication system that solves the revocation problem. It guarantees instantaneous revocation of both long-term digital identities of hosts/users and short-lived identities of user proxies. With our approach, revocation information is guaranteed to be fresh with high time-granularity. Our system employs

mediated RSA

(mRSA), adapts Boneh’s notion of

semi-trusted mediators

to suit security in virtual organizations and propagates proxy revocation information as in Micali’s

NOVOMODO

system. Our approach’s added benefits include a configuration-free security model for end-users of the grid and fine-grained management of users’ delegation capabilities.

Babu Sundaram, Barbara M. Chapman

Integrating a New Cluster Assignment and Scheduling Algorithm into an Experimental Retargetable Code Generation Framework

This paper presents a new unified algorithm for cluster assignment and region scheduling, and its integration into an experimental retargetable code generation framework. The components of the framework are an instruction selector generator based on a recent technique, the IMPACT front end, a machine description module which uses a modification of the HMDES machine description language to include cluster information, a combined cluster allocator and an acyclic region scheduler, and a register allocator. Experiments have been carried out on the targeting of the tool to the Texas Instruments TMS320c62x architecture. We report preliminary results on a set of TI benchmarks.

K. Vasanta Lakshmi, Deepak Sreedhar, Easwaran Raman, Priti Shankar

Cooperative Instruction Scheduling with Linear Scan Register Allocation

Linear scan register allocation is an attractive register allocation algorithm because of its simplicity and fast running time. However, it is generally felt that linear scan register allocation yields poorer code than allocation schemes based on graph coloring. In this paper, we propose a pre-pass instruction scheduling algorithm that improves on the code quality of linear scan allocators. Our implementation in the Trimaran compiler-simulator infrastructure shows that our scheduler can reduce the number of active live ranges that the linear scan allocator has to deal with. As a result, fewer spills are needed and the quality of the generated code is improved. Furthermore, compared to the default scheduling and graph-coloring allocator schemes found in the IMPACT and Elcor components of Trimaran, our implementation with our pre-pass scheduler and linear scan register allocator significantly reduced compilation times.

Khaing Khaing Kyi Win, Weng-Fai Wong

iSCSI Analysis System and Performance Improvement of iSCSI Sequential Access in High Latency Networks

IP-SAN and iSCSI are expected to remedy the problems of FC-based SAN. iSCSI has a structure of multilayer protocols. A typical configuration of the protocols to realize this system is as follows: SCSI over iSCSI over TCP/IP over Ethernet. Thus, in order to improve the performance of the system, it is necessary to precisely analyze the complicated behavior of each layer. In this paper, we present an IP-SAN analysis tool that monitors each of these layers from different viewpoints. By using this analysis tool, we experimentally demonstrate that the performance of iSCSI storage access can be significantly improved by more than 60 times.

Saneyasu Yamaguchi, Masato Oguchi, Masaru Kitsuregawa

Springer Professional

Table of Contents

Frontmatter

Keynote Addresses

Data Confidentiality in Collaborative Computing

Productivity in High Performance Computing

A New Approach to Programming and Prototyping Parallel Systems

The Changing Challenges of Collaborative Algorithmics

Quantum Physics and the Nature of Computation

Plenary Session - Best Papers

Preemption Adaptivity in Time-Published Queue-Based Spin Locks

Criticality Driven Energy Aware Speculation for Speculative Multithreaded Processors

Session I - Algorithms

Search-Optimized Suffix-Tree Storage for Biological Applications

Cost-Optimal Job Allocation Schemes for Bandwidth-Constrained Distributed Computing Systems

A Fault Recovery Scheme for P2P Metacomputers

A Distributed Location Identification Algorithm for Ad hoc Networks Using Computational Geometric Methods

A Symmetric Localization Algorithm for MANETs Based on Collapsing Coordinate Systems

Session II - Applications

Performance Study of LU Decomposition on the Programmable GPU

PENCAPS: A Parallel Application for Electrode Encased Grounding Systems Project

Application of Reduce Order Modeling to Time Parallelization

Orthogonal Decision Trees for Resource-Constrained Physiological Data Stream Monitoring Using Mobile Devices

Throughput Computing with Chip MultiThreading and Clusters

Session III - Architecture

Supporting MPI-2 One Sided Communication on Multi-rail InfiniBand Clusters: Design Challenges and Performance Benefits

High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters

Providing Full QoS Support in Clusters Using Only Two VCs at the Switches

Offloading Bloom Filter Operations to Network Processor for Parallel Query Processing in Cluster of Workstations

A High-Speed VLSI Array Architecture for Euclidean Metric-Based Hausdorff Distance Measures Between Images

Session IV - Applications

Sensor Selection Heuristic in Sensor Networks

Mobile Pipelines: Parallelizing Left-Looking Algorithms Using Navigational Programming

Distributed Point Rendering

An Intra-task DVS Algorithm Exploiting Program Path Locality for Real-Time Embedded Systems

Advanced Resource Management and Scheduling of Workflow Applications in JavaSymphony

Session V - Systems Software

Using Clustering to Address Heterogeneity and Dynamism in Parallel Scientific Applications

Data and Computation Abstractions for Dynamic and Irregular Computations

XCAT-C++: Design and Performance of a Distributed CCA Framework

The Impact of Noise on the Scaling of Collectives: A Theoretical Approach

Extensible Parallel Architectural Skeletons

Session VI - Communication Networks

An Efficient Distributed Algorithm for Finding Virtual Backbones in Wireless Ad-Hoc Networks

A Novel Battery Aware MAC Protocol for Minimizing Energy × Latency in Wireless Sensor Networks

On the Power Optimization and Throughput Performance of Multihop Wireless Network Architectures

A Novel Solution for Time Synchronization in Wireless Ad Hoc and Sensor Networks

An Algorithm for Boundary Discovery in Wireless Sensor Networks

Session VII - Architecture

A Low-Complexity Issue Queue Design with Speculative Pre-execution

Performance and Power Evaluation of an Intelligently Adaptive Data Cache

Neural Confidence Estimation for More Accurate Value Prediction

The Potential of On-Chip Multiprocessing for QCD Machines

Low-Power 32bit×32bit Multiplier Design with Pipelined Block-Wise Shutdown

Session VIII - Communication Networks

Performance Analysis of User-Level PIM Communication in the Data IntensiVe Architecture (DIVA) System

Improved Point-to-Point and Collective Communication Performance with Output-Queued High-Radix Routers

A Clustering and Traffic-Redistribution Scheme for High-Performance IPsec VPNs

WDM Multistage Interconnection Networks Architectures for Enhancing Supernetworks Switching Infrastructure

Learning-TCP: A Novel Learning Automata Based Congestion Window Updating Mechanism for Ad hoc Wireless Networks

Session IX - Algorithms

Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors

Scheduling Multiple Flows on Parallel Disks

Snap-Stabilizing Detection of Cutsets

Scheduling Divisible Loads with Return Messages on Heterogeneous Master-Worker Platforms

Session X - Systems and Networks

A Grid Authentication System with Revocation Guarantees

Integrating a New Cluster Assignment and Scheduling Algorithm into an Experimental Retargetable Code Generation Framework

Cooperative Instruction Scheduling with Linear Scan Register Allocation

iSCSI Analysis System and Performance Improvement of iSCSI Sequential Access in High Latency Networks

Backmatter