Keynote Address

Info-Bio-Nano Interface: High-Performance Computing & Visualization

Follow the advances in computing technologies from teraflop to petaflop (hardware, software, and algorithms) to : . Perform realistic simulations of nanosystems and devices . Demonstrate the feasibility of simulating systems not yet attempted . Incorporate simulation and parallel computing & visualization in physical sciences and engineering education

Priya Vashishta, Rajiv K. Kalia, Aiichiro Nakano

Algorithms I

2-D Wavelet Transform Enhancement on General- Purpose Microprocessors: Memory Hierarchy and SIMD Parallelism Exploitation

This paper addresses the implementation of a 2-D Discrete Wavelet Transform on general-purpose microprocessors, focusing on both memory hierarchy and SIMD parallelization issues. Both topics are somewhat related, since SIMD extensions are only useful if the memory hierarchy is efficiently exploited. In this work, locality has been significantly improved by means of a novel approach called pipelined computation, which complements previous techniques based on loop tiling and non-linear layouts. As experimental platforms we have employed a Pentium-III (P-III) and a Pentium-4 (P-4) microprocessor. However, our SIMD-oriented tuning has been exclusively performed at source code level. Basically, we have reordered some loops and introduced some modifications that allow automatic vectorization. Taking into account the abstraction level at which the optimizations are carried out, the speedups obtained on the investigated platforms are quite satisfactory, even though further improvement can be obtained by dropping the level of abstraction (compiler intrinsics or assembly code).

Daniel Chaver, Christian Tenllado, Luis Piñuel, Manuel Prieto, Francisco Tirado

A General Data Layout for Distributed Consistency in Data Parallel Applications

This paper presents a general layout for partitioning and mapping data across processors in data parallel applications. Our scheme generalizes the existing schemes (block, cyclic) and enables nontraditional ones (e.g. graph partitioning [7],[17]). A distributed algorithm uses the data layout and the read/write access patterns to ensure consistency for data parallel applications. We show examples of the applicability of our data layout and consistency schemes for different classes of scientific applications. We present experimental results on the effectiveness of our approach for loosely synchronous, data parallel applications.

Roxana Diaconescu

A Parallel DFA Minimization Algorithm

In this paper,we have considered the state minimization problem for Deterministic Finite Automata (DFA). An efficient parallel algorithm for solving the problem on an arbitrary CRCW PRAM has been proposed. For n number of states and k number of inputs in Σ of the DFA to be minimized,the algorithm runs in O(kn log n) time and uses O( n/log n ) processors.

Ambuj Tewari, Utkarsh Srivastava, P. Gupta

Accelerating the CKY Parsing Using FPGAs

The main contribution of this paper is to present an FPGAbased implementation of an instance-specific hardware which accelerates the CKY (Cook-Kasami-Younger) parsing for context-free grammars. Given a context-free grammar G and a string x, the CKY parsing determines if G derives x. We have developed a hardware generator that creates a Verilog HDL source to perform the CKY parsing for any given context-free grammar G. The created source is embedded in an FPGA using the design software provided by the FPGA vendor. We evaluated the instance-specific hardware, generated by our hardware generator, using a timing analyzer and tested it using the Altera FPGAs. The generated hardware attains a speed-up factor of approximately 750 over the software CKY parsing algorithm. Hence, we believe that our approach is a promising solution for the CKY parsing.

Jacir L. Bordim, Yasuaki Ito, Koji Nakano

Duplication-Based Scheduling Algorithm for Interconnection-Constrained Distributed Memory Machines

Duplication-based scheduling techniques are more appropriate for fine grain task graphs and for networks with high communication latencies. However, most of the algorithms are developed under the assumption of fully connected processor network and with prohibitively high O(v4) time complexity. An insertion based duplication algorithm is proposed for precedence constrained task graphs, for working with limited interconnection constrained processors. It duplicates only the most important immediate parents of a task, that too if critical. Results are presented for benchmark random task graphs, having widely varying shape and cost parameters for the clique, Hypercube and an extensible and fault tolerant binary de Bruijn (undirected) multiprocessor network. The average performance degradation, due to interconnection constraints, is about 21% in comparison to fully connected processor network. Further, the schedules generated on the fixed degree binary de-Bruijn network are within 5% of the schedules on Hypercube network, whose degree keeps on increasing with size.

Savina Bansal, Padam Kumar, Kuldip Singh

Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs)

Extended Abstract

The ability to provide uniform shared-memory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform shared-memory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors and across the entire range of instance sizes tested. This linear speedup with the number of processors is one of the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is for evaluating arithmetic expression trees using the algorithmic techniques of list ranking and tree contraction; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of shared-memory parallel algorithms.

David A. Bader, Sukanya Sreshta, Nina R. Weisse-Bernstein

Architecture I

Dead-Block Elimination in Cache: A Mechanism to Reduce I-cache Power Consumption in High Performance Microprocessors

Both power and performance are important design parameters of the present day processors. This paper explores an integrated software and circuit level technique to reduce leakage power in L1 instruction caches of high performance microprocessors, by eliminating basic blocks from the cache, as soon as they are dead. The effect of this dead block elimination in cache on both the power consumption of the I-cache and the performance of the processor is studied. Identification of basic blocks is done by the compiler from the control flow graph of the program. This information is conveyed to the processor, by annotating the first instruction of selected basic blocks. During execution, the blocks that are not needed further are traced and invalidated and the lines occupied by them are turned off. This mechanism yields an average of about 5% to 16% reduction, in the energy consumed for different sizes of I-cache, for a set of the SPEC CPU 2000 benchmarks [16], without any performance degradation.

Mohan G. Kabadi, Natarajan Kannan, Palanidaran Chidambaram, Suriya Narayanan, M. Subramanian, Ranjani Parthasarathi

Exploiting Web Document Structure to Improve Storage Management in Proxy Caches

Proxy caches are essential to improve the performance of World Wide Web and to enhance user perceived latency. In this paper, we propose a new Web object based policy to manage the storage system of a proxy cache. We propose two techniques to improve the storage system performance. The first technique is concerned with prefetching the related files belonging to a Web object, from the disk to main memory. This prefetching improves performance as most of the files can be provided from the main memory instead of proxy disk. The second technique stores the Web object members in contiguous disk blocks in order to reduce the disk access time. This in turn reduces the disk response time. We have used trace-driven simulations to study the performance improvements one can obtain with these two techniques.

Abdolreza Abhari, Sivarama P. Dandamudi, Shikharesh Majumdar

High Performance Multiprocessor Architecture Design Methodology for Application-Specific Embedded Systems

This paper presents a methodology for the design of application-specific multiprocessor systems. The methodology, named AIM, guides a designer to select the right heterogeneous architecture starting from a set of target applications. The methodology distinguishes between applications and architectures that are modeled as Kahn Process Networks and cycle-accurate bit-true models, respectively. A gradual mapping of applications over architectures is defined to analyze the performance of the resulting system by simulation. As a consequence, functionally complete system is always available with the designer without the overwhelming issues related with the multiprocessor architectures. The described methodology is illustrated through the design of a multiprocessor architecture for an MPEG-1 audio decoder application.

Syed Saif Abrar

LLM: A Low Latency Messaging Infrastructure for Linux Clusters

In this paper, we develop a messaging infrastructure, called LLM, to arrive at a robust and efficient low latency message passing infrastructure for kernel-to-kernel communication. The main focus is to overcome the high latencies associated with the conventional communication protocol stack management of TCP/IP. The LLM provides a transport protocol that offers high reliability at the fragment level keeping the acknowledgment overhead low given the high reliability levels of the LAN. The system utilizes some of the architectural facilities provided by the Linux kernel specially designed for optimization in the respective areas. Reliability against fragment losses is ensured by using a low overhead negative acknowledgment scheme. The implementation is in the form of loadable modules extending the Linux OS. In a typical implementation on a cluster of two nodes, each of uniprocessor Intel Pentium 400 MHz on a 10/100 Mbps LAN achieved an average round trip latency of .169ms as compared to the .531ms obtained by ICMP (Ping) protocol. A relative comparison of LLM with others is also provided.

R. K. Shyamasundar, Basant Rajan, Manish Prasad, Amit Jain

Low-Power High-Performance Adaptive Computing Architectures for Multimedia Processing

The demand for higher computing power and thus more onchip computing resources is ever increasing. The size of on-chip cache memory has also been consistently increasing. To efficiently utilize silicon real-estate on the chip, a part of L1 data cache is designed as a Reconfigurable Functional Cache (RFC), that can be configured to perform a selective core function in the media application whenever higher computing capability is required. The idea of Adaptive Balanced Computing architecture was developed, where the RFC module is used as a coprocessor controlled by main processor. Initial results have proved that ABC architecture provides speedups ranging from 1.04x to 5.0x for various media applications. In this paper, we address the impact of RFC on cache access time and energy dissipation. We show that reduced number of cache accesses and lesser utilization of other on-chip resources will result in energy savings of up to 60% for MPEG decoding, and in the range of 10% to 20% for various other multimedia applications.

Rama Sangireddy, Huesung Kim, Arun K. Somani

Keynote Address

Field Programmable Systems

The emergence of Platform FPGAsTM marks a significant change in the capabilities of FPGAs. Unlike their predecessors, Platform FPGAs are characterized by heterogeneous architectures. In addition to programmable logic, routing and inputs/outputs, they feature soft and hard microprocessors, bus and memory hierarchies, digital signal processing cores and gigabit communications channels. These devices are targeted at high performance computing applications in embedded systems. The complexity and flexibility of Platform FPGAs is unparalleled. We explore some of the potential implications for current and future design of programmable digital systems.

Patrick Lysaght

Systems Software I

CORBA-as-Needed: A Technique to Construct High Performance CORBA Applications

This paper proposes a new optimization technique called CORBA-as-needed to improve the performance of distributed CORBA applications. This technique is based on the observation that in many cases the client and the server of a distributed application run on compatible computing platforms, and do not need the interoperability functionality of CORBA. CORBA-as-needed dynamically determines if the interoperability functionality is needed for a specific application invocation, and bypasses this functionality if it is not needed. Performance measurements from a prototype implementation in omniORB show that CORBA-as-needed achieves a very significant performance improvement.

Hui Dai, Shivakant Mishra, Matti A. Hiltunen

Automatic Search for Performance Problems in Parallel and Distributed Programs by Using Multi-experiment Analysis

We introduce Aksum, a novel system for performance analysis that helps programmers to locate and to understand performance problems in message passing, shared memory and mixed parallel programs. The user must provide the set of problem and machine sizes for which performance analysis should be conducted. The search for performance problems (properties) is user-controllable by restricting the performance analysis to specific code regions, by creating new or customizing existing property specifications and property hierarchies, by indicating the maximum search time and maximum time a single experiment may take, by providing thresholds that define whether or not a property is critical, and by indicating conditions under which the search for properties stops. Aksum automatically selects and instruments code regions for collecting raw performance data based on which performance properties are computed. Heuristics are incorporated to prune the search for performance properties. We have implemented Aksum as a portable Java-based distributed system which displays all properties detected during the search process together with the code regions that cause them. A filtering mechanism allows the examination of properties at various levels of detail. We present an experiment with a financial modeling application to demonstrate the usefulness and effectiveness of our approach.

Thomas Fahringer, Clovis Seragiotto Jr

An Adaptive Value-Based Scheduler and Its RT-Linux Implementation

In dynamic real-time systems, value-based scheduling aims to achieve graceful degradation during overloads, in addition to maintaining a high schedulability during normal and underloads. The objective of this paper is twofold: (1) to propose an adaptive value-based scheduler for multiprocessor real-time systems aimed at maintaining a high system value with less deadline misses, and (2) to present the implementation of the proposed scheduler in a Linux based real-time operating system, RT-Linux, which in its current form does not employ a notion of task value. We evaluate the performance of the proposed scheduler in terms of two performance metrics, namely, “value ratio” and “success ratio” through both simulation and implementation.

S. Swaminathan, G. Manimaran

Effective Selection of Partition Sizes for Moldable Scheduling of Parallel Jobs

Although the current practice in parallel job scheduling requires jobs to specify a particular number of requested processors, most parallel jobs are moldable, i.e. the required number of processors is flexible. This paper addresses the issue of effective selection of processor partition size for moldable jobs. The proposed scheduling strategy is shown to provide significant benefits over a rigid scheduling model and is also considerably better than a previously proposed approach to moldable job scheduling.

Srividya Srinivasan, Vijay Subramani, Rajkumar Kettimuthu, Praveen Holenarsipur, P. Sadayappan

Runtime Support for Multigrain and Multiparadigm Parallelism

This paper presents a general methodology for implementing on clusters the runtime support for a two-level dependence-driven thread model, initially targeted to shared-memory multiprocessors. The general ideal is to exploit existing programming solutions for these architectures, like Software DSM (SWDSM) and Message Passing Interface. The management of the internal runtime system structures and of the dependence-driven multilevel parallelism is performed with explicit messages, exploiting however the shared-memory hardware of the available SMP nodes whenever this is possible. The underlying programming models and hybrid programming solutions are not excluded, using threads for the intra-node parallelism. The utilization of shared virtual memory for thread stacks and a translator for allocating Fortran77 common blocks in shared memory enable the execution of unmodified OpenMP codes on clusters of SMPs. Initial performance results demonstrate the efficient support for fork-join and multilevel parallelism on top of SWDSM and MPI and confirm the benefits of explicit, though transparent, message passing.

Panagiotis E. Hadjidoukas, Eleftherios D. Polychronopoulos, Theodore S. Papatheodorou

A Fully Compliant OpenMP Implementation on Software Distributed Shared Memory

OpenMP is a relatively new industry standard for programming parallel computers with a shared memory programming model. Given that clusters of workstations are a cost-effective solution for building parallel platforms, it would of course be highly interesting if the OpenMPmo del could be extended to these systems as well as to the standard shared memory architectures for which it was originally intended.We present in this paper a fully compliant implementation of the OpenMP specification 1.0 for C targeting networks of workstations. We have used an experimental software distributed shared memory system called Coherent Virtual Machine to implement a run-time library which is the target of a source-to-source OpenMPt ranslator also developed in this project.The system has been evaluated using an OpenMPm icro-benchmark suite as to evaluate the effect of some memory coherence protocol improvements. We have also used OpenMPv ersions of three Splash-2 applications concluding in reasonable speedups on an IBM SP2 machine. This also is the first study to investigate the subtle mechanisms of consistency in OpenMPon software distributed shared memory systems.

Sven Karlsson, Sung-Woo Lee, Mats Brorsson

Networks

A Fast Connection-Time Redirection Mechanism for Internet Application Scalability

Applications that are distributed, fault tolerant, or perform dynamic load balancing rely on redirection techniques, such as network address translation (NAT), DNS request routing, or middleware to handle Internet scale loads. In this paper, we describe a new connection redirection mechanism that allows applications to change end-points of communication channels. The mechanism supports redirections across LANs and WANs and is application-independent. Further, it does not introduce any central bottlenecks. We have implemented the redirection mechanism using a novel end-point control session layer. The performance results show that the overhead of the mechanism is minimal. Further, Internet applications built using this mechanism scale better than those built using HTTP redirection.

Michael Haungs, Raju Pandey, Earl Barr, J. Fritz Barnes

Algorithms for Switch-Scheduling in the Multimedia Router for LANs

The primary objective of the Multimedia Router (MMR) [1] project is to design and implement a single chip router targeted for use in cluster and LAN interconnection networks. The goal can be concisely captured in the phrase ‘QoS routing at link speeds’. This paper studies a set of algorithms for switch scheduling based on a highly concurrent implementation for capturing output port requests. Two different switch-scheduling algorithms called Row-Column Ordering and Diagonal Ordering are proposed and implemented in a switch-scheduling framework which involves a matrix data structure, and therefore enables concurrent and parallel operations at high-speed. Their performance has been evaluated with Constant Bit Rate (CBR), Variable Bit Rate (VBR), and a mixture of CBR and VBR traffic. At high offered loads both these ordering functions have been shown to deliver superior Quality of Service (QoS) to connections at a high scheduling rate and high utilization.

Indrani Paul, Sudhakar Yalamanchili, Jose Duato

An Efficient Resource Sharing Scheme for Dependable Real-Time Communication in Multihop Networks

Many real-time applications require communication services with guaranteed timeliness and fault-tolerance capabilities. Hence, we have the concept of a dependable connection (D-connection), which is a fault-tolerant real-time connection. A popular method used for a D-connection establishment is the primary-backup channel approach, wherein a D-connection is setup by the establishment of a primary and a backup path. But, we note that such a 100% dependable network is very inefficient. We introduce an efficient scheme called primary-backup multiplexing where D-connections may share their backup paths with other connections. We also describe a method where a D-connection can be given an intermediate fault-tolerance level between 0% dependability (no backup channel) and 100% dependability (dedicated backup). We present an efficient, connection-oriented and fault-tolerant D-connection establishment and resource-sharing scheme for real-time communications. We have noted in the simulation studies, that at light load, our scheme can effect a considerable improvement in the network resource utilization, with a relatively low drop in its fault-tolerance capabilities.

Ranjith G, C. Siva Ram Murthy

Improving Web Server Performance by Network Aware Data Bu.ering and Caching

In this paper we propose a new method of data handling for web servers. We call this method Network Aware Bufiering and Caching (NABC for short). NABC facilitates reduction of data copies in web server’s data sending path, by doing three things: (1) Layout the data in main memory in a way that protocol processing can be done without data copies (2) Keep a unified cache of data in kernel and ensure safe access to it by various processes and kernel and (3) Pass only the necessary meta data between processes so that bulk data handling time spent during IPC can be reduced. We realize NABC by implementing a set of system calls and an user library. The end product of the implementation is a set of APIs specifically designed for use by the web servers. We port an in house web server called SWEET, to NABC APIs and evaluate performance using a range of workloads both simulated and real. The results show a very impressive gain of 12% to 21% in throughput for static file serving and 1.6 to 4 times gain in throughput for lightweight dynamic content serving for a server using NABC APIs over the one using UNIX APIs.

Sourav Sen, Y. Narahari

WRAPS Scheduling and Its Efficient Implementation on Network Processors

Network devices in high-speed networks need to support a large number of concurrent streams with different quality of service (QoS) requirements. This paper introduces a new packet scheduling algorithm and its efficient implementation on the novel programmable network processor-Intel’s IXP1200. WRAPS algorithm is based on the observation of packet rates within a dynamic window. The window can move continuously or discretely as the system transmits packets. Cumulated packet rates in the window are utilized to predict the next incoming packet of the same flow and reserve resource for the transmission of later packets. The implementation on network processor considers both accuracy and efficiency. To expedite the calculation and avoid the high cost of maintaining an ordered list, we designed a time- slotted circular queue to achieve O(1) insertion and selection time. Our experiments on the real system show good performance in terms of scalability and flow interference avoidance.

Xiaotong Zhuang, Jian Liu

Performance Comparison of Pipelined Hash Joins on Workstation Clusters

The traditional hash join algorithm uses a single hash table built on one of the relations participating in the join operation. A variation called double hash join was proposed to remedy some of the performance problems with the single join. In this paper, we compare the performance of single- and double-pipelined hash joins in a cluster environment. In this environment, nodes are heterogeneous; furthermore, nodes experience dynamic, non-query local background load that can impact the pipelined query execution performance. Previous studies have shown that double-pipelined hash join performs substantially better than the single-pipelined hash join when dealing with data from remote sources. However, their relative performance has not been studied in cluster environments. Our study indicates that, in the type of cluster environments we consider here, single pipelined hash join performs as well as or better than the double pipelined hash join in most cases. We present experimental results on a Pentium cluster and identify these cases.

Kenji Imasaki, Hong Nguyen, Sivarama P. Dandamudi

Keynote Address

Computational Science and Engineering — Past, Present, and Future

Physics-based computational modeling has become more practical in the past few years due to the availability of high performance computers (HPC), highspeed communications, and more robust visualization and data mining tools. Applications range from understanding fundamental science to solving a diverse set of important practical, real-world problems. The old paradigm of designing, testing, building, and then modeling is being replaced in some cases with a new paradigm of investigating and modeling before designing and building. People are now realizing computational modeling and simulation is an essential ingredient in reducing acquisition time for large systems. Single discipline problems are giving way more and more to the solution of multi-disciplinary problems. HPC, coupled with multi-disciplinary applications, is the driving force behind the challenges we face today in research and engineering.

N. Radhakrishnan

Algorithms II

Iterative Algorithms on Heterogeneous Network Computing: Parallel Polynomial Root Extracting

This article describes various algorithms on heterogeneous architecture applied to parallel polynomial root extracting. The use of a set of clusters in order to compute great parallel applications is an ongoing challenge. Asynchronous algorithms are well suited for this type of architecture, because no synchronous barriers are needed. The scalability of such architecture is a hard task to achieve. In this paper, we compare local calculus and calculus on the Internet in order to focus on the efficiency of such methods.

Raphaël Couturier, Philippe Canalda, FranÇois Spies

Efficient Tree-Based Multicast in Wormhole-Routed Networks

This paper presents two tree-based multicast algorithms for wormhole-routed nD torus and 2D mesh networks, respectively. Both of the two algorithms use multiple spanning trees with an up-down routing strategy to provide deadlock-free routing. In contrast with single spanning tree algorithms that use non-tree links only as shortcuts, the presented algorithms use all links (for nD tori, all but n links) as tree links and, therefore, better balance network traffic and resource utilization. For the algorithm in 2D mesh networks, shortcuts can still be used to shorten routing path for both unicast and multicast messages. Simulation results show that both of the proposed algorithms outperform the best single spanning tree approach obviously.

Jianping Song, Zifeng Hou, Yadong Qu

Parallel Algorithms for Identification of Basis Polygons in an Image

Given a set of n straight line segments each described by its two end points, we propose two novel algorithms for detecting all basis polygons formed by them. The algorithms, based on traversals along the sides of the basis polygons, detect the polygons in O(n) time using n2 processors. The first algorithm handles the simple scenes consisting of convex basis polygons only, while the second one deals with the general situation. These algorithms have been simulated and tested for a number of input sets of intersecting line segments.

Arijit Laha, Amitava Sen, Bhabani P. Sinha

Range Image Segmentation on a Cluster

We report on the implementation of a range image segmentation approach on a cluster using Message Passing Interface (MPI). The approach combines and integrates different strategies to find the best fitting planes for a set of three dimensional points. There are basically three distint modules for plane recovery; each module has a distinct method for generating a candidate plane and a distinct objective function for evaluating the candidate and selecting the ”best” plane among the candidates. Parallelism can be exploited in two different ways. First, all three modules can be executed concurrently and asynchronously by distinct processes. The scheduling of the modules in the parallel implementation differs significantly from that of the sequential implementation. Thus, different output images can be obtained for the two implementations. However, the experiments conducted on severalra nge images show that on average the quality of results is similar in comparison with ground truth images. Second, the computation within each module can be performed in parallel. A module chooses the best plane among a large set of randomly selected candidate planes; this is a highly parallel task that can be efficiently partitioned among a group of processors. The approach proposed in this paper for a multiprocessor environment has been implemented on a cluster of workstations using MPI. Preliminary results are presented.

Mary Ellen Bock, Concettina Guerra

Detection of Orthogonal Interval Relations

The complete set ℜ of orthogonal temporal interactions between pairs of intervals, formulated by Kshemkalyani, allows the detailed specification of the manner in which intervals can be related to one another in a distributed execution. This paper presents a distributed algorithm to detect whether pre-specified interaction types between intervals at different processes hold. Specifically, for each pair of processes i and j, given a relation r_i,j from the set of orthogonal relations ℜ, this paper presents a distributed (on-line) algorithm to determine the intervals, if they exist, one from each process, such that each relation r_i,j is satisfied for that (i, j) process pair. The algorithm uses O(n min(np, 4mn)) messages of size O(n) each, where n is the number of processes, m is the maximum number of messages sent by any process, and p is the maximum number of intervals at any process. The average time complexity per process is O(min(np, 4mn)), and the total space complexity across all the processes is min(4pn2. 2np, 10mn2).

Punit Chandra, Ajay D. Kshemkalyani

An Efficient Parallel Algorithm for Computing Bicompatible Elimination Ordering (BCO) of Proper Interval Graphs

In this paper, we first show how a certein ordering of vertices, called bicompatible elimination ordering (BCO), of a proper interval graph (PIG) can be used to solve optimally the following problems: finding Hamiltonian cycle in a Hamiltonian PIG, the set of articulation points and bridges, and the single source or all pair shortest paths. We then propose an NC parallel algorithm (i.e., polylogarithmic-time employing a polynomial number of processors) to compute a BCO of a proper interval graph.

B. S. Panda, Sajal K. Das

Mobile Computing and Databases

Router Handoff: An Approach for Preemptive Route Repair in Mobile Ad Hoc Networks

Mobile Adhoc Networks (MANET) are distributed, mobile, wireless, multihop networks that operate without pre-existing communication infrastructure. Several routing protocols both reactive and proactive have been proposed to provide the self starting behavior needed for adhoc networks. Unreliable wireless links and node mobility may result in a large number of route breakages. The standard approach followed in reactive routing protocols in case of broken routes is to flag an error and re-initiate route discovery either at the source or at an intermediate node. Repairing these broken links increases routing overhead and the delay in delivering packets. In this paper, we propose an approach called Router Hando. which repairs routes preemptively, before they break, by using other nodes in the vicinity of a weak link. We have incorporated Router Hando. into the AODV routing protocol and have validated the idea through analysis and simulations. The results of the simulations indicate an increase in throughput under certain conditions. This improvement is a result of smaller overhead and delay.

P. Abhilash, Srinath Perur, Sridhar Iyer

A 2-D Random Walk Based Mobility Model for Location Tracking

Performance analysis of different location tracking schemes needs a model that reflects the mobility of the mobile terminals in a realistic way. We propose a two-dimensional random walk based model with inertia for performance analysis of different location management schemes with specific update and paging mechanism.We used our mobility model to analyze the cost of movement-based location update technique with selective paging. We extend the model to have different values for inertia of motion and inertia of rest.

Srabani Mukhopadhyaya, Krishnendu Mukhopadhyaya

Data Placement in Intermittently Available Environments

In this paper, we address the problem of data placement in a grid based multimedia environment, where the resource providers, i.e. servers, are intermittently available. The goal is to optimize the system performance by admitting maximum number of users into the system while ensuring user Quality of Service (QoS). We define and formulate various placement strategies that determine the degree of replication necessary for video objects by using a cost-based optimization procedure based on predictions of expected requests under various time-map scenarios and QoS demands. We also devise methods for dereplication of videos based on changes in popularity and server usage patterns. Our performance results indicate the benefits obtained the judicious use of dynamic placement strategies.

Yun Huang, Nalini Venkatasubramanian

RT-MuPAC: Multi-power Architecture for Voice Cellular Networks

We have considered the problem of providing greater throughput in cellular networks.We propose a novel cellular architecture, RT-MuPAC, that supports greater throughput compared to conventional cellular architectures. RT-MuPAC (Real-time Multi-Power Architecture for Cellular Networks) is based on two fundamental features not present in today’s cellular networks: usage of multiple hops and power control (power control is used only in a limited fashion to reduce interference in today’s networks). These features, we believe, will become increasingly important in next generation cellular systems as heterogeneous networks will operate in synergy. We show using detailed simulations that RTMuPAC is indeed a significant improvement over conventional networks. RT-MuPAC can evolve from the existing infrastructure and offer advantages to both the service provider and the users. RT-MuPAC also serves as a proof of concept for the use of multi-hop architectures in cellular networks.

K. Jayanth Kumar, B. S. Manoj, C. Siva Ram Murthy

Asynchronous Transaction Processing for Updates by Client: With Elimination of Wait-for State

In a distributed database system, time-critical transactions need to complete their processing, within a time limit. This is especially true in the case of a real-time database systems, and also in the case of mobile database systems. This study considers server level enhancements. By adopting transaction classification, few changes can be accommodated within 2-phase locking at a low cost that enable the database update by time-critical Clients. A use of an instant priority based execution scheme, based on transaction classification, can reduce delays caused by resource conflicts with ordinary transactions. We further investigate a procedure that performs critical functions asynchronously.

Subhash Bhalla

Active File Systems for Data Mining and Multimedia

Data mining and multimedia applications require huge amounts of storage. These applications are also compute-intensive. Active disks make use of the computational power available in the disk to reduce storage traffic. Many of the file system proposals for active disks work at the block level. In this paper we argue for the necessity of filtering at application level. We propose two file systems for active disks: active file system (ACFS) which binds files and filters at the file system level and active network file system (ANFS) which extends ACFS over networks. These file systems preserve the familiar Unix file system semantics to a large extent. We present an implementation of the file systems which makes minimal changes to the existing file system code in Linux.

S. H. Srinivasan, Pranay Singh

Applications

Simulating DNA Computing

Although DNA (deoxy-ribo nucleic acid) can perform 1022 computations per second, it is time intensive and complex to set up input and output of data to and from a biological computer and to filter the final result. This paper, discusses how to simulate DNA computing on a digital computer to solve the Hamiltonian path problem using Adleman’s model. The simulation serves as an educational tool to teach DNA computing without the elaborate bio-experiments. As an aside, it also digitally verifies Adleman’s notion of DNA computing to solve the Hamiltonian path problem. Future work will involve a parallel implementation of the algorithm and investigation of the possibility of construction of simple regular VLSI structures to implement the basics of the model for fixed-sized problems.

Sanjeev Baskiyar

Parallel Syntenic Alignments

Given two genomic DNA sequences, the syntenic alignment problem is to compute an ordered list of subsequences for each sequence such that the corresponding subsequence pairs exhibit a high degree of similarity. Syntenic alignments are useful in comparing genomic DNA from related species andin identifying conservedgen es. In this paper, we present a parallel algorithm for computing syntenic alignments that runs in O(mn/p) time and O(m + n/p) memory per processor, where m and n are the respective lengths of the two genomic sequences. Our algorithm is time optimal with respect to the corresponding sequential algorithm and can use O(n/log n) processors, where n is the length of the larger sequence. Using an implementation of this parallel algorithm, we report the alignment of human chromosome 12p13 andit s syntenic region in mouse chromosome 6 (both over 220, 000 base pairs in length) in under 24 minutes on a 64-processor IBM xSeries cluster.

Natsuhiko Futamura, Srinivas Aluru, Xiaoqiu Huang

XS-systems: eXtended S-Systems and Algebraic Differential Automata for Modeling Cellular Behavior

Several biological and biochemical mechanisms can be modeled with relativelysimp le sets of differential algebraic equations (DAE). The numerical solution to these differential equations provide the main investigative tool for biologists and biochemists. However, the set of numerical traces of verycomp lex systems become unwieldyt o wade through when several variables are involved. To address this problem, we propose a novel wayto querylarge sets of numerical traces bycom bining in a new wayw ell known tools from numerical analysis, temporal logic and verification, and visualization.In this paper we describe XS-systems: computational models whose aim is to provide the users of S-systems with the extra tool of an automaton modeling the temporal evolution of complex biochemical reactions. The automaton construction is described starting from both numerical and analytic solutions of the differential equations involved, and parameter determination and tuning are also considered. A temporal logic language for expressing and verifying properties of XS-systems is introduced and a prototype implementation is presented.

Marco Antoniotti, Alberto Policriti, Nadia Ugel, Bud Mishra

A High Performance Scheme for EEG Compression Using a Multichannel Model

The amount of data contained in electroencephalogram (EEG) recordings is quite massive and this places constraints on bandwidth and storage. The requirement of online transmission of data needs a scheme that allows higher performance with lower computation. Single channel algorithms, when applied on multichannel EEG data fail to meet this requirement. While there have been many methods proposed for multichannel ECG compression, not much work appears to have been done in the area of multichannel EEG compression. In this paper, we present an EEG compression algorithm based on a multichannel model, which gives higher performance compared to other algorithms. Simulations have been performed on both normal and pathological EEG data and it is observed that a high compression ratio with very large SNR is obtained in both cases. The reconstructed signals are found to match the original signals very closely, thus confirming that diagnostic information is being preserved during transmission.

D. Gopikrishna, Anamitra Makur

Scalability and Performance of Multi-threaded Algorithms for International Fare Construction on High-Performance Machines

We describe the design, implementation and performance of a project for constructing international fares at United Airlines. An efficient fare construction engine allows an airline to simplify, and automate the process of pricing fares in lucrative international markets. The impact of a fare engine to the revenues of a large airline has been estimated at between $20M and $60M per year. The goal is to design an efficient software system for handling 10 Gb of data pertaining to base fares from all airlines, and to generate over 250 million memory-resident records (fares). The software architecture uses a 64-bit, object -oriented, and multi-threaded approach and the hardware platforms used for benchmarking include a 24-CPU, IBM S-80 and 32-CPU, Hewlett-Packard Superdome. Two competing software designs using (i) dynamic memory and (ii) static memory are compared. Critical part of the design includes a scheduler that uses a heuristic for load balancing which is provably within a constant factor of optimality. Performance results are presented and discussed.

Chandra N. Sekharan, Krishnan Saranathan, Raj Sivakumar, Zia Taherbhai

Systems Software II

A Resource Brokering Infrastructure for Computational Grids

With the advances in the networking infrastructure in general, and the Internet in specific, we can build grid environments that allow users to utilize a diverse set of distributed and heterogeneous resources. Since the focus of such environments is the efficient usage of the underlying resources, a critical component is the brokering module that mediates the discovery, access and usage of these resources. One of the major tasks of the brokering module is brokering of resources. With the consumer’s constraints, provider’s rules, distributed heterogeneous resources and the large number of scheduling choices, the brokering module needs to decide where to place the user’s jobs and when to start their execution in a way that yields the best performance to the user and the best utilization to the resource provider. In this paper we present the design and implementation of a flexible, extensible and generic policy- based resource brokering infrastructure for computational grids following a layered façade design pattern and using XML as the underlying specification language. We also describe a testbed environment and our efforts at integrating it with several grid systems.

Ahmed Al-Theneyan, Piyush Mehrotra, Mohammad Zubair

On Improving Thread Migration: Safety and Performance

Application-level migration schemes have been paid more attention recently because of their great potential for heterogeneous migration. But they are facing an obstacle that few migration-unsafe features in certain programming languages prevent some programs from migrating. Most application-level migration schemes declare or assume they are dealing with “safe” programs which confuse users without explanation. This paper proposes an application-level thread migration package, MigThread, to identify “unsafe” features in C/C++ and migrate this kind of programs with correct results. Therefore, users need not worry if their programs are qualified for migration as they experienced before. Besides the existing characteristics like scalability and flexibility, MigThread improves transparency and reliability. Complexity analysis and performance evaluation illustrate the migration efficiency.

Hai Jiang, Vipin Chaudhary

Improved Preprocessing Methods for Modulo Scheduling Algorithms

Instruction scheduling with an automaton-based resource conflict model is well-established for normal scheduling. Such models have been generalized to software pipelining in the modulo-scheduling framework. One weakness with existing methods is that a distinct automaton must be constructed for each combination of a reservation table and initiation interval. In this work, we present a different approach to model conflicts. We construct one automaton for each reservation table which acts as a compact encoding of all the conflict automata for this table, which can be recovered for use in modulo-scheduling. The basic premise of the construction is to move away from the Proebsting-Fraser model of conflict automaton to the Müller model of automaton modelling issue sequences. The latter turns out to be useful and efficient in this situation. Having constructed this automaton, we show how to improve the estimate of resource constrained initiation interval. Such a bound is always better than the average-use estimate. We show that our bound is safe: it is always lower than the true initiation interval. This use of the automaton is orthogonal to its use in modulo-scheduling. Once we generate the required information during pre-processing, we can compute the lower bound for a program without any further reference to the automaton.

D. V. Ravindra, Y. N. Srikant

Dynamic Path Profile Aided Recompilation in a JAVA Just-In-Time Compiler

Just-in-Time (JIT) compilers for Java can be augmented by making use of runtime profile information to produce better quality code and hence achieve higher performance. In a JIT compilation environment, the profile information obtained can be readily exploited in the same run to aid recompilation and optimization of frequently executed (hot) methods. This paper discusses a low overhead path profiling scheme for dynamically profiling JIT produced native code. The profile information is used in recompilation during a subsequent invocation of the hot method. During recompilation tree regions along the hot paths are enlarged and instruction scheduling at the superblock level is performed. We have used the open source LaTTe JIT compiler framework for our implementation. Our results on a SPARC platform for SPEC JVM98 benchmarks indicate that (i) there is a significant reduction in the number of tree regions along the hot paths, and (ii) profile aided recompilation in LaTTe achieves performance comparable to that of adaptive LaTTe in spite of retranslation and profiling overheads.

R. Vinodh Kumar, B. Lakshmi Narayanan, R. Govindarajan

Exploiting Data Value Prediction in Compiler Based Thread Formation

Speculative multithreading (SpMT) is an effective execution model for parallelizing non-numeric programs, which tend to use irregular and pointer-intensive data structures, and have complex flows of control. An SpMT compiler performs program partitioning by carefully considering the data dependencies present in the program. However, at run-time, the data dependency picture changes dramatically if the SpMT hardware performs data value prediction. Many of the data dependencies, which guided the compiler’s partitioning algorithm in taking decisions, may lose their relevance due to successful data value prediction. This paper presents a compiler framework that uses profile-based value predictability information when making program partitioning decisions. We have developed a Value Predictability Profiler (VPP) that generates the value prediction statistics for the source variables in a program. Our SpMT compiler utilizes this information by ignoring the data dependencies due to variables with high prediction accuracies. The compiler can thus perform more efficient thread formation. This SpMT compiler framework is implemented on the SUIF-MachSUIF platform. A simulation-based evaluation of SPEC programs shows that the speedup with 6 processing elements increases up to 21% when utilizing value predictability information during program partitioning.

Anasua Bhowmik, Manoj Franklin

Scientific Computation

High Performance Computing of Fluid-Structure Interactions in Hydrodynamics Applications Using Unstructured Meshes with More than One Billion Elements

A parallel finite element fluid-structure interaction solver is developed for numerical simulation of water waves interacting with floating objects. In our approach, the governing equations are the Navier-Stokes equations written for two incompressible fluids. An interface function with two distinct values serves as a marker identifying the location of the interface. The numerical method is based on writing stabilized finite element formulations in an arbitrary Lagrangian-Eulerian frame. This allows us to handle the motion of the floating objects by moving the computational nodes. In the meshmoving schemes, we assume that the computational domain is made of elastic materials. The linear elasticity equations are solved to obtain the displacements. In order to update the position of the floating object, the nonlinear rigid body dynamics equations are coupled with the governing equations of fluids and are solved simultaneously. The mooring forces are modeled using nonlinear cables and linear spring models.

S. Aliabadi, A. Johnson, J. Abedi, B. Zellars

An Efficient and Exponentially Accurate Parallel h-p Spectral Element Method for Elliptic Problems on Polygonal Domains - The Dirichlet Case

For smooth problems spectral element methods (SEM) exhibit exponential convergence and have been very successfully used in practical problems. However, in many engineering and scientific applications we frequently encounter the numerical solutions of elliptic boundary value problems in non-smooth domains which give rise to singularities in the solution. In such cases the accuracy of the solution obtained by SEM deteriorates and they offer no advantages over low order methods. A new Parallel h-p Spectral Element Method is presented which resolves this form of singularity by employing a geometric mesh in the neighborhood of the corners and gives exponential convergence with asymptotically faster results than conventional methods. The normal equations are solved by the Preconditioned Conjugate Gradient (PCG) method. Except for the assemblage of the resulting solution vector, all computations are done on the element level and we don’t need to compute and store mass and stiffness like matrices. The technique to compute the preconditioner is quite simple and very easy to implement. The method is based on a parallel computer with distributed memory and the library used for message passing is MPI. Load balancing issues are discussed and the communication involved among the processors is shown to be quite small.

S. K. Tomar, P. Dutt, B. V. Rathish Kumar

Fast Stable Solver for Sequentially Semi-separable Linear Systems of Equations

In this paper we will present a fast backward stable algorithm for the solution of certain structured matrices which can be either sparse or dense. It essentially combines the fast solution techniques for banded plus semi-separable linear systems of equations of Chandrasekaran and Gu [4] with similar techniques of Dewilde and van der Veen for time-varying systems [12].

S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, A. J. van der Veen

Dynamic Network Information Collection for Distributed Scientific Application Adaptation

An application for collecting local dynamic network information during the course of distributed scientific application execution is presented. The technique used is light weight and provides adaptive capabilities to the application.

Devdatta Kulkarni, Masha Sosonkina

Adaptive Runtime Management of SAMR Applications

This paper presents the design, prototype implementation, and evaluation of a runtime management framework for structured adaptive mesh refinement applications. The framework is capable of reactively and proactively managing and optimizing application execution using current system and application state, predictive models for system behavior and application performance, and an agent based control network. The overall goal of this research is to enable large-scale dynamically adaptive scientific and engineering simulations on distributed, heterogeneous and dynamic execution environments such as the computational “grid”.

Sumir Chandra, Shweta Sinha, Manish Parashar, Yeliang Zhang, Jingmei Yang, Salim Hariri

Mobile Agents — The Right Vehicle for Distributed Sequential Computing

Distributed sequential computing (DSC) is computing with distributed data using a single locus of computation. In this paper we argue that computation mobility-the ability for the locus of computation to migrate across distributed memories and continue the computation as it meets the required data-facilitated by mobile agents with strong mobility is essential for scalable distributed sequential programs that preserve the integrity of the original algorithm.

Lei Pan, Lubomir F. Bic, Michael B. Dillencourt, Ming Kin Lai

Architecture II

Using Dataflow Based Context for Accurate Branch Prediction

Contexts formed only from the outcomes of the last several instances of a static branch instruction or that of the last several dynamic branches do not always encapsulate all of the information required for correct prediction of the branch. Complex interactions between data flow and control flow change the context in ways that result in predictability loss for a significant number of dynamic branches. For improving the prediction accuracy, we use contexts derived from the predictable portions of the data flow graph. That is, the predictability of hard-to-predict branches can be improved by taking advantage of the predictability of the easy-to-predict instructions that precede it in the data flow graph. We propose and investigate a run-time scheme for producing such an improved context from the predicted values of preceding instructions. We also propose a novel branch predictor that uses dynamic dataflow-inherited speculative context (DDISC) for prediction. Simulation results verify that the use of dataflow-based contexts yields significant reduction in branch mispredictions, ranging up to 40%. This translates to an overall branch prediction accuracy of 89% to 99.5%.

Renju Thomas, Manoj Franklin

Rehashable BTB: An Adaptive Branch Target Buffer to Improve the Target Predictability of Java Code

Java programs are increasing in popularity and prevalence on numerous platforms, including high-performance general-purpose processors. The dynamic characteristics of the Java runtime system present unique performance challenges for several aspects of microarchitecture design. In this work, we focus on the effects of indirect branches on branch target address prediction performance. Runtime bytecode translation, just-in-time compilation, frequent calls to the native interface libraries, and dependence on virtual methods increase the frequency of polymorphic indirect branches. Therefore, accurate target address prediction for indirect branches is very important for Java code. This paper characterizes the indirect branch behavior in Java processing and proposes an adaptive branch target buffer (BTB) design to enhance the predictability of the targets. Our characterization shows that a traditional BTB will frequently mispredict polymorphic indirect branches, significantly deteriorating predictor accuracy in Java processing. Therefore, we propose a Rehashable branch target buffer (R-BTB), which dynamically identifies polymorphic indirect branches and adapts branch target storage to accommodate multiple targets for a branch. The R-BTB improves the target predictability of indirect branches without sacrificing overall target prediction accuracy. Simulations show that the R-BTB eliminates 61% of the indirect branch mispredictions suffered with a traditional BTB for Java programs running in interpreter mode (46% in JIT mode), which leads to a 57% decrease in overall target address misprediction rate (29% in JIT mode). With an equivalent number of entries, the R-BTB also outperforms the previously proposed target cache scheme for a majority of Java programs by adapting to a greater variety of indirect branch behaviors.

Tao Li, Ravi Bhargava, Lizy Kurian John

Return-Address Prediction in Speculative Multithreaded Environments

There is a growing interest in the use of speculative multithreading to speed up the execution of sequential programs. In this execution model, threads are extracted from sequential code and are speculatively executed in parallel. This makes it possible to use parallel processing to speed up ordinary applications, which are typically written as sequential programs. This paper has two objectives. The first is to highlight the problems involved in performing accurate return address predictions in speculative multithreaded processors, where many of the subroutine call instructions and return instructions are fetched out of program order. A straightforward application of a return address stack popular scheme for predicting return addresses in single-threaded environments does not work well in such a situation. With out-of-order fetching of call instructions as well as return instructions, pushing and popping of return addresses onto and from the return address stack happen in a somewhat random fashion. This phenomena corrupts the return address stack, resulting in poor prediction accuracy for return addresses. The second objective of the paper is to propose a fixup technique for using the return address stack in speculative multithreaded processors. Our technique involves the use of a distributed return address stack, with facilities for repair when out-of-order pushes and pops happen. Detailed simulation results of the proposed schemes show significant improvements in the predictability of return addresses in a speculative multithreaded environment.

Mohamed Zahran, Manoj Franklin

HLSpower: Hybrid Statistical Modeling of the Superscalar Power-Performance Design Space

As power densities increase and mobile applications become pervasive, power-aware microprocessor design has become a critical issue. We present HLSpower, a unique tool for power-aware design space exploration of superscalar processors. HLSpower is based upon HLS [OCF00], a tool which used a novel blend of statistical modeling and symbolic execution to accelerate performance modeling more than 100-1000X over conventional cycle-based simulators.In this paper, we extend the HLSmetho dology to model energy efficiency of superscalars. We validate our results against the Wattch [BTM00] cycle-based power simulator. While minor second order power effects continue to require detailed cycle-by-cycle simulation, HLSpower is useful for large-scale exploration of the significant power-performance design space. For example, we can show that the instruction cache hit rate and pipeline depth interact with power efficiency in a non-trivial way as they are varied over significant ranges. In particular, we note that, while the IPC of a superscalar increases monotonically with both optimizations, the energy efficiency does not. We highlight the design capabilities by focusing on these non-monotonic contour graphs to demonstrate how HLSpower can help build intuition in power-aware design.

Ravishankar Rao, Mark H. Oskin, Frederic T. Chong

Efficient Decomposition Techniques for FPGAs

In this paper, we propose AND/XOR-basedd ecomposition methods to implement parity prediction circuits efficiently in field programmable gate arrays (FPGAs). Due to the fixed size of the programmable blocks in an FPGA, decomposing a circuit into sub-circuits with appropriate number of inputs can achieve excellent implementation efficiency. The typical EDA tools deal mainly with AND/OR expressions and the refore are quite inefficient for the parity prediction functions since parity prediction function is inherently based on AND/XOR in nature. The Davio expansion theorem is appliedhe re to the technology mapping method for FPGA. We design three different approaches: (1) Direct Approach, (2) AND/XOR Direct, and (3 ) Proposed Davio Approach and conduct experiments using MCNC benchmark circuits to demonstrate the effectiveness of ProposedDa vio Approach. We formulate the parity prediction circuits for the benchmark circuits. The Proposed Davio Approach is superior to the typical methods for parity prediction circuits in terms of the number of CLBs. The proposedD avio expansion approach, which is basedon AND/XOR expressions, is superior to the other common techniques in achieving realization efficiency. The proposedD avio approach only needs 21 CLBs for eight benchmark circuits. It takes only on average 2.75 CLBs or 20 % of the original area.

Seok-Bum Ko, Jien-Chung Lo

Keynote Address

Protocols for Bandwidth Management in Third Generation Optical Networks

During the height of telecommunications boom all optical networking was seen as the next horizon, that would soon revolutionize the data communications world. Networking technology and service companies have been heavily investing in startups that promised to bring all optical, on demand and any bandwidth granularity solutions to the market in eighteen month or less. By the early 2001, there were several hundreds of companies focused on optical equipment development. including optical switches, optical cross-connects and similar, all coming with accompanying management software for the provisioning and control that would bring the final, all optical solution to all. A number of other service companies were focused on the delivery of next generation optically driven services, as exemplified by the emerging IXCs and CLECs in North America. The declared effect was to be a revolutionized networking, resulting in a complete shift to data based networking infrastructures.The rationale for all of the optical enthusiasm was the promise of significantly more data carrying bandwidth, new optically based service provider revenue generating service opportunities, and simplified network infrastructures that greatly reduced service provider Capex and Opex costs. New optical equipment should reduce or eliminate the costly electronic regeneration of optical signals, and new optical switching and mux equipment promised to ease the burden of path provisioning, path monitoring, and path restoration of optical data carrying paths.While certain optical technologies have, indeed, gained customer traction, by having proven themselves both economically and technologically, including 2.5Gb/s and 10Gb/s DWDM transport systems, together with certain associated EDFAs, optical switching, including fiber switching, wavelength switching, wavelength bundle switching, etc. have not. In some cases, technological barriers were more difficult to overcome than developers had originally thought, in others more time was needed to refine these technologies, and more time was needed to refine the distinguishable different needs in equipment cost, switching, speed, and amplifier needs, etc., of the access, metro core, regional core, and core (including long haul) networking infrastructures.Definitely, more time was needed to develop a uniform (standards based) approach to the management of optical network infrastructures, for the wealth of product developments had also allowed a number of product specific, and incompatible network management systems to be created.During 2002, we have come to believe that, with some exceptions, the optical revolution, as it was envisioned during the years 2000 and 2001, will not occur. The revolution called for the rapid replacement of electronically oriented networks by all optical networks, with the retention of electronics only at points of user ingress and egress. Instead some form of optical evolution will occur. This evolution to optical networks will occur at a slower, more rational, pace than was expected for the revolution.We argue that carrier networks are currently changing through an evolutionary, and not a revolutionary, migration process. As a result, end-to-end networks with SONET, optical, and packet sub-networks will exist for some time to come. However, the gradual penetration of optical WDM systems, and the rapid growth of the Internet as a medium for global multimedia communications has resulted in a need for scalable network infrastructures which are capable of providing large amounts of bandwidth, on demand, with quality guarantees, at varying granularities, to support a diverse range of emerging applications. This means the DWDM network bandwidth in the emerging wavelength-routed systems, needs to be, managed, and provisioned effectively, and intelligently, and support the emerging services in heterogeneous networking environments, with no prevalent standard, hence presenting one of the bigger challenges for optical networking in the years to come.

Imrich Chlamtac

Embedded Systems

Memory Architectures for Embedded Systems-On-Chip

Embedded systems are typically designed for one or a few target applications, allowing for customization of the system architecture for the desired system goals such as performance, power and cost. The memory subsystem will continue to present significant bottlenecks in the design of future embedded systems-on-chip. Using advance knowledge of the application’s instruction and data behavior, it is possible to customize the memory architecture to meet varying system goals. On one hand, different applications exhibit varying memory behavior. On the other hand, a large variety of memory modules allow design implementations with a wide range of cost, performance and power profiles. The embedded system architect can thus explore and select custom memory architectures to fit the constraints of target applications and design goals. In this paper we present an overview of recent research in the area of memory architecture customization for embedded systems.

Preeti Ranjan Panda, Nikil D. Dutt

Structured Component Composition Frameworks for Embedded System Design

The increasing integration of system-chips is leading to a widening gap in the size and complexity of the chip-level design and the design capabilities. A number of advances in high-level modeling and validation have been proposed over the past decade in an attempt to bridge the gap in design productivity. Prominent among these are advances in Abstraction and Reuse and structured design methods such as Component-Based Design and Platform-Based Design. In this paper, we present an overview of the recent advances in reuse, abstraction, and component frameworks. We describe a compositional approach to highlevel modeling as implemented in the BALBOA project.

Sandeep K. Shukla, Frederic Doucet, Rajesh K. Gupta

Low Power Distributed Embedded Systems: Dynamic Voltage Scaling and Synthesis

In this paper, we survey multi-objective system synthesis algorithms for lowp ower real-time systems-on-a-chip (SOCs), distributed and wireless client-server embedded systems, distributed embedded systems with recon.gurable.eld-programmable gate arrays (FPGAs), as well as distributed systems of SOCs. Many of these synthesis algorithms target simultaneous optimization of di.erent cost objectives, including system price, area and power consumption. Dynamic voltage scaling has proved to be a powerful technique for reducing power consumption. We also survey several dynamic voltage scaling techniques for distributed embedded systems containing voltage-scalable processors. The dynamic voltage scaling algorithms can be embedded in the inner-loop of a system synthesis framework and provide feedback for system-level design space exploration. Besides voltage-scalable processors, dynamically voltagescalable links have also been proposed for implementing high performance and low power interconnection networks for distributed systems. We survey relevant techniques in this area as well.

Jiong Luo, Niraj K. Jha

The Customization Landscape for Embedded Systems

The explosive growth of embedded systems in existing and emerging application domains is accompanied by unique constraints and performance requirements along multiple dimensions such as speed, power, and real-time behavior. Typically, these requirements are encapsulated in a few important components, or kernels, for example, for audio and video encoding/decoding, data compression, and encryption. Software implementations of these kernels executing on embedded microprocessors are inadequate to meet the requirements. Customization plays a central role in meeting these unique and specialized needs and has historically been realized by the development of custom hardware solutions such as fully custom application specific integrated circuits (ASICs).However, each new generation of semiconductor technology is increasing the non- recurring engineering (NRE) costs associated with ASIC development and thereby making it feasible for only high volume applications and those with product lifetimes amenable to a long time to market design cycle. The increasing pervasiveness of embedded systems encompasses applications with smaller product lifetimes, shorter development cycles, and increasing cost containment pressures. Sustaining the continued growth demanded by the market will require technology to deliver performance characteristics of custom solutions while overcoming the twin hurdles of high NRE costs and long time to market of current ASIC solutions. Further, the embedded market is comprised of application segments covering a range of costs, volumes, and time to market needs. For example, components of 3G phones, set top boxes, color laser printers, color copiers, and network attached storage devices span a range of prices from on the order of a few hundred to a few thousand dollars and a range of volumes ranging from thousands to tens of millions of units. Consequently the approaches to customization are diverse.As new products and services are envisioned, these evolving workload requirements will place more pressure on the need to innovate in overcoming the high NRE costs and time to market considerations of custom hardware solutions. The challenge for future embedded systems is to reconcile the demands of hardware customization, cost, and time to market constraints. This talk presents the emerging landscape of solutions, identifies key trends, and describes some representative approaches.

Sudhakar Yalamanchili

Keynote Address

Parallel Computations of Electron-Molecule Collisions in Processing Plasmas

In the plasmas used in semiconductor fabrication, collisions between electrons and polyatomic molecules produce reactive fragments that drive etching and other processes at the wafer surface. Extensive and reliable data on electron-molecule collisions are therefore essential to simulations of plasma reactors. For the low electron energies and polyatomic gases of interest, both measurements and calculations are difficult, and many needed cross sections are lacking. However, rapid advances in computer speeds now make such calculations feasible. Because the fastest computers are highly parallel,both a formulation that accounts well for the physics of low-energy collisions and an implementation that is efficient on parallel architectures are required. We will give an overview of our formulation of the electron-molecule collision problem and of its implementation and performance on parallel machines, and of some results of its application.

B. Vincent McKoy, Carl Winstead

Biocomputation

Computing Challenges and Systems Biology

Over the past two decades the manner in which science is conducted in biology and related disciplines has been changing dramatically. Automation has led to a vast amount of data being generated by new experimental techniques, such as genomic sequencing and DNA microarrays. To extract the scientific insights buried in this high volume of data, life science researchers are employing increasingly sophisticated information technology approaches to create data analysis and simulation tools that run on high performance computing (HPC) platforms. The domain is one rich in compute-intensive applications: determining gene sequence, predicting macromolecular structure, understanding the temporal dynamics of protein folding, modeling molecular interactions, simulating the behavior of cell-signaling cascades and genetic regulatory networks, and the real-time manipulation of 3D rendered volumes of organs derived from structural magnetic resonance imaging scans. Future trends in the life sciences are expected to expand upon this set to include computation performed using hybrid bio-nano devices, personalized medicine, mining of heterogeneous data sets stored at disparate sites, and knowledge discovery in patient medical records. The talk will review high performance computing needs in emerging systems biology applications, the potential impact of bio-nano research on future high performance computing designs, and work on related topics at DARPA.

Srikanta P. Kumar, Jordan C. Feidler, Henrietta Kulaga

Visual Programming for Modeling and Simulation of Biomolecular Regulatory Networks

In this paper we introduce our new tool BIOSKETCHPAD that allows visual progamming and modeling of biological regulatory networks. The tool allows biologists to create dynamic models of networks using a menu of icons, arrows, and pop-up menus, and translates the input model into CHARON, a modeling language for modular design of interacting multi-agent hybrid systems. Hybrid systems are systems that are characterized by continuous as well as discrete dynamics. Once a CHARON model of the underlying system is generated, we are able to exploit the various analysis capabilities of the CHARON toolkit, including simulation and reachability analysis. We illustrate the advantages of this approach using a case study concerning the regulation of bioluminescence in a marine bacterium.

Rajeev Alur, Calin Belta, Franjo Ivančić, Vijay Kumar, Harvey Rubin, Jonathan Schug, Oleg Sokolsky, Jonathan Webb

Framework for Open Source Software Development for Organ Simulation in the Digital Human

The current state of the field of medical simulation is characterized by scattered research projects using a variety of models that are neither interoperable nor independently verifiable models.Individual simulators are frequently built from scratch by individual research groups without input and validation from a larger community. The challenge of developing useful medical simulations is often too great for any individual group since expertise is required from different fields, from molecular and cell biology, from anatomy and physiology, and from computer science and systems theory.

M. Cenk Cavusoglu, Tolga Goktekin, Frank Tendick, S. Shankar Sastry

Reachability Analysis of Delta-Notch Lateral Inhibition Using Predicate Abstraction

This paper examines the feasibility of predicate abstraction as a method for the reachability analysis of hybrid systems. A hybrid system can be abstracted into a purely discrete system by mapping the continuous state space into an equivalent finite discrete state space using a set of Boolean predicates and a decision procedure in the theory of real closed fields. It is then possible to find the feasible transitions between these states. In this paper, we propose new conditions for predicate abstraction which greatly reduce the number of transitions in the abstract discrete system. We also develop a computational technique for reachability analysis and apply it to a biological system of interest (the Delta-Notch lateral inhibition problem).

Inseok Hwang, Hamsa Balakrishnan, Ronojoy Ghosh, Claire Tomlin

A Symbolic Approach to Modeling Cellular Behavior

The author examines the connection between classical differential algebra of Ritt and Kolchin and differential algebraic models of biochemical systems-in particular, the models generated by S-system of Savageau. Several open problems of both biological and mathematical significance are proposed.

Bhubaneswar Mishra

Springer Professional

Inhaltsverzeichnis

Frontmatter

Keynote Address

Info-Bio-Nano Interface: High-Performance Computing & Visualization

Algorithms I

2-D Wavelet Transform Enhancement on General- Purpose Microprocessors: Memory Hierarchy and SIMD Parallelism Exploitation

A General Data Layout for Distributed Consistency in Data Parallel Applications

A Parallel DFA Minimization Algorithm

Accelerating the CKY Parsing Using FPGAs

Duplication-Based Scheduling Algorithm for Interconnection-Constrained Distributed Memory Machines

Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs)

Architecture I

Dead-Block Elimination in Cache: A Mechanism to Reduce I-cache Power Consumption in High Performance Microprocessors

Exploiting Web Document Structure to Improve Storage Management in Proxy Caches

High Performance Multiprocessor Architecture Design Methodology for Application-Specific Embedded Systems

LLM: A Low Latency Messaging Infrastructure for Linux Clusters

Low-Power High-Performance Adaptive Computing Architectures for Multimedia Processing

Keynote Address

Field Programmable Systems

Systems Software I

CORBA-as-Needed: A Technique to Construct High Performance CORBA Applications

Automatic Search for Performance Problems in Parallel and Distributed Programs by Using Multi-experiment Analysis

An Adaptive Value-Based Scheduler and Its RT-Linux Implementation

Effective Selection of Partition Sizes for Moldable Scheduling of Parallel Jobs

Runtime Support for Multigrain and Multiparadigm Parallelism

A Fully Compliant OpenMP Implementation on Software Distributed Shared Memory

Networks

A Fast Connection-Time Redirection Mechanism for Internet Application Scalability

Algorithms for Switch-Scheduling in the Multimedia Router for LANs

An Efficient Resource Sharing Scheme for Dependable Real-Time Communication in Multihop Networks

Improving Web Server Performance by Network Aware Data Bu.ering and Caching

WRAPS Scheduling and Its Efficient Implementation on Network Processors

Performance Comparison of Pipelined Hash Joins on Workstation Clusters

Keynote Address

Computational Science and Engineering — Past, Present, and Future

Algorithms II

Iterative Algorithms on Heterogeneous Network Computing: Parallel Polynomial Root Extracting

Efficient Tree-Based Multicast in Wormhole-Routed Networks

Parallel Algorithms for Identification of Basis Polygons in an Image

Range Image Segmentation on a Cluster

Detection of Orthogonal Interval Relations

An Efficient Parallel Algorithm for Computing Bicompatible Elimination Ordering (BCO) of Proper Interval Graphs

Mobile Computing and Databases

Router Handoff: An Approach for Preemptive Route Repair in Mobile Ad Hoc Networks

A 2-D Random Walk Based Mobility Model for Location Tracking

Data Placement in Intermittently Available Environments

RT-MuPAC: Multi-power Architecture for Voice Cellular Networks

Asynchronous Transaction Processing for Updates by Client: With Elimination of Wait-for State

Active File Systems for Data Mining and Multimedia

Applications

Simulating DNA Computing

Parallel Syntenic Alignments

XS-systems: eXtended S-Systems and Algebraic Differential Automata for Modeling Cellular Behavior

A High Performance Scheme for EEG Compression Using a Multichannel Model

Scalability and Performance of Multi-threaded Algorithms for International Fare Construction on High-Performance Machines

Systems Software II

A Resource Brokering Infrastructure for Computational Grids

On Improving Thread Migration: Safety and Performance

Improved Preprocessing Methods for Modulo Scheduling Algorithms

Dynamic Path Profile Aided Recompilation in a JAVA Just-In-Time Compiler

Exploiting Data Value Prediction in Compiler Based Thread Formation

Scientific Computation

High Performance Computing of Fluid-Structure Interactions in Hydrodynamics Applications Using Unstructured Meshes with More than One Billion Elements

An Efficient and Exponentially Accurate Parallel h-p Spectral Element Method for Elliptic Problems on Polygonal Domains - The Dirichlet Case

Fast Stable Solver for Sequentially Semi-separable Linear Systems of Equations

Dynamic Network Information Collection for Distributed Scientific Application Adaptation

Adaptive Runtime Management of SAMR Applications

Mobile Agents — The Right Vehicle for Distributed Sequential Computing

Architecture II

Using Dataflow Based Context for Accurate Branch Prediction

Rehashable BTB: An Adaptive Branch Target Buffer to Improve the Target Predictability of Java Code

Return-Address Prediction in Speculative Multithreaded Environments

HLSpower: Hybrid Statistical Modeling of the Superscalar Power-Performance Design Space

Efficient Decomposition Techniques for FPGAs

Keynote Address

Protocols for Bandwidth Management in Third Generation Optical Networks

Embedded Systems

Memory Architectures for Embedded Systems-On-Chip

Structured Component Composition Frameworks for Embedded System Design