Skip to main content

2002 | Buch

High Performance Computing

4th International Symposium, ISHPC 2002 Kansai Science City, Japan, May 15–17, 2002 Proceedings

herausgegeben von: Hans P. Zima, Kazuki Joe, Mitsuhisa Sato, Yoshiki Seo, Masaaki Shimasaki

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

I wish to welcome all of you to the International Symposium on High Perf- mance Computing 2002 (ISHPC2002) and to Kansai Science City, which is not farfromtheancientcapitalsofJapan:NaraandKyoto.ISHPC2002isthefourth in the ISHPC series, which consists, to date, of ISHPC ’97 (Fukuoka, November 1997), ISHPC ’99 (Kyoto, May 1999), and ISHPC2000 (Tokyo, October 2000). The success of these symposia indicates the importance of this area and the strong interest of the research community. With all of the recent drastic changes in HPC technology trends, HPC has had and will continue to have a signi?cant impact on computer science and technology. I am pleased to serve as General Chair at a time when HPC plays a crucial role in the era of the IT (Information Technology) revolution. The objective of this symposium is to exchange the latest research results in software, architecture, and applications in HPC in a more informal and friendly atmosphere. I am delighted that the symposium is, like past successful ISHPCs, comprised of excellent invited talks, panels, workshops, as well as high-quality technical papers on various aspects of HPC. We hope that the symposium will provide an excellent opportunity for lively exchange and discussion about - rections in HPC technologies and all the participants will enjoy not only the symposium but also their stay in Kansai Science City.

Inhaltsverzeichnis

Frontmatter

Invited Papers

The Gilgamesh MIND Processor-in-Memory Architecture for Petaflops-Scale Computing
An Extended Abstract

Implicit in the evolution of current technology and high-end system evolution is the anticipated achievement of the implementation of computers capable of a peak performance of 1 Petaflops by the year 2010. This is consistent with both the semiconductor industry’s roadmap of basic device technology development and an extrapolation of the TOP-500 list of the world’s fastest computers according to the Linpack benchmark. But if contemporary experiences with today’s largest systems hold true for their descendents at the end of the decade, then they will be very expensive (> $100M), consume too much power (> 3 Mwatts), take up too much floor space (> 10,000 square feet), deliver very low efficiency (< 10%), and are too difficult to program. Also important is the likely degradation of reliability due to the multiplicative factors of MTBF and scale of components. Even if these systems do manage to drag the community to the edge of Petaflops, there is no basis of confidence to assume that they will provide the foundation for systems across the next decade that will transition across the trans-Petaflops performance regime. It has become increasingly clear that an alternative model of system architecture may be required for future generation high-end computers.

Thomas Sterling
The UK e-Science Program and the Grid

The talk describes the £ 120M UK ‘e-Science’ initiative and begins by defining what is meant by the term e-Science. The majority of the £ 120M, some £ 85M, is for the support of large-scale e-Science projects in many areas of science and engineering. The infrastructure needed to support such projects needs to allow sharing of distributed and heterogeneous computational and data resources and support effective collaboration between groups of scientists. Such an infrastructure is commonly referred to as the Grid. The remaining funds, some £ 35M, constitute the e-Science ‘Core Program’ and is intended to encourage the development of robust and generic Grid middleware in collaboration with industry. The key elements of the Core Program will be described including the construction of a UK e-Science Grid. In addition, the talk will include a description of the pilot projects that have so far received funding. These span a range of disciplines from particle physics and astronomy to engineering and healthcare. We conclude with some remarks about the need to develop a data architecture for the Grid that will allow federated access to relational databases as well as flat files.

Tony Hey
SPEC HPC2002: The Next High-Performance Computer Benchmark
Extended Abstract

The High-Performance Group of the Standard Performance Evaluation Corporation (SPEC/HPG) [1] is developing a next release of its high-performance computer benchmark suite. The current suite, SPEChpc96 [2] is expected to be replaced by the SPEC HPC2002 suite in the second quarter of the year 2002. HPC2002 represents a minor update to SPEC’s current high-performance computing benchmarks. In the last subsection of this document we describe SPEC’s longer-term plans.

Rudolf Eigenmann, Greg Gaertner, Wesley Jones, Hideki Saito, Brian Whitney, SPEC High-Performance Group

Award Papers

Language and Compiler Support for Hybrid-Parallel Programming on SMP Clusters

In this paper we present HPF extensions for clusters of SMPs and their implementation within the VFC compiler. The main goal of these extensions is to optimize HPF for clusters of SMPs by enhancing the functionality of the mapping mechanisms and by providing the user with high-level means for controlling key aspects of distributed-memory and shared-memory parallelization. Based on the proposed language extensions, the VFC compiler adopts a hybrid parallelization strategy which closely reflects the hierarchical structure of SMP clusters by exploiting shared-memory parallelism based on OpenMP within nodes and distributed-memory parallelism utilizing MPI across nodes. We describe the language extensions, outline the hybrid parallelization strategy of VFC and present experimental results which show the effectiveness of these techniques.

Siegfried Benkner, Viera Sipkova
Parallelizing Merge Sort onto Distributed Memory Parallel Computers

Merge sort is useful in sorting a great number of data progressively, especially when they can be partitioned and easily collected to a few processors. Merge sort can be parallelized, however, conventional algorithms using distributed memory computers have poor performance due to the successive reduction of the number of participating processors by a half, up to one in the last merging stage.This paper presents load-balanced parallel merge sort where all processors do the merging throughout the computation. Data are evenly distributed to all processors, and every processor is forced to work in all merging phases. An analysis shows the upper bound of the speedup of the merge time as (P- 1)/log P where P is the number of processors. We have reached a speedup of 8.2 (upper bound is 10.5) on 32-processor Cray T3E in sorting of 4M 32-bit integers.

Minsoo Jeon, Dongseung Kim

Networks

Avoiding Network Congestion with Local Information

Congestion leads to a severe performance degradation in multiprocessor interconnection networks. Therefore, the use of techniques that prevent network saturation are of crucial importance. Some recent proposals use global network information, thus requiring that nodes exchange some control information, which consumes a far from negligible bandwidth. As a consequence, the behavior of these techniques in practice is not as good as expected.In this paper, we propose a mechanism that uses only local information to avoid network saturation. Each node estimates traffic locally by using the percentage of free virtual output channels that can be used to forward a message towards its destination. When this number is below a threshold value, network congestion is assumed to exist and message throttling is applied. The main contributions of the proposed mechanism are two: i) it is more selective than previous approaches, as it only prevents the injection of messages when they are destined to congested areas; and ii) it outperforms recent proposals that rely on global information.

E. Baydal, P. López, J. Duato
Improving InfiniBand Routing through Multiple Virtual Networks

InfiniBand is very likely to become the de facto standard for communication between nodes and I/O devices (SANs) as well as for interprocessor communication (NOWs). The InfiniBand Architecture (IBA) defines a switch-based network with point-to-point links whose topology is arbitrarily established by the customer. Often, the interconnection pattern is irregular. Up*/down* is the most popular routing scheme currently used in NOWs with irregular topologies. However, the main drawbacks of up*/down* routing are the unbalanced channel utilization and the difficulties to route most packets through minimal paths, which negatively affects network performance. Using additional virtual lanes can improve up*/down* routing performance by reducing the head-of-line blocking effect, but its use is not aimed to remove its main drawbacks. In this paper, we propose a new methodology that uses a reduced number of virtual lanes in an efficient way to achieve a better traffic balance and a higher number of minimal paths. This methodology is based on routing packets simultaneously through several properly selected up*/down* trees. To guarantee deadlock freedom, each up*/down* tree is built over a different virtual network. Simulation results, show that the proposed methodology increases throughput up to an average factor ranging from 1.18 to 2.18 for 8, 16, and 32-switch networks by using only two virtual lanes. For larger networks with an additional virtual lane, network throughput is tripled, on average.

J. Flich, P. López, J. C. Sancho, A. Robles, J. Duato

Architectures I

Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance

We present a new cache protocol, Minerva, which allows the effective cache block size to vary dynamically. Minerva works using sector caches (also known as block/subblock caches). Cache consistency attributes (from the MESI set of states) are associated with each 4-byte word in the cache. Each block can itself have one of the attributes invalid, exclusive or shared. Each block also has a current subblock size, of 2k words and a confidence value for hysteresis. The subblock size is reevaluated every time there is an external access (read or invalidate) to the block. When a fetch miss occurs within a block, a subblock equal to the current subblock size is fetched. Note that the fetch may involve a gather operation, with various words coming from different sources; some of the words may already be present.Depending on the assumed cache sizes, block sizes, bus width, and bus timings, we find that Minerva reduces execution times by 19-40%, averaged over 12 test parallel programs. For a 64-bit wide bus, we find a consistent execution time reduction of around 30%. Our evaluation considers the utility of various other optimizations and considers the extra state bits required.

Jeffrey B. Rothman, Alan Jay Smith
Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters

We show how novel active memory system research and system networking trends can be combined to realize hardware distributed shared memory on clusters of industry-standard workstations. Our active memory controller extends the cache coherence protocol to support transparent use of address re-mapping techniques that dramatically improve single-node performance, and also contains the necessary functionality for building a hardware DSM machine. Simultaneously, commodity network technology is becoming more tightly-integrated with the memory controller. We call our design of active memory commodity nodes interconnected by a next-generation network active memory clusters. We present a detailed design of the AMC architecture, focusing on the active memory controller and the network characteristics necessary to support AMC. We show simulation results for a range of parallel applications, showing that AMC performance is comparable to that of custom hardware DSM systems and far exceeds that of the fastest software DSM solutions.

Mark Heinrich, Evan Speight, Mainak Chaudhuri
The Impact of Alias Analysis on VLIW Scheduling

This experiment studies the speed-up increase that alias analysis (AA) produces on code for very long instruction word machines. AA is done on-demand when requested by the scheduler, in order to eliminate critical arcs of the data dependence graph. Different heuristic criteria are investigated for deciding when to compute alias information,and they show that only a fraction of the alias relation really contributes to the program speed-up. A qualitative study shows that the quality of the initial code affects the speedup alias analysis can give. The results should help compiler designers for VLIW machines in making cost effective AA decisions.

Marco Garatti, Roberto Costa, Stefano Crespi Reghizzi, Erven Rohou
Low-Cost Value Predictors Using Frequent Value Locality

The practice of speculation in resolving data dependences has been recently studied as a means of extracting more instruction level parallelism (ILP). Each instruction’s outcome is predicted by value predictors. The instruction and its dependent instructions can be executed in parallel, thereby exploiting ILP aggressively. One of the serious hurdles for realizing data speculation is the huge hardware budget required by the predictors. In this paper, we propose techniques that exploit frequent value locality, resulting in a significant budget reduction. Based on these proposals, we evaluate two value predictors, named the zero-value predictor and the 0/1-value predictor. The zero-value predictor generates only value 0. Similarly, the 0/1-value predictor generates only values 0 and 1. Simulation results show that the proposed predictors have greater performance than does the last-value predictor which requires a hardware budget twice as large as that of the predictors. Therefore, the zero- and the 0/1-value predictors are promising candidates for cost-effective and practical value predictors which can be implemented in real microprocessors.

Toshinori Sato, Itsujiro Arita

Architectures II

Integrated I-cache Way Predictor and Branch Target Buffer to Reduce Energy Consumption

In this paper, we present a Branch Target Buffer (BTB) design for energy savings in set-associative instruction caches. We extend the functionality of a BTB by caching way predictions in addition to branch target addresses. Way prediction and branch target prediction are done in parallel. Instruction cache energy savings are achieved by accessing one cache way if the way prediction for a fetch is available. To increase the number of way predictions for higher energy savings, we modify the BTB management policy to allocate entries for non-branch instructions. Furthermore, we propose to partition a BTB into ways for branch instructions and ways for non-branch instructions to reduce the BTB energy as well.We evaluate the effectiveness of our BTB design and management policies with SPEC95 benchmarks. The best BTB configuration shows a 74% energy savings on average in a 4-way set-associative instruction cache and the performance degradation is only 0.1&. When the instruction cache energy and the BTB energy are considered together, the average energy-delay product reduction is 65%.

Weiyu Tang, Alexander Veidenbaum, Alexandru Nicolau, Rajesh Gupta
A Comprehensive Analysis of Indirect Branch Prediction

Indirect branch prediction is a performance limiting factor for current computer systems, preventing superscalar processors from exploiting the available ILP. Indirect branches are responsible for 55.7% of mispredictions in our benchmark set, although they only stand for 15.5% of dynamic branches. Moreover, a 10.8% average IPC speedup is achievable by perfectly predicting all indirect branches.The Multi-Stage Cascaded Predictor (MSCP) is a mechanism proposed for improving indirect branch prediction. In this paper, we show that a MSCP can replace a BTB and accurately predict the target address of both indirect and non-indirect branches. We do a detailed analysis of MSCP behavior and evaluate it in a realistic setup, showing that a 5.7% average IPC speedup is achievable.

Oliverio J. Santana, Ayose Falcón, Enrique Fernández, Pedro Medina, Alex Ramírez, Mateo Valero
High Performance and Energy Efficient Serial Prefetch Architecture

Energy efficient architecture research has flourished recently, in an attempt to address packaging and cooling concerns of current microprocessor designs, as well as battery life for mobile computers. Moreover, architects have become increasingly concerned with the complexity of their designs in the face of scalability, verification, and manufacturing concerns.In this paper, we propose and evaluate a high performance, energy and complexity efficient front-end prefetch architecture. This design, called Serial Prefetching, combines a high fetch bandwidth branch prediction and efficient instruction prefetching architecture with a low-energy instruction cache. Serial Prefetching explores the benefit of decoupling the tag component of the cache from the data component. Cache blocks are first verified by the tag component of the cache, and then the accesses are put into a queue to be consumed by the data component of the instruction cache. Energy is saved by only accessing the correct way of the data component specified by the tag lookup in a previous cycle. The tag component does not stall on a I-cache miss, only the data component. The accesses that miss in the tag component are speculatively brought in from lower levels of the memory hierarchy. This in effect performs a prefetch, while the access migrates through the queue to be consumed by the data component.

Glenn Reinman, Brad Calder, Todd Austin
A Programmable Memory Hierarchy for Prefetching Linked Data Structures

Prefetching is often used to overlap memory latency with computation for array-based applications. However, prefetching for pointer-intensive applications remains a challenge because of the irregular memory access pattern and pointer-chasing problem. In this paper, we use a programmable processor, a prefetch engine (PFE), at each level of the memory hierarchy to cooperatively execute instructions that traverse a linked data structure. Cache blocks accessed by the processors at the L2 and memory levels are proactively pushed up to the CPU.We look at several design issues to support this programmable memory hierarchy. We establish a general interaction scheme among three PFEs and design a mechanism to synchronize the PFE execution with the CPU. Our simulation results show that the proposed prefetching scheme can reduce up to 100% of memory stall time on a suite of pointer-intensive applications, reducing overall execution time by an average 19%.

Chia-Lin Yang, Alvin Lebeck

HPC Systems

Block Red-Black Ordering Method for Parallel Processing of ICCG Solver

The present paper proposes a new parallel ordering, ”block red-black ordering,” for a parallelized ICCG solver with fewer synchronization points and a high convergence rate. In the new method, nodes in an analyzed grid are divided into several or many blocks, and red-black ordering is applied to the blocks. Several blocks are assigned to each processor and the substitution is carried out in parallel. Only one synchronization point exists in each parallelized substitution. We performed an analytical investigation using the ordering graph theory and computational tests on a scalar parallel computer. These results show that a convergence rate is improved by an increase in the number of nodes of one block and that a high parallel performance is attained by using an appropriate block size.

Takeshi Iwashita, Masaaki Shimasaki
Integrating Performance Analysis in the Uintah Software Development Cycle

Technology for empirical performance evaluation of parallel programs is driven by the increasing complexity of high performance computing environments and programming methodologies. This paper describes the integration of the TAU and XPARE tools in the Uintah computational framework. Performance mapping techniques in TAU relate low-level performance data to higher levels of abstraction. XPARE is used for specifying regression testing benchmarks that are evaluated with each periodically scheduled testing trial. This provides a historical panorama of the evolution of application performance. The paper concludes with a scalability study that shows the benefits of integrating performance technology in the development of large-scale parallel applications.

J. Davison de St. Germain, Alan Morris, Steven G. Parker, Allen D. Malony, Sameer Shende
Performance of Adaptive Mesh Refinement Scheme for Hydrodynamics on Simulations of Expanding Supernova Envelope

We present results of performance measurement of our astrophysical fluid dynamics code with Adaptive Mesh Refinement (AMR) scheme. As an example of its application for astrophysical phenomena, we show the results of 3D simulation of Rayleigh-Taylor instability in expanding supernova envelope and the possibility of reconciliation of the present model with observations. The efficiency of memory and CPU-time savings is discussed and measured with this simulation. Its result shows that our code has the ability to simulate phenomena with wide dynamic ranges even in limited computing resources.

Ayato Noro, Tomoya Ogawa, Takuma Ohta, Kazuyuki Yamashita, Shigeki Miyaji, Mitue Den

Earth Simulator

An MPI Benchmark Program Library and Its Application to the Earth Simulator

Parallel programming is essential for large-scale scientific simulations, and MPI is intended to be a de facto standard API for this kind of programming. Since MPI has several functions that exhibit similar behaviors, programmers often have difficulty in choosing the appropriate function for their programs. An MPI benchmark program library named MBL has been developed for gathering performance data for various MPI functions. It measures the performance of MPI-1 and MPI-2 functions under several communication patterns. MBL has been applied to the measurement of MPI performance in the Earth Simulator. It is confirmed that a maximum throughput of 11.7GB/s is obtained in inter-node communications in the Earth Simulator.

Hitoshi Uehara, Masanori Tamura, Mitsuo Yokokawa
Parallel Simulation of Seismic Wave Propagation

A dense seismic network of strong ground motion instruments in Japan allows a detailed characterization of regional wave propagation during destructive earthquakes. Steadily improving computer power associated with sophisticated parallel algorithms enables a large scale computer simulation of seismic waves in 3D heterogeneous structure. This is illustrated by the 2000 Tottori-ken Seibu earthquake, southwestern Japan. With the aid of assessing potential impact at urban cities expected for future earthquake scenarios (e.g. Nankai earthquake). A close link between strong motion seismology and high-performance computing technology becomes indispensable. The newly developing high performance parallel computer, Earth Simulator, would provide a key to realistic simulations of strong ground motion.

Takashi Furumura
Large-Scale Parallel Computing of Cloud Resolving Storm Simulator

A sever thunderstorm is composed of strong convective clouds. In order to perform a simulation of this type of storms, a very fine-grid system is necessary to resolve individual convective clouds within a large domain. Since convective clouds are highly complicated systems of the cloud dynamics and microphysics, it is required to formulate detailed cloud physical processes as well as the fluid dynamics. A huge memory and large-scale parallel computing are necessary for the computation. For this type of computations, we have developed a cloud resolving numerical model which was named the Cloud Resolving Storm Simulator (CReSS). In this paper, we will describe the basic formulations and characteristics of CReSS in detail. We also show some results of numerical experiments of storms obtained by a large-scale parallel computation using CReSS.

Kazuhisa Tsuboki, Atsushi Sakakibara

Short Papers

Routing Mechanism for Static Load Balancing in a Partitioned Computer System with a Fully Connected Network

System partitioning provides the users of high-performance parallel servers with the flexibility in resource allocation and dynamic reconfiguration as well as fault isolation. However, the bandwidth of links that connect different domains can be wasted while links within the same domains are congested. In this paper, we present a routing mechanism that can utilize the bandwidth of otherwise unused links to balance the message traffic and lead to lower message latencies for the latency-sensitive transactions. The performance of the proposed routing mechanism was studied using an analytical model with on-line transaction processing type workload parameters. The results indicated the proposed routing mechanism reduced the congestion on the direct paths significantly and lowered the queuing delay for the links. For example, when a 4-cluster system with a fully connected network with the bandwidth of 3.2GB/s per link is partitioned into two 2-cluster domains, the queuing delay was reduced from 53ns to 37ns and resulted in the improvement of CPI by 2%.

Hitoshi Oi, Bing-rung Tsai
Studying New Ways for Improving Adaptive History Length Branch Predictors

Pipeline stalls due to branches limit processor performance significantly. This paper provides an in depth evaluation of Dynamic History Length Fitting, a technique that changes the history length of a two-level branch predictor during the execution, trying to adapt to its different phases. We analyse the behaviour of DHLF compared with fixed history length gshare predictors, and contribute showing two factors that explain DHLF behaviour: Opportunity Cost and Warm-up Cost.Additionally, we evaluate the use of profiling for detecting future improvements. Using this information, we show that new heuristics that minimise both opportunity cost and warm-up cost could outperform significantly current variable history length techniques. Especially at program start-up, where the algorithm tries to learn the behaviour of the program to better predict future branches, the use of profiling reduces considerably the cost produced by continuous history length changes.

Ayose Falcón, Oliverio J. Santana, Pedro Medina, Enrique Fernández, Alex Ramírez, Mateo Valero
Speculative Clustered Caches for Clustered Processors

Clustering is a technique for partitioning superscalar processor’s execution resources to simultaneously allow for more in-flight instructions, wider issue width, and more aggressive clock speeds. As either the size of individual clusters or the total number of clusters increases, the distance to the first level data cache increases as well. Although clustering may expose more parallelism by allowing a greater number of instructions to be simultaneously analyzed and issued, the gains may be obliterated if the latencies to memory grow too large. We propose to augment each cluster with a small, fast, simple Level Zero (L0) data cache that is accessed in parallel with a traditional L1 data cache. The difference between our solution and other proposed caching techniques for clustered processors is that we do not support versioning or coherence. This may occasionally result in a load instruction that reads a stale value from the L0 cache, but the common case is a low latency hit in the L0 cache. Our simulation studies show that 4KB, 2-way set associative L0 caches provide a 6.5–12.3% IPC improvement over a wide range of processor configurations.

Dana S. Henry, Gabriel H. Loh, Rahul Sami
The Effects of Timing Dependence and Recursion on Parallel Program Schemata

We are interested in the effects of timing dependence and recursion in the class of parallel program schemata. We compare the expression power of some classes of dataflow schemata: UDF, DFπ, RDF etc. DFπ includes a π-gate which introduces timing dependence, and RDF has a facility for recursion and UDF has both devices. In conclusion, we show some inclusion relations between classes, including the existence of a class standing between EF and EFd. These relations reveal the role of timing dependence and recursion.

Yasuo Matsubara, Takahiro Shakushi
Cache Line Impact on 3D PDE Solvers

Because performance disparity between processor and main memory is serious, it is necessary to reduce off-chip memory accesses by exploiting temporal locality. Loop tiling is a well-known optimization which enhances data locality. In this paper, we show a new cost model to select the best tile size in 3D partial differential equations. Our cost model carefully takes account of the effect of cache line. We present performance evaluation of our cost models. The evaluation results reveal the superiority of our cost model to other cost models proposed so far.

Masaaki Kondo, Mitsugu Iwamoto, Hiroshi Nakamura
An EPIC Processor with Pending Functional Units

The Itanium processor, an implementation of an Explicitly Parallel Instruction Computing (EPIC) architecture, is an in-order processor that fetches, executes, and forwards results to functional units in-order. The architecture relies heavily on the compiler to expose Instruction Level Parallelism (ILP) to avoid stalls created by in-order processing.The goal of this paper is to examine, in small steps, changing the in-order Itanium processor model to allow execution to be performed out-of-order. The purpose is to overcome memory and functional unit latencies. To accomplish this, we consider an architecture with Pending Functional Units (PFU). The PFU architecture assigns/schedules instructions to functional units in-order. Instructions sit at the pending functional units until their operands become ready and then execute out-of-order. While an instruction is pending at a functional unit, no other instruction can be scheduled to that functional unit. We examine several PFU architecture designs. The minimal design does not perform renaming, and only supports bypassing of non-speculative result values. We then examine making PFU more aggressive by supporting speculative register state, and then finally by adding in register renaming. We show that the minimal PFU architecture provides on average an 18% speedup over an in-order EPIC processor and produces up to half of the speedup that would be gained using a full out-of-order architecture.

Lori Carter, Weihaw Chuang, Brad Calder
Software Energy Optimization of Real Time Preemptive Tasks by Minimizing Cache-Related Preemption Costs

An attempt has been made to optimize the software energy of real time preemptive tasks by minimizing the cache related preemption costs which are primarily incurred due to the inter-task interference in the cache. We have presented an algorithm that outputs an ”optimum” task layout which reduces the overall inter-task interference in cache and thus reduces the preemption costs of each task of a given task set. We have compared the result of our estimated layout with that of the random layout generated for benchmark examples for demonstrating the performance of our algorithm.

Rakesh Kumar, Tusar Kanti Patra, Anupam Basu
Distributed Genetic Algorithm with Multiple Populations Using Multi-agent

This paper designs a distributed genetic algorithm in order to reduce the execution time and obtain more near optimal using master-slave multi-agent model for the TSP. Distributed genetic algorithms with multiple populations are difficult to configure because they are controlled by many parameters that affect their efficiency and accuracy. Among other things, one must decide the number and the size of the populations (demes), the rate of migration, the frequency of migrations, and the destination of the migrants. In this paper, I develop two dynamic migration window methods, increasing dynamic window and random dynamic window, that control the size and the frequency of migrations. In addition to this, I design new genetic migration policy that selects the destination of the migrants among the slave agents.

Jung-Sook Kim
Numerical Weather Prediction on the Supercomputer Toolkit

The Supercomputer Toolkit constructs parallel computation networks by connecting processor modules. These connections are set by the user prior to a run and are static during the run. The Technion’s Toolkit prototype was used to run a simplified version of the PSU/NCAR MM5 mesoscale model [9]. Each processor is assigned columns of the grid points of a square in the (x,y) space. When n x n columns are assigned to each processor its computation time is proportional to n2 and its communication time to n. Since the Toolkit’s network computes in parallel and communicates in parallel, then, for a given n, the total time is independent of the size of the two dimensional array or the area over which the weather prediction takes place. A mesoscale forecast over the eastern Mediterranean was run and measured; it suggests that were the Toolkit constructed from ALPHA processors, 10 processors would do a 36 h prediction in only about 13 minutes. A 36 hours prediction with full physics for the whole earth will require 2 hours for 80 ALPHA processors.

Pinhas Alpert, Alexander Goikhman, Jacob Katzenelson, Marina Tsidulko
OpenTella: A Peer-to-Peer Protocol for the Load Balancing in a System Formed by a Cluster from Clusters

The search for the high performance in the traditional computing systems has been related to the development of specific applications executed over parallel machines. However, the cost of this kind of system, which is high due to the hardware involved in these projects, has limited the continuing development in these areas, as just a small part of the community has access to those systems. With the aim of using a low cost hardware to achieve a high performance, this paper presents the OpenTella, a protocol based on the peer-to-peer models to update the information related to the occupation of resources and an analysis of this occupation for a post migration of processes among computers of a cluster.

Rodrigo F. de Mello, Maria Stela V. Paiva, Luís Carlos Trevelin, Adilson Gonzaga
Power Estimation of a C Algorithm Based on the Functional-Level Power Analysis of a Digital Signal Processor

A complete methodology to estimate power consumption at the C-level for on-the-shelf processors is introduced. It relies on the Functional-Level Power Analysis, which results in a power model of the processor that describes the consumption variations relatively to algorithmic and configuration parameters. Some parameters can be predicted directly from the C-algorithm with simple assumptions on the compilation. Maximum and minimum bounds for power consumption are obtained, together with a very accurate estimation; for the TI C6x, a maximum error of 6% against measurements is obtained for classical digital signal processing algorithms. Estimation results are summarized on a consumption map; the designer can compare the algorithm consumption, and its variations, with the application constraints.

Nathalie Julien, Johann Laurent, Eric Senn, Eric Martin
Irregular Assignment Computations on cc-NUMA Multiprocessors

This paper addresses the parallelization of loops with irregular assignment computations on cc-NUMA multiprocessors. This loop pattern is distinguished by the existence of loop-carried output data dependences that can only be detected at run-time. A parallelization technique based on the inspector-executor model is proposed in this paper. In the inspector, loop iterations are reordered so that they can be executed in a conflict-free manner during the executor stage. The design of the inspector ensures load-balancing and uniprocessor data write locality exploitation. Experimental results show the scalability of this technique, which is presented as a clear alternative to other existing methods.

Manuel Arenaz, Juan Touriño, Ramón Doallo

International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2002)

Large System Performance of SPEC OMP2001 Benchmarks

Performance characteristics of application programs on large-scale systems are often significantly different from those on smaller systems. SPEC OMP2001 is a benchmark suite intended for measuring performance of modern shared memory parallel systems. The first component of the suite, SPEC OMPM2001, is developed for medium-scale (4- to 16-way) systems. We present our experiences on benchmark development in achieving good scalability using the OpenMP API. This paper then analyzes the published results of SPEC OMPM2001 on large systems (32-way and larger), based on application program behavior and systems’ architectural features. The ongoing development of the SPEC OMP2001 benchmark suites is also discussed. Its main feature is the increased data set for large-scale systems. We refer to this suite as SPEC OMPL2001, in contrast to the current SPEC OMPM2001 (medium data set) suite.

Hideki Saito, Greg Gaertner, Wesley Jones, Rudolf Eigenmann, Hidetoshi Iwashita, Ron Lieberman, Matthijs van Waveren, Brian Whitney, SPEC High-Performance Group
A Shared Memory Benchmark in OpenMP

The efficient use of the memory system is a key issue for the performance of many applications. A benchmark written with OpenMP is presented that measures several aspects of a shared memory system like ban ,dwidth, memory latency and inter-thread latency. Special focus is on revealing and identifying bottlenecks and possible hierarchies in the main memory system.

Matthias S. Müller
Performance Evaluation of the Hitachi SR8000 Using OpenMP Benchmarks

This paper reports the performance of a single node of the Hitachi SR8000 when using OpenMP benchmarks. Each processing node of the SR8000 is a shared-memory parallel computer composed of eight scalar processors with pseudo-vector processing feature. We have run the all of the SPEC OMP2001 Benchmarks and three benchmarks (LU, SP and BT) of the NAS Parallel Benchmarks on the SR8000. According to the results of this performance measurement, we found that the SR8000 has good scalability continuing up to 8 processors except for a few benchmark programs. The performance results demonstrate that the SR8000 achieves high performance especially for memory-intensive applications.

Daisuke Takahashi, Mitsuhisa Sato, Taisuke Boku
Communication Bandwidth of Parallel Programming Models on Hybrid Architectures

Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside of each node. This paper introduces several programming models for hybrid systems. It focuses on programming methods that can achieve optimal inter-node communication bandwidth and on the hybrid MPI+OpenMP approach and its programming rules. The communication behavior is compared with the pure MPI programming paradigm and with RDMA and NUMA based programming models.

Rolf Rabenseifner
Performance Comparisons of Basic OpenMP Constructs

OpenMP has become the de-facto standard for shared memory parallel programming. The directive based nature of OpenMP allows incremental and portable developement of parallel application for a wide range of platforms. The fact that OpenMP is easy to use implies that a lot of details are hidden from the end user. Therefore, basic factors like the runtime system, compiler optimizations and other implementation specific issues can have a significant impact on the performance of an OpenMP application. Frequently, OpenMP constructs can have widely varying performance on different operating platforms and even with different compilers on the same machine. This makes it very important to have a comparative study of the low-level performance of individual OpenMP constructs. In this paper, we present an enhanced set of microbenchmarks for OpenMP derived from the EPCC benchmarks and based on the SKaMPI benchmarking framework. We describe the methodology of evaluation followed by details of some of the constructs and their performance measurement. Results from experiments conducted on the IBM SP3 and the SUN SunFire systems are presented for each construct.

Achal Prabhakar, Vladimir Getov, Barbara Chapman
SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks

Shared Memory Multiprocessors are becoming more popular since they are used to deploy large parallel computers. The current trend is to enlarge the number of processors inside such multiprocessor nodes. However a lot of existing applications are using the message passing paradigm even when running on shared memory machines. This is due to three main factors: 1) the legacy of previous versions written for distributed memory computers, 2) the difficulty to obtain high performances with OpenMP when using loop level parallelization and 3) the complexity of writing multithreaded programs using a low level thread library. In this paper we demonstrate that OpenMP can provide better performance than MPI on SMP machines. We use a coarse grain parallelization approach, also known as the SPMD programming style with OpenMP. The performance evaluation considers the IBM SP3 NH2 and three kernels of the NAS benchmark: FT, CG and MG. We compare three implementations of them: the NAS 2.3 MPI, a fine grain (loop level) OpenMP version and our SPMD OpenMP version. A breakdown of the execution times provides an explanation of the performance results.

Géraud Krawezik, Guillaume Alléon, Franck Cappello
Parallel Iterative Solvers for Unstructured Grids Using an OpenMP/MPI Hybrid Programming Model for the GeoFEM Platform on SMP Cluster Architectures

An efficient parallel iterative method for unstructured grids developed by the authors for SMP cluster architectures on the GeoFEM platform is presented. The method is based on a 3-level hybrid parallel programming model, including message passing for inter-SMP node communication, loop directives by OpenMP for intra-SMP node parallelization and vectorization for each processing element (PE). Simple 3D elastic linear problems with more than 8×108 DOF have been solved by 3x3 block ICCG(0) with additive Schwarz domain decomposition and PDJDS/CM-RCM reordering on 128 SMP nodes of Hitachi SR8000/MPP parallel computer, achieving performance of 335.2 GFLOPS. The PDJDS/CM-RCM reordering method provides excellent vector and parallel performance in SMP nodes.

Kengo Nakajima, Hiroshi Okuda
A Parallel Computing Model for the Acceleration of a Finite Element Software

This paper presents the OpenMP parallelization of the Finite Element code LAGAMINE. It is a non-linear great deformations code for solid mechanics. The parallelization approach uses coarse grain: for the element loop, to build the stiffness matrix and for the direct solver, to compute the LU factorisation. Application is proposed to a stamping simulation.

Pierre de Montleau, Jose Maria Cela, Serge Moto Mpong, André Godinass
Towards OpenMP Execution on Software Distributed Shared Memory Systems

In this paper, we examine some of the challenges present in providing support for OpenMP applications on a Software Distributed Shared Memory(DSM) based cluster system. We present detailed measurements of the performance characteristics of realistic OpenMP applications from the SPEC OMP2001 benchmarks. Based on these measurements, we discuss application and system characteristics that impede the efficient execution of these programs on a Software DSM system. We point out pitfalls of a naive translation approach from OpenMP into the API provided by a Software DSM system, and we discuss a set of possible program optimization techniques.

Ayon Basumallik, Seung-Jai Min, Rudolf Eigenmann
Dual-Level Parallelism Exploitation with OpenMP in Coastal Ocean Circulation Modeling

Two alternative dual-level parallel implementations of the Multiblock Grid Princeton Ocean Model (MGPOM) are compared in this paper. The first one combines the use of two programming paradigms: message passing with the Message Passing Interface (MPI) and shared memory with OpenMP (version called MPI-OpenMP); the second uses only OpenMP (version called OpenMP-Only). MGPOM is a multiblock grid code that enables the exploitation of two levels of parallelism.The MPI-OpenMP implementation uses MPI to parallelize computations by assigning each grid block to a unique MPI process. Since not all grid blocks are of the same size, the workload between processes varies. OpenMP is used within each MPI process to improve load balance. The alternative OpenMP-Only implementation uses some extensions proposed to OpenMP that defines thread groups in order to efficiently exploit the available two levels of parallelism. These extensions are supported by a research OpenMP compiler named NanosCompiler.Performance results of the two implementations from the MGPOM code on a 20-block grid for the Arabian Gulf simulation demonstrate the efficacy of the OpenMP-Only versions of the code. The simplicity of the OpenMP implementation as well as the possibility of using and simply defining policies to dynamically change the allocation of OpenMP threads to the two levels of parallelism is the main result of this study and suggests to consider this alternative for the parallelization of future applications.

Marc González, Eduard Ayguadé, Xavier Martorell, Jesús Labarta, Phu V. Luong
Static Coarse Grain Task Scheduling with Cache Optimization Using OpenMP

Effective use of cache memory is getting more important with increasing gap between the processor speed and memory access speed. Also, use of multigrain parallelism is getting more important to improve effective performance beyond the limitation of loop iteration level parallelism. Considering these factors, this paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme schedules coarse grain tasks to threads so that shared data among coarse grain tasks can be passed via cache after task and data decomposition considering cache size at compile time. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP workstation, using Swim and Tomcatv from the SPEC fp 95. As the results, the proposed scheme gives us 4.56 times speedup for Swim and 2.37 times on 4 processors for Tomcatv respectively against the Sun Forte HPC 6 loop parallelizing compiler.

Hirofumi Nakano, Kazuhisa Ishizaka, Motoki Obata, Keiji Kimura, Hironori Kasahara

HPF International Workshop: Experiences and Progress (HiWEP 2002)

High Performance Fortran - History, Status and Future

High Performance Fortran (HPF) is a data-parallel language that was designed to provide users with a high-level interface for programming scientific applications, while delegating to the compiler the task of generating an explicitly parallel message-passing program. This lecture will outline the original motivation for the development of HPF and provide an overview of various versions of the language. The focus will be on a study of the expressivity of the HPF approach in the context of the requirements posed by “real” applications, and the need to achieve target code performance close to that of hand-written MPI programs. We will identify a set of important language elements for the efficient solution of a range of advanced applications in science and engineering. Dealing with such problems does not only need the flexible data and work distribution features included in the HPF-2 Standard, but also requires additional capabilities such as the explicit control of some aspects of communication. The lecture concludes with an outlook to future research and development in HPF-related languages and compilers.

Hans P. Zima
Performance Evaluation for Japanese HPF Compilers with Special Benchmark Suite

The lack of performance portability has been disheartening scientific application users to develop portable programs written in HPF. As the users would like to run the same source code on different parallel machines as fast as possible, we have investigated the performance portability for Japanese HPF compilers (NEC and Fujitsu) with a special benchmark suite. We got good performance in most cases with DISTRIBUTE and INDEPENDENT directives on NEC SX-5, but Fujitsu VPP800 required to explicitly force no communication inside parallel loops with additional LOCAL directives. It was also found that manual optimizations for communication with HPF/JA extensions were very useful to tune parallel performance.

Hitoshi Sakagami, Shingo Furubayashi
Evaluation of the HPF/JA Extensions on Fujitsu VPP Using the NAS Parallel Benchmarks

We have ported 5 codes in APR’s HPF implementation of NAS Parallel benchmark, so that it is conformable to HPF 2.0 and HPF/JA specification. The porting is done with the full usage of HPF 2.0 and HPF/JA new features, while the base Fortran codes remain almost unmodified. Then we have measured the performance of these benchmark codes on Fujitsu VPP800 by HPF/VPP compiler. We have achieved fairly good acceleration ratio for all codes except MG, and better absolute performance than NPB 2.3 parallel implementation with MPI for EP and BT.

Kae Asaoka, Akio Hirano, Yasuo Okabe, Masanori Kanazawa
Three-Dimensional Electromagnetic Particle-in-Cell Code Using High Performance Fortran on PC Cluster

A three-dimensional full electromagnetic particle-in-cell (PIC) code, TRISTAN (Tridimensional Stanford) code, has been parallelized using High Performance Fortran (HPF) as a RPM (Real Parallel Machine). In the simulation, the simulation domains are decomposed in one-dimension, and both the particle and field data located in each domain that we call the sub-domain are distributed on each processors. Both the particle and field data on a sub-domain is needed by the neighbor sub-domains and thus communications between the sub-domains are inevitable. Our simulation results using HPF exhibits the promising applicability of the HPF communications to a large scale scientific computing such as 3D particle simulations.

DongSheng Cai, Yaoting Li, Ken-ichi Nishikawa, Chiejie Xiao, Xiaoyan Yan
Towards a Lightweight HPF Compiler

The UXP/V HPF compiler, that has been developed for the VPP series vector-parallel supercomputers, extracts the highest performance from the hardware. However, it is getting difficult for developers to concentrate on a specific hardware. This paper describes a method of developing an HPF compiler for multiple platforms without losing performance. Advantage is taken of existing technology. The code generator and runtime system of VPP Fortran are reused for high-end computers; MPI is employed for general distributed environments, such as a PC cluster. Following a performance estimation on different systems, we discuss effectiveness of the method and open issues.

Hidetoshi Iwashita, Kohichiro Hotta, Sachio Kamiya, Matthijs van Waveren
Parallel I/O Support for HPF on Computational Grids

Recently several projects have started to implement large-scale high-performance computing on “computational grids” composed of heterogeneous and geographically distributed systems of computers, networks, and storage devices that collectively act as a single “virtual supercomuter”. One of the great challenges for this environment is to provide appropriate high-level programming models. High Performance Fortran (HPF) is a language of choice for development of data parallel components of Grid applications. Another challenge is to provide efficient access to data that is distributed across local and remote Grid resources. In this paper, constructs to specify parallel input and output (I/O) operations on multidimensional arrays on the Grid in the context of HPF are proposed. The paper also presents implementation concepts that are based on the HPF compiler VFC, the parallel I/O runtime system Panda, Internet, and Grid technologies. Preliminary experimental performance results are discussed in the context of a real application example.

Peter Brezany, Jonghyun Lee, Marianne Winslett
Optimization of HPF Programs with Dynamic Recompilation Technique

Optimizing compilers perform various optimizations in order to exploit the best performance from computer systems. However, some kinds of optimizations cannot be applied if values of variables or system parameters are not known at compilation time. To solve this problem, we designed and implemented a system which collects such information at run time, and dynamically recompiles part of the program based on it. In our system, recompilation and management of runtime information are carried out on processors other than those which execute user programs. Therefore, recompilation cost does not affect the program execution time, unlike other similar systems. The evaluation result shows that quite high speedup can be attained with this method.

Takuya Araki, Hitoshi Murai, Tsunehiko Kamachi, Yoshiki Seo
Backmatter
Metadaten
Titel
High Performance Computing
herausgegeben von
Hans P. Zima
Kazuki Joe
Mitsuhisa Sato
Yoshiki Seo
Masaaki Shimasaki
Copyright-Jahr
2002
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-47847-8
Print ISBN
978-3-540-43674-4
DOI
https://doi.org/10.1007/3-540-47847-7