Top

2011 | Book

Read chapter Read first chapter

High Performance Computing for Computational Science – VECPAR 2010

9th International conference, Berkeley, CA, USA, June 22-25, 2010, Revised Selected Papers

Editors: José M. Laginha M. Palma, Michel Daydé, Osni Marques, João Correia Lopes

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the thoroughly refereed post-conference proceedings of the 9th International Conference on High Performance Computing for Computational Science, VECPAR 2010, held in Berkeley, CA, USA, in June 2010. The 34 revised full papers presented together with five invited contributions were carefully selected during two rounds of reviewing and revision. The papers are organized in topical sections on linear algebra and solvers on emerging architectures, large-scale simulations, parallel and distributed computing, numerical algorithms.

Frontmatter

Invited Talks

Exascale Computing Technology Challenges

High Performance Computing architectures are expected to change dramatically in the next decade as power and cooling constraints limit increases in microprocessor clock speeds. Consequently computer companies are dramatically increasing on-chip parallelism to improve performance. The traditional doubling of clock speeds every 18-24 months is being replaced by a doubling of cores or other parallelism mechanisms. During the next decade the amount of parallelism on a single microprocessor will rival the number of nodes in early massively parallel supercomputers that were built in the 1980s. Applications and algorithms will need to change and adapt as node architectures evolve. In particular, they will need to manage locality to achieve performance. A key element of the strategy as we move forward is the co-design of applications, architectures and programming environments. There is an unprecedented opportunity for application and algorithm developers to influence the direction of future architectures so that they meet DOE mission needs. This article will describe the technology challenges on the road to exascale, their underlying causes, and their effect on the future of HPC system design.

John Shalf, Sudip Dosanjh, John Morrison

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem?

An Overview of Research at the Berkeley Parallel Computing Laboratory

The Par Lab started in 2008, based on an earlier technical report “The Berkeley View” on the parallel computing challenge. (K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, December 18 2006.) This talk gives an update on where we are two years in the Par Lab. We picked five applications to drive our research, and believe they collectively capture many of the important features of future client applications even if they themselves do not become the actual future “killer app”. The Personalized Medicine application focuses on detailed modeling of individual’s responses to treatments, representing the important health market. The Music application emphasizes real-time responsiveness to rich human input, with high-performance many-channel audio synthesis. The Speech application focuses on making speech input work well in the real-world noisy environments where mobile devices will be operated. The Content-Based Image Recognition (CBIR) application represents the growing practical use of machine vision. Finally, the Parallel Web Browser is currently perhaps the most important single application on client devices, as well as representative of many other interactive rich-document processing tasks.

Our first step in attacking the parallel programming challenge was to analyze a wide range of applications, including workloads from embedded computing, desktop computing, games, databases, machine learning, and scientific computing, as well as our five driving applications. We discovered a surprisingly compact set of recurring computational patterns, which we termed “motifs”. We have greatly expanded on this work, and now believe that any successful software architecture, parallel or serial, can be described as a hierarchy of patterns. We divide patterns into either computational patterns, which describe a computation to be performed, or structural patterns, which describe how computations are composed. The patterns have proven central to ourresearch effort, serving as both a common human vocabulary for multidisciplinary discussions spanning application developers to hardware architects, as well as an organizing structure for software development. Another organizing principle in our original proposal was to divide the software development stack into two layers: efficiency and productivity. Programmers working in the efficiency layer are generally experts in achieving high performance from the underlying hardware, but are not necessarily knowledgeable of any given application domain. Programmers working in the productivity layer are generally knowledgeable about an application domain, but are less concerned with hardware details. The patterns bridge these two layers. Efficiency programmers develop libraries and frameworks that efficiently implement the standard patterns, and productivity programmers can decompose an application into patterns and use high-level languages to compose corresponding libraries and frameworks to form applications.

To improve the quality and portability of efficiency-level libraries, we proposed to leverage our earlier work on autotuning. Autotuning is an automatic search-based optimization process whereby multiple variants of a routine are generated and empirically evaluated on the hardware platform. We have also included a major effort on parallel program correctness to help programmers test, verify, and debug their code. Different correctness techniques apply at the efficiency layer, where low-level data races and deadlocks are of concern, and at the productivity layer, where we wish to ensure semantic determinism and atomicity. Our whole pattern-based component approach to the software stack hinges on the ability to efficiently and flexibly compose software modules. We developed a low-level user-level scheduling substrate called “Lithe” to support efficient sharing of processing resources between arbitrary modules, even those written in different languages and to different programming models.

Our operating system and architecture research is devoted to supporting the software stack. The OS is based on space-time partitioning, which exports stable partitions of the machine resources with quality-of-service guarantees to an application, and two-level scheduling, which allows a user-level scheduler, such as Lithe, to perform detailed application-specific scheduling within a partition. Our architecture research focuses on techniques to support OS resource partitioning, performance counters to support application adaptivity, software-managed memory hierarchies to increase memory efficiency, and scalable coherence and synchronization mechanisms to lower parallel system overheads. To experiment with the behavior of our new software stack on our new OS and hardware mechanisms, we have developed an FPGA-based simulation environment, “RAMP Gold”. By running our full application and OS software environment on our fast architectural simulator, we can quickly iterate across levels in our system stack.

David Patterson

HPC Techniques for a Heart Simulator

In the post-genome era, the integration of molecular and cellular findings in studies into the functions of organs and individuals is recognized as an important field of medical science and physiology. Computational modeling plays a central role in this field, which is referred to as Physiome. However, despite advancements in computational science, this task remains difficult. In addition to coupling multiple disciplines, including electricity, physical chemistry, solid mechanics and fluid dynamics, the integration of events over a wide range of scales must also be accomplished. Our group, including clinical practitioners, has been tackling this problem over several years, with a focus on the human heart.

The morphology of our heart model is reconstructed from human multi-detector computed tomography data and discretized using the finite element method. For the electrophysiology simulation, a composite voxel mesh with fine voxels in and around the heart and coarse voxels covering the torso is adopted to solve the bidomain equation. Following the excitation of a sinus node, the simulator reproduces the excitation propagation and depolarization of the membrane potentials of virtual cells sequentially in the atria, conduction system, and ventricles. The mechanical simulation for the interaction between the heart wall and intracavitary blood flow is performed on a tetrahedral mesh. The Ca

2 +

concentration data obtained from the electrophysiology model are applied to the molecular model of sarcomere dynamics to compute the contraction force of every element. This results in the synchronous contraction of the heart and blood flow.

Thus far, we have been able to retrieve and present the data in the same way as clinical diagnostic tools, such as ECG, UCG, and magneto-cardiogram in our simulation studies. These data are in good agreement with the clinical data for both normal and diseased heart models, thus suggesting their potentials for diagnostic support.

However, a more important aspect of the simulation involves modeling the underlying mechanism driving the myocardium, i.e., the origin of the pulsation of the heart, which includes electrophysiological regulation and cross-bridge kinetics in the cardiac cells. To integrate such microscopic phenomena with the macroscopic function of the organ in a seamless manner, the cardiac cells are also modeled using the finite element method, based on the cell physiology for every finite element in the heart model. The mathematical linkage is realized using the so-called homogenization method. All the cell models and the heart model are then solved simultaneously, because the instantaneous states of the macroscopic model, such as the strains and strain rates over the heart wall, also regulate each cell response. It is apparent that the total number of degrees of freedom of all the cell models becomes prohibitively large.

We will introduce basic algorithms and parallel computational techniques applied to the above mentioned multi-physics and multi-scale simulations.

Takumi Washio, Jun-ichi Okada, Seiryo Sugiura, Toshiaki Hisada

Game Changing Computational Engineering Technology

During the last two decades, giant strides have been achieved in many aspects of Computational Engineering. Higher-fidelity mathematical models, better approximation methods, and faster algorithms have been developed for many time-dependent applications. SIMD, SPMD, MIMD, coarse-grain, and fine-grain parallel processors have come and gone. Linux clusters are now ubiquitous, cores have replaced CEs, and GPUs have shattered computing speed barriers. Most importantly, the potential of high-fidelity physics-based simulations for providing deeper understanding of complex engineering systems and enhancing system performance has been recognized in almost every field of engineering. Yet, in many engineering applications, high-fidelity time-dependent numerical simulations are not performed as often as needed, or are more often performed in special circumstances than routinely. The reason is very simple: these simulations remain too computationally intensive for time-critical operations such as design, design optimization, and active control. Consequently, the impact of computational sciences on such operations has yet to materialize. Petascale or exascale computing alone is unlikely to make this happen. Achieving this objective demands instead a game-changing computational technology that bridges both ends of the computing spectrum. This talk will attempt to make the case for this pressing need and outline a candidate computational technology for filling it that is based on model reduction, machine learning concepts, trained data bases, and rigorous interpolation methods. It will also illustrate it with preliminary results obtained from its application to the support of the flutter flight testing of a fighter aircraft and the aerodynamic optimization of Formula 1 car.

Charbel Farhat

HPC in Phase Change: Towards a New Execution Model

HPC is entering a new phase in system structure and operation driven by a combination of technology and architecture trends. Perhaps foremost are the constraints of power and complexity that as a result of the at-lining of clock rates relies on multicore as the primary means by which performance gain is being achieved with Moore’s Law. Indeed, for all intense and purposes, “multicore” is the new “Moore’s Law" with steady increases in the number of cores per socket. Added to this are the highly multithreaded GPU components moving HPC into the heterogeneous modality for additional performance gain. These dramatic changes in system architecture are forcing new methods of use including programming and system management. Historically HPC has experienced five previous phase changes involving technology, architecture, and programming models. The current phase of two decades is exemplified by the communicating sequential model of computation replacing previous vector and SIMD models. HPC is now faced with the need for new effective means of sustaining performance growth with technology through rapid expansion of multicore with anticipated structures of hundreds of millions of cores by the end of this decade delivering Exaflops performance. This presentation will discuss the driving trends and issues of the new phase change in HPC and will discuss the ParalleX execution model that is serving as a pathfinding framework for exploring an innovative synthesis of semantic constructs and mechanisms that may serve as a foundation for computational systems and techniques in the Exascale era. This talk is being given just as DARPA is initiating its UHPC program and DOE is launching additional programs such as their X-stack all aimed at catalyzing research in to the challenging area.

Thomas Sterling

Linear Algebra and Solvers on Emerging Architectures

Factors Impacting Performance of Multithreaded Sparse Triangular Solve

As computational science applications grow more parallel with multi-core supercomputers having hundreds of thousands of computational cores, it will become increasingly difficult for solvers to scale. Our approach is to use hybrid MPI/threaded numerical algorithms to solve these systems in order to reduce the number of MPI tasks and increase the parallel efficiency of the algorithm. However, we need efficient threaded numerical kernels to run on the multi-core nodes in order to achieve good parallel efficiency. In this paper, we focus on improving the performance of a multithreaded triangular solver, an important kernel for preconditioning. We analyze three factors that affect the parallel performance of this threaded kernel and obtain good scalability on the multi-core nodes for a range of matrix sizes.

Michael M. Wolf, Michael A. Heroux, Erik G. Boman

Performance and Numerical Accuracy Evaluation of Heterogeneous Multicore Systems for Krylov Orthogonal Basis Computation

We study the numerical behavior of heterogeneous systems such as CPU with GPU or IBM Cell processors for some orthogonalization processes. We focus on the influence of the different floating arithmetic handling of these accelerators with Gram-Schmidt orthogonalization using single and double precision. We observe for dense matrices a loss of at worst 1 digit for CUDA-enabled GPUs as well as a speed-up of 20x, and 2 digits for the Cell processor for a 7x speed-up. For sparse matrices, the result between CPU and GPU is very close and the speed-up is 10x. We conclude that the Cell processor is a good accelerator for double precision because of its full IEEE compliance, and not sufficient for single precision applications. The GPU speed-up is better than Cell and the decent IEEE support delivers results close to the CPU ones for both precisions.

Jérôme Dubois, Christophe Calvin, Serge Petiton

An Error Correction Solver for Linear Systems: Evaluation of Mixed Precision Implementations

This paper proposes an error correction method for solving linear systems of equations and the evaluation of an implementation using mixed precision techniques.

While different technologies are available, graphic processing units (GPUs) have been established as particularly powerful coprocessors in recent years. For this reason, our error correction approach is focused on a CUDA implementation executing the error correction solver on the GPU.

Benchmarks are performed both for artificially created matrices with preset characteristics as well as matrices obtained from finite element discretizations of fluid flow problems.

Hartwig Anzt, Vincent Heuveline, Björn Rocker

Multifrontal Computations on GPUs and Their Multi-core Hosts

The use of GPUs to accelerate the factoring of large sparse symmetric matrices shows the potential of yielding important benefits to a large group of widely used applications. This paper examines how a multifrontal sparse solver performs when exploiting both the GPU and its multi-core host. It demonstrates that the GPU can dramatically accelerate the solver relative to one host CPU. Furthermore, the solver can profitably exploit both the GPU to factor its larger frontal matrices and multiple threads on the host to handle the smaller frontal matrices.

Robert F. Lucas, Gene Wagenbreth, Dan M. Davis, Roger Grimes

Accelerating GPU Kernels for Dense Linear Algebra

Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular,

Pointer Redirecting

– a set of GPU specific optimization techniques – allows us to easily remove performance oscillations associated with problem dimensions not divisible by fixed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware configuration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU specific acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20× faster than the currently available kernels. We present these kernels and also show their acceleration effect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.

Rajib Nath, Stanimire Tomov, Jack Dongarra

A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators

We present a Cholesky factorization for multicore with GPU accelerators systems. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs’ compute power

the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. This results in a scalable hybrid Cholesky factorization of unprecedented performance. In particular, using NVIDIA’s Tesla S1070 (4 C1060 GPUs, each with 30 cores @1.44 GHz) connected to two dual-core AMD Opteron @1.8GHz processors, we reach up to 1.163 TFlop/s in single and up to 275 GFlop/s in double precision arithmetic. Compared with the performance of the embarrassingly parallel xGEMM over four GPUs, where no communication between GPUs are involved, our algorithm still runs at 73% and 84% for single and double precision arithmetic respectively.

Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, Jack Dongarra

On the Performance of an Algebraic Multigrid Solver on Multicore Clusters

Algebraic multigrid (AMG) solvers have proven to be extremely efficient on distributed-memory architectures. However, when executed on modern multicore cluster architectures, we face new challenges that can significantly harm AMG’s performance. We discuss our experiences on such an architecture and present a set of techniques that help users to overcome the associated problems, including thread and process pinning and correct memory associations. We have implemented most of the techniques in a MultiCore SUPport library (MCSup), which helps to map OpenMP applications to multicore machines. We present results using both an MPI-only and a hybrid MPI/OpenMP model.

Allison H. Baker, Martin Schulz, Ulrike M. Yang

An Hybrid Approach for the Parallelization of a Block Iterative Algorithm

The Cimmino method is a row projection method in which the original linear system is divided into subsystems. At every iteration, it computes one projection per subsystem and uses these projections to construct an approximation to the solution of the linear system.

The usual parallelization strategy in block algorithms is to distribute the different blocks on the available processors. In this paper, we follow another approach where we do not perform explicitly this block distribution to processors within the code, but let the multi-frontal sparse solver MUMPS handle the data distribution and parallelism. The data coming from the subsystems defined by the block partition in the Block Cimmino method are gathered in an unique block diagonal sparse matrix which is analysed, distributed and factorized in parallel by MUMPS. Our target is to define a methodology for parallelism based only on the functionalities provided by general sparse solver libraries and how efficient this way of doing can be.

Carlos Balsa, Ronan Guivarch, Daniel Ruiz, Mohamed Zenadi

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

The algorithms in the current sequential numerical linear algebra libraries (

e.g.

LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the

tile algorithms

, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, a QR factorization, and computing the inverse of a symmetric positive definite matrix. In this extended abstract, we revisit the computation of the inverse of a symmetric positive definite matrix. We observe that, using a dynamic task scheduler, it is relatively painless to translate existing LAPACK code to obtain a ready-to-be-executed tile algorithm. However we demonstrate that, for some variants, non trivial compiler techniques (array renaming, loop reversal and pipelining) need then to be applied to further increase the parallelism of the application. We present preliminary experimental results.

Emmanuel Agullo, Henricus Bouwmeester, Jack Dongarra, Jakub Kurzak, Julien Langou, Lee Rosenberg

A Massively Parallel Dense Symmetric Eigensolver with Communication Splitting Multicasting Algorithm

In this paper, we propose a process grid free algorithm for a massively parallel dense symmetric eigensolver with a communication splitting multicasting algorithm. In this algorithm, a tradeoff exists between speed and memory space to keep the Householder vectors. As a result of a performance evaluation with the T2K Open Supercomputer (U. Tokyo) and the RX200S5, we obtain the performance with 0.86x and 0.95x speed-downs and 1/2 memory space compared to the conventional algorithm for a square process grid. We also show a new algorithm for small-sized matrices in massively parallel processing that takes an appropriately small value of

of the process grid

. In this case, the execution time of inverse transformation is negligible.

Takahiro Katagiri, Shoji Itoh

Large Scale Simulations in CS&E

Global Memory Access Modelling for Efficient Implementation of the Lattice Boltzmann Method on Graphics Processing Units

In this work, we investigate the global memory access mechanism on recent GPUs. For the purpose of this study, we created specific benchmark programs, which allowed us to explore the scheduling of global memory transactions. Thus, we formulate a model capable of estimating the execution time for a large class of applications. Our main goal is to facilitate optimisation of regular data-parallel applications on GPUs. As an example, we finally describe our CUDA implementations of LBM flow solvers on which our model was able to estimate performance with less than 5% relative error.

Christian Obrecht, Frédéric Kuznik, Bernard Tourancheau, Jean-Jacques Roux

Data Structures and Transformations for Physically Based Simulation on a GPU

As general purpose computing on Graphics Processing Units (GPGPU) matures, more complicated scientific applications are being targeted to utilize the data-level parallelism available on a GPU. Implementing physically-based simulation on data-parallel hardware requires preprocessing overhead which affects application performance. We discuss our implementation of physics-based data structures that provide significant performance improvements when used on data-parallel hardware. These data structures allow us to maintain a physics-based abstraction of the underlying data, reduce programmer effort and obtain 6x-8x speedup over previously implemented GPU kernels.

Perhaad Mistry, Dana Schaa, Byunghyun Jang, David Kaeli, Albert Dvornik, Dwight Meglan

Scalability Studies of an Implicit Shallow Water Solver for the Rossby-Haurwitz Problem

The scalability of a fully implicit global shallow water solver is studied in this paper. In the solver a conservative second-order finite volume scheme is used to discretize the shallow water equations on a cubed-sphere mesh which is free of pole-singularities. Instead of using the popular explicit or semi-implicit methods in climate modeling, we employ a fully implicit method so that the restrictions on the time step size can be greatly relaxed. Newton-Krylov-Schwarz method is then used to solve the nonlinear system of equations at each time step. Within each Newton iteration, the linear Jacobian system is solved by using a Krylov subspace method preconditioned with a Schwarz method. To further improve the scalability of the algorithm, we use multilevel hybrid Schwarz preconditioner to suppress the increase of the iteration number as the mesh is refined or more processors are used. We show by numerical experiments on the Rossby-Haurwitz problem that the fully implicit solver scales well to thousands of processors on an IBM BlueGene/L supercomputer.

Chao Yang, Xiao-Chuan Cai

Parallel Multigrid Solvers Using OpenMP/MPI Hybrid Programming Models on Multi-Core/Multi-Socket Clusters

OpenMP/MPI hybrid parallel programming models were implemented to 3D finite-volume based simulation code for groundwater flow problems through heterogeneous porous media using parallel iterative solvers with multigrid preconditioning. Performance and robustness of the developed code has been evaluated on the “T2K Open Supercomputer (Tokyo)” and “Cray-XT4” using up to 1,024 cores through both of weak and strong scaling computations. OpenMP/MPI hybrid parallel programming model demonstrated better performance and robustness than flat MPI with large number of cores for ill-conditioned problems with appropriate command lines for NUMA control, first touch data placement and reordering of the data for contiguous “sequential” access to memory.

Kengo Nakajima

A Parallel Strategy for a Level Set Simulation of Droplets Moving in a Liquid Medium

The simulation of two-phase flow problems involving two time-dependent spatial regions with different physical properties is computationally hard. The numerical solution of such problems is complicated by the need to represent the movement of the interface. The level set approach is a front-capturing method representing the position of the interface implicitly by the root of a suitably defined function. We describe a parallel adaptive finite element simulation based on the level set approach. For freely sedimenting n-butanol droplets in water, we quantify the parallel performance on a Xeon-based cluster using up to 256 processes.

Oliver Fortmeier, H. Martin Bücker

Optimization of Aircraft Wake Alleviation Schemes through an Evolution Strategy

We investigate schemes to accelerate the decay of aircraft trailing vortices. These structures are susceptible to several instabilities that lead to their eventual destruction. We employ an Evolution Strategy to design a lift distribution and a lift perturbation scheme that minimize the wake hazard as proposed in [6]. The performance of a scheme is measured as the reduction of the mean rolling moment that would be induced on a following aircraft; it is computed by means of a Direct Numerical Simulation using a parallel vortex particle code. We find a configuration and a perturbation scheme characterized by an intermediate wavelength

~4.64, necessary to trigger medium wavelength instabilities between tail and flap vortices and subsequently amplify long wavelength modes.

Philippe Chatelain, Mattia Gazzola, Stefan Kern, Petros Koumoutsakos

Parallel and Distributed Computing

On-Line Multi-Threaded Processing of Web User-Clicks on Multi-Core Processors

Real time search — a setting in which Web search engines are able to include among their query results documents published on the Web in the very recent past — is a clear evidence that many of the off-line computations performed so far on conventional search engines need to be moved to the on-line arena. This is a demanding case for parallel computing since it is necessary to cope efficiently with thousands of concurrent read and write operations per unit time, all requiring latency times within a fraction of a second. To our knowledge, computations related to capturing user preferences through their clicks on the query result webpages and include this feature in the document ranking process are currently performed in an off-line manner. This is effected by pre-processing very large logs containing millions of queries submitted by actual users in a time scale of days, weeks or even months. The outcome is score data for the set of documents indexed by the search engine which were selected by users in the past. This paper studies the efficiency of this process in the on-line setting by evaluating a set of strategies for concurrent read/write operations executed on a multi-threaded multi-core architecture. The benefit of efficient on-line processing of user clicks is making it feasible to include user preference in document ranking also in a real-time fashion.

Carolina Bonacic, Carlos Garcia, Mauricio Marin, Manuel Prieto, Francisco Tirado

Performance Evaluation of Improved Web Search Algorithms

In this paper we propose an evaluation method for parallel algorithms that can be used independently of the used parallel programming library and architecture. We propose to predict the execution costs using a simple but efficient framework that consists in modeling the strategies via a BSP architecture, and estimating the real costs using as input real query traces over real or stochastically generated data. In particular we apply this method on a 2D inverted file index used to resolve web search queries. We present results for OR queries, for which we compare different ranking and caching strategies, and show how our framework works. In addition, we present and evaluate intelligent ranking and caching algorithms for AND queries.

Esteban Feuerstein, Veronica Gil-Costa, Michel Mizrahi, Mauricio Marin

Text Classification on a Grid Environment

The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Text mining is the process of extracting interesting information and knowledge from unstructured text. One key difficulty with text classification learning algorithms is that they require many hand-labeled documents to learn accurately. In the text mining pattern discovery phase, the text classification step aims at automatically attribute one or more pre-defined classes to text documents. In this research, we propose to use an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naïve Bayes classifier on a grid environment, this combination is based on a mixture of multinomials, which is commonly used in text classification. Naïve Bayes is a probabilistic approach to inductive learning. It estimates the a posteriori probability that a document belongs to a class given the observed feature values of the documents, assuming independence of the features. The class with the maximum a posteriori probability is assigned to the document. Expectation-Maximization (EM) is a class of iterative algorithms for maximum likelihood or maximum a posteriori estimation in problems with unlabeled data. The grid environment is a geographically distributed computation infrastructure composed of a set of heterogeneous resources. The semi-supervised learning classifier in the grid is available as a grid service, expanding the functionality of Aîuri Portal, which is a framework for a cooperative academic environment for education and research. Text classification mining methods are time-consuming by using the grid infrastructure can bring significant benefits in learning and the classification process.

Valeriana G. Roncero, Myrian C. A. Costa, Nelson F. F. Ebecken

On the Vectorization of Engineering Codes Using Multimedia Instructions

After years dominating high performance computing, expensive vector computers were gradually replaced by more affordable solutions and the use of vectorization techniques once applied to many scientific codes also faded. This paper addresses the vectorization of engineering codes using Streaming SIMD Extensions (SSE) also known as multimedia instructions. This particular kind of vectorization differs from the old vectorization techniques in the sense that it relies on hardware features and instruction sets only present on modern processors. Evolving from Intel MMX, SSE is the fifth generation of an established technology highly suited to handle computing intensive tasks like encryption/decryption, audio and video compression, also including digital signal processing but not so well explored for scientific computing, specially among engineering programmers. To demonstrate the benefits of vector/SIMD computing and its implementation on existing scalar algorithms, the authors present this technique applied to an engineering program for the solution of two dimensional elastostatic problems with the boundary element method. Taking an application from a well-know reference on the area, the paper focus on the programming techniques and addresses common tasks used in many other codes, like Gauss integration and equations system assembling. Thus, the vectorization guidelines provided here may also be extended to solve many other types of problems, using other numerical methods as well as other multimedia instruction set extensions. Numerical experiments show the effectiveness of the proposed approach.

Manoel Cunha, Alvaro Coutinho, J. C. F. Telles

Numerical Library Reuse in Parallel and Distributed Platforms

In the context of parallel and distributed computation, the currently existing numerical libraries do not allow code reuse. Besides, they are not able to exploit the multi-level parallelism offered by many numerical methods. A few linear algebra numerical libraries make use of object oriented approach allowing modularity and extensibility. Nevertheless, those which offer modularity together with sequential and parallel code reuse are almost non-existent. We analyze the lacks in existing libraries and propose a design model based on a component approach and the strict separation between computation operations, data definition and communication control of applications. We present then an implementation of this design using YML scientific workflow environment jointly with the object oriented LAKe (Linear Algebra Kernel) library. Some numerical experiments on GRID5000 platform validate our approach and show its efficiency.

Nahid Emad, Olivier Delannoy, Makarem Dandouna

Improving Memory Affinity of Geophysics Applications on NUMA Platforms Using Minas

On numerical scientific High Performance Computing (HPC), Non-Uniform Memory Access (NUMA) platforms are now commonplace. On such platforms, the memory affinity management remains an important concern in order to overcome the memory wall problem. Prior solutions have presented some drawbacks such as machine dependency and a limited set of memory policies. This paper introduces Minas, a framework which provides either explicit or automatic memory affinity management with architecture abstraction for ccNUMAs. We evaluate our solution on two ccNUMA platforms using two geophysics parallel applications. The results show some performance improvements in comparison with other solutions available for Linux.

Christiane Pousa Ribeiro, Márcio Castro, Jean-François Méhaut, Alexandre Carissimi

HPC Environment Management: New Challenges in the Petaflop Era

High Performance Computing (HPC) is becoming much more popular nowadays. Currently, the biggest supercomputers in the world have hundreds of thousands of processors and consequently may have more software and hardware failures. HPC centers managers also have to deal with multiple clusters from different vendors with their particular architectures. However, since there are not enough HPC experts to manage all the new supercomputers, it is expected that non-experts will be managing those large clusters. In this paper we study the new challenges to manage HPC environments containing different clusters with different sizes and architectures. We review available tools and present LEMMing [1], an easy-to-use open source tool developed to support high performance computing centers. LEMMing integrates machine resources and the available management and monitoring tools on a single point of management.

Jonas Dias, Albino Aveleda

Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems

This work presents a performance evaluation of single node and subdomain communication schemes available in EdgeCFD, an implicit edge-based coupled fluid flow and transport code for solving large scale problems in modern clusters. A natural convection flow problem is considered to assess performance metrics. Tests, focused in single node multi-core performance, show that past Intel Xeon processors dramatically suffer when large workloads are imposed to a single node. However, the problem seems to be mitigated in the newest Intel Xeon processor. We also observe that MPI non-blocking point-to-point interface sub-domain communications, although more difficult to implement, are more effective than collective interface sub-domain communications.

Renato N. Elias, Jose J. Camata, Albino Aveleda, Alvaro L. G. A. Coutinho

Applying Process Migration on a BSP-Based LU Decomposition Application

Process migration is an useful mechanism to offer load balancing. In this context, we developed a model called MigBSP that controls processes rescheduling on BSP applications. MigBSP is especially pertinent to obtain performance on this type of applications, since they are composed by supersteps which always wait for the slowest process. In this paper, we focus on the BSP-based modeling of the widely used LU Decomposition algorithm as well as its execution with MigBSP. The use of multiple metrics to decide migrations and adaptations on rescheduling frequency turn possible gains up to 19% over our cluster-of-clusters architecture. Finally, our final idea is to show the possibility to get performance in LU effortlessly by using novel migration algorithms.

Rodrigo da Rosa Righi, Laércio Lima Pilla, Alexandre Carissimi, Philippe Olivier Alexandre Navaux, Hans-Ulrich Heiss

A P2P Approach to Many Tasks Computing for Scientific Workflows

Scientific Workflow Management Systems (SWfMS) are being used intensively to support large scale

in silico

experiments. In order to reduce execution time, current SWfMS have exploited workflow parallelization under the arising Many Tasks Computing (MTC) paradigm in homogeneous computing environments, such as multiprocessors, clusters and grids with centralized control. Although successful, this solution no longer applies to heterogeneous computing environments, such as hybrid clouds, which may combine users’ own computing resources with multiple edge clouds. A promising approach to address this challenge is Peer-to-Peer (P2P) which relies on decentralized control to deal with scalability and dynamic behavior of resources. In this paper, we propose a new P2P approach to apply MTC in scientific workflows. Through the results of simulation experiments, we show that our approach is promising.

Eduardo Ogasawara, Jonas Dias, Daniel Oliveira, Carla Rodrigues, Carlos Pivotto, Rafael Antas, Vanessa Braganholo, Patrick Valduriez, Marta Mattoso

Intelligent Service Trading and Brokering for Distributed Network Services in GridSolve

One of the great benefits of computational grids is to provide access to a wide range of scientific software and a variety of different computational resources. It is then possible to choose from this large variety of available resources the one that solves a given problem, and even to combine these resources in order to obtain the best solution.

Grid service trading (searching for the best combination of software and execution platform according to the user requirements) is thus a crucial issue. Trading relies on the description of available services and computers, on the current state of the grid, and on the user requirements. Given the large amount of services that may be deployed over a Grid, this description cannot be reduced to a simple service name.

In this paper, a sophisticated service specification approach similar to algebraic data types is combined with a grid middleware. This leads to a transparent solution for users: they give a mathematical expression to be solved, and the appropriate grid services will be transparently located, composed and executed on their behalf.

Aurélie Hurault, Asim YarKhan

Load Balancing in Dynamic Networks by Bounded Delays Asynchronous Diffusion

Load balancing is a well known problem, which has been extensively addressed in parallel algorithmic. However, there subsist some contexts in which the existing algorithms cannot be used. One of these contexts is the case of dynamic networks where the links between the different elements are intermittent. We propose in this paper an efficient algorithm, based on asynchronous diffusion, to perform load balancing in such a context. A convergence theorem is proposed and proved. Finally, experimental results performed in the SimGrid environment confirm the efficiency of our algorithm.

Jacques M. Bahi, Sylvain Contassot-Vivier, Arnaud Giersch

A Computing Resource Discovery Mechanism over a P2P Tree Topology

Peer-to-Peer (P2P) computing, the harnessing of idle compute cycles through Internet, offers new research challenges in the domain of distributed computing. In this paper, we propose an efficient computing resource discovery mechanism based on a balanced multi-way tree structure capable of supporting both exact and range queries, efficiently. Likewise, a rebalancing algorithm is proposed. By means of simulation, we evaluated our proposal in relation to other approaches of the literature. Our results reveal the good performance of our proposals.

Damia Castellà, Hector Blanco, Francesc Giné, Francesc Solsona

Numerical Algorithms

A Parallel Implementation of the Jacobi-Davidson Eigensolver for Unsymmetric Matrices

This paper describes a parallel implementation of the Jacobi-Davidson method to compute eigenpairs of large unsymmetric matrices. Taking advantage of the capabilities of the PETSc library —Portable Extensible Toolkit for Scientific Computation—, we build an efficient and robust code adapted either for traditional serial computation or parallel computing environments. Particular emphasis is given to the description of some implementation details of the so-called correction equation, responsible for the subspace expansion, and crucial in the Jacobi-Davidson algorithm. Numerical results are given and the performance of the code is analyzed in terms of serial and parallel efficiency. The developments achieved in the context of this work will be incorporated in future releases of SLEPc —Scalable Library for Eigenvalue Problem Computations—, thus serving the scientific community and guaranteeing dissemination.

Eloy Romero, Manuel B. Cruz, Jose E. Roman, Paulo B. Vasconcelos

The Impact of Data Distribution in Accuracy and Performance of Parallel Linear Algebra Subroutines

In parallel computing the data distribution may have a significant impact in the application performance and accuracy. These effects can be observed using the parallel matrix-vector multiplication routine from PBLAS with different grid configurations in data distribution. Matrix-vector multiplication is an especially important operation once it is widely used in numerical simulation (

e.g.

, iterative solvers for linear systems of equations).

This paper presents a mathematical background of error propagation in elementary operations and proposes benchmarks to show how different grid configurations based on the two dimensional cyclic block distribution impacts accuracy and performance using parallel matrix-vector operations. The experimental results validate the theoretical findings.

Björn Rocker, Mariana Kolberg, Vincent Heuveline

On a Strategy for Spectral Clustering with Parallel Computation

Spectral Clustering is one of the most important method based on space dimension reduction used in Pattern Recognition. This method consists in selecting dominant eigenvectors of a matrix called affinity matrix in order to define a low-dimensional data space in which data points are easy to cluster. By exploiting properties of Spectral Clustering, we propose a method where we apply independently the algorithm on particular subdomains and gather the results to determine a global partition. Additionally, with a criterion for determining the number of clusters, the domain decomposition strategy for parallel spectral clustering is robust and efficient.

Sandrine Mouysset, Joseph Noailles, Daniel Ruiz, Ronan Guivarch

On Techniques to Improve Robustness and Scalability of a Parallel Hybrid Linear Solver

A hybrid linear solver based on the Schur complement method has great potential to be a general purpose solver scalable on tens of thousands of processors. For this, it is imperative to exploit two levels of parallelism; namely, solving independent subdomains in parallel and using multiple processors per subdomain. This hierarchical parallelism can lead to a scalable implementation which maintains numerical stability at the same time. In this framework, load imbalance and excessive communication, which can lead to performance bottlenecks, occur at two levels: in an intra-processor group assigned to the same subdomain and among inter-processor groups assigned to different subdomains. We developed several techniques to address these issues, such as taking advantage of the sparsity of right-hand-sides during the triangular solutions with interfaces, load balancing sparse matrix-matrix multiplication to form update matrices, and designing an effective asynchronous point-to-point communication of the update matrices. We present numerical results to demonstrate that with the help of these techniques, our hybrid solver can efficiently solve large-scale highly-indefinite linear systems on thousands of processors.

Ichitaro Yamazaki, Xiaoye S. Li

Solving Dense Interval Linear Systems with Verified Computing on Multicore Architectures

Automatic result verification is an important tool to reduce the impact of floating-point errors in numerical computation and to guarantee the mathematical rigor of results. One fundamental problem in Verified Computing is to find an enclosure that surely contains the exact result of a linear system. Many works have been developed to optimize Verified Computing algorithms using parallel programming techniques and message passing paradigm on clusters of computers. However, the High Performance Computing scenario changed considerably with the emergence of multicore architectures in the past few years. This paper presents an ongoing research project which has the purpose of developing a self-verified solver for dense interval linear systems optimized for parallel execution on these new architectures. The current version has obtained up to 85% of reduction at execution time and a speedup of 6.70 when solving a 15,000 × 15,000 interval linear system on an eight core computer.

Cleber Roberto Milani, Mariana Kolberg, Luiz Gustavo Fernandes

TRACEMIN-Fiedler: A Parallel Algorithm for Computing the Fiedler Vector

The eigenvector corresponding to the second smallest eigenvalue of the Laplacian of a graph, known as the Fiedler vector, has a number of applications in areas that include matrix reordering, graph partitioning, protein analysis, data mining, machine learning, and web search. The computation of the Fiedler vector has been regarded as an expensive process as it involves solving a large eigenvalue problem. We present a novel and efficient parallel algorithm for computing the Fiedler vector of large graphs based on the Trace Minimization algorithm. We compare the parallel performance of our method with a multilevel scheme, designed specifically for computing the Fiedler vector, which is implemented in routine MC73_FIEDLER of the Harwell Subroutine Library (HSL).

Murat Manguoglu, Eric Cox, Faisal Saied, Ahmed Sameh

Applying Parallel Design Techniques to Template Matching with GPUs

Designing algorithms for data parallelism can create significant gains in performance on SIMD architectures. The performance of General Purpose GPUs can also benefit from careful analysis of memory usage and data flow due to their large throughput and system memory bottlenecks. In this paper we present an algorithm for template matching that is designed from the beginning for the GPU architecture and achieves greater than an order of magnitude speedup over traditional algorithms designed for the CPU and reimplemented on the GPU. This shows that it is not only desirable to adapt existing algorithms to run on GPUs, but also that future algorithms should be designed with the GPU architecture in mind.

Robert Finis Anderson, J. Steven Kirtzic, Ovidiu Daescu

Backmatter

Title: High Performance Computing for Computational Science – VECPAR 2010
Editors: José M. Laginha M. Palma
Michel Daydé
Osni Marques
João Correia Lopes
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-19328-6
Print ISBN: 978-3-642-19327-9
DOI: https://doi.org/10.1007/978-3-642-19328-6

Springer Professional

About this book

Table of Contents

Frontmatter

Invited Talks

Exascale Computing Technology Challenges

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem?

HPC Techniques for a Heart Simulator

Game Changing Computational Engineering Technology

HPC in Phase Change: Towards a New Execution Model

Linear Algebra and Solvers on Emerging Architectures

Factors Impacting Performance of Multithreaded Sparse Triangular Solve

Performance and Numerical Accuracy Evaluation of Heterogeneous Multicore Systems for Krylov Orthogonal Basis Computation

An Error Correction Solver for Linear Systems: Evaluation of Mixed Precision Implementations

Multifrontal Computations on GPUs and Their Multi-core Hosts

Accelerating GPU Kernels for Dense Linear Algebra

A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators

On the Performance of an Algebraic Multigrid Solver on Multicore Clusters

An Hybrid Approach for the Parallelization of a Block Iterative Algorithm

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

A Massively Parallel Dense Symmetric Eigensolver with Communication Splitting Multicasting Algorithm

Large Scale Simulations in CS&E

Global Memory Access Modelling for Efficient Implementation of the Lattice Boltzmann Method on Graphics Processing Units

Data Structures and Transformations for Physically Based Simulation on a GPU

Scalability Studies of an Implicit Shallow Water Solver for the Rossby-Haurwitz Problem

Parallel Multigrid Solvers Using OpenMP/MPI Hybrid Programming Models on Multi-Core/Multi-Socket Clusters

A Parallel Strategy for a Level Set Simulation of Droplets Moving in a Liquid Medium

Optimization of Aircraft Wake Alleviation Schemes through an Evolution Strategy

Parallel and Distributed Computing

On-Line Multi-Threaded Processing of Web User-Clicks on Multi-Core Processors

Performance Evaluation of Improved Web Search Algorithms

Text Classification on a Grid Environment

On the Vectorization of Engineering Codes Using Multimedia Instructions

Numerical Library Reuse in Parallel and Distributed Platforms

Improving Memory Affinity of Geophysics Applications on NUMA Platforms Using Minas

HPC Environment Management: New Challenges in the Petaflop Era

Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems

Applying Process Migration on a BSP-Based LU Decomposition Application

A P2P Approach to Many Tasks Computing for Scientific Workflows

Intelligent Service Trading and Brokering for Distributed Network Services in GridSolve

Load Balancing in Dynamic Networks by Bounded Delays Asynchronous Diffusion

A Computing Resource Discovery Mechanism over a P2P Tree Topology

Numerical Algorithms

A Parallel Implementation of the Jacobi-Davidson Eigensolver for Unsymmetric Matrices

The Impact of Data Distribution in Accuracy and Performance of Parallel Linear Algebra Subroutines

On a Strategy for Spectral Clustering with Parallel Computation

On Techniques to Improve Robustness and Scalability of a Parallel Hybrid Linear Solver

Solving Dense Interval Linear Systems with Verified Computing on Multicore Architectures

TRACEMIN-Fiedler: A Parallel Algorithm for Computing the Fiedler Vector

Applying Parallel Design Techniques to Template Matching with GPUs

Backmatter

Premium Partner