Skip to main content

2010 | Buch

Facing the Multicore-Challenge

Aspects of New Paradigms and Technologies in Parallel Computing

herausgegeben von: Rainer Keller, David Kramer, Jan-Philipp Weiss

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter

Invited Talks

Analyzing Massive Social Networks Using Multicore and Multithreaded Architectures
Abstract
Emerging real-world graph problems include detecting community structure in large social networks, improving the resilience of the electric power grid, and detecting and preventing disease in human populations. Unlike traditional applications in computational science and engineering, solving these problems at scale often raises new challenges because of sparsity and the lack of locality in the data, the need for additional research on scalable algorithms and development of frameworks for solving these problems on high performance computers, and the need for improved models that also capture the noise and bias inherent in the torrential data streams. The explosion of real-world graph data poses a substantial challenge: How can we analyze constantly changing graphs with billions of vertices? Our approach leverages the Cray XMT’s fine-grained parallelism and flat memory model to scale to massive graphs. On the Cray XMT, our static graph characterization package GraphCT summarizes such massive graphs, and our ongoing STINGER streaming work updates clustering coefficients on massive graphs at a rate of tens of thousands updates per second.
David Bader
MareIncognito: A Perspective towards Exascale
Abstract
MareIncognito is a cooperative project between IBM and the Barcelona Supercomputing Center (BSC) targeting the design of relevant technologies on the way towards exascale. The initial challenge of the project was to study the potential design of a system based on a next generation of Cell processors. Even so, the approaches pursued are general purpose, applicable to a wide range of accelerator and homogeneous multicores and holistically addressing a large number of components relevant in the design of such systems.
The programming model is probably the most important issue when facing the multicore era. We need to offer support for asynchronous data flow execution and decouple the way source code looks like and the way the program is executed and its operations (tasks) scheduled. In order to ensure a reasonable migration path for programmers the execution model should be exposed to them through a syntactical and semantic structure that is not very far away from current practice. We are developing the StarSs programming model which we think addresses some the challenges of targeting the future heterogeneous / hierarchical multicore systems at the node level. It also integrates nicely into coarser level programming models such as MPI and what is more important in ways that propagate the asynchronous dataflow execution to the whole application. We are also investigating how some of the features of StarSs can be integrated in OpenMP.
At the architecture level, interconnect and memory subsystem are two key components. We are studying in detail the behavior of current interconnect systems and in particular contention issues. The question is to investigate better ways to use the raw bandwidth that we already have in our systems and can expect to grow in the future. Better understanding of the interactions between the raw transport mechanisms, the communication protocols and synchronization behavior of applications should lead to avoid an exploding need for bandwidth that is often claimed. The use of the asynchronous execution model that StarSs offers can help in this direction as a very high overlap between communication and computation should be possible. A similar effect or reducing sensitivity to latency as well as the actual off chip bandwidth required should be supported by the StarSs model.
The talk will present how we target the above issues, with special details on the StarSs programming model and the underlying idea of the project of how tight cooperation between architecture, run time, programming model, resource management and application are needed in order to achieve in the future the exascale performance.
Jesus Labarta
The Natural Parallelism
Abstract
With the advent of multi-core processors a new unwanted way of parallel programming is required which is seen as a major challenge. This talk will argue in exactly the opposite direction that our accustomed programming paradigm has been unwanted for years and parallel processing is the natural scheduling and execution model on all levels of hardware.
Sequential processing is a long outdated illusionary software concept and we will expose its artificiality and absurdity with appropriate analogies of everyday life. Multi-core appears as a challenge only when looking at it from the crooked illusion of sequential processing. There are other important aspects such as specialization or data movement, and admittedly large scale parallelism has also some issues which we will discuss. But the main problem is changing our mindset and helping others to do so with better education so that parallelism comes to us as a friend and not enemy.
Robert Strzodka

Computer Architecture and Parallel Programming

RapidMind: Portability across Architectures and Its Limitations
Abstract
Recently, hybrid architectures using accelerators like GPGPUs or the Cell processor have gained much interest in the HPC community. The “RapidMind Multi-Core Development Platform” is a programming environment that allows generating code which is able to seamlessly run on hardware accelerators like GPUs or the Cell processor and multi-core CPUs both from AMD and Intel. This paper describes the ports of three mathematical kernels to RapidMind which have been chosen as synthetic benchmarks and representatives of scientific codes. Performance of these kernels has been measured on various RapidMind backends (cuda, cell and x86) and compared to other hardware-specific implementations (using CUDA, Cell SDK and Intel MKL). The results give an insight into the degree of portability of RapidMind code and code performance across different architectures.
Iris Christadler, Volker Weinberg
A Majority-Based Control Scheme for Way-Adaptable Caches
Abstract
Considering the trade-off between performance and power consumption has become significantly important in microprocessor design. For this purpose, one promising approach is to employ way-adaptable caches, which adjust the number of cache ways available to a running application based on assessment of its working set size. However, in a very short period, the estimated working set size by cache access locality assessment may become different from that of the overall trend in a long period. Such a locality assessment result will cause excessive adaptation to allocate too many cache ways to a thread and, as a result, deteriorate the energy efficiency of way-adaptable caches. To avoid the excessive adaptation, this paper proposes a majority-based control scheme, in which the number of activated ways is adjusted based on majority voting of locality assessment results of several short sampling periods. By using majority voting, the proposed scheme can make way-adaptable caches less sensitive to the results of the periods including exceptional behaviors. The experimental results indicate that the proposed scheme can reduce the number of activated ways by up to 37% and on average by 9.4%, while maintaining performance compared with a conventional scheme, resulting in reduction of power consumption.
Masayuki Sato, Ryusuke Egawa, Hiroyuki Takizawa, Hiroaki Kobayashi
Improved Scalability by Using Hardware-Aware Thread Affinities
Abstract
The complexity of an efficient thread management steadily rises with the number of processor cores and heterogeneities in the design of system architectures, e.g., the topologies of execution units and the memory architecture. In this paper, we show that using information about the system topology combined with a hardware-aware thread management is worthwhile. We present such a hardware-aware approach that utilizes thread affinity to automatically steer the mapping of threads to cores and experimentally analyze its performance. Our experiments show that we can achieve significantly better scalability and runtime stability compared to the ordinary dispatching of threads provided by the operating system.
Sven Mallach, Carsten Gutwenger
Thread Creation for Self-aware Parallel Systems
Abstract
Upcoming computer architectures will be built out of hundreds of heterogeneous components, posing an obstacle for efficient central management of system resources. Thus, distributed management schemes, such as Self-aware Memory, gain importance. The goal of this work is to establish a POSIX-like thread model in a distributed system, to enable a smooth upgrade path for legacy software. Throughout this paper, design and implementation of protocol enhancements of the SaM protocol are outlined. This paper studies the overhead of thread creation and presents first performance numbers.
Martin Schindewolf, Oliver Mattes, Wolfgang Karl

Applications on Multicore

G-Means Improved for Cell BE Environment
Abstract
The performance gain obtained by the adaptation of the G-means algorithm for a Cell BE environment using the CellSs framework is described. G-means is a clustering algorithm based on k-means, used to find the number of Gaussian distributions and their centers inside a multi-dimensional dataset. It is normally used for data mining applications, and its execution can be divided into 6 execution steps. This paper analyzes each step to select which of them could be improved. In the implementation, the algorithm was modified to use the specific SIMD instructions of the Cell processor and to introduce parallel computing using the CellSs framework to handle the SPU tasks. The hardware used was an IBM BladeCenter QS22 containing two PowerXCell processors. The results show the execution of the algorithm 60% faster as compared with the non-improved code.
Aislan G. Foina, Rosa M. Badia, Javier Ramirez-Fernandez
Parallel 3D Multigrid Methods on the STI Cell BE Architecture
Abstract
The STI Cell Broadband Engine (BE) is a highly capable heterogeneous multicore processor with large bandwidth and computing power perfectly suited for numerical simulation. However, all performance benefits come at the price of productivity since more responsibility is put to the programmer. In particular, programming with the IBM Cell SDK is hampered by not only taking care of a parallel decomposition of the problem but also of managing all data transfers and organizing all computations in a performance-beneficial manner. While raising complexity of program development, this approach enables efficient utilization of available resources.
In the present work we investigate the potential and the performance behavior of Cell’s parallel cores for a resource-demanding and bandwidth-bound multigrid solver for a three-dimensional Poisson problem. The chosen multigrid method based on a parallel Gauß-Seidel and Jacobi smoothers combines mathematical optimality with a high degree of inherent parallelism. We investigate dedicated code optimization strategies on the Cell platform and evaluate associated performance benefits by a comprehensive analysis. Our results show that the Cell BE platform can give tremendous benefits for numerical simulation based on well-structured data. However, it is inescapable that isolated, vendor-specific, but performance-optimal programming approaches need to be replaced by portable and generic concepts like OpenCL – maybe at the price of performance loss.
Fabian Oboril, Jan-Philipp Weiss, Vincent Heuveline
Applying Classic Feedback Control for Enhancing the Fault-Tolerance of Parallel Pipeline Workflows on Multi-core Systems
Abstract
Nuclear disaster early warning systems are based on simulations of the atmospheric dispersion of the radioactive pollutants that may have been released into the atmosphere as a result of an accident at a nuclear power plant. Currently the calculation is performed by a series of 9 enchained FORTRAN and C/C++ sequential simulation codes. The new requirements to our example early warning system we focus on in this paper include a maximum response time of 120 seconds whereas currently computing a single simulation step exceeds this limit. For the purpose of improving performance we propose a pipeline parallelization of the simulation workflow on a multi-core system. This leads to a 4.5x speedup with respect to the sequential execution time on a dual quad-core machine. The scheduling problem which arises is that of maximizing the number of iterations of the dispersion calculation algorithm while not exceeding the maximum response time limit. In the context of our example application, a static scheduling strategy (e.g., a fixed rate of firing iterations) proves to be inappropriate because it is not able to tolerate faults that may occur during regular use (e.g., CPU failure, software errors, heavy load bursts). In this paper we show how a simple PI-controller is able to keep the realized response time of the workflow around a desired value in different failure and heavy load scenarios by automatically reducing the throughput of the system when necessary, thus improving the system’s fault tolerance.
Tudor B. Ionescu, Eckart Laurien, Walter Scheuermann
Lattice-Boltzmann Simulation of the Shallow-Water Equations with Fluid-Structure Interaction on Multi- and Manycore Processors
Abstract
We present an efficient method for the simulation of laminar fluid flows with free surfaces including their interaction with moving rigid bodies, based on the two-dimensional shallow water equations and the Lattice-Boltzmann method. Our implementation targets multiple fundamentally different architectures such as commodity multicore CPUs with SSE, GPUs, the Cell BE and clusters. We show that our code scales well on an MPI-based cluster; that an eightfold speedup can be achieved using modern GPUs in contrast to multithreaded CPU code and, finally, that it is possible to solve fluid-structure interaction scenarios with high resolution at interactive rates.
Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Stefan Turek
FPGA vs. Multi-core CPUs vs. GPUs: Hands-On Experience with a Sorting Application
Abstract
Currently there are several interesting alternatives for low-cost high-performance computing. We report here our experiences with an N-gram extraction and sorting problem, originated in the design of a real-time network intrusion detection system. We have considered FPGAs, multi-core CPUs in symmetric multi-CPU machines and GPUs and have created implementations for each of these platforms. After carefully comparing the advantages and disadvantages of each we have decided to go forward with the implementation written for multi-core CPUs. Arguments for and against each platform are presented – corresponding to our hands-on experience – that we intend to be useful in helping with the selection of the hardware acceleration solutions for new projects.
Cristian Grozea, Zorana Bankovic, Pavel Laskov

GPGPU Computing

Considering GPGPU for HPC Centers: Is It Worth the Effort?
Abstract
In contrast to just a few years ago, the answer to the question “What system should we buy next to best assist our users” has become a lot more complicated for the operators of an HPC center today. In addition to multicore architectures, powerful accelerator systems have emerged, and the future looks heterogeneous. In this paper, we will concentrate on and apply the abovementioned question to a specific accelerator with its programming environment that has become increasingly popular: systems using graphics processors from NVidia, programmed with CUDA. Using three benchmarks encompassing main computational needs of scientific codes, we compare performance results with those obtained by systems with modern x86 multicore processors. Taking the experience from optimizing and running the codes into account, we discuss whether the presented performance numbers really apply to computing center users running codes in their everyday tasks.
Hans Hacker, Carsten Trinitis, Josef Weidendorfer, Matthias Brehm
Real-Time Image Segmentation on a GPU
Abstract
Efficient segmentation of color images is important for many applications in computer vision. Non-parametric solutions are required in situations where little or no prior knowledge about the data is available. In this paper, we present a novel parallel image segmentation algorithm which segments images in real-time in a non-parametric way. The algorithm finds the equilibrium states of a Potts model in the superparamagnetic phase of the system. Our method maps perfectly onto the Graphics Processing Unit (GPU) architecture and has been implemented using the framework NVIDIA Compute Unified Device Architecture (CUDA). For images of 256 ×320 pixels we obtained a frame rate of 30 Hz that demonstrates the applicability of the algorithm to video-processing tasks in real-time.
Alexey Abramov, Tomas Kulvicius, Florentin Wörgötter, Babette Dellen
Parallel Volume Rendering Implementation on Graphics Cards Using CUDA
Abstract
The ever-increasing amounts of volume data require high-end parallel visualization methods to process this data interactively. To meet the demands, progamming on graphics cards offers an effective and fast approach to compute volume rendering methods due to the parallel architecture of today’s graphics cards.
In this paper, we introduce a volume ray casting method working in parallel which provides an interactive visualization. Since data can be processed independently, we managed to speed up the computation on the GPU by a peak factor of more than 400 compared to our sequential CPU version. The parallelization is realized by using the application programming interface CUDA.
Jens Fangerau, Susanne Krömker
Backmatter
Metadaten
Titel
Facing the Multicore-Challenge
herausgegeben von
Rainer Keller
David Kramer
Jan-Philipp Weiss
Copyright-Jahr
2010
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-16233-6
Print ISBN
978-3-642-16232-9
DOI
https://doi.org/10.1007/978-3-642-16233-6