nach oben

2005 | Buch

Kapitel lesen Erstes Kapitel lesen

Languages and Compilers for Parallel Computing

15th Workshop, LCPC 2002, College Park, MD, USA, July 25-27, 2002. Revised Papers

herausgegeben von: Bill Pugh, Chau-Wen Tseng

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The 15th Workshop on Languages and Compilers for Parallel Computing was held in July 2002 at the University of Maryland, College Park. It was jointly sponsored by the Department of Computer Science at the University of Ma- land and the University of Maryland Institute for Advanced Computer Studies (UMIACS).LCPC2002broughttogetherover60researchersfromacademiaand research institutions from many countries. The program of 26 papers was selected from 32 submissions. Each paper was reviewed by at least three Program Committee members and sometimes by additional reviewers. Prior to the workshop, revised versions of accepted papers were informally published on the workshop’s website and in a paper proceedings that was distributed at the meeting. This year, the workshopwas organizedinto sessions of papers on related topics, and each session consisted of two to three 30-minute presentations.Based on feedback from the workshop,the papers were revised and submitted for inclusion in the formal proceedings published in this volume. Two papers were presented at the workshop but later withdrawn from the ?nal proceedings by their authors. We were very lucky to have Bill Carlson from the Department of Defense give the LCPC 2002 keynote speech on “UPC: A C Language for Shared M- ory Parallel Programming.” Bill gave an excellent overview of the features and programming model of the UPC parallel programming language.

Inhaltsverzeichnis

Frontmatter

Memory-Constrained Communication Minimization for a Class of Array Computations

Abstract

The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions involving large multidimensional arrays. The efficient computation of complex tensor contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, but they can often be generated and used in batches through appropriate loop fusion transformations. To optimize the performance of such computations on parallel computers, the total amount of inter-processor communication must be minimized, subject to the available memory on each processor. In this paper, we address the memory-constrained communication minimization problem in the context of this class of computations. Based on a framework that models the relationship between loop fusion and memory usage, we develop an approach to identify the best combination of loop fusion and data partitioning that minimizes inter-processor communication cost without exceeding the per-processor memory limit. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites.

Daniel Cociorva, Gerald Baumgartner, Chi-Chung Lam, P. Sadayappan, J. Ramanujam

Forward Communication Only Placements and Their Use for Parallel Program Construction

Abstract

The context of this paper is automatic parallelization by the space-time mapping method. One key issue in that approach is to adjust the granularity of the derived parallelism. For that purpose, we use tiling in the space and time dimensions. While space tiling is always legal, there are constraints on the possibility of time tiling, unless the placement is such that communications always go in the same direction (forward communications only). We derive an algorithm that automatically constructs an FCO placement – if it exists. We show that the method is applicable to many familiar kernels and that it gives satisfactory speedups.

Martin Griebl, Paul Feautrier, Armin Größlinger

Hierarchical Parallelism Control for Multigrain Parallel Processing

Abstract

To improve effective performance and usability of shared memory multiprocessor systems, a multi-grain compilation scheme, which hierarchically exploits coarse grain parallelism among loops, subroutines and basic blocks, conventional loop parallelism and near fine grain parallelism among statements inside a basic block, is important. In order to efficiently use hierarchical parallelism of each nest level, or layer, in multigrain parallel processing, it is required to determine how many processors or groups of processors should be assigned to each layer, according to the parallelism of the layer. This paper proposes an automatic hierarchical parallelism control scheme to assign suitable number of processors to each layer so that the parallelism of each hierarchy can be used efficiently. Performance of the proposed scheme is evaluated on IBM RS6000 SMP server with 8 processors using 8 programs of SPEC95FP.

Motoki Obata, Jun Shirako, Hiroki Kaminaga, Kazuhisa Ishizaka, Hironori Kasahara

Compiler Analysis and Supports for Leakage Power Reduction on Microprocessors

Abstract

Power leakage constitutes an increasing fraction of the total power consumption in modern semiconductor technologies. Recent research efforts also indicate architecture, compiler, and software participations can help reduce the switching activities (also known as dynamic power) on microprocessors. This raises interests on the issues to employ architecture and compiler efforts to reduce leakage power (also known as static power) on microprocessors. In this paper, we investigate the compiler analysis techniques related to reducing leakage power. The architecture model in our design is a system with an instruction set to support the control of power gating in the component levels. Our compiler gives an analysis framework to utilize the instruction to reduce the leakage power. We present a data flow analysis framework to estimate the component activities at fixed points of programs with the consideration of pipelines of architectures. We also give the equation for the compiler to decide if the employment of the power gating instructions on given program blocks will benefit the total energy reductions. As the duration of power gating on components on given program routines is related to program branches, we propose a set of scheduling policy include Basic_Blk_Sched, MIN_Path_Sched, and AVG_Path_Sched mechanisms and evaluate the effectiveness of those schemes. Our experiment is done by incorporating our compiler analysis and scheduling policy into SUIF compiler tools [32] and by simulating the energy consumptions on Wattch toolkits [6]. Experimental results show our mechanisms are effective in reducing leakage powers on microprocessors.

Yi-Ping You, Chingren Lee, Jenq Kuen Lee

Automatic Detection of Saturation and Clipping Idioms

Abstract

The MMX^TM technology and SSE/SSE2 (streaming-SIMD-extensions) introduced a variety of SIMD instructions that can exploit data parallelism in numerical and multimedia applications. In particular, new saturation and clipping instructions can boost the performance of applications that make extensive use of such operations. Unfortunately, due to the lack of support for saturation and clipping operators in e.g. C/C++ or Fortran, these operations must be explicitly coded with conditional constructs that test the value of operands before actual wrap-around arithmetic is performed. As a result, inlineassembly or language extensions are most commonly used to exploit the new instructions. In this paper, we explore an alternative approach, where the compiler automatically maps high-level saturation and clipping idioms onto efficient low-level instructions. The effectiveness of this approach is demonstrated with some experiments.

Aart J. C. Bik, Milind Girkar, Paul M. Grey, Xinmin Tian

Compiler Optimizations with DSP-Specific Semantic Descriptions

Abstract

Due to the specialized architecture and stream-based instruction set, traditional DSP compilers usually yield poor-quality object codes. Lack of an insight into the DSP architecture and the specific semantics of DSP applications, a compiler would have trouble selecting appropriate special instructions to exploit advanced hardware features. In order to extract optimal performance from DSPs, we propose a set of user-specified directives called Digital Signal Processing Interface (DSPI), which can facilitate code generation by relaying DSP specific semantics to compilers. We have implemented a prototype compiler based on the SPAM and SUIF compiler toolkits and integrated the DSPI into the prototype compiler. The compiler is currently targeted to TI’s TMS320C6X DSP and will be extended to a retargetable compiler toolkit for embedded systems and System-on-a-Chip (SoC) platforms. Preliminary experimental results show that by incorporating DSPI directives significant performance improvements can be achieved in several DSP applications.

Yung-Chia Lin, Yuan-Shin Hwang, Jenq Kuen Lee

Combining Performance Aspects of Irregular Gauss-Seidel Via Sparse Tiling

Abstract

Finite Element problems are often solved using multigrid techniques. The most time consuming part of multigrid is the iterative smoother, such as Gauss-Seidel. To improve performance, iterative smoothers can exploit parallelism, intra-iteration data reuse, and inter-iteration data reuse. Current methods for parallelizing Gauss-Seidel on irregular grids, such as multi-coloring and owner-computes based techniques, exploit parallelism and possibly intra-iteration data reuse but not inter-iteration data reuse. Sparse tiling techniques were developed to improve intra-iteration and inter-iteration data locality in iterative smoothers. This paper describes how sparse tiling can additionally provide parallelism. Our results show the effectiveness of Gauss-Seidel parallelized with sparse tiling techniques on shared memory machines, specifically compared to owner-computes based Gauss-Seidel methods. The latter employ only parallelism and intra-iteration locality. Our results support the premise that better performance occurs when all three performance aspects (parallelism, intra-iteration, and inter-iteration data locality) are combined.

Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Jonathan Freeman, Barbara Kreaseck

A Hybrid Strategy Based on Data Distribution and Migration for Optimizing Memory Locality

Abstract

The performance of a NUMA architecture depends on the efficient use of local memory. Therefore, software-level techniques that improve memory locality (in addition to parallelism) are extremely important to extract the best performance from these architectures. The proposed solutions so far include OS-based automatic data migrations and compiler-based static/dynamic data distributions.

This paper proposes and evaluates a hybrid strategy for optimizing memory locality in NUMA architectures. In this strategy, we employ both compiler-directed data distribution and OS-directed dynamic page migration. More specifically, a given program code is first divided into segments, and then each segment is optimized either using compiler-based data distributions (at compile-time) or using dynamic migration (at runtime). In selecting the optimization strategy to use for a program segment, we use a criterion based on the number of compile-time analyzable references in loops.

To test the effectiveness of our strategy in optimizing memory locality of applications, we implemented it and compared its performance with that of several other techniques such as compiler-directed data distribution and OS-directed dynamic page migration. Our experimental results obtained through simulation indicate that our hybrid strategy outperforms other strategies and achieves the best performance for a set of codes with regular, irregular, and mixed (regular + irregular) access patterns.

I. Kadayif, M. Kandemir, A. Choudhary

Compiler Optimizations Using Data Compression to Decrease Address Reference Entropy

Abstract

In modern computers, a single “random” access to main memory often takes as much time as executing hundreds of instructions. Rather than using traditional compiler approaches to enhance locality by interchanging loops, reordering data structures, etc., this paper proposes the radical concept of using aggressive data compression technology to improve hierarchical memory performance by reducing memory address reference entropy.

In some cases, conventional compression technology can be adapted. However, where variable access patterns must be permitted, other compression techniques must be used. For the special case of random access to elements of sparse matrices, data structures and compiler technology already exist. Our approach is much more general, using compressive hash functions to implement random access lookup tables. Techniques that can be used to improve the effectiveness of any compression method in reducing memory access entropy also are discussed.

H. G. Dietz, T. I. Mattox

Towards Compiler Optimization of Codes Based on Arrays of Pointers

Abstract

To successfully exploit all the possibilities of current computer/multicomputer architectures, optimization compiling techniques are a must. However, for codes based on pointers and dynamic data structures these optimization techniques have to be necessarily carried out after identifying the characteristics and properties of the data structure used in the code. In this paper we present one method able to automatically identify complex dynamic data structures used in a code even in the presence of arrays of pointers. This method has been implemented in an analyzer which symbolically executes the input code to generate a set of graphs, called RSRSG (Reduced Set of Reference Shape Graphs), for each statement. Each RSRSG accurately describes the data structure configuration at each program point. In order to deal with arrays of pointers we have introduced two main concepts: the multireference class, and instances. Our analyzer has been validated with several codes based on complex data structures containing arrays of pointers which were successfully identified.

F. Corbera, R. Asenjo, E. L. Zapata

An Empirical Study on the Granularity of Pointer Analysis in C Programs

Abstract

Pointer analysis plays a critical role in modern C compilers because of the frequent appearances of pointer expressions. It is even more important for data dependence analysis, which is essential in exploiting parallelism, because complex data structures such as arrays are often accessed through pointers in C. One of the important aspects of pointer analysis methods is their granularity, the way in which the memory objects are named for analysis. The naming schemes used in a pointer analysis affect its effectiveness, especially for pointers pointing to heap memory blocks. In this paper, we present a new approach that applies the compiler analysis and profiling techniques together to study the impact of the granularity in pointer analyses. An instrumentation tool, based on the Intel’s Open Resource Compiler (ORC), is devised to simulate different naming schemes and collect precise target sets for indirect references at runtime. The collected target sets are then fed back to the ORC compiler to evaluate the effectiveness of different granularity in pointer analyses. The change of the alias queries in the compiler analyses and the change of performance of the output code at different granularity levels are observed. With the experiments on the SPEC CPU2000 integer benchmarks, we found that 1) finer granularity of pointer analysis show great potential in optimizations, and may bring about up to 15% performance improvement, 2) the common naming scheme, which gives heap memory blocks names according to the line number of system memory allocation calls, is not powerful enough for some benchmarks. The wrapper functions for allocation or the user-defined memory management functions have to be recognized to produce better pointer analysis result, 3) pointer analysis of fine granularity requires inter-procedural analysis, and 4) it is also quite important that a naming scheme distinguish the fields of a structure in the targets.

Tong Chen, Jin Lin, Wei-Chung Hsu, Pen-Chung Yew

Automatic Implementation of Programming Language Consistency Models

Abstract

Concurrent threads executing on a shared memory system can access the same memory locations. A consistency model defines constraints on the order of these shared memory accesses. For good run-time performance, these constraints must be as few as possible. Programmers who write explicitly parallel programs must take into account the consistency model when reasoning about the behavior of their programs. Also, the consistency model constrains compiler transformations that reorder code. It is not known what consistency models best suit the needs of the programmer, the compiler, and the hardware simultaneously. We are building a compiler infrastructure to study the effect of consistency models on code optimization and run-time performance. The consistency model presented to the user will be a programmable feature independent of the hardware consistency model. The compiler will be used to mask the hardware consistency model from the user by mapping the software consistency model onto the hardware consistency model. When completed, our compiler will be used to prototype consistency models and to measure the relative performance of different consistency models. We present preliminary experimental data for performance of a software implementation of sequential consistency using manual inter-thread analysis.

Zehra Sura, Chi-Leung Wong, Xing Fang, Jaejin Lee, Samuel P. Midkiff, David Padua

Parallel Reductions: An Application of Adaptive Algorithm Selection

Abstract

Irregular and dynamic memory reference patterns can cause significant performance variations for low level algorithms in general and especially for parallel algorithms. We have previously shown that parallel reduction algorithms are quite input sensitive and thus can benefit from an adaptive, reference pattern directed selection. In this paper we extend our previous work by detailing a systematic approach to dynamically select the best parallel algorithm. First we model the characteristics of the input, i.e., the memory reference pattern, with a descriptor vector. Then we measure the performance of several reduction algorithms for various values of the pattern descriptor. Finally we establish a (many-to-one) mapping (function) between a finite set of descriptor values and a set of algorithms. We thus obtain a performance ranking of the available algorithms with respect to a limited set of descriptor values. The actual dynamic selection code is generated using statistical regression methods or a decision tree. Finally we present experimental results to validate our modeling and prediction techniques.

Hao Yu, Francis Dang, Lawrence Rauchwerger

Adaptively Increasing Performance and Scalability of Automatically Parallelized Programs

Abstract

This paper presents adaptive execution techniques that determine whether automatically parallelized loops are executed parallelly or sequentially in order to maximize performance and scalability. The adaptation and performance estimation algorithms are implemented in a compiler preprocessor. The preprocessor inserts code that automatically determines at compile-time or at run-time the way the parallelized loops are executed. Using a set of standard numerical applications written in Fortran77 and running them with our techniques on a distributed shared memory multiprocessor machine (SGI Origin2000), we obtain the performance of our techniques, on average, 26%, 20%, 16%, and 10% faster than the original parallel program on 32, 16, 8, and 4 processors, respectively. One of the applications runs even more than twice faster than its original parallel version on 32 processors.

Jaejin Lee, H. D. K. Moonesinghe

Selector: A Language Construct for Developing Dynamic Applications

Abstract

Fitting algorithms to meet input data characteristics and/or a changing computing environment is a tedious and error prone task. Programmers need to deal with code instrumentation details and implement the selection of which algorithm best suits a given data set. In this paper we describe a set of simple programming constructs for C that allows programmers to specify and generate applications that can select at run-time the best of several possible implementations based on measured run-time performance and/or algorithmic input values. We describe the application of this approach to a realistic linear solver for an engineering crash analysis code. The preliminary experimental results reveal that this approach provides an effective mechanism for creating sophisticated dynamic application behavior with minimal effort.

Pedro C. Diniz, Bing Liu

Optimizing the Java Piped I/O Stream Library for Performance

Abstract

The overall performance of Java programs has been significantly improved since Java emerged as a mainstream programming language. However, these improvements have revealed a second tier of performance bottlenecks. In this paper, we address one of these issues: the performance of Java piped I/O stream library. We analyze commonly used data transfer patterns in which one reader thread and one writer thread communicate via Java piped I/O streams. We consider data buffering and synchronization between these two threads, as well as the thread scheduling policy used in the Java virtual machine. Based on our observations, we propose several optimization techniques that can significantly improve Java piped I/O stream performance. We use these techniques to modify the Java piped I/O stream library. We present performance results for seven example programs from the literature that use the Java piped I/O stream library. Our methods improve the performance of the programs by over a factor of 4 on average, and by a factor of 27 in the best case.

Ji Zhang, Jaejin Lee, Philip K. McKinley

A Comparative Study of Stampede Garbage Collection Algorithms

Abstract

Stampede is a parallel programming system to support interactive multimedia applications. The system maintains temporal causality in such streaming real-time applications via channels that contain timestamped items. A Stampede application is a coarse-grain dataflow pipeline of timestamped items. Not all timestamps are relevant for the application output due to the differential processing rates of the pipeline stages. Therefore, garbage collection (GC) is crucial for Stampede runtime performance. Three GC algorithms are currently available in Stampede. In this paper, we ask the question how far off these algorithms are from an ideal garbage collector, one in which the memory usage is exactly equal to that which is required for buffering only the relevant timestamped items in the channels? This oracle, while unimplementable, serves as an empirical lower-bound for memory usage. We then propose optimizations that will help us get closer to this lower-bound. Using an elaborate measurement and post-mortem analysis infrastructure in Stampede, we evaluate the performance potential for these optimizations. A color-based people tracking application is used for the performance evaluation. Our results show that these optimizations reduce the memory usage by over 60% for this application over the best GC algorithm available in Stampede.

Hasnain A. Mandviwala, Nissim Harel, Kathleen Knobe, Umakishore Ramachandran

Compiler and Runtime Support for Shared Memory Parallelization of Data Mining Algorithms

Abstract

Data mining techniques focus on finding novel and useful patterns or models from large datasets. Because of the volume of the data to be analyzed, the amount of computation involved, and the need for rapid or even interactive analysis, data mining applications require the use of parallel machines. We have been developing compiler and runtime support for developing scalable implementations of data mining algorithms. Our work encompasses shared memory parallelization, distributed memory parallelization, and optimizations for processing disk-resident datasets.

In this paper, we focus on compiler and runtime support for shared memory parallelization of data mining algorithms. We have developed a set of parallelization techniques that apply across algorithms for a variety of mining tasks. We describe the interface of the middleware where these techniques are implemented. Then, we present compiler techniques for translating data parallel code to the middleware specification. Finally, we present a brief evaluation of our compiler using apriori association mining and k-means clustering.

Xiaogang Li, Ruoming Jin, Gagan Agrawal

Performance Analysis of Symbolic Analysis Techniques for Parallelizing Compilers

Abstract

Understanding symbolic expressions is an important capability of advanced program analysis techniques. Many current compiler techniques assume that coefficients of program expressions, such as array subscripts and loop bounds, are integer constants. Advanced symbolic handling capabilities could make these techniques amenable to real application programs. Symbolic analysis is also likely to play an important role in supporting higher–level programming languages and optimizations. For example, entire algorithms may be recognized and replaced by better variants. In pursuit of this goal, we have measured the degree to which symbolic analysis techniques affect the behavior of current parallelizing compilers. We have chosen the Polaris parallelizing compiler and studied the techniques such as range analysis – which is the core symbolic analysis in the compiler – expression propagation, and symbolic expression manipulation. To measure the effect of a technique, we disabled it individually, and compared the performance of the resulting program with the original, fully-optimized program. We found that symbolic expression manipulation is important for most programs. Expression propagation and range analysis is important in few programs only, however they can affect these programs significantly. We also found that in all but one programs, a simpler form of range analysis – control range analysis – is sufficient.

Hansang Bae, Rudolf Eigenmann

Efficient Manipulation of Disequalities During Dependence Analysis

Abstract

Constraint-based frameworks can provide a foundation for efficient algorithms for analysis and transformation of regular scientific programs. For example, we recently demonstrated that constraint-based analysis of both memory- and value-based array dependences can often be performed in polynomial time. Many of the cases that could not be processed with our polynomial-time algorithm involved negated equality constraints (also known as disequalities).

In this report, we review the sources of disequality constraints in array dependence analysis and give an efficient algorithm for manipulating certain disequality constraints. Our approach differs from previous work in that it performs efficient satisfiability tests in the presence of disequalities, rather than deferring satisfiability tests until more constraints are available, performing a potentially exponential transformation, or approximating. We do not (yet) have an implementation of our algorithms, or empirical verification that our test is either fast or useful, but we do provide a polynomial time bound and give our reasons for optimism regarding its applicability.

Robert Seater, David Wonnacott

Removing Impediments to Loop Fusion Through Code Transformations

Abstract

Loop fusion is a common optimization technique that takes several loops and combines them into a single large loop. Most of the existing work on loop fusion concentrates on the heuristics required to optimize an objective function, such as data reuse or creation of instruction level parallelism opportunities. Often, however, the code provided to a compiler has only small sets of loops that are control flow equivalent, normalized, have the same iteration count, are adjacent, and have no fusion-preventing dependences. This paper focuses on code transformations that create more opportunities for loop fusion in the IBM®XL compiler suite that generates code for the IBM family of PowerPC®processors. In this compiler an objective function is used at the loop distributor to decide which portions of a loop should remain in the same loop nest and which portions should be redistributed. Our algorithm focuses on eliminating conditions that prevent loop fusion. By generating maximal fusion our algorithm increases the scope of later transformations. We tested our improved code generator in an IBM pSeries^TM690 machine equipped with a POWER4^TMprocessor using the SPEC CPU2000 benchmark suite. Our improvements to loop fusion resulted in three times as many loops fused in a subset of CFP2000 benchmarks, and four times as many for a subset of CINT2000 benchmarks.

Bob Blainey, Christopher Barton, José Nelson Amaral

Near-Optimal Padding for Removing Conflict Misses

Abstract

The effectiveness of the memory hierarchy is critical for the performance of current processors. The performance of the memory hierarchy can be improved by means of program transformations such as padding, which is a code transformation targeted to reduce conflict misses. This paper presents a novel approach to perform near-optimal padding for multi-level caches. It analyzes programs, detecting conflict misses by means of the Cache Miss Equations. A genetic algorithm is used to compute the parameter values that enhance the program. Our results show that it can remove practically all conflicts among variables in the SPECfp95, targeting all the different cache levels simultaneously.

Xavier Vera, Josep Llosa, Antonio González

Fine-Grain Stacked Register Allocation for the Itanium Architecture

Abstract

The introduction of a hardware managed register stack in the Itanium Architecture creates an opportunity to optimize both the frequency in which a compiler requests allocation of registers from this stack and the number of registers requested. The Itanium Architecture specifies the implementation of a Register Stack Engine (RSE) that automatically performs register spills and fills. However, if the compiler requests too many registers, through the alloc instruction, the RSE will be forced to execute unnecessary spill and fill operations. In this paper we introduce the formulation of the fine-grain register stack frame sizing problem. The normal interaction between the compiler and the RSE suggested by the Itanium Architecture designers is for the compiler to request the maximum number of registers required by a procedure at the procedure invocation. Our new problem formulation allows for more conservative stack register allocation because it acknowledges that the number of registers required in different control flow paths varies significantly. We introduce a basic algorithm to solve the stack register allocation problem, and present our preliminary performance results from the implementation of our algorithm in the Open64 compiler.

Alban Douillet, José Nelson Amaral, Guang R. Gao

Evaluating Iterative Compilation

Abstract

This paper describes a platform independent optimisation approach based on feedback-directed program restructuring. We have developed two strategies that search the optimisation space by means of profiling to find the best possible program variant. These strategies have no a priori knowledge of the target machine and can be run on any platform. In this paper our approach is evaluated on three full SPEC benchmarks, rather than the kernels evaluated in earlier studies where the optimisation space is relatively small. This approach was evaluated on six different platforms, where it is shown that we obtain on average a 20.5% reduction in execution time compared to the native compiler with full optimisation. By using training data instead of reference data for the search procedure, we can reduce compilation time and still give on average a 16.5% reduction in time when running on reference data. We show that our approach is able to give similar significant reductions in execution time over a state of the art high level restructurer based on static analysis and a platform specific profile feedback directed compiler that employs the same transformations as our iterative system.

G. G. Fursin, M. F. P. O’Boyle, P. M. W. Knijnenburg

Backmatter

Titel: Languages and Compilers for Parallel Computing
herausgegeben von: Bill Pugh
Chau-Wen Tseng
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-31612-1
Print ISBN: 978-3-540-30781-5
DOI: https://doi.org/10.1007/11596110