Skip to main content
Top

2022 | Book

Languages and Compilers for Parallel Computing

33rd International Workshop, LCPC 2020, Virtual Event, October 14-16, 2020, Revised Selected Papers

insite
SEARCH

About this book

This book constitutes the thoroughly refereed post-conference proceedings of the 33rd International Workshop on Languages and Compilers for Parallel Computing, LCPC 2020, held in Stony Brook, NY, USA, in October 2020. Due to COVID-19 pandemic the conference was held virtually. The 15 revised full papers were carefully reviewed and selected from 19 submissions. The contributions were organized in topical sections named as follows: Code and Data Transformations; OpenMP and Fortran; Domain Specific Compilation; Machine Language and Quantum Computing; Performance Analysis; Code Generation.

Table of Contents

Frontmatter

Code and Data Transformations

Frontmatter
An Affine Scheduling Framework for Integrating Data Layout and Loop Transformations
Abstract
Code transformations in optimizing compilers can often be classified as loop transformations that change the execution order of statement instances and data layout transformations that change the memory layouts of variables. There is a mutually dependent relationship between the two, i.e., the best statement execution order can depend on the underlying data layout and vice versa. Existing approaches have typically addressed this inter-dependency by picking a specific phase order, and can thereby miss opportunities to co-optimize loop transformations and data layout transformations. In this paper, we propose a cost-based integration of loop and data layout transformations, aiming to cover a broader optimization space than phase-ordered strategies and thereby to find better solutions. Our approach builds on the polyhedral model, and shows how both loop and data layout transformations can be represented as affine scheduling in a unified manner. To efficiently explore the broader optimization space, we build analytical memory and computational cost models that are parameterized with a range of machine features including hardware parallelism, cache and TLB locality, and vectorization. Experimental results obtained on 12-core Intel Xeon and 24-core IBM POWER8 platforms demonstrate that, for a set of 22 Polybench benchmarks, our proposed cost-based integration approach can respectively deliver 1.3\(\times \) and 1.6\(\times \) geometric mean improvements over a state-of-the-art polyhedral optimizer, PLuTo, and a 1.2\(\times \) geometric mean improvement on both platforms over a phase-ordered approach in which loop transformations are followed by the best data layout transformations.
Jun Shirako, Vivek Sarkar
Guiding Code Optimizations with Deep Learning-Based Code Matching
Abstract
Performance models can be very useful for understanding the behavior of applications and guide design and optimization decisions. Unfortunately, performance modeling of nontrivial computations typically requires significant expertise and human effort. Moreover, even when performed by experts, it is necessarily limited in scope, accuracy, or both. In this paper, we are building the Meliora framework for machine learning-based performance model generation of arbitrary codes based on static analysis of intermediate language representations. We demonstrate good accuracy in matching known codes and show how Meliora can be used to optimize new codes though reusing optimization knowledge, either manually or in conjunction with an autotuner. When autotuning, Meliora eliminates or dramatically reduces the empirical search space, while generally achieving competitive performance.
Kewen Meng, Boyana Norris
Expanding Opportunities for Array Privatization in Sparse Computations
Abstract
Sparse computation, where sparse formats are used to compress nonzero values of big data, are commonly used in real world applications. However, the compiler-based data dependence analysis of sparse computations needed for automatic parallelization is difficult due to usage of indirect memory accesses through index arrays, e.g. col in val[col[j]], in these computations. One use of such data dependence analysis is to find opportunities for array privatization, which is an approach to increase available parallelism in a loop by providing each parallel thread its own copy of arrays where in each iteration the array reads are dominated by array writes in the same iteration. In this paper, we expand opportunities for compile-time array privatization in sparse computations by using newly formulated index array properties and a novel concept we call content-based privatization. Furthermore, we discuss existing opportunities to use our approach for detecting private arrays in existing library implementations of sparse computations.
Mahdi Soltan Mohammadi, Mary Hall, Michelle Mills Strout

OpenMP and Fortran

Frontmatter
Concurrent Execution of Deferred OpenMP Target Tasks with Hidden Helper Threads
Abstract
In this paper, we introduce a novel approach to support concurrent offloading for OpenMP tasks based on hidden helper threads. We contrast our design to alternative implementations and explain why the approach we have chosen provides the most consistent performance across a wide range of use cases. In addition to a theoretical discussion of the trade-offs, we detail our implementation in the LLVM compiler infrastructure. Finally, we provide evaluation results of four extreme offloading situations on the Summit supercomputer, showing that we achieve speedup of up to \(6.7\times \) over synchronous offloading, and provide comparable speedup to the commercial IBM XL C/C++ compiler.
Shilei Tian, Johannes Doerfert, Barbara Chapman
Using Hardware Transactional Memory to Implement Speculative Privatization in OpenMP
Abstract
Loop Thread-Level Speculation on Hardware Transactional Memories is a promising strategy to improve application performance in the multicore era. However, the reuse of shared scalar or array variables introduces constraints (false dependences or false sharing) that obstruct efficient speculative parallelization. Speculative privatization relieves these constraints by creating speculatively private data copies for each transaction thus enabling scalable parallelization. To support it, this paper proposes two new OpenMP clauses to parallel for that enable speculative privatization of scalar or arrays in may DOACROSS loops: spec_private and spec_reduction. We also present an evaluation that reveals that, for certain loops, speed-ups of up to \(3.24\times \) can be obtained by applying speculative privatization in TLS.
Juan Salamanca, Alexandro Baldassin
Improving Fortran Performance Portability
Abstract
We present a new Fortran source-to-source tool that helps to bridge the gap in the current Fortran tooling ecosystem. The goal of this tool is to translate standard Fortran code to various parallel programming languages in Fortran and C/C++ to enable running on a wide variety of GPUs and CPUs. The translation is performed using the full syntax parsing capabilities of the LFortran compiler, a research compiler currently in development. Using the Abstract Semantic Representation intermediate output of the compiler in this new work has made the translation simpler to accomplish. We also develop a map of the needed parallel constructs for a complete parallel language and begin to identify possible extensions to the existing Fortran language.
Jacob Marks, Eric Medwedeff, Ondřej Čertík, Robert Bird, Robert W. Robey

Domain Specific Compilation

Frontmatter
COMET: A Domain-Specific Compilation of High-Performance Computational Chemistry
Abstract
The computational power increases over the past decades have greatly enhanced the ability to simulate chemical reactions and understand ever more complex transformations. Tensor contractions are the fundamental computational building block of these simulations. These simulations have often been tied to one platform and restricted in generality by the interface provided to the user. The expanding prevalence of accelerators and researcher demands necessitate a more general approach which is not tied to specific hardware or requires contortion of algorithms to specific hardware platforms. In this paper we present COMET, a domain-specific programming language and compiler infrastructure for tensor contractions targeting heterogeneous accelerators. We present a system of progressive lowering through multiple layers of abstraction and optimization that achieves up to \(1.98\times \) speedup for 30 tensor contractions commonly used in computational chemistry and beyond.
Erdal Mutlu, Ruiqin Tian, Bin Ren, Sriram Krishnamoorthy, Roberto Gioiosa, Jacques Pienaar, Gokcen Kestor
G-Code Re-compilation and Optimization for Faster 3D Printing
Abstract
The 3D printing technology has seen increasingly wider application in industrial manufacturing and the general public domain. The normal working flow of 3D printing, i.e., from Computer Aided Design (CAD), to 3D model description, and last to 3D printers, essentially uses languages such as STL (Standard Tessellation Language or STereoLithography) and G-code to pass information between the work phases from designing to manufacturing. However, the languages are produced and used literally (like using XML for only data representation), and there has not been much discussion on how these de-facto programming languages can be compiled or optimized. In this paper, we present our preliminary work that tries to improve 3D printing’s efficiency at the backend of the working flow. We re-compile the G-code into a higher-level IR, design a number of physics and graphics driven optimizations, and re-generate G-code from optimized IR representation. We test our G-code compiler on several popular 3D models and show upto \(10.4\%\) speedup or save more than 16 h on printing complex models.
Xiaoming Li

Machine Language and Quantum Computing

Frontmatter
Optimized Code Generation for Deep Neural Networks
Abstract
As Deep Neural Networks (DNNs) become more widely used in a variety of applications, the need for performance and portability on many different architectures, including CPUs, becomes increasingly important. Compiler-based methods offer opportunities for performance gains over statically-tuned libraries by exploiting data reuse and parallelism, efficient memory access, and vectorization for specific backends with the use of abstraction. The Batch Normalization (BN) operator can accelerate the training and increase the robustness of DNNs, making it a widely-used operator in many DNNs. LATTE is a domain-specific language for DNNs, and SWIRL is a compiler that can be used with LATTE. We extend the applicability of LATTE/SWIRL by incorporating the BN operator into the LATTE framework and by expanding the optimizations of SWIRL to this operator. The optimized BN operator in LATTE/SWIRL is compared to existing frameworks such as TensorFlow, TensorFlow with Intel MKL-DNN, TensorFlow with XLA, PyTorch with MKL-DNN and MXNet with MKL-DNN. The results show that a compiler-based approach for the BN operator can increase performance on CPU architectures.
Janaan Lake, Tharindu R. Patabandi, Mary Hall
Thermal-Aware Compilation of Spiking Neural Networks to Neuromorphic Hardware
Abstract
Hardware implementation of neuromorphic computing can significantly improve performance and energy efficiency of machine learning tasks implemented with spiking neural networks (SNNs), making these hardware platforms particularly suitable for embedded systems and other energy-constrained environments. We observe that the long bitlines and wordlines in a crossbar of the hardware create significant current variations when propagating spikes through its synaptic elements, which are typically designed with non-volatile memory (NVM). Such current variations create a thermal gradient within each crossbar of the hardware, depending on the machine learning workload and the mapping of neurons and synapses of the workload to these crossbars. This thermal gradient becomes significant at scaled technology nodes and it increases the leakage power in the hardware leading to an increase in the energy consumption. We propose a novel technique to map neurons and synapses of SNN-based machine learning workloads to neuromorphic hardware. We make two novel contributions. First, we formulate a detailed thermal model for a crossbar in a neuromorphic hardware incorporating workload dependency, where the temperature of each NVM-based synaptic cell is computed considering the thermal contributions from its neighboring cells. Second, we incorporate this thermal model in the mapping of neurons and synapses of SNN-based workloads using a hill-climbing heuristic. The objective is to reduce the thermal gradient in crossbars. We evaluate our neuron and synapse mapping technique using 10 machine learning workloads for a state-of-the-art neuromorphic hardware. We demonstrate an average 11.4K reduction in the average temperature of each crossbar in the hardware, leading to a 52% reduction in the leakage power consumption (11% lower total energy consumption) compared to a performance-oriented SNN mapping technique.
Twisha Titirsha, Anup Das
A Quantum-Inspired Model for Bit-Serial SIMD-Parallel Computation
Abstract
Bit-serial SIMD-parallel execution was once commonly used in supercomputers, but fell out of favor as it became practical to implement word-level operations directly in MIMD hardware. Word-level primitive operations simplify programming and significantly speed-up sequential code. However, aggressive gate-level compiler optimization can dramatically reduce power consumed in massively-parallel bit-serial execution without a performance penalty. The model described here, Parallel Bit Pattern Computing, not only leverages gate-level just-in-time optimization of bit-serial code, but also uses a quantum-inspired type of symbolic execution based on regular expressions to obtain a potentially exponential reduction in computational complexity while using entirely conventional computer hardware.
Henry Dietz, Aury Shafran, Gregory Austin Murphy

Performance Analysis

Frontmatter
Enhancing the Top-Down Microarchitectural Analysis Method Using Purchasing Power Parity Theory
Abstract
The Top-Down method makes it possible to identify bottlenecks as instructions traverse the CPU’s pipeline. Once bottlenecks are identified, incremental changes to the code can be made to mitigate the negative effects bottlenecks might have in performance. This is an iterative process that could potentially result in a more optimal use of CPU resources. It can be difficult to compare bottleneck metrics of the same program generated by different compilers running on the same system. Different compilers could potentially generate different instructions, arrange the instructions in different order, and require different number of cycles to execute the program. Ratios with relatively similar values could hide valuable information that could be used to identify differences in magnitude and influence of bottlenecks. To amplify magnitude differences of bottleneck metrics, we use the cycles required to complete the program as a reference point. We can then quantify the relative difference the effect a bottleneck has when compared with the bottleneck of the reference compiler. This study’s proposed approach is based on the Purchasing Power Parity theory, which is used by economists to compare the purchasing power of different currencies by comparing similar products. We show that this approach can give us more information on how effective each compiler is in using the CPU’s architectural features by comparing their respective bottlenecks. For example, using conventional methods, our measurements show that for the 363.swim benchmark, BackEnd Bound rates for GCC4 was 0.949, and 0.956 for GCC6 and GCC7 respectively. However, using the PPP normalization approach, we showed that there were differences of \(55.3\%\) for GCC6 and \(54.9\%\) for GCC7 over GCC4.
Yectli A. Huerta, Brent Swartz, David J. Lilja

Code Generation

Frontmatter
Cain: Automatic Code Generation for Simultaneous Convolutional Kernels on Focal-plane Sensor-processors
Abstract
Focal-plane Sensor-processors (FPSPs) are a camera technology that enable low power, high frame rate computation, making them suitable for edge computation. Unfortunately, these devices’ limited instruction sets and registers make developing complex algorithms difficult. In this work, we present Cain, an open-source compiler that targets SCAMP-5, a general-purpose FPSP – which generates code from multiple convolutional kernels. As an example, given the convolutional kernels for an MNIST digit recognition neural network, Cain produces code that is half as long, when compared to the other available compilers for SCAMP-5.
Edward Stow, Riku Murai, Sajad Saeedi, Paul H. J. Kelly
Reordering Under the ECMAScript Memory Consistency Model
Abstract
Relaxed memory accesses are used to gain substantial improvement in the performance of concurrent programs. A relaxed memory consistency model specifically describes the semantics of such memory accesses for a particular programming language. Historically, such semantics are often ill-defined or misunderstood and have been shown to conflict with common compiler optimizations essential for the performance of programs overall. In this paper, we give a formal description of the ECMAScript relaxed memory consistency model. We then analyze the impact of this model on one of the most common compiler optimizations, viz. instruction reordering. We give a conservative proof under which such optimization is allowed for relaxed memory accesses. Finally, we discuss the advantage of our conservative approach and the gaps needed to be filled in order to incorporate our results while doing such optimizations at the program level.
Akshay Gopalakrishnan, Clark Verbrugge
Verification of Vectorization of Signal Transforms
Abstract
This paper proves properties of the vectorized code generated by the SPIRAL system for implementing, optimizing, and tuning fast signal transforms. In particular, it is shown that the generated code is correct and fully vectorized. The SPIRAL system uses multiple rewrite systems with varying levels of abstraction to generate, optimize, parallelize and implement code. The proofs proceed by showing that the rules preserve semantics and lead to code with guaranteed performance and correctness properties. Unlike more general approaches, much of the work is done at a higher level incorporating the underlying mathematics. This shifts much of the verification from proving equivalence of programs to proving equivalence of mathematical expressions. Furthermore, the proofs incorporate domain specific knowledge leading to stronger guarantees than could be obtained for a more general vectorizing compiler.
Patrick Brinich, Jeremy Johnson
Backmatter
Metadata
Title
Languages and Compilers for Parallel Computing
Editors
Prof. Barbara Chapman
José Moreira
Copyright Year
2022
Electronic ISBN
978-3-030-95953-1
Print ISBN
978-3-030-95952-4
DOI
https://doi.org/10.1007/978-3-030-95953-1

Premium Partner