Search Space Properties for Mapping Coarse-Grain Pipelined FPGA Applications

This paper describes an automated approach to hardware design space exploration, through a collaboration between parallelizing compiler technology and high-level synthesis tools. In previous work, we described a compiler algorithm that optimizes individual loop nests, expressed in C, to derive an efficient FPGA implementation. In this paper, we describe a global optimization strategy that maps multiple loop nests to a coarse-grain pipelined FPGA implementation. The global optimization algorithm automatically transforms the computation to incorporate explicit communication and data reorganization between pipeline stages, and uses metrics to guide design space exploration to consider the impact of communication and to achieve balance between producer and consumer data rates across pipeline stages. We illustrate the components of the algorithm with a case study, a machine vision kernel.

Heidi Ziegler, Mary Hall, Byoungro So

Adapting Convergent Scheduling Using Machine-Learning

Convergent scheduling is a general framework for instruction scheduling and cluster assignment for parallel, clustered architectures. A convergent scheduler is composed of many independent passes, each of which implements a specific compiler heuristic. Each of the passes shares a common interface, which allows them to be run multiple times, and in any order. Because of this, a convergent scheduler is presented with a vast number of legal pass orderings. In this work, we use machine-learning techniques to automatically search for good orderings. We do so by evolving, through genetic programming, s-expressions that describe a particular pass sequence. Our system has the flexibility to create dynamic sequences where the ordering of the passes is predicated upon characteristics of the program being compiled. In particular, we implemented a few tests on the present state of the code being compiled. We are able to find improved sequences for a range of clustered architectures. These sequences were tested with cross-validation, and generally outperform Desoli’s PCC and UAS.

Diego Puppin, Mark Stephenson, Saman Amarasinghe, Martin Martin, Una-May O’Reilly

TFP: Time-Sensitive, Flow-Specific Profiling at Runtime

Program profiling can help performance prediction and compiler optimization. This paper describes the initial work behind TFP, a new profiling strategy that can gather and verify a range of flow-specific information at runtime. While TFP can collect more refined information than block, edge or path profiling, it is only 5.75% slower than a very fast runtime path-profiling technique. Statistics collected using TFP over the SPEC2000 benchmarks reveal possibilities for further flow-specific runtime optimizations. We also show how TFP can improve the overall performance of a real application.

Sagnik Nandy, Xiaofeng Gao, Jeanne Ferrante

A Hierarchical Model of Reference Affinity

To improve performance, data reorganization needs locality models to identify groups of data that have reference affinity. Much past work is based on access frequency and does not consider accessing time directly. In this paper, we propose a new model of reference affinity. This model considers the distance between data accesses in addition to the frequency. Affinity groups defined by this model are consistent and have a hierarchical structure. The former property ensures the profitability of data packing, while the latter supports data packing for storage units of different sizes. We then present a statistical clustering method that identifies affinity groups among structure fields and data arrays by analyzing training runs of a program. When used by structure splitting and array regrouping, the new method improves the performance of two test programs by up to 31%. The new data layout is significantly better than that produced by the programmer or by static compiler analysis.

Yutao Zhong, Xipeng Shen, Chen Ding

Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding

The wide use of multiprocessor system has been making automatic parallelizing compilers more important. To improve the performance of multiprocessor system more by compiler, multigrain parallelization is important. In multigrain parallelization, coarse grain task parallelism among loops and subroutines and near fine grain parallelism among statements are used in addition to the traditional loop parallelism. In addition, locality optimization to use cache effectively is also important for the performance improvement. This paper describes inter-array padding to minimize cache conflict misses among macro-tasks with data localization scheme which decomposes loops sharing the same arrays to fit cache size and executes the decomposed loops consecutively on the same processor. In the performance evaluation on Sun Ultra 80(4pe), OSCAR compiler on which the proposed scheme is implemented gave us 2.5 times speedup against the maximum performance of Sun Forte compiler automatic loop parallelization at the average of SPEC CFP95 tomcatv, swim hydro2d and turb3d programs. Also, OSCAR compiler showed 2.1 times speedup on IBM RS/6000 44p-270(4pe) against XLF compiler.

Kazuhisa Ishizaka, Motoki Obata, Hironori Kasahara

Compiler-Assisted Cache Replacement: Problem Formulation and Performance Evaluation

Recent research results show that conventional hardware-only cache solutions result in unsatisfactory cache utilization for both regular and irregular applications. To overcome this problem, a number of architectures introduce instruction hints to assist cache replacement. For example, Intel Itanium architecture augments memory accessing instructions with cache hints to distinguish data that will be referenced in the near future from the rest. With the availability of such methods, the performance of the underlying cache architecture critically depends on the ability of the compiler to generate code with appropriate cache hints. In this paper we formulate this problem – giving cache hints to memory instructions such that cache miss rate is minimized – as a 0/1 knapsack problem, which can be efficiently solved using a dynamic programming algorithm. The proposed approach has been implemented in our compiler testbed and evaluated on a set of scientific computing benchmarks. Initial results show that our approach is effective on reducing the cache miss rate and improving program performance.

Hongbo Yang, R. Govindarajan, Guang R. Gao, Ziang Hu

Memory-Constrained Data Locality Optimization for Tensor Contractions

The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions over large multi-dimensional arrays. Efficient computation of these contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, requiring their storage on disk. However, the intermediates can often be generated and used in batches through appropriate loop fusion transformations. To optimize the performance of such computations a combination of loop fusion and loop tiling is required, so that the cost of disk I/O is minimized. In this paper, we address the memory-constrained data-locality optimization problem in the context of this class of computations. We develop an optimization framework to search among a space of fusion and tiling choices to minimize the data movement overhead. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites.

Alina Bibireata, Sandhya Krishnan, Gerald Baumgartner, Daniel Cociorva, Chi-Chung Lam, P. Sadayappan, J. Ramanujam, David E. Bernholdt, Venkatesh Choppella

Compositional Development of Parallel Programs

This paper presents a programming model, an interface definition language (P-COM2) and a compiler that composes parallel and distributed programs from independently written components. P-COM2 specifications incorporate information on behaviors and implementations of components to enable qualification of components for effectiveness in specific application instances and execution environments. The programming model targets development of families of related programs. One objective is to be able to compose programs which are near-optimal for given application instances and execution environments. Component-oriented development is motivated for parallel and distributed computations. The programming model is defined and described and illustrated with a simple example. The compilation process is briefly defined and described. Experience with one more complex application, a generalized fast multipole solver is sketched including performance data, some of which was surprising.

Nasim Mahmood, Guosheng Deng, James C. Browne

Supporting High-Level Abstractions through XML Technology

Development of applications that process large scientific datasets is often complicated by complex and specialized data storage formats. In this paper, we describe the use of XML technologies for supporting high-level programming methodologies for processing scientific datasets. We show how XML Schemas can be used to give a high-level abstraction of a dataset to an application developer. A corresponding low-level Schema describes the actual layout of data and is used by the compiler for code generation. The compiler needs a systematic way for translating the high-level code to a low-level code. Then, it needs to transform the generated low-level code to achieve high locality and efficient execution. This paper describes our approach to these two problems. By using Active Data Repository as the underlying runtime system, we offer an XML based front-end for storing, retrieving, and processing flat-file based scientific datasets in a cluster environment.

Xiaogang Li, Gagan Agrawal

Applications of HPJava

We describe two applications of our HPJava language for parallel computing. The first is a multigrid solver for a Poisson equation, and the second is a CFD application that solves the Euler equations for inviscid flow. We illustrate how the features of the HPJava language allow these algorithms to be expressed in a straightforward and convenient way. Performance results on an IBM SP3 are presented.

Bryan Carpenter, Geoffrey Fox, Han-Ku Lee, Sang Boem Lim

Programming for Locality and Parallelism with Hierarchically Tiled Arrays

This paper introduces a new primitive data type, hierarchically tiled arrays (HTAs), which could be incorporated into conventional languages to facilitate parallel programming and programming for locality. It is argued that HTAs enable a natural representation for many algorithms with a high degree of locality. Also, the paper shows that, with HTAs, parallel computations and the associated communication operations can be expressed as array operations within single threaded programs. This, is then argued, facilitates reasoning about the resulting programs and stimulates the development of code that is highly readable and easy to modify. The new data type is illustrated using examples written in an extended version of MATLAB.

Gheorghe Almási, Luiz De Rose, Basilio B. Fraguela, José Moreira, David Padua

Co-array Fortran Performance and Potential: An NPB Experimental Study

Co-array Fortran (CAF) is an emerging model for scalable, global address space parallel programming that consists of a small set of extensions to the Fortran 90 programming language. Compared to MPI, the widely-used message-passing programming model, CAF’s global address space programming model simplifies the development of single-program-multiple-data parallel programs by shifting the burden for choreographing and optimizing communication from developers to compilers. This paper describes an open-source, portable, and retargetable CAF compiler under development at Rice University that is well-suited for today’s high-performance clusters. Our compiler translates CAF into Fortran 90 plus calls to one-sided communication primitives. Preliminary experiments comparing CAF and MPI versions of several of the NAS parallel benchmarks on an Itanium 2 cluster with a Myrinet 2000 interconnect show that our CAF compiler delivers performance that is roughly equal to or, in many cases, better than that of programs parallelized using MPI, even though support for global optimization of communication has not yet been implemented in our compiler.

Cristian Coarfa, Yuri Dotsenko, Jason Eckhardt, John Mellor-Crummey

Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures

We evaluate the impact of programming language features on the performance of parallel applications on modern parallel architectures, particularly for the demanding case of sparse integer codes. We compare a number of programming languages (Pthreads, OpenMP, MPI, UPC) on both shared and distributed-memory architectures. We find that language features can make parallel programs easier to write, but cannot hide the underlying communication costs for the target parallel architecture. Powerful compiler analysis and optimization can help reduce software overhead, but features such as fine-grain remote accesses are inherently expensive on clusters. To avoid large reductions in performance, language features must avoid degrading the performance of local computations.

Konstantin Berlin, Jun Huan, Mary Jacob, Garima Kochhar, Jan Prins, Bill Pugh, P. Sadayappan, Jaime Spacco, Chau-Wen Tseng

Putting Polyhedral Loop Transformations to Work

We seek to extend the scope and efficiency of iterative compilation techniques by searching not only for program transformation parameters but for the most appropriate transformations themselves. For that purpose, we need a generic way to express program transformations and compositions of transformations. In this article, we introduce a framework for the polyhedral representation of a wide range of transformations in a unified way. We also show that it is possible to generate efficient code after the application of polyhedral program transformations. Finally, we demonstrate an implementation of the polyhedral representation and code generation techniques in the Open64/ORC compiler.

Cédric Bastoul, Albert Cohen, Sylvain Girbal, Saurabh Sharma, Olivier Temam

Index-Association Based Dependence Analysis and its Application in Automatic Parallelization

In this paper, we present a technique to perform dependence analysis on more complex array subscripts than the linear form of the enclosing loop indices. For such complex array subscripts, we decouple the original iteration space and the dependence test iteration space and link them through index-association functions. Dependence analysis is performed in the dependence test iteration space to determine whether the dependence exists in the original iteration space. The dependence distance in the original iteration space is determined by the distance in the dependence test iteration space and the property of index-association functions. For certain non-linear expressions, we show how to equivalently transform them to a set of linear expressions. The latter can be used in traditional dependence analysis techniques targeting subscripts which are linear forms of enclosing loop indices. We also show how our advanced dependence analysis technique can help parallelize some otherwise hard-to-parallelize loops.

Yonghong Song, Xiangyun Kong

Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling

Reducing the Price of Naivety

Hierarchically-blocked non-linear storage layouts, such as the Morton ordering, have been proposed as a compromise between row-major and column-major for two-dimensional arrays. Morton layout offers some spatial locality whether traversed row-wise or column-wise. The goal of this paper is to make this an attractive compromise, offering close to the performance of row-major traversal of row-major layout, while avoiding the pathological behaviour of column-major traversal. We explore how spatial locality of Morton layout depends on the alignment of the array’s base address, and how unrolling has to be aligned to reduce address calculation overhead. We conclude with extensive experimental results using five common processors and a small suite of benchmark kernels.

Jeyarajan Thiyagalingam, Olav Beckmann, Paul H. J. Kelly

Spatial Views: Space-Aware Programming for Networks of Embedded Systems

Networks of embedded systems, in the form of cell phones, PDAs, wearable computers, and sensors connected through wireless networking technology, are emerging as an important computing platform. The ubiquitous nature of such a platform promises exciting applications. This paper presents a new programming model for a network of embedded systems, called Spatial Views, targeting its dynamic, space-sensitive and resource-restrained characteristics. The core of the proposed model is iterative programming over a dynamic collection of nodes identified by the physical spaces they are in and the services they provide. Hidden in the iteration is execution migration as the main collaboration paradigm, constrained by user specified limits on resource usage such as response time and energy consumption. A Spatial Views prototype has been implemented, and first results are reported.

Yang Ni, Ulrich Kremer, Liviu Iftode

Operation Reuse on Handheld Devices

Compilers have long used redundancy removal to improve program execution speed. For handheld devices, redundancy removal is particularly attractive because it improves execution speed and energy efficiency at the same time. In a broad view, redundancy exists in many different forms, e.g., redundant computations and redundant branches. We briefly describe our recent efforts to expand the scope of redundancy removal. We attain computation reuse by replacing a code segment by a table look-up. We use IF-merging to merge conditional statements into a single conditional statement. We present part of our preliminary experimental results from an HP/Compaq iPAQ PDA.

Yonghua Ding, Zhiyuan Li

Memory Redundancy Elimination to Improve Application Energy Efficiency

Application energy consumption has become an increasingly important issue for both high-end microprocessors and mobile and embedded devices. A multitude of circuit and architecture-level techniques have been developed to improve application energy efficiency. However, relatively less work studies the effects of compiler transformations in terms of application energy efficiency. In this paper, we use energy-estimation tools to profile the execution of benchmark applications. The results show that energy consumption due to memory instructions accounts for a large share of total energy. An effective compiler technique that can improve energy efficiency is memory redundancy elimination. It reduces both application execution cycles and the number of cache accesses. We evaluate the energy improvement over 12 benchmark applications from SPEC2000 and MediaBench. The results show that memory redundancy elimination can significantly reduce energy in the processor clocking network and the instruction and data caches. The overall application energy consumption can be reduced by up to 15%, and the reduction in terms of energy-delay product is up to 24%.

Keith D. Cooper, Li Xu

Adaptive MPI

Processor virtualization is a powerful technique that enables the runtime system to carry out intelligent adaptive optimizations like dynamic resource management. Charm++ is an early language/system that supports processor virtualization. This paper describes Adaptive MPI or AMPI, an MPI implementation and extension, that supports processor virtualization. AMPI implements virtual MPI processes (VPs), several of which may be mapped to a single physical processor. AMPI includes a powerful runtime support system that takes advantage of the degree of freedom afforded by allowing it to assign VPs onto processors. With this runtime system, AMPI supports such features as automatic adaptive overlap of communication and computation and automatic load balancing. It can also support other features such as checkpointing without additional user code, and the ability to shrink and expand the set of processors used by a job at runtime. This paper describes AMPI, its features, benchmarks that illustrate performance advantages and tradeoffs offered by AMPI, and application experiences.

Chao Huang, Orion Lawlor, L. V. Kalé

MPJava: High-Performance Message Passing in Java Using Java.nio

We explore advances in Java Virtual Machine (JVM) technology along with new high performance I/O libraries in Java 1.4, and find that Java is increasingly an attractive platform for scientific cluster-based message passing codes.We report that these new technologies allow a pure Java implementation of a cluster communication library that performs competitively with standard C-based MPI implementations.

William Pugh, Jaime Spacco

Polynomial-Time Algorithms for Enforcing Sequential Consistency in SPMD Programs with Arrays

The simplest semantics for parallel shared memory programs is sequential consistency in which memory operations appear to take place in the order specified by the program. But many compiler optimizations and hardware features explicitly reorder memory operations or make use of overlapping memory operations which may violate this constraint. To ensure sequential consistency while allowing for these optimizations, traditional data dependence analysis is augmented with a parallel analysis called cycle detection. In this paper, we present new algorithms to enforce sequential consistency for the special case of the Single Program Multiple Data (SPMD) model of parallelism. First, we present an algorithm for the basic cycle detection problem, which lowers the running time from O(n3) to O(n2). Next, we present three polynomial-time methods that more accurately support programs with array accesses. These results are a step toward making sequentially consistent shared memory programming a practical model across a wide range of languages and hardware platforms.

Wei-Yu Chen, Arvind Krishnamurthy, Katherine Yelick

C 3: A System for Automating Application-Level Checkpointing of MPI Programs

Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual application-level checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs.In ([1],[2]) We have presented a distributed checkpoint coordination protocol which handles MPI’s point-to-point and collective constructs, while dealing with the unique challenges of application-level checkpointing. We have implemented our protocols as part of a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. This thin layer is used by the C3 (Cornell Checkpoint (pre-) Compiler), a tool that automatically converts an MPI application in an equivalent fault-tolerant version. In this paper, we summarize our work on this system to date. We also present experimental results that show that the overhead introduced by the protocols are small. We also discuss a number of future areas of research.

Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill

The Power of Belady’s Algorithm in Register Allocation for Long Basic Blocks

Optimization techniques such as loop-unrolling and trace-scheduling can result in long straight-line codes. It is, however, unclear how well the register allocation algorithms of current compilers perform on these codes. Compilers may well have been optimized for human written codes, which are likely to have short basic blocks. To evaluate how the state of the art compilers behave on long straight-line codes, we wrote a compiler that implements the simple Belady’s MIN algorithm.The main contribution of this paper is the evaluation of Belady’s MIN algorithm when used for register allocation for long straight-line codes. These codes were executed on a MIPS R12000 processor. Our results show that applications compiled using Belady’s MIN algorithm run faster than when compiled with the MIPSPro or GCC compiler. In particular, Fast Fourier Transforms (FFTs) of size 32 and 64 run 12% and 33% faster than when compiled using the MIPSPro compiler.

Jia Guo, María Jesús Garzarán, David Padua

Load Elimination in the Presence of Side Effects, Concurrency and Precise Exceptions

Partial redundancy elimination can reduce the number of loads corresponding to field and array accesses in Java programs. The reuse of values loaded from memory at subsequent occurrences of load expressions must be done with care: Precise exceptions and the potential of side effects through method invocations and concurrent modifications in multi-threaded programs must be considered.This work focuses on the effect of concurrency on the load optimization. Unlike previous approaches, our system determines accurate information about side effects and concurrency through a whole-program analysis. Partial redundancy elimination is extended to exploit this information and to broaden the optimization scope.There are three main results: (1) Load elimination is effective even in the most conservative variant without side effect and concurrency analysis (avg. dynamic reduction of loads 23.4%, max. 55.6%). (2) Accurate side effect information can significantly increase the number of optimized expressions (avg. dynamic reduction of loads 28.6%, max. 66.1%). (3) Information about concurrency can make the optimization independent of the memory model, enables aggressive optimization across synchronization statements, and can improve the number of optimization opportunities compared to an uninformed optimizer that is guided by a (weak) memory model (avg. dynamic reduction of loads 28.5%, max. 70.3%).

Christoph von Praun, Florian Schneider, Thomas R. Gross

To Inline or Not to Inline? Enhanced Inlining Decisions

The decision to inline a procedure in the Open Research Compiler (ORC) was based on a temperature heuristics that takes into consideration the time spent in a procedure and the size of the procedure. In this paper we describe the trade-off that has to be worked out to make the correct inlining decisions. We introduce two new heuristics to enhance the ORC inlining heuristics: adaptation and cycle_density. With adaptation we are allowed to vary the temperature threshold and prevent penalizing small benchmarks. With cycle_density we prevent the inlining of procedures that have a high temperature in spite of being called infrequently. Experiments show that while adaptation improves the speedup obtained with inlining across the SPEC2000 suite, cycle_density reduces significantly both the code growth and compilation time increase caused by inlining. We then characterize the SPEC INT2000 benchmarks according to the inlining potential of their function calls. Our enhancement is released in the ORC 2.0.

Peng Zhao, José Nelson Amaral

A Preliminary Study on the Vectorization of Multimedia Applications for Multimedia Extensions

In 1994, the first multimedia extension, MAX-1, was introduced to general-purpose processors by HP. Almost ten years have passed, the present means of accessing the computing power of multimedia extensions are still limited to mostly assembly programming, intrinsic functions, and the use of system libraries. Because of the similarities between multimedia extensions and vector processors, it is believed that traditional vectorization can be used to compile for multimedia extensions. Can traditional vectorization effectively vectorize multimedia applications for multimedia extensions? If not, what additional techniques are needed? To answer these two questions, we conducted a code study on the Berkeley Multimedia Workload. Through this, we identified several new challenges arise in vectorizing for multimedia extensions and proposed some solutions to these challenges.

Gang Ren, Peng Wu, David Padua

A Data Cache with Dynamic Mapping

Dynamic Mapping is an approach to cope with the loss of performance due to cache interference and to improve performance predictability of blocked algorithms for modern architectures. An example is matrix multiply: tiling matrix multiply for a data cache of 16KB using optimal tile size achieves an average data-cache miss rate of 3%, but with peaks of 16% due to interference.Dynamic Mapping is a software-hardware approach where the mapping in cache is determined at compile time, by manipulating the address used by the data cache. The reduction of cache misses translates into a 2-fold speed-up for matrix multiply and FFT by eliminating data-cache miss spikes.Dynamic mapping has the same goal as other proposed approaches, but it determines the cache mapping before issuing a load. It uses the computational power of the processor – instead of the memory controller or the data cache mapping – and it has no effect on the access time of memory and cache. It is an approach combining several concepts, such as non-standard cache mapping functions and data layout reorganization and, potentially, without any overhead.

Paolo D’Alberto, Alexandru Nicolau, Alexander Veidenbaum

Compiler-Based Code Partitioning for Intelligent Embedded Disk Processing

Recent trends indicate that system intelligence is moving from main computational units to peripherals. In particular, several studies show the feasibility of building an intelligent disk architecture by executing some parts of the application code on an embedded processor attached to the disk system. This paper focuses on such an architecture and addresses the problem of what parts of the application code should be executed on the embedded processor attached to the disk system. Our focus is on image and video processing applications where large data sets (mostly arrays) need to be processed. To decide the work division between the disk system and the host system, we use an optimizing compiler to identify computations that exhibit a filtering characteristic; i.e., their output data sets are much smaller than their input data sets. By performing such computations on the disk, we reduce the data volume that need to be communicated from the disk to the host system substantially. Our experimental results show significant improvements in execution cycles of six applications.

Guilin Chen, Guangyu Chen, M. Kandemir, A. Nadgir

Much Ado about Almost Nothing: Compilation for Nanocontrollers

Advances in nanotechnology have made it possible to assemble nanostructures into a wide range of micrometer-scale sensors, actuators, and other novel devices... and to place thousands of such devices on a single chip. Most of these devices can benefit from intelligent control, but the control often requires full programmability for each device’s controller. This paper presents a combination of programming language, compiler technology, and target architecture that together provide full MIMD-style programmability with per-processor circuit complexity low enough to allow each nanotechnology-based device to be accompanied by its own nanocontroller.

Henry G. Dietz, Shashi D. Arcot, Sujana Gorantla

Increasing the Accuracy of Shape and Safety Analysis of Pointer-Based Codes

Analyses and transformations of programs that manipulate pointer-based data structures rely on understanding the topological relationships between the nodes i.e., the overall shape of the data structures. Current static shape analyses either assume correctness of the code or trade-off accuracy for analysis performance, leading in most cases to shape information that is of little use for practical purposes. This paper introduces four novel analysis techniques, namely structural fields, scan loops, assumed/verified shape properties and context tracing. Analysis of structural fields allows compilers to uncover node configurations that play key roles in the data structure. Analysis of scan loops allows compilers to establish accurate relationship between pointer variables while traversing the data structures. Assumed/verified property analysis derives sufficient shape properties that guarantee termination of scan loops. These properties must then be verified during shape analysis for consistency. Context tracing allows the analyses to isolate data structure nodes by tracing relationships between pointer variables along control-flow paths in the program. We believe that future static shape and safety analysis algorithms will have to include some if not all of these techniques to attain a high level of accuracy. In this paper we illustrate the application of the proposed techniques to codes that build (correctly as well as incorrectly) data structures that are beyond the reach of current approaches.

Pedro C. Diniz

Slice-Hoisting for Array-Size Inference in MATLAB

Inferring variable types precisely is very important to be able to compile MATLAB libraries effectively in the context of the telescoping languages framework being developed at Rice. Past studies have demonstrated the value of type information in optimizing MATLAB [4]. The variable types are inferred through a static approach based on writing propositional constraints on program statements [11]. The static approach has certain limitations with respect to inferring array-sizes. Imprecise inference of array-sizes can have a drastic effect on the performance of the generated code, especially in those cases where arrays are resized dynamically. The impact of appropriate array allocation is also borne out of earlier studies [3]. This paper presents a new approach to inferring array-sizes, called slice-hoisting. The approach is based on simple code transformations and is easy to implement in a practical compiler. Experimental evaluation shows that slice-hoisting, along with the constraints-based static algorithm, can result in a very high level of precision in inferring MATLAB array sizes.

Arun Chauhan, Ken Kennedy

Efficient Execution of Multi-query Data Analysis Batches Using Compiler Optimization Strategies

This work investigates the leverage that can be obtained from compiler optimization techniques for efficient execution of multi-query workloads in data analysis applications. Our approach is to address multi-query optimization at the algorithmic level, by transforming a declarative specification of scientific data analysis queries into a high-level imperative program that can be made more efficient by applying compiler optimization techniques. These techniques – including loop fusion, common subexpression elimination and dead code elimination – are employed to allow data and computation reuse across queries. We describe a preliminary experimental analysis on a real remote sensing application that analyzes very large quantities of satellite data. The results show our techniques achieve sizable reductions in the amount of computation and I/O necessary for executing query batches and in average execution times for the individual queries in a given batch.

Henrique Andrade, Suresh Aryangat, Tahsin Kurc, Joel Saltz, Alan Sussman

Semantic-Driven Parallelization of Loops Operating on User-Defined Containers

We describe ROSE, a C++ infrastructure for source-to-source translation, that provides an interface for programmers to easily write their own translators for optimizing the use of high-level abstractions. Utilizing the semantics of these high-level abstractions, we demonstrate the automatic parallelization of loops that iterate over user-defined containers that have interfaces similar to the lists, vectors and sets in the Standard Template Library (STL). The parallelization is realized in two phases. First, we insert OpenMP directives into a serial program, driven by the recognition of the high-level abstractions, containers, that are thread-safe. Then, we translate the OpenMP directives into library routines that explicitly create and manage parallelism. By providing an interface for the programmer to classify the semantics of their abstractions, we are able to automatically parallelize operations on containers, such as linked-lists, without resorting to complex loop dependence analysis techniques. Our approach is consistent with general goals within telescoping languages.

Dan Quinlan, Markus Schordan, Qing Yi, Bronis R. de Supinski

Cetus – An Extensible Compiler Infrastructure for Source-to-Source Transformation

Cetus is a compiler infrastructure for the source-to-source transformation of programs. We created Cetus out of the need for a compiler research environment that facilitates the development of interprocedural analysis and parallelization techniques for C, C++, and Java programs. We will describe our rationale for creating a new compiler infrastructure and give an overview of the Cetus architecture. The design is intended to be extensible for multiple languages and will become more flexible as we incorporate feedback from any difficulties we encounter introducing other languages. We will characterize Cetus’ runtime behavior of parsing and IR generation in terms of execution time, memory usage, and parallel speedup of parsing, as well as motivate its usefulness through examples of projects that use Cetus. We will then compare these results with those of the Polaris Fortran translator.

Sang-Ik Lee, Troy A. Johnson, Rudolf Eigenmann

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter