nach oben

2011 | Buch

Kapitel lesen Erstes Kapitel lesen

Compiler Construction

20th International Conference, CC 2011, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2011, Saarbrücken, Germany, March 26–April 3, 2011. Proceedings

herausgegeben von: Jens Knoop

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed proceedings of the 20th International Conference on Compiler Construction, CC 2011, held in Saarbrücken, Germany, March 26—April 3, 2011, as part of ETAPS 2011, the European Joint Conferences on Theory and Practice of Software. The 15 revised full papers presented together with the abstract of one invited talk were carefully reviewed and selected from 52 submissions. The papers are organized in topical sections on JIT compilation and code generation, program analysis, reversible computing and interpreters, parallelism and high-performance computing, and task and data distribution.

Inhaltsverzeichnis

Frontmatter

Invited Talk

Future-Proofing Collections: From Mutable to Persistent to Parallel

Abstract

Multicore processors are on every desk now. How are we going to make use of the extra power they provide? Some think that actors, or transactional memory, or some other new concurrency construct will save the day by making concurrent programming easier and safer. Even though these are welcome, I am skeptical about their ultimate success. Concurrency is fundamentally hard and no dressing up will be able to hide that fact completely.

A safer and for the programmer much simpler alternative is to treat parallel execution as essentially an optimization. A promising application area are collections. Programming by transforming and aggregating collections is simple, powerful, and can be optimized by executing bulk operations in parallel. To be able to do this in practice, any side effects of parallel operations need to be carefully controlled. This means that immutable, persistent collections are more suitable than mutable ones.

In this talk I will describe the new Scala collections framework, and show how it allows a seamless migration from traditional mutable collections to persistent collections, and from there to parallel collections. I show how the same vocabulary of methods can be used for either type of collection, and how one can have parallel as well as sequential views on the same underlying collection.

The design of this framework is guided by the “uniform return type principle”: every collection transformer should return the same kind of collection it applies to. Simple as this sounds, it is surprisingly hard to achieve in a statically typed language with a rich type hierarchy (in fact, I know of no other collections framework that achieved it). In the talk I will explain the type-systematic constructions required by the principle. I will also present some thoughts on how we might develop type-explanation capabilities of compilers to effectively support these techniques in a user-friendly way.

Martin Odersky

JIT Compilation and Code Generation

Dynamic Elimination of Overflow Tests in a Trace Compiler

Abstract

Trace compilation is a technique used by just-in-time (JIT) compilers such as TraceMonkey, the JavaScript engine in the Mozilla Firefox browser. Contrary to traditional JIT machines, a trace compiler works on only part of the source program, normally a linear path inside a heavily executed loop. Because the trace is compiled during the interpretation of the source program the JIT compiler has access to runtime values. This observation gives the compiler the possibility of producing binary code specialized to these values. In this paper we explore such opportunity to provide an analysis that removes unnecessary overflow tests from JavaScript programs. Our optimization uses range analysis to show that some operations cannot produce overflows. The analysis is linear in size and space on the number of instructions present in the input trace, and it is more effective than traditional range analyses, because we have access to values known only at execution time. We have implemented our analysis on top of Firefox’s TraceMonkey, and have tested it on over 1000 scripts from several industrial strength benchmarks, including the scripts present in the top 100 most visited webpages in the Alexa index. We generate binaries to either x86 or the embedded microprocessor ST40-300. On the average, we eliminate 91.82% of the overflows in the programs present in the TraceMonkey test suite. This optimization provides an average code size reduction of 8.83% on ST40 and 6.63% on x86. Our optimization increases TraceMonkey’s runtime by 2.53%.

Rodrigo Sol, Christophe Guillon, Fernando Magno Quintão Pereira, Mariza A. S. Bigonha

Staged Static Techniques to Efficiently Implement Array Copy Semantics in a MATLAB JIT Compiler

Abstract

Matlab has gained widespread acceptance among scientists. Several dynamic aspects of the language contribute to its appeal, but also provide many challenges. One such problem is caused by the copy semantics of Matlab. Existing Matlab systems rely on reference-counting schemes to create copies only when a shared array representation is updated. This reduces array copies, but requires runtime checks.

We present a staged static analysis approach to determine when copies are not required. The first stage uses two simple, intraprocedural analyses, while the second stage combines a forward necessary copy analysis with a backward copy placement analysis. Our approach eliminates unneeded array copies without requiring reference counting or frequent runtime checks.

We have implemented our approach in the McVM JIT. Our results demonstrate that, for our benchmark set, there are significant overheads for both existing reference-counted and naive copy-insertion approaches, and that our staged approach is effective in avoiding unnecessary copies.

Nurudeen Lameed, Laurie Hendren

SSA-Based Register Allocation with PBQP

Abstract

Recent research shows that maintaining SSA form allows to split register allocation into separate phases: spilling, register assignment and copy coalescing. After spilling, register assignment can be done in polynomial time, but copy coalescing is NP-complete. In this paper we present an assignment approach with integrated copy coalescing, which maps the problem to the Partitioned Boolean Quadratic Problem (PBQP). Compared to the state-of-the-art recoloring approach, this reduces the relative number of swap and copy instructions for the SPEC CINT2000 benchmark to 99.6% and 95.2%, respectively, while taking 19% less time for assignment and coalescing.

Sebastian Buchwald, Andreas Zwinkau, Thomas Bersch

Program Analysis

Probabilistic Points-to Analysis for Java

Abstract

Probabilistic points-to analysis is an analysis technique for defining the probabilities on the points-to relations in programs. It provides the compiler with some optimization chances such as speculative dead store elimination, speculative redundancy elimination, and speculative code scheduling. Although several static probabilistic points-to analysis techniques have been developed for C language, they cannot be applied directly to Java because they do not handle the classes, objects, inheritances and invocations of virtual methods. In this paper, we propose a context-insensitive and flow-sensitive probabilistic points-to analysis for Java (JPPA) for statically predicting the probability of points-to relations at all program points (i.e., points before or after statements) of a Java program. JPPA first constructs an interprocedural control flow graph (ICFG) for a Java program, whose edges are labeled with the probabilities calculated by an algorithm based on a static branch prediction approach, and then calculates the probabilistic points-to relations of the program based upon the ICFG. We have also developed a tool called Lukewarm to support JPPA and conducted an experiment to compare JPPA with a traditional context-insensitive and flow-sensitive points-to analysis approach. The experimental results show that JPPA is a precise and effective probabilistic points-to analysis technique for Java.

Qiang Sun, Jianjun Zhao, Yuting Chen

Faster Alias Set Analysis Using Summaries

Abstract

Alias sets are an increasingly used abstraction in situations which require flow-sensitive tracking of objects through different points in time and the ability to perform strong updates on individual objects. The interprocedural and flow-sensitive nature of these analyses often make them difficult to scale. In this paper, we use two types of method summaries (callee and caller) to improve the performance of an interprocedural flow- and context-sensitive alias set analysis. We present callee method summaries and algorithms to compute them. The computed summaries contain sufficient escape and return value information to selectively replace flow-sensitive analysis of methods without affecting analysis precision. When efficiency is a bigger concern, we also use caller method summaries which provide conservative initial assumptions for pointer and aliasing relations at the start of a method. Using caller summaries in conjunction with callee summaries enables the alias set analysis to flow-sensitively analyze only methods containing points of interest thereby reducing running time. We present results from empirically evaluating the use of these summaries for the alias set analysis. Additionally, we also discuss precision results from a realistic client analysis for verifying temporal safety properties. The results show that although caller summaries theoretically reduce precision, empirically they do not. Furthermore, on average, using callee and caller summaries reduces the running time of the alias set analysis by 27% and 96%, respectively.

Nomair A. Naeem, Ondřej Lhoták

JPure: A Modular Purity System for Java

Abstract

Purity Analysis is the problem of determining whether or not a method may have side-effects. This has applications in automatic parallelisation, extended static checking, and more. We present a novel purity system for Java that employs purity annotations which can be checked modularly. This is done using a flow-sensitive, intraprocedural analysis. The system exploits two properties, called freshness and locality, to increase the range of methods that can be considered pure. JPure also includes an inference engine for annotating legacy code. We evaluate our system against several packages from the Java Standard Library. Our results indicate it is possible to uncover significant amounts of purity efficiently.

David J. Pearce

Tainted Flow Analysis on e-SSA-Form Programs

Abstract

Tainted flow attacks originate from program inputs maliciously crafted to exploit software vulnerabilities. These attacks are common in server-side scripting languages, such as PHP. In 1997, Ørbæk and Palsberg formalized the problem of detecting these exploits as an instance of type-checking, and gave an O(V ³) algorithm to solve it, where V is the number of program variables. A similar algorithm was, ten years later, implemented on the Pixy tool. In this paper we give an O(V ²) solution to the same problem. Our solution uses Bodik et al.’s extended Static Single Assignment (e-SSA) program representation. The e-SSA form can be efficiently computed and it enables us to solve the problem via a sparse data-flow analysis. Using the same infrastructure, we compared a state-of-the-art data-flow solution with our technique. Both approaches have detected 36 vulnerabilities in well known PHP programs. Our results show that our approach tends to outperform the data-flow algorithm for bigger inputs. We have reported the bugs that we found, and an implementation of our algorithm is now publicly available.

Andrei Rimsa, Marcelo d’Amorim, Fernando Magno Quintão Pereira

Reversible Computing and Interpreters

Clean Translation of an Imperative Reversible Programming Language

Abstract

We describe the translation techniques used for the code generation in a compiler from the high-level reversible imperative programming language Janus to the low-level reversible assembly language PISA. Our translation is both semantics preserving (correct), in that target programs compute exactly the same functions as their source programs (cleanly, with no extraneous garbage output), and efficient, in that target programs conserve the complexities of source programs. In particular, target programs only require a constant amount of temporary garbage space.

The given translation methods are generic, and should be applicable to any (imperative) reversible source language described with reversible flowcharts and reversible updates. To our knowledge, this is the first compiler between reversible languages where the source and target languages were independently developed; the first exhibiting both correctness and efficiency; and just the second compiler for reversible languages overall.

Holger Bock Axelsen

Interpreter Instruction Scheduling

Abstract

Whenever we extend the instruction set of an interpreter, we risk increased instruction cache miss penalties. We can alleviate this problem by selecting instructions from the instruction set and re-arranging them such that frequent instruction sequences are co-located in memory. We take these frequent instruction sequences from hot program traces of external programs and we report a maximum speedup by a factor of 1.142. Thus, interpreter instruction scheduling complements the improved efficiency of an extended instruction set by optimizing its instruction arrangement.

Stefan Brunthaler

Parallelism and High-Performance Computing

Actor-Based Parallel Dataflow Analysis

Abstract

Defining algorithms in a way which allows parallel execution is becoming increasingly important as multicore computers become ubiquitous. We present IFDS-A, a parallel algorithm for solving context-sensitive interprocedural finite distributive subset (IFDS) dataflow problems. IFDS-A defines these problems in terms of Actors, and dataflow dependencies as messages passed between these Actors. We implement the algorithm in Scala, and evaluate its performance against a comparable sequential algorithm. With eight cores, IFDS-A is 6.12 times as fast as with one core, and 3.35 times as fast as a baseline sequential algorithm. We also found that Scala’s default Actors implementation is not optimal for this algorithm, and that a custom-built implementation outperforms it by a significant margin. We conclude that Actors are an effective way to parallelize this type of algorithm.

Jonathan Rodriguez, Ondřej Lhoták

Using Disjoint Reachability for Parallelization

Abstract

We present a disjoint reachability analysis for Java. Our analysis computes extended points-to graphs annotated with reachability states. Each heap node is annotated with a set of reachability states that abstract the reachability of objects represented by the node. The analysis also includes a global pruning step which analyzes a reachability graph to prune imprecise reachability states that cannot be removed with local reasoning alone. We have implemented the analysis and used it to parallelize 8 benchmarks. Our evaluation shows the analysis results are sufficiently precise to parallelize our benchmarks and achieve an average speedup of 16.9× .

James Jenista, Yong hun Eom, Brian Demsky

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Abstract

Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on three modern SIMD-capable processors.

Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J. Ramanujam, P. Sadayappan

Subregion Analysis and Bounds Check Elimination for High Level Arrays

Abstract

For decades, the design and implementation of arrays in programming languages has reflected a natural tension between productivity and performance. Recently introduced HPCS languages (Chapel, Fortress and X10) advocate the use of high-level arrays for improved productivity. For example, high-level arrays in the X10 language support rank-independent specification of multidimensional loop and array computations using regions and points. Three aspects of X10 high-level arrays are important for productivity but pose significant performance challenges: high-level accesses are performed through point objects rather than integer indices, variables containing references to arrays are rank-independent, and all subscripts in a high-level array access must be checked for bounds violations.

The first two challenges have been addressed in past work. In this paper, we address the third challenge of optimizing the overhead of array bounds checks by developing a novel region-based interprocedural array bounds analysis to automatically identify redundant checks. Elimination of redundant checks reduces the runtime overhead of bounds checks, and also enables further optimization by removing constraints that arise from precise exception semantics. We have implemented an array bounds check elimination algorithm that inserts special annotations that are recognized by a modified JVM.

We also introduce array views, a high-level construct that improves productivity by allowing the programmer to access the underlying array through multiple views. We describe a technique for optimizing away the overhead of many common cases of array views in X10. Our experiments show that eliminating bounds checks using the results of the analysis described in this paper improves the performance of our benchmarks by up to 22% over JIT compilation.

Mackale Joyner, Zoran Budimlić, Vivek Sarkar

Task and Data Distribution

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

Abstract

Modern architectures are characterized by deeper levels of memory hierarchy, often explicitly addressable. Optimizing applications for such architectures requires careful management of the data movement across all these levels. In this paper, we focus on the problem of mapping tensor contractions to memory hierarchies with more than two levels, specifically addressing placement of memory allocation and data movement statements, choice of loop fusions, and tile size selection. Existing algorithms to find an integrated solution to this problem even for two-level memory hierarchies have been shown to be expensive. We improve upon this work by focusing on the first-order cost components, simplifying the analysis required and reducing the number of candidates to be evaluated. We have evaluated our framework on a cluster of GPUs. Using five candidate tensor contraction expressions, we show that fusion at multiple levels improves performance, and our framework is effective in determining profitable transformations.

Wenjing Ma, Sriram Krishnamoorthy, Gagan Agrawal

A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL

Abstract

Heterogeneous multi-core platforms are increasingly prevalent due to their perceived superior performance over homogeneous systems. The best performance, however, can only be achieved if tasks are accurately mapped to the right processors. OpenCL programs can be partitioned to take advantage of all the available processors in a system. However, finding the best partitioning for any heterogeneous system is difficult and depends on the hardware and software implementation.

We propose a portable partitioning scheme for OpenCL programs on heterogeneous CPU-GPU systems. We develop a purely static approach based on predictive modelling and program features. When evaluated over a suite of 47 benchmarks, our model achieves a speedup of 1.57 over a state-of-the-art dynamic run-time approach, a speedup of 3.02 over a purely multi-core approach and 1.55 over the performance achieved by using just the GPU.

Dominik Grewe, Michael F. P. O’Boyle

Backmatter

Titel: Compiler Construction
herausgegeben von: Jens Knoop
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-19861-8
Print ISBN: 978-3-642-19860-1
DOI: https://doi.org/10.1007/978-3-642-19861-8

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Invited Talk

Future-Proofing Collections: From Mutable to Persistent to Parallel

JIT Compilation and Code Generation

Dynamic Elimination of Overflow Tests in a Trace Compiler

Staged Static Techniques to Efficiently Implement Array Copy Semantics in a MATLAB JIT Compiler

SSA-Based Register Allocation with PBQP

Program Analysis

Probabilistic Points-to Analysis for Java

Faster Alias Set Analysis Using Summaries

JPure: A Modular Purity System for Java

Tainted Flow Analysis on e-SSA-Form Programs

Reversible Computing and Interpreters

Clean Translation of an Imperative Reversible Programming Language

Interpreter Instruction Scheduling

Parallelism and High-Performance Computing

Actor-Based Parallel Dataflow Analysis

Using Disjoint Reachability for Parallelization

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Subregion Analysis and Bounds Check Elimination for High Level Arrays

Task and Data Distribution

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL

Backmatter

Premium Partner