Skip to main content
main-content

Über dieses Buch

This book is a revision of my Ph. D. thesis dissertation submitted to Carnegie Mellon University in 1987. It documents the research and results of the compiler technology developed for the Warp machine. Warp is a systolic array built out of custom, high-performance processors, each of which can execute up to 10 million floating-point operations per second (10 MFLOPS). Under the direction of H. T. Kung, the Warp machine matured from an academic, experimental prototype to a commercial product of General Electric. The Warp machine demonstrated that the scalable architecture of high-peiformance, programmable systolic arrays represents a practical, cost-effective solu­ tion to the present and future computation-intensive applications. The success of Warp led to the follow-on iWarp project, a joint project with Intel, to develop a single-chip 20 MFLOPS processor. The availability of the highly integrated iWarp processor will have a significant impact on parallel computing. One of the major challenges in the development of Warp was to build an optimizing compiler for the machine. First, the processors in the xx A Systolic Array Optimizing Compiler array cooperate at a fine granularity of parallelism, interaction between processors must be considered in the generation of code for individual processors. Second, the individual processors themselves derive their performance from a VLIW (Very Long Instruction Word) instruction set and a high degree of internal pipelining and parallelism. The compiler contains optimizations pertaining to the array level of parallelism, as well as optimizations for the individual VLIW processors.

Inhaltsverzeichnis

Frontmatter

1. Introduction

Abstract
Since Kung and Leiserson introduced “systolic arrays” in 1978, tremendous amounts of research effort have gone into systolic algorithm research. Over the last decade, researchers have developed many systolic algorithms in many important numerical processing areas, including signal processing and scientific computing, demonstrating that systolic architecture is highly suited to deliver the bandwidth required of those applications. The original concept of systolic computing is to map systolic algorithms directly onto custom hardware implementations, and the cost of implementation has limited the practice of systolic computing to a few instances of dedicated systolic hardware. Most systolic algorithms remained as paper designs.
Monica S. Lam

2. Architecture of Warp

Abstract
The Warp project is a study in high-performance, programmable systolic architectures. Although the feasibility of the concept of systolic computing has long been established, the results have mainly been theoretical in nature, and many lower-level architectural and implementation issues have not been studied and addressed. We believed that these practical issues would not even be exposed, let alone resolved, without implementation and experimentation. We did not just set out to implement a systolic array to perform a specific function, we extended the concept of systolic processing by implementing each processing unit as a programmable, high-performance processor. The tenet of the project is that the theories and results in systolic computing research are applicable also to arrays of processors that are programmable and powerful.
Monica S. Lam

3. A Machine Abstraction

Abstract
This chapter studies the design of a machine abstraction for systolic arrays with high-performance cells. While the study is based on the Warp architecture, the design is applicable to any programmable array for which a high-level language is used to specify fine-grained parallel computation on the cells. In this chapter, we also discuss the hardware and software necessary to support the proposed machine abstraction.
Monica S. Lam

4. The W2 Language and Compiler

Abstract
The machine abstraction proposed in the last chapter is supported by the W2 language. W2 is a language in which each cell in the Warp array is individually programmed in a Pascal-like notation, and cells communicate with their left and right neighbors via asynchronous communication primitives. The user controls the array level concurrency, but not the system and cell level concurrency.
Monica S. Lam

5. Software Pipelining

Abstract
Pipelining and parallel functional units are common optimization techniques used in high-performance processors. Traditionally, this parallelism internal to the data path of a processor is only available to the microcode programmer, and the problems of minimizing the execution time of the microcode within and across basic blocks are known as local and global compaction, respectively. The development of the global compaction technique, trace scheduling, has led to the introduction of VLIW (very long instruction word) architectures [9,19,20,21]. A VLIW machine is like a horizontally microcoded machine: it consists of parallel functional units, each of which can be independently controlled through dedicated fields in a “very long” instruction. A characteristic distinctive of VLIW architectures is that these long instructions are the machine instructions. There is no additional layer of interpretation where machine instructions are expanded into micro-instructions. A compiler directly generates these long machine instructions from programs written in a high-level language. A VLIW machine generally has an orthogonal instruction set; whereas in a typical horizontally microcoded engine, complex resource or field conflicts exist between functionally independent operations.
Monica S. Lam

6. Hierarchical Reduction

Abstract
Hierarchical reduction is a unified approach to scheduling both within and across basic blocks. The motivation for the technique is to make software pipelining applicable to all innermost loops, including those containing conditional statements. Software pipelining has previously been defined only for loops containing straight-line code bodies. A simple conditional statement in the loop would have rendered the technique inapplicable. A loop such as the following would have been extremely inefficient on a heavily pipelined and parallel data path.
Monica S. Lam

7. Evaluation

Abstract
The ideas and techniques in this book have been validated and implemented in the W2 compilers for the Warp machines. The original W2 compiler built for the first prototype of the Warp machine was modified and retargeted as the Warp architecture evolved. The compiler was first released in late 1985 for the prototype Warp machine. Large numbers of applications including robot navigation, low-level vision, signal processing and scientific computing have been developed using this compiler by early 1987 [3,4], The compiler has since been retargeted to the PC production machine and iWarp, the integrated Warp architecture. While the PC machine is similar to the first prototypes, the iWarp machine is significantly different. The different compilers contain different machine-dependent optimizations but they all use the same scheduler module.
Monica S. Lam

8. Conclusions

Abstract
The Warp architecture represents a significant breakthrough in systolic computing. Systolic arrays have previously been known to be highly efficient, but special-purpose hardware architectures. The cost of dedicated hardware is so prohibitive that although many different systolic algorithms have been designed, few have been implemented. The Warp project demonstrates that programmability can be achieved without sacrificing efficiency. Even in its prototype form, the Warp machine can already deliver execution speeds for many numerical oriented applications that rival existing supercomputers. The availability of the single-chip iWarp processor will make a significant impact on the practice of parallel computing. Arrays of thousands of cells are feasible, programmable, and much cheaper than many other supercomputers of comparable power.
Monica S. Lam

Backmatter

Weitere Informationen