Skip to main content
main-content

Über dieses Buch

The papers in this book were presented at the CMU Conference on VLSI Systems and Computations, held October 19-21, 1981 in Pittsburgh, Pennsylvania. The conference was organized by the Computer Science Department, Carnegie-Mellon University and was partially supported by the National Science Foundation and the Office of Naval Research. These proceedings focus on the theory and design of computational systems using VLSI. Until very recently, integrated-circuit research and development were concentrated in the device physics and fabrication design disciplines and in the integrated-circuit industry itself. Within the last few years, a community of researchers is growing to address issues closer to computer science: the relationship between computing structures and the physical structures that implement them; the specification and verification of computational procosses implemented in VLSI; the use of massively parallel computing made possible by VLSI; the design of special­ purpose computing architectures; and the changes in general-purpose computer architecture that VLSI makes possible. It is likely that the future exploitation of VLSI technology depends as much on structural and design innovations as on advances in fabrication technology. The book is divided into nine sections: - Invited Papers. Six distinguished researchers from industry and academia presented invited papers. - Models of Computation. The papers in this section deal with abstracting the properties of VLSI circuits into models that can be used to analyze the chip area, time or energy required for a particular computation.

Inhaltsverzeichnis

Frontmatter

Invited Papers

The Optical Mouse, and an Architectural Methodology for Smart Digital Sensors

A mouse is a pointing device used with interactive display-oriented computer systems, which tracks the movement of a user’s hand as the user pushes the mouse about on a pad (usually on the work surface next to the user’s keyboard). Mice have recently become available in the office products market as a part of the Xerox “Star,” the 8010 Professional Workstation [Business 1981, Seybold 1981–1, and Seybold 1981-2].

Richard F. Lyon

Designing VLSI Processor-Aids and Architectures

Forest Baskett

Keys to Successful VLSI System Design

The designability of successful VLSI systems has been a major topic of discussion at conferences over the last three to four years. I will present a method of approaching the overall problem which has been successfully applied at TRW for several years.

James G. Peterson

Programmable LSI Digital Signal Processor Development

Single chip LSI signal processor μPD 7720 development is explained. General requirements, employed architecture, device feature and processor performances are presented. Comparisons among so far announced single chip signal processors are briefly commented upon. Future trends are also discussed.

Akira Sawai

Functional Parallelism in VLSI Systems and Computations

The effectiveness of very large scale integration (VLSI) in reducing the incremental cost per unit of performance of a variety of flexible system functions can be significantly enhanced by employing a high degree of functional parallelism with serialized data-flow and control. Both Functional Parallelism (the parallel use of an array of high density, low cost, lower performance devices to obtain a high performance function) and Bit-Serialized Arithmetic (the use of single bit-stream operations to perform elementary arithmetic functions) have been factored into VLSI systems and computations to permit advantageous use of MOS solid-state technologies as well as graceful transitions of processor implementation from one scale of large scale integration to the next. Some of the major considerations linking form to function are noted here with examples illustrating the impact of functional parallelism and serialized arithmetic.

Noble R. Powell

Functional Extensibility: Making the World Safe for VLSI

The greatly improved access to VLSI technology , now available to non-specialists , has sparked considerable interest in the design of special function architectures (SFAs) that exploit its unique characteristics and advantages. For example, many of SFA’s take the form of pipelines or arrays in order to capitalize on the economies of replication that come with VLSI. Application of these VLSI SFA’s has reflected current research interests in areas such as pattern recognition , image analysis , data base retrieval, interactive graphics, and speech processing.

Justin R. Rattner

Models of Computation

Replication of Inputs May Save Computational Resources in VLSI

Several authors have investigated the complexity of a VLSI chip that solves a certain problem or class of problems. The results are stated in terms of a function R(A,T) of the chip area (A) and execution time (T) and are generally based on an abstract model of VLSI computation first defined by [T79] and further developed in [BK81] and [V80].

Zvi M. Kedem, Alessandro Zorat

Planar Circuit Complexity and The Performance of VLSI Algorithms +

In 1979 Thompson [1] reported that, under a suitable model for VLSI chips, the product AT2 of chip area and time T to compute the Fast Fourier Transform on n inputs must satisfy AT2 = Ω(n2). His model accounts for the chip area used by wires as well as for computational elements. He extended these results in [2] and in addition examined the sorting problem. Brent and Kung [3] introduced a somewhat different model for VLSI chips in which the area occupied by wires and circuit elements is convex. They demonstrate that AT2 = Ω(n2) to multiply two n-bit integers, a result obtained with the original model of Thompson by Abelson and Andreae [4]. Brent and Kung show that A = Ω (n) and they present algorithms that come close to meeting their lower bounds. Savage [5] obtained bounds of AT2 = Ω(p4) with both models for pxp matrix multiplication, inversion and transitive closure. Algorithms previously given by Kung and Leiserson [6] and Guibas et al. [7] were shown to be optimal. Preparata and Yuillemin [8] subsequently introduced a family of optimal matrix multiplication algorithms for Ω(1) ≤ T ≤ 0(n),

John E. Savage

Three-Dimensional Integrated Circuitry

This paper is devoted to the question, What benefits would accrue if we could implement microcircuits in a three-dimensional medium? The patient reader can view the reported results as demonstrating the efficiencies to be gained once the current technological obstacles to three-dimensional VLSI chips are overcome. The less patient reader can view the results as indicating the currently attainable advantages of three-dimensional printed circuit boards.The results obtained indicate (roughly) that bounds on area (both upper and lower) in the neighborhood of order n2 in the two-dimensional case translate to bounds on volume in the neighborhood of order n3/2 in the three-dimensional case; moreover, most of the upper bounds are attainable via (idealized) realizations that have active devices on only one level and that use the third dimension only for wire-routing? such realizations place somewhat fewer demands on the fabrication technology. However, it is also shown that unrestricted use of the third dimension can yield realizations that are more conservative of volume (by the factor log1/2 n) than any ”one-active-level” realization can be. Finally, it is shown that three-dimensional circuit realizations can afford one significant savings in device-to-device wire length: examples are presented wherein two-dimensional realizations require runs of wire as long as n/log n, while three-dimensional realizations can always get by with wire lengths not exceeding n1/2. Thus, at least in the worst case, there are substantive benefits to be gained from three-dimensional circuit realization, in terms of savings of both material (area vs. volume) and time (wire length).

Arnold L. Rosenberg

A Critique and an Appraisal of VLSI Models of Computation

Considerable attention has been paid to the definition of a suitable model of computation for Very-Large-Scale Integrated (VLSI) circuits [1], [2], [3], [4]. The basic parameters of any VLSI computation model are chip area A and computation time T. VLSI systems display a trade-off between these two parameters, each of which represents a well-defined cost aspect: chip area is a measure of fabrication cost and computation time is a measure of operating cost. VLSI Models of Computation

G. Bilardi, M. Pracchi, F. P. Preparata

Complexity Theory

On the Complexity of VLSI Computations

We present four results on the complexity of VLSI computations: a)We further justify the Boolean circuit model [Vu, Sa, LS] by showing that it is able to model multi-directional VLSI devices [e.g. pass transistors, pre-charged bus drivers).b)We prove a general cutting theorem for compact regions in Rd Cd≥2) that allows us to drop the convexity assumption in lower bound proofs based on the crossing sequence argument.c)We exhibit an Ω(n1/3) asymptotically tight lower bound on the area of strongly where oblivious chips for transitive functions.d)We prove a lower bound on the switching energy needed for computing transitive functions.

Thomas Lengauer, Kurt Mehlhorn

On the Area Required by VLSI Circuits

A technique is developed and used to derive lower bounds on the area required by a VLSI circuit by taking into account the amount of information that has to be memorized in the course of the computation. Simple arguments show, in particular, that any circuit performing operations such as cyclic shift and binary multiplication requires an area at least proportional to its output size. By extending the technique, it is also possible to obtain general tradeoffs between the area, the time, and the period (a measure of the pipeline rate) of a circuit performing operations like binary addition. The existence of VLSI designs for these operations shows that all the lower bounds are optimal up to some constant factor.

Gérard M. Baudet

The VLSI Complexity of Sorting

The area-time complexity of sorting is analyzed under an updated model of VLSI computation. The new model has fewer restrictions on chip I/O than previous models. Also, the definitions of area and time performance have been adjusted to permit fair comparisons between pipelined and non-pipelined designs.Using the new model, this paper briefly describes eleven different designs for VLSI sorters. These circuits demonstrate the existence of an area*time2 tradeoff for the sorting problem. The smallest circuit is only large enough to store a few elements at a time; it is, of course, rather slow at sorting N elements. The largest design solves a sorting problem in only 0(lgN) clock cycles. The area*time2 performance figure for all but three of the designs is close to the limiting value, Ω(N2).

C. D. Thompson

Minimum Edge Length Planar Embeddings of Trees

Valiant [1] showed how to embed any binary tree into the plane in linear area without crossovers. The edges in this embedding have a maximum length of 0($$ \sqrt {n} $$) With Paterson, we [2] showed that a complete binary tree can be embedded in the plane with maximum edge length of 0($$ \sqrt {n} $$/log n) and we argued the importance of short edge length for VLSI design and layout. Here we show that every binary tree can be embedded in the plane with all three properties: linear area, no crossovers, and 0($$ \sqrt {n} $$/log n) maximum edge length. This improves a result of Bhatt and Leiserson [3] — a graph with an n1/2-ε separator theorem can be embedded (perhaps with crossovers) in linear area and with a maximum edge length of 0($$ \sqrt {n} $$) — for the case of binary trees. In the paper we also observe that Valiant’s result can be extended to the case of oriented trees [7].

Walter L. Ruzzo, Lawrence Snyder

The VLSI Approach to Computational Complexity

The rapid advance of VLSI and the trend toward the decrease of the geometrical feature size, through the submicron and the subnano to the subpico, and beyond, have dramatically reduced the cost of VLSI circuitry. As a result, many traditionally unsolvable problems can now (or will in the near future) be easily implemented using VLSI technology.

J. Finnegan

Layout Theory and Algorithms

Optimal Placement for River Routing

River routing is the problem of connecting a set of terminals a1,…,an on a line to another set b1,…,bn in order across a rectangular channel. When the terminals are located on modules, the modules must be placed relative to one another before routing. This placement problem arises frequently in design systems like bristle-blocks where stretch lines through a module can effectively break it into several chunks, each of which must be placed separately. This paper gives concise necessary and sufficient conditions for wirability which are applied to reduce the optimal placement problem to the graph-theoretic single-source-longest-paths problem. By exploiting the special structure of graphs that arise from the placement problem for rectilinear wiring, an optimal solution may be determined in linear time.

Charles E. Leiserson, Ron Y. Pinter

The Separation for General Single-Layer Wiring Barriers

The problems of placement and routing in integrated circuit design have been gaining increasing attention as fabrication technology advances. Although a variety of those the problems have been proven to be.NP-hard, progress is being made on restricted versions. Tompa, for example, gives a quadratic solution to a particular single layer routing problem ([T]). This paper gives efficient algorithms for finding the separation and the offset in contexts which include his model.

Alan Siegel, Danny Dolev

Provably Good Channel Routing Algorithms

In this paper we present three new two-layer channel routing algorithms that are provably good in that they never require more than 2d-1 horizontal tracks where d is the channel density, when each net connects just two terminals. To achieve this result, we use a slightly relaxed (but still realistic) wiring model in which wires may run on top of each other for short distances as long as they are on different layers. Two of our algorithms will never use such a “parallel run” of length greater than 2d-1 and our third algorithm will require overlap only at jog points or cross points. Since in this wiring model at least d/2 horizontal tracks are required, these algorithms produce a routing requiring no more than four times the best possible number of horizontal tracks. The second algorithm also has the property that it uses uses at most 4n contacts, where n is the number of nets being connected.

Ronald L. Rivest, Alan E. Baratz, Gary Miller

Optimal Routing in Rectilinear Channels

Programs for integrated circuit layout typically have two phases: placement and routing. The router should produce as efficient a layout as possible, but of course the quality of the routing depends heavily on the quality of the placement. On the other hand, the placement procedure would like to know how good a routing it can expect without actually routing the wires. This paper presents fast algorithms for optimal routing and for accurately estimating the area cost of such routings without actually laying them out.The most common types of junctions occurring in layouts are T-shaped or X-shaped; this paper presents efficient algorithms to measure and produce the optimal rectilinear, two-layer routing in channels formed around such junctions. The ability to do this is based on the new notion of pairwise ordering which is used to propagate routing constraints from one part of a channel to the rest, and alleviates a fundamental problem plaguing traditional channel routers, in addition we present a greedy algorithm for optimal routing in rectangles with a new type of terminal ordering which comes up frequently in practice but has not been studied before.

Ron Y. Pinter

New Lower Bounds for Channel Width

We present here a simple yet effective technique for calculating a lower bound on the number of tracks required to solve a given channel-routing problem. The bound applies to the wiring model where horizontal wires run on one layer and vertical wires run on another layer. One of the major results is that at least $$ \sqrt {{2n}} $$ tracks are necessary for any dense channel routing problem with n two-terminal nets that begin and end in different columns. For example, if each net i begins in column i and ends in column i+1, at least $$ \sqrt {{2n}} $$ tracks are required, even though the channel “density” is only 2. This is the first technique which can give results which are significantly better than the naive channel density arguments. A modification results in the calculation of an improved bound, which we conjecture to be optimal to within a constant factor.

Donna J. Brown, Ronald L. Rivest

Compact Layouts of Banyan/FFT Networks

A two-layer pattern is presented for the crossover pattern that appears as the FFT signal flow graph and in many switching networks like the banyan, delta, or omega nets. It is characterized by constant path length and a regular pattern, providing uniform propagation delay and capacitance, and ameliorating design problems for VLSI implementation. These are important issues since path length grows linearly with the number of inputs to such networks, even though switching delay seems to grow only logarithmically.Independent of that pattern, an arrangement for stacking planes of such planar crossover patterns in three dimensions is described. Whereas a planar crossover pattern of 0(m) inputs and outputs has at best 0(m) path length, the stacked pattern allows 0($$ \sqrt {m} $$) path length. The scheme provides for stacking 2k planar nets (perhaps VLSI chips), each with k inputs/outputs into a network of k2 inputs/outputs. Using this pattern, all such paths would have length (propagation delays) = 0(k).

David S. Wise

Languages and Verification

Syntax-Directed Verification of Circuit Function

This paper introduces a new technique, called syntax-directed verification, for proving properties of circuits composed of standard cells. The lengths of proofs using this technique are independent of the size of the circuits, but depend only on the number of standard cell types and the complexity of the rules for interconnecting them. Syntax-directed verification is thus well-suited to VLSI, in which large circuits are built using relatively few types of cells. The paper describes the syntax-directed verification method, and presents an example of its use.

Michael J. Foster

Temporal Specifications of Self-Timed Systems

Self-timed logic provides a method for managing the complexity of asynchronous module connections; the correctness of a properly constructed self-timed system is independent of the speed of its components. In this paper we present a means of formally specifying self-timed systems and modules using temporal logic, an extension of ordinary logic to include an abstract notion of time. We show by example that temporal logic can describe Seitz’s self-timed modules, giving detailed specifications for combinatory logic, and sketching the treatment of wires, align elements, feedback registers, pipelines and finite state machines. Temporal logic has an expressive power that makes it well suited to this task; it also provides a framework for proofs of the properties of self-timed systems.

Yonatan Malachi, Susan S. Owicki

A Mathematical Approach to Modelling the Flow of Data and Control in Computational Networks

This paper proposes a mathematical formalism for the synthesis and Qualitative analysis of computational networks that treats data and control in the same manner. Expressions in this notation are given a direct interpretation in the implementation domain. Topology, broadcasting, pipelining, and similar properties of implementations can be determined directly from the expressions.This treatment of computational networks emphasizes the space/time tradeoff of implementations. A full instantiation in space of most computational problems is unrealistic, even in VLSI (Finnegan [4]). Therefore, computations also have to be at least partially instantiated in the time domain, requiring the use of explicit control mechanisms, which typically cause the data flow to be nonstationary and sometimes turbulent.

Lennart Johnsson, Danny Cohen

A Wavefront Notation Tool for VLSI Array Design

This paper presents an overview of an extension to a mathematically based methodology for mapping an algorithmic description into a concurrent implementation on silicon. The result of this methodology yields a systolic array [4]. The basic mathematical method was initially described by Cohen [1]. Extensions were made by Weiser and Davis [5]; Johnsson, Weiser, Cohen, and Davis [2]; Cohen and Johnsson [3]; and Weiser [6]. This approach focuses on the correspondence between equations defining a certain computation and networks which perform the computation. As the complexity of problems increases, a hierarchical approach can reduce the complexity by hiding detail and thus reduce the design complexity at each level. The purpose of this paper is to introduce a method for treating sets of data as wavefront entities in the equations of the mathematical methodology and in the graphical representation.

Uri Weiser, Al Davis

A Matrix Data Flow Language/Architecture for Parallel Matrix Operations Based on Computational Wavefront Concept

This paper focuses on revolutionary parallel architecture and language concepts for a class of matrix operations which are fundamental for signal processing computations. Based on the notion of computational wavefront, a data flow language for matrix array processors is developed and a new processor architecture tailored to this language is proposed. Simulations were done in global and local levels and both of them report encouraging success.

S. Y. Kung, K. S. Arun, D. V. Bhaskar Rao, Y. H. Hu

Special-Purpose Architectures

Digital Signal Processing Applications of Systolic Algorithms

VLSI structures and algorithms are given for bit-serial FIR filtering, IIR filtering, and convolution. We also present a bit-parallel FIR filter design. The structures are highly regular, programmable, and area-efficient. In fact, we will show that most are within log factors of asymptotic optimality. These structures are completely pipelined; that is, the throughput rate (bits/second) is independent of both word size and filter length. This is to be contrasted with algorithms designed and implemented in terms of, say, multipliers and adders whose throughput rates may depend on word length.

Peter R. Cappello, Kenneth Steiglitz

A Two-Level Pipelined Systolic Array for Convolutions

Pipelining computations over a large array of cells has been an important feature of systolic arrays. To achieve even higher degrees of concurrency, it is desirable to have cells of a systolic array themselves be pipelined as well. The resulting two-level pipelined systolic array would enjoy in principle a k-fold increase in its throughput, where k is the ratio of the time to perform the entire cell computation over that to perform just one of its pipeline stages. This paper describes such a two-level pipelined systolic array that is capable of performing convolutions of any dimension. The designs take full advantages of the pipelining assumed to be available at each cell.Multi-stage pipelined arithmetic units built from discrete components have been used in most of high-performance computers. With the advent of VLSI, these pipelined units will surely be implemented in one or few chips. This paper shows for the first time how a large number of these pipelined chips can be efficiently combined to form a systolic array.

H. T. Kung, Lawrence M. Ruane, David W. L. Yen

Systolic Algorithms for Running Order Statistics in Signal and Image Processing

Median smoothing, a filtering technique with wide application in digital signal and image processing, involves replacing each sample in a grid with the median of the samples within some local neighborhood. As implemented on conventional computers, this operation is extremely expensive in both computation and communication resources. This paper defines the running order statistics (ROS) problem, a generalization of median smoothing. It then summarizes some of the issues involved in the design of special purpose devices implemented with very large scale integration (VLSI) technology. Finally, it presents algorithms designed for VLSI implementation which solve the ROS problem and are efficient with respect to hardware resources, computation time, and communication bandwidth.

Allan L. Fisher

Systolic Array Processor Developments

The combination of systolic array processing techniques and VLSI fabrication promises to provide modularity in the implementation of matrix operations for signal-processing with throughput increasing linearly with the number of cells utlized. In order to achieve this, however, many design tradeoffs must be made.Several fundamental questions need to be addressed: What level of complexity (control) should the processor incorporate in order to perform complicated algorithms? Should the control for the processing element be combinatorial logic or a microprocessor? The broad application of a systolic processing element will require flexibility in its architecture if it is to be produced in large enough quantities to lower the unit cost so that large arrays can be constructed.In order to have a timely marriage of algorithms and hardware we must develop both concurrently so that each will affect the other. A brief description of the hardware for a programmable, reconfigurable systolic array test-bed, implemented with presently available integrated circuits and capable of 32 bit floating point arithmetic will be given. While this hardware requires a small printed circuit board for each processor, in a few years, one or two custom VLSI chips could be used instead, yielding a smaller, faster systolic array. The test-bed is flexible enough to allow experimentation with architecture and algorithms so that knowledgeable decisions can be made when it comes time to specify the architecture of a VLSI circuit for a particular set of applications.The systolic array testbed system is composed of a minicomputer system interfaced to the array of systolic processor elements (SPEs). The minicomputer system is an HP-1000, with the usual complement of printer, disk storage, keyboard-CRT, etc. The systolic array is housed in a cabinet approximately 28″x19″x21″. The interface circuitry uses a single 16-bit data path from the host HP-1000 to communicate data and commands to the array.Commands and data are generated in the HP-1000 by the operator using interface programs written in FORTRAN. Algorithms can be conceived, put into a series of commands for the systolic array processor, and tested for validity. Data computed in the array can be read by the host HP-1000 and displayed for the operator.The use of a general purpose minicomputer as the driver for the systolic array gives unlimited flexibility in developing algorithms. Through the use of interface routines, algorithms can be tried, evaluated, change and tried again in a few minutes. Also, in cases where the output must be manipulated and fed back into the array, the manipulation of the data can be done either in the host using the high order language capability (for optimum flexibility), or in a dedicated microprocessor interfacing the systolic array to the host (for optimum speed).

Keith Bromley, J. J. Symanski, J. M. Speiser, H. J. Whitehouse

A Systolic (VLSI) Array for Processing Simple Relational Queries

This paper discusses die use of systolic arrays (a conceptual and design tool for VLSI systems [11]) to produce VLSI capable of processing simple relational database queries, which are by far the most frequently executed queries in practical large database systems. We will be concerned with the exploitation of VLSI technology to process “simple” relational queries very rapidly; the design of an array for this task is described below. The systolic properties of the array design arc considered, and are shown to have analogs in the domain of databases by using the systolic properties to prove certain consistency and scheduling complexity properties of all transactions executed by the array (hereinafter called the simple query array, or SQA). The SQA is intended for use as an integral part of a systolic database machine [13], which would handle very large databases and is expected to have a high performance gain over conventional database systems. The machine should also compare quite favorably with other database machine designs (for example [1,4,16,17,19]), especially when used for databases with frequent simple queries, i.e. those databases used by most commercial applications!

Philip L. Lehman

A Systolic Data Structure Chip for Connectivity Problems

In this paper we present an example of a design for a “data structure chip” and suggest how it can be used for problem solving in a digital system. In particular, we describe a systolic structure which can be used, for a graph, to find the connected components, a spanning tree, or, when used in conjunction with a systolic priority queue, a minimum spanning tree.

Carla Savage

Multiplier Designs

Fixed-Point High-Speed Parallel Multipliers in VLSI

The paper presents techniques to increase the speed of fixed-point parallel multipliers and reduce the multiplier chip size for VLSI realizations. It is shown that a higher order (octal) version of the Booth’s Algorithm will lead to significant improvements in speed, coupled with a decrease of chip area and power consumption, as compared to the modified (quaternary) version of the Booth’s Algorithm presently used in or proposed for monolithic multipliers. In addition, further speed improvements can be obtained by using Wallace trees or optimal Dadda types of realizations.The optimal Dadda realizations with minimal number of adders can be layed out in a regular, rectangular array interleaved with partial product generation. The resulting regular structure is suitable for VLSI implementations. The more complex interconnection wiring which is needed is shown to be feasable in technologies with at least 3 layers of interconnections. Layout, interconnection and speed considerations for the proposed high-speed VLSI parallel multiplier configurations have been studied.

Peter Reusens, Walter H. Ku, Yu-Hai Mao

A Mesh-Connected Area-Time Optimal VLSI Integer Multiplier

VLSI circuits for integer multiplication based on the Fourier Transform have been discussed in the literature [1-3]. The circuits presented in [2] meet the area x (time)2 lower bound for multiplication established by various workers [1,4,5] in the well-known VLSI synchronous model of computation [6]. This bound states that, if the numbers to be multiplied are represented with N bits and A and T are respectively chip area and computation time, then any VLSI multiplication network must satisfy the condition AT2 = Ω (N2).

Franco P. Preparata

A Regular Layout for Parallel Multiplier of 0(Log2N) Time

A O(log2n) time n-bit binary multiplier basing on the Mead and Conway VLSI design rule is presented. The layout has regular, recursive structure and is directly suitable for practical VLSI implementation. This multiplier is much faster than the traditional “serial pipeline multiplier” having O(n) time and the “Brent-Kung multiplier” having O(n1/2log n) time. Its layout is of more practical interest than the multiplier proposed by Preparata and Vuillemin basing on the CCC network though it is optimal and has time O(log2n). The AT2 measure of this multiplier layout is nearly optimal, being O(n2log4n); so it answers the question posed by Brent-Kung that the existence of a practical multiplier having AT2 = o(n3) measure.The detailed VLSI layout, the theoretical and actual exact complexity of the time and area measure will be presented. The actual implementation of a 16-bit example through the MPC is also discussed.

W. K. Luk

Processors

VLSI Implementations of a Reduced Instruction Set Computer

A general trend in computers today is to increase the complexity of architectures commensurate with the increasing potential of implementation technologies. Consequences of this complexity are increased design time, more design errors, inconsistent implementations, and the delay of single chip implementation[7]. The Reduced Instruction Set Computer (RISC) Project investigates a VLSI alternative to this trend. Our initial design is called RISC I.

Daniel T. Fitzpatrick, Manolis G. H. Katevenis, David A. Patterson, Zvi Peshkess, Robert W. Sherburne, John K. Foderaro, Howard A. Landman, James B. Peek, Carlo H. Séquin, Korbin S. Van Dyke

MIPS: A VLSI Processor Architecture

MIPS (Microprocessor without Interlocked Pipe Stages) is a general purpose processor architecture designed to be implemented on a single VLSI chip. The main goal of the design is high performance in the execution of compiled code. The architecture is experimental since it is a radical break with the trend of modern computer architectures. The basic philosophy of MIPS is to present an instruction set that is a compiler-driven encoding of the microengine. Thus, little or no decoding is needed and the instructions correspond closely to microcode instructions. The processor is pipelined but provides no pipeline interlock hardware; this function must be provided by software.

John Hennessy, Norman Jouppi, Forest Baskett, John Gill

Comparative Survey of Different Design Methodologies for Control Part of Microprocessors

We present several methodologies used in the design of control parts of microprocessors and we discuss their classification with respect to the qualities of the design. All these different methodologies were brought out by decoding existing integrated circuits. Afterwards each one of these methodologies was used to redesign a new control part of the MC 6800 microprocessor, its operation part remaining unchanged. By so doing, we obtained a set of normalized solutions so that the real efficiency of each method could be estimated in terms of the cost of hardware and design time. The performance expressed by the cycle time of each control part was also calculated leading to the complete, valid classification of different design styles. At last the evolution of the design efficiency versus the circuit complexity was studied.

Monika Obrebska

C.fast: A Fault Tolerant and Self Testing Microprocessor

During the spring of 1981, the authors were involved in a project to design a single chip fault tolerant microprocessor. The microprocessor chip is now being fabricated by the Multi Project Chip (MPC) facilities. This report presents a brief overview of the chip, examples of the reliability — testability techniques implemented, and some of the trade-off issues resolved during the design process: partitioning of control code into several PLA’s, and increase of PLA size and the overall chip size due to testability — reliability constraints.

Michael M. Tsao, Andrew W. Wilson, Ralph C. McGarity, Chia-Jeng Tseng, Daniel P. Siewiorek

Systems and Processors

VLSI Processor Arrays for Matrix Manipulation

It is generally recognized that computing system throughput can be significantly increased by concurrent processing techniques. The component densities available from VLSI have made this potential economically feasible, and at the same time have considerably enhanced circuit performance. Implementation of concurrent architectures can be achieved by mapping desired algorithms into data flow forms suitable for pipelining. However, the utility of this approach suffers because different hardware organization is required for every algorithm. A preferable approach is based on the observation that most of the signal processing algorithms being investigated today can be formatted in terms of matrix manipulations.1 A matrix operation is also well-suited to concurrent implementations in which a number of small, identical processors operate simultaneously on the matrix elements. Thus, a set of “matrix operator” chips made using VLSI would, when coupled to a general purpose host computer, provide both a high computational throughput and the flexibility to perform a wide range of signal processing algorithms.

J. G. Nash, S. Hansen, G. R. Nudd

A General-Purpose Cam-Based System

VLSI makes feasible the massive use of content addressable memory in a general purpose computer. We present a design for memory which is addressable conventionally and by content, and which supports low-level bit-serial word-parallel algorithms. CAM provides one of the most easily understood and programmed frameworks for massively parallel computations. We present a programming methodology for the use of our design. This includes a programming language, CAML; a number of algorithms from various fields which demonstrate the generality of the design and the language; and techniques for transforming algorithms from conventional to CAM-based structures and methods.We do not attempt to better the performance of specialized hardware for particular tasks. Our contention is that CAM is a practical technique for the application to general-purpose use of the massive parallelism available with VLSI.

J. Storrs Hall

A Statically Scheduled VLSI Interconnect for Parallel Processors

The interconnect mechanism of a computer system plays a central role in determining the performance of the system. Consequently, much effort has been expended in designing interconnection structures and in attempting to bring the full power of VLSI technology to bear upon this issue. One particularly attractive candidate for VLSI implementation is a crossbar switch of relatively small dimensions (e.g., 2x2 or 4x4). A crossbar switch of this type can be used to build larger crossbar switches (which may be either sparse or complete), multi-bus interconnects with the crossbar chips serving to connect the various resources to the multiple buses, and multi-stage networks such as the delta network [l].

B. R. Rau, P. J. Kuekes, C. D. Glaeser

The CMOS SLA Implementation and SLA Program Structures

The storage/logic array (SLA) is a form of structured logic which is well suited to VLSI design. The SLA concept, which was derived from the PLA, was originally conceived by Patil [3] and later elaborated upon by Patil and Welch [4]. The SLA differs from the PLA in several major respects. The SLA has both the AND and the OR planes from the PLA, but these planes are superimposed or folded on top of each other. This folding of the AND and OR planes generates a structure in which AND terms are generated on the rows of the SLA and the OR terms are generated on the columns. The single AND/OR plane of the SLA contains column wires which can serve as inputs to the SLA rows (AND plane) or as outputs from the OR plane. This functional duality of SLA columns means not only that the SLA can be arbitrarily segmented, but that inputs to and outputs from segments of the SLA can be arbitrarily interleaved.

K. F. Smith, T. M. Carter, C. E. Hunt

A New CCD Parallel Processing Architecture

Charge-coupled signal processing devices are attractive for applications in such systems as communication, radar and sonar because of the ability of a single rather simple device to perform the equivalent of a large number of arithmetic operations per second. For example, a 32-point transversal filter(1) operating at 20 MHz is performing 1.2 x 109 operations per second which is the equivalent of a fair size digital processor. However, conventional CCDs have not gained wide spread acceptance in commercial or military systems because of the lack of availibility of CCDs with sufficient cost, power, weight and throughput advantages over competing digital techniques to make them attractive for integration into otherwise digital hardware.

A. M. Chiang
Weitere Informationen