Skip to main content

Über dieses Buch

Application Specific Processors is written for use by engineers who are developing specialized systems (application specific systems).
Traditionally, most high performance signal processors have been realized with application specific processors. The explanation is that application specific processors can be tailored to exactly match the (usually very demanding) application requirements. The result is that no `processing power' is wasted for unnecessary capabilities and maximum performance is achieved. A disadvantage is that such processors have been expensive to design since each is a unique design that is customized to the specific application.
In the last decade, computer-aided design systems have been developed to facilitate the development of application specific integrated circuits. The success of such ASIC CAD systems suggests that it should be possible to streamline the process of application specific processor design.
Application Specific Processors consists of eight chapters which provide a mixture of techniques and examples that relate to application specific processing. The inclusion of techniques is expected to suggest additional research and to assist those who are faced with the requirement to implement efficient application specific processors. The examples illustrate the application of the concepts and demonstrate the efficiency that can be achieved via application specific processors. The chapters were written by members and former members of the application specific processing group at the University of Texas at Austin. The first five chapters relate to specific arithmetic which often is the key to achieving high performance in application specific processors. The next two chapters focus on signal processing systems, and the final chapter examines the interconnection of possibly disparate elements to create systems.



1. Variable-Precision, Interval Arithmetic Processors

This chapter presents the design and analysis of variable-precision, interval arithmetic processors. The processors give the user the ability to specify the precision of the computation, determine the accuracy of the results, and recompute inaccurate results with higher precision. The processors support a wide variety of arithmetic operations on variable-precision floating point numbers and intervals. Efficient hardware algorithms and specially designed functional units increase the speed, accuracy, and reliability of numerical computations. Area and delay estimates indicate that the processors can be implemented with areas and cycle times that are comparable to conventional IEEE double-precision floating point coprocessors. Execution time estimates indicate that the processors are two to three orders of magnitude faster than a conventional software package for variable-precision, interval arithmetic.
Michael J. Schulte

2. Modeling the Power Consumption of CMOS Arithmetic Elements

Designers are faced with the task of designing circuits and systems which must use minimum power dissipation. By providing the designer with accurate estimates of the power dissipation of CMOS adders and multipliers, this research aids in the initial circuit choice, thus reducing the number and length of the design iterations. Simulation and direct measurement of the performance of test chips is used to evaluate their characteristics, and the results are used to rank the circuits on dynamic power dissipation.
Thomas K. Callaway

3. Fault Tolerant Arithmetic

This chapter summarizes various techniques used to achieve fault tolerance in computer arithmetic. There are basically three approaches including hardware redundancy, information redundancy, and time redundancy. Hardware redundancy has the highest hardware overhead. However, its delay is minimal. In the information redundancy approach, both the hardware complexity and delay are higher than other approaches. Time redundancy uses the smallest amount of hardware at the expense of extra computation time. In this research a concurrent error correcting technique based on time redundancy called time shared TMR is developed. It has been successfully applied to ripple carry adders and array multipliers. VLSI ripple carry adders and array multipliers are designed. They are compared in area, delay and cycle time. The time shared TMR technique can also be applied to more complex arithmetic processors like sorting networks, FFT arrays, convolvers, and inner product units. This research is significant because the time shared TMR technique proves to be a high reliability, low hardware complexity, and reasonable delay penalty solution to fault tolerant arithmetic.
Yuang-Ming Hsu

4. Low Power Digital Multipliers

CMOS digital multipliers have high power dissipation in comparison with other circuits due to carry propagation and spurious transitions. Techniques to reduce switching activity and improve the performance at the algorithm and circuit level are presented. A new concept to reduce switching activity using combinational self-timed elements and bypassing logic blocks to eliminate redundant operations is proposed.
Edwin de Angel

5. A Unified View of CORDIC Processor Design

The COordinate Rotation Digital Computer (CORDIC) algorithm is a well-known and widely studied method for plane vector manipulation. It uses a sequence of partial vector rotations to approximate the expected one. Under different operating modes, this algorithm can be used either to do Givens transformation for vector rotation and vectoring or to evaluate more than a dozen of elementary, trigonometric, and hyperbolic functions. CORDIC processors are therefore powerful computing systems for applications involving large amount of rotation operations and mathematical functions mentioned above.
CORDIC computation adopts only primitive arithmetic operations (additions, subtractions, and shiftings) instead of multiplications. This has a great impact on the hardware characteristics especially when circuit complexity is concerned. As a consequence, the CORDIC algorithm is become a widely used approach for elementary function evaluation whenever the silicon area is a primary constraint. The main drawback is the intrinsic low performance due to the iterative computational approach. In particular, parallelism cannot be easily introduced since each CORDIC iteration has to select the rotation direction by analyzing the result of the previous one.
In this chapter, a unified view of the CORDIC architecture design is presented. Our goal is to provide a wide spectrum of architectures, a coordinated and comprehensive design methodology, and the main figures of merit characterizing performance and complexity. This methodology contains the basic guidelines for designers to choose an approach with respect to specific requirements and constraints of the application.
Shaoyun Wang, Vincenzo Piuri

6. Multidimensional Systolic Arrays for Computing Discrete Fourier Transforms and Discrete Cosine Transforms

This chapter presents a new approach for computing the multidimensional discrete Fourier transform (DFT) and the multidimensional discrete cosine transform (DCT) in a multidimensional systolic array. There are extensive applications of fast Fourier transform (FFT) and fast cosine transform (FDCT) algorithms. From the basic principle of fast transform algorithms (breaking the computation in successively smaller computations), we find that the multidimensional systolic architecture is efficiently used for implementing FFT algorithms and FDCT algorithms. The essence of the multidimensional systolic array is to combine different types of semi-systolic arrays into one array so that the resulting array becomes truly systolic. This systolic array does not require any preloading of input data and it generates output data only from boundary PEs. No networks for transposition between intermediate constituent 1-D transforms are required; therefore the entire processing is fully pipelined. This approach is well suited for VLSI implementation by providing simple and regular structures. Complexity estimation of area*time 2 shows our multidimensional systolic array is within a factor of logk of the lower bound for an M-dimensional k-point DFT (k=N M ).
Hyesook Lim

7. Parallel Implementation of a Fast Third-Order Volterra Filtering Algorithm

A parallel implementation of a fast third-order Volterra filtering algorithm is presented. Our initial implementation is on an AT&T DSP-3 parallel processor. This advanced system allows us to focus on the parallelization of the Volterra filter algorithm without the expense of VLSI fabrication. When the parallel version of the algorithm is thoroughly tested, our long range goal is to do a VLSI implementation. An application to nonlinear digital satellite channels is described.
Hercule Kwan

8. Design and Implementation of an Interface Control Unit for Rapid Prototyping

A major difficulty in rapid prototyping digital signal processing systems is the interconnection of processors with tailored networks. This difficulty can be alleviated by utilizing a standardized processor-to-processor interface. This approach permits the configuration of application specific hardware, with arbitrary hardware redundancy, to match the signal flow graph of specific applications. The hardware is mapped to the application as opposed to the traditional approach of mapping the application to the hardware. An inventory of heterogeneous processors, specialized to perform a predefined set of functions, enables rapid prototyping of systems with arbitrary topologies and functionalities. Application specific systems that match the signal flow graph of applications outperform general purpose systems both in speed and throughput.
This research focuses on solving the problems associated with the interconnection of the heterogeneous building blocks into systems with arbitrary topologies. A communication architecture is proposed that allows the interconnection of processors with varying speed and functionalities. Standardization of the Interface Control Unit (ICU) greatly reduces the development cost and time by removing the need to design and develop custom interfaces.
A robust event transaction protocol has been developed which eliminates centralized control and synchronization. The communication protocol is designed to be self-organizing and self-synchronizing by distributing control functions among the individual system resources through the Interface Control Unit (lCU). The protocol is optimized and verified by simulation using Rainbow Nets. Using this approach, it is possible to investigate variables in system configuration, application algorithms, and VLSI technology parameters separately. A gate-level synchronous design of the ICU is developed using LSI Logic Inc., LCA300K Technology. This is a CMOS technology with a minimum feature size of 0.7 micron.
Mohammad S. Khan


Weitere Informationen