Skip to main content

2019 | Buch

Approximate Circuits

Methodologies and CAD

herausgegeben von: Sherief Reda, Prof. Dr. Muhammad Shafique

Verlag: Springer International Publishing

insite
SUCHEN

Über dieses Buch

This book provides readers with a comprehensive, state-of-the-art overview of approximate computing, enabling the design trade-off of accuracy for achieving better power/performance efficiencies, through the simplification of underlying computing resources. The authors describe in detail various efforts to generate approximate hardware systems, while still providing an overview of support techniques at other computing layers. The book is organized by techniques for various hardware components, from basic building blocks to general circuits and systems.

Inhaltsverzeichnis

Frontmatter

Approximate Arithmetic Circuit

Frontmatter
Chapter 1. Configurable Models and Design Space Exploration for Low-Latency Approximate Adders
Abstract
Adders, being one of the fundamental operators in many processing applications, have received a significant amount of attention in approximate computing. Using approximate low-power and low-latency adders can have a significant effect on improving the energy efficiency and performance of a system, at the cost of reasonable accuracy loss. For this chapter, we restrict our focus towards generic configurable models for high-performance (low-latency) approximate adders. These models provide design time support for choosing appropriate low-latency approximate adder design, which offers desired level of quality while providing optimal performance. In this chapter, we also cover the mathematical analysis which shows that, provided a latency constraint, an adder configuration with the highest quality and lowest area/power requirement can effortlessly be selected from the complete design space of low-latency adders. Towards the end of the chapter, we present extensive results in support of the presented mathematical analysis and show that the selected optimal configurations indeed provide optimal results even in real-world applications.
Muhammad Abdullah Hanif, Rehan Hafiz, Muhammad Shafique
Chapter 2. Approximate Multipliers and Dividers Using Dynamic Bit Selection
Abstract
Within the approximate computing paradigm, approximate arithmetic logic design has received the most attention. The reason, simply the flexibility of such logic where basic building blocks, e.g., adders, multipliers, and dividers, can be utilized within a wide range of approximate data paths offering the benefits of approximate computing. Ideally, effective implementations of approximate arithmetic can readily be utilized within different applications without requiring application-level knowledge. In this chapter, a generic methodology for design of approximate arithmetic logic, namely approximate multipliers and approximate dividers, is proposed. The methodology has the desirable features that maintain an upper bound on maximum attainable error, and result in a zero-balanced error distribution averting error accumulation in consecutive processing. Furthermore, the proposed methodology is highly scalable to higher input sizes, and offers a wide range of accuracy hardware cost trade-offs. We evaluate our methodology with an approximate multiplier and approximate divider and highlight the significant benefits achieved, using both stand-alone and in application experiments.
Soheil Hashemi, Sherief Reda
Chapter 3. Heterogeneous Approximate Multipliers: Architectures and Design Methodologies
Abstract
Multipliers are an integral block of a wide range of error-resilient applications like audio, image, and video processing, and machine learning. However, these multiplier architectures are computationally complex, and hence consume more power and occupy more area with long carry-adder trees when implementing multipliers with high bit-width. Approximate computing is an emerging design paradigm and is currently exploited to alleviate such area and power overheads, with slight/affordable degradation in the output quality of error-resilient application. An approximate multiplier architecture could either be approximated at the partial-product generation, accumulation, or summation stages. In this chapter, we focus on the different design aspects of energy-efficient approximate multipliers for both ASICs- and FPGAs-based systems.
Semeen Rehman, Bharath Srinivas Prabakaran, Walaa El-Harouni, Muhammad Shafique, Jörg Henkel
Chapter 4. Approximate Arithmetic Circuits: Design and Evaluation
Abstract
Arithmetic circuits are important computing modules in a processor. They play a key role in the performance and the energy consumption of many image processing applications. In this chapter, a classification is presented for the current designs of approximate arithmetic circuits including adders, multipliers, and dividers. To understand the features of various designs, a comparative evaluation of their error and circuit characteristics is performed. The accuracy of approximate arithmetic circuits is evaluated by carrying out Monte Carlo simulations. The circuit measurements are assessed by synthesizing approximate designs in an STM CMOS 28 nm process. The simulation and synthesis results show the trade-offs of approximate arithmetic circuits between accuracy and hardware efficiency.
Honglan Jiang, Leibo Liu, Fabrizio Lombardi, Jie Han
Chapter 5. Probabilistic Error Analysis of Approximate Adders and Multipliers
Abstract
Approximate adders and multipliers are widely being advocated to be used in error resilient applications. A very important performance metric in this regard is the probability of occurrence of error in these arithmetic circuits as this allows us to choose the most efficient configuration of an adder or multiplier for a given application. In this chapter, we present an analytical error analysis approach for approximate adders, which comprise of subadder units, and recursive approximate multipliers with approximate partial products. We also derive probability mass function (PMF) of error for both of the considered adder and multiplier models. The results show that the proposed analysis serves as an effective tool for predicting, evaluating, and comparing the accuracy of various approximate adders and multipliers. For illustration purposes, we also show that the comparative performance of different approximate adders and multipliers can be correctly predicted in practical applications of image processing.
Sana Mazahir, Muhammad Kamran Ayub, Osman Hasan, Muhammad Shafique

Approximate Circuit Synthesis

Frontmatter
Chapter 6. Automatic Synthesis Techniques for Approximate Circuits
Abstract
Approximate circuits are the basic hardware building blocks of approximate computing platforms. Manual approximate designs of simple arithmetic circuits (adders and multipliers) have demonstrated the potential for efficiency improvements using approximate circuits. However, their broader adoption requires systematic methodologies that can create approximate implementations for any arbitrary circuit. An approximate circuit can be viewed as realizing a logic function that slightly deviates from the original specification, while falling within a specified quality constraint. A design automation methodology should therefore enable the generation of “correct-by-construction” approximate circuits that are guaranteed to satisfy designer-specified quality constraints.
This chapter presents an overview of automatic techniques to design approximate circuits and describes the key principles behind them. The proposed techniques apply to both combinational and sequential circuits and to circuits that contain a mix of control and datapath logic. In many applications, the degree of resilience often varies depending on the application context or the dataset being processed. In such cases, it is desirable to construct quality configurable circuits, or circuits that are capable of reconfiguring to adapt their accuracy at runtime. We describe how approximate circuit design frameworks can be extended to enable automatic synthesis of quality configurable circuits. The synthesized approximate circuits achieve considerable benefits in area and energy with minimal design effort.
Ashish Ranjan, Swagath Venkataramani, Shubham Jain, Younghoon Kim, Shankar Ganesh Ramasubramanian, Arnab Raha, Kaushik Roy, Anand Raghunathan
Chapter 7. Approximate Logic Synthesis Using Boolean Matrix Factorization
Abstract
In this chapter, a new approximate circuit synthesis paradigm is presented, where approximations are introduced to the input circuit using Boolean matrix factorization (BMF). For a given multi-input, multi-output circuit, we first build its truth table and then approximate the truth table using BMF in a controllable fashion. The results of the BMF factorization are then used to synthesize the final approximate circuit. To scale our technique to large circuits, we devise a circuit decomposition method that breaks the circuit into manageable subcircuits. Furthermore, to effectively explore the design space of subcircuit approximations, a design space exploration technique is presented. Our approach offers a wide range of fine-grain trade-offs between accuracy and design complexity, i.e., design area and total power. We demonstrate that the proposed methodology can achieve large savings in power and area with small reductions in accuracy.
Soheil Hashemi, Hokchhay Tann, Sherief Reda
Chapter 8. Approximate Hardware Generation Using Formal Techniques
Abstract
When it comes to the design of hardware for approximate computing, the exactness requirement between a specification of a circuit and its implementation is relaxed. In this chapter we present two different methods to generate approximate hardware for a given specification and its non-approximated implementation. We use formal techniques to guarantee that bounds for application specific error-metrics hold.
The first method for approximate hardware generation is an exact BDD-based technique, which focuses on single-output functions. Due to the complexity of the problem, scalability is an issue. For this reason, we further present a heuristic approach, which uses Symbolic Computer Algebra to determine the error-metric. This approach is tailored for arithmetic circuits. We apply this method to Ripple-Carry-Adders and compare the results to state-of-the-art handcrafted approximate hardware.
Saman Froehlich, Daniel Große, Rolf Drechsler
Chapter 9. Automated Search-Based Functional Approximation for Digital Circuits
Abstract
The problem of developing an approximate implementation of a given combinational circuit can be formulated as a multi-objective design problem and solved by means of a search algorithm. This approach usually provides many solutions showing high-quality tradeoffs between key design objectives; however, it is very computationally expensive. This chapter presents a general-purpose method based on genetic programming for an automated functional approximation of combinational circuits at the gate and register-transfer levels. It surveys relevant error metrics and circuit parameters that are typically optimized by genetic programming. A special attention is given to the techniques capable of providing formal guarantees in terms of error bounds and accelerating the search process. Case studies dealing with approximate implementations of arithmetic circuits and image operators are presented to highlight the quality of results obtained by the search-based functional approximation in completely different application domains.
Lukas Sekanina, Zdenek Vasicek, Vojtech Mrazek
Chapter 10. Approximate High-Level Synthesis of Custom Hardware
Abstract
Approximate computing exploits trade-offs between quality and energy/performance of computed results for inherently error-tolerant applications. At the hardware level, various components, such as arithmetic and logic units (ALUs), have been proposed to build approximate hardware processors. However, existing work has been mostly ad hoc or using expensive iterative simulation and resynthesis for design space exploration. In this chapter, we present an approximate high-level synthesis (AHLS) approach that utilizes approximate operators in synthesizing an energy- or performance-optimized register-transfer level (RTL) design from its high-level C description under overall quality constraints at design outputs. In effective AHLS, fast and accurate quality and energy models are required together with an optimization technique to efficiently find a Pareto-optimal design. Quality effects of hardware approximations strongly depend on input data. In this work, a statistical formulation is employed to capture input dependency and analytically estimate quality using one-time profiling only. Energy and performance savings due to approximations strongly depend on operation scheduling and binding. We present an approach that estimates the performance, voltage scaling, and energy impact of approximations while taking into account tight interactions with existing synthesis tasks. Quality, performance, and energy estimation methods are further combined with novel AHLS-specific loop optimizations and heuristic solvers that find Pareto-optimized solutions in an efficient manner. Results show that our tool can achieve near-optimal results with low runtimes, demonstrating energy savings of, on average, more than 77.6%, where up to 24.5% higher savings are achieved compared to approaches that only consider switching activity.
Seogoo Lee, Andreas Gerstlauer
Chapter 11. Error Analysis and Optimization in Approximate Arithmetic Circuits
Abstract
This chapter presents a comprehensive study of various error analysis methodologies for evaluating the accuracy of approximate circuits, and the importance of such methodologies in their design. Although approximate circuits leverage the inherent perceptual limitations of human senses, they should be deployed in a manner that does not compromise user experience. In other words, the errors introduced due to using approximate circuits should be within acceptable margins. These margins depend on the target applications, and a systematic approach is required to ensure that the designed approximate circuit indeed meets the specifications in terms of the margins. The first step in achieving this goal is to obtain the error introduced in the output of the circuit due to approximation, and the first part of this chapter discusses various metrics to quantify that error. Since the error not only depends on the circuit structure, but also on the input vectors, these metrics are derived statistically. The error is then modeled as a function of various design parameters of the circuit, as well as the statistics of the input vector. The second part of the chapter discusses these modeling techniques in detail for various types of approximate circuits. Finally, the error model is utilized during the design phase to limit the maximum inaccuracy in approximate circuits. In other words, similar to the timing, power, and area constraints in regular circuit design, error is treated as an additional constraint for approximate circuit design. In this connection, the last part of this chapter discusses a set of optimization algorithms for circuit design using this additional error constraint.
Deepashree Sengupta, Jiang Hu, Sachin S. Sapatnekar

Application-Specific Approximate Accelerators and Systems

Frontmatter
Chapter 12. Approximate Multi-Accelerator Tiled Architecture for Energy-Efficient Motion Estimation
Abstract
Video processing applications are inherently error resilient. This resilience comes from the fact that: (1) inputs obtained are noisy and highly correlated in the spatial and temporal domains, (2) probabilistic computational algorithms in HEVC are inherently noise tolerant with error masking capabilities, and finally, (3) the visual perception of the final user is limited by various psychological and environmental factors. Considering these features, we analyze the complex multimode motion-estimation module in the latest High Efficiency Video Coding (HEVC) for employing heterogeneous approximations. This chapter presents a short overview of the HEVC motion estimator with an in-depth analysis of its computational complexity and energy consumption, followed by a full-system approximate architecture for the energy-efficient motion-estimation coprocessor.
Bharath Srinivas Prabakaran, Walaa El-Harouni, Semeen Rehman, Muhammad Shafique
Chapter 13. Hardware–Software Approximations for Deep Neural Networks
Abstract
Neural networks (NNs) are the state of the art for many artificial intelligence (AI) applications. However, in order to facilitate the training process, most of the neural networks are over-parameterized and result in significant computational and memory overheads. Therefore, to alleviate the computational and memory requirements of these NNs, numerous optimization techniques have been proposed. In this chapter, we highlight one of the prominent paradigms, i.e., approximate computing, that can significantly improve the resource requirements of these networks. We describe a sensitivity analysis methodology for estimating the significance sub-parts of the state-of-the-art NNs. Based upon the significance analysis, we then present a methodology for employing tolerable amount of approximations at various stages of the network, i.e., removal of ineffectual filters/neurons at the software layer and precision reduction and memory approximations at the hardware layer. Towards the end of this chapter, we also highlight few of the prominent challenges in adopting different types of approximation and the effects that they have on the overall efficiency and accuracy of the baseline networks.
Muhammad Abdullah Hanif, Muhammad Usama Javed, Rehan Hafiz, Semeen Rehman, Muhammad Shafique
Chapter 14. Lightweight Deep Neural Network Accelerators Using Approximate SW/HW Techniques
Abstract
Deep neural networks (DNNs) provide state-of-the-art accuracy performances in many application domains, such as computer vision and speech recognition. At the same time, DNNs require millions of expensive floating-point operations to process each input, which limit their applicability to resource-constrained systems that are limited in hardware design area or power consumption. Our goal is to devise lightweight, approximate accelerators for DNN accelerations that use less hardware resources with negligible reduction in accuracy. To simplify the hardware requirements, we analyze a spectrum of data precision methods ranging from fixed-point, dynamic fixed-point, powers-of-two to binary data precision. In conjunction, we provide new training methods to compensate for the simpler hardware resources. To boost the accuracy of the proposed lightweight accelerators, we describe ensemble processing techniques that use an ensemble of lightweight DNN accelerators to achieve the same or better accuracy than the original floating-point accelerator, while still using much less hardware resources. Using 65 nm technology libraries and industrial-strength design flow, we demonstrate a custom hardware accelerator design and training procedure which achieve low-power, low-latency while incurring insignificant accuracy degradation. We evaluate our design and technique on the CIFAR-10 and ImageNet datasets and show that significant reduction in power and inference latency is realized.
Hokchhay Tann, Soheil Hashemi, Sherief Reda
Chapter 15. Approximate Computing Techniques for Deep Neural Networks
Abstract
Deep neural networks (DNNs) have emerged as a powerful and versatile set of techniques enabling unprecedented success on challenging artificial intelligence (AI) problems. However, the recent success of DNNs comes at the cost of high computational complexity using very large models, which often require 100s of MBs of data storage, ExaOps of computation, and immense bandwidth for data movement. Despite advances in computing systems, it still takes days to weeks to train state-of-the-art DNNs—which directly limits the pace of innovation. Approximate computing is gaining traction as a promising method to alleviate demanding computational complexity in DNNs. Exploiting their inherent resiliency, approximate computing aims to relax exactness constraints with the goal of obtaining significant gains in computational throughput while maintaining an acceptable quality of results. In this chapter, we review the wide spectrum of approximate computing techniques that have been successfully applied to DNNs.
Jungwook Choi, Swagath Venkataramani
Chapter 16. Approximate Computing for Iris Recognition Systems
Abstract
Leveraging the error tolerance characteristics of many emerging applications, approximate computing techniques aim to trade-off small amount of inaccuracies in the computation to significantly reduce computational resources such as runtime, power, and design area. Approximate computing has been successfully applied to a wide range of areas including computer vision and machine learning. In this chapter, we demonstrate a novel application of approximate computing techniques in the field of biometric security by providing a comprehensive iris recognition system case study. Our system consists of an end-to-end flow, which captures input images of eyes from a near-infrared (NIR) camera and produces the iris encoding. The goal is to produce sufficiently accurate final encoding despite relying on intermediate approximate computational steps. Unlike previous efforts in approximate computing which typically target individual algorithms, this chapter explores a complex software/hardware pipeline system for iris code computation from live camera feed using an FPGA-based SoC. Our flow consists of four major algorithms, through which eight approximation knobs are identified for accuracy versus runtime trade-off at both the algorithmic and hardware levels. In order to explore this large design space for optimal parameter configurations, we employ reinforcement learning technique with a recurrent neural network as the learning agent. Using the proposed techniques, we demonstrate significant runtime saving of 48×, while conforming with industry-standard accuracy requirements for iris biometric systems.
Hokchhay Tann, Soheil Hashemi, Francesco Buttafuoco, Sherief Reda
Chapter 17. Approximate Systems: Synergistically Approximating Sensing, Computing, Memory, and Communication Subsystems for Energy Efficiency
Abstract
Emerging application domains exhibit the property of intrinsic error resilience that enables new avenues for energy optimization of computing systems, namely the introduction of a small amount of approximations during system operation in exchange for substantial energy savings. Almost all prior work in the area of approximate computing has focused on individual subsystems of a computing system, e.g., the computational subsystem or the memory subsystem. Since they focus only on individual subsystems, these techniques are unable to exploit the large energy-saving opportunities that stem from adopting a full-system perspective and approximating multiple subsystems of a computing platform simultaneously in a coordinated manner. Towards this end, this chapter introduces the concept of an Approximate System that performs joint approximations across different subsystems, leading to significant energy benefits compared to approximating individual subsystems in isolation. We use the example of a smart camera system that executes various computer vision and image processing applications to illustrate how the sensing, memory, processing, and communication subsystems can all be approximated synergistically. The approximate smart camera system was implemented using an Altera Stratix IV GX FPGA development board, a Terasic TRDB-D5M 5 Megapixel camera module, a Terasic RFS WiFi module, and a 1 GB DDR3 DRAM SODIMM module. Experimental results obtained using six application benchmarks demonstrate that the proposed full-system approximation methodology achieves significant energy savings of 1.8 × to 5.5 × on average over individual subsystem-level approximations for minimal (<1%) application-level quality loss.
Arnab Raha, Vijay Raghunathan

Approximate Methods for CPUs and GPUs

Frontmatter
Chapter 18. Approximate Ultra-Low Voltage Many-Core Processor Design
Abstract
Computing at ultra-low voltages can increase the energy efficiency significantly, however, operating frequency and resilience to errors degrade as the operating voltage reaches the transistor threshold voltage. More parallelism can help prevent degradation in throughput performance arising from the lower frequency. More parallelism, however, makes more components subject to errors, which exacerbates the already intensified vulnerability to errors. This chapter is all about how to exploit the intrinsic noise tolerance of emerging R(ecognition), M(ining), and S(ynthesis) applications in addressing degraded resilience at ultra-low voltages by embracing errors.
Nam Sung Kim, Ulya R. Karpuzcu
Chapter 19. Approximate CPU and GPU Design Using Emerging Memory Technologies
Abstract
Approximate computing trades off energy and accuracy in existing computing systems in order to accelerate computation. In this chapter, we outline examples of novel designs that enable approximation in CPU and GPU. Cores have been approximated by using a small-size associative memory next to each core to store the precomputed results. The associative memory returns precomputed results not only for operands that perfectly matches but also for the inexact matches, providing significant energy savings. If operands do not match closely enough, then exact computation is done. CPUs use associative memory based on nonvolatile memory, in particular memristor technology, to approximate complex functions that tolerate approximation well, while GPUs use associative memory placed next to each floating point unit to accelerate floating point operations. Each of the cores can dynamically adapt their approximation while controlling for accuracy. Our proposed designs can get more than 12.6× energy improvement and 6.6× speedup for CPU-based workloads, and bring 2× improvement in energy efficiency for GPUs, while providing the desired accuracy.
Mohsen Imani, Tajana S. Rosing
Chapter 20. Approximate Cache Architectures
Abstract
In this chapter, we explore the application of approximate computing techniques to caches and the memory access portion of the processor pipeline. As memory accesses contribute significantly to the latency and energy consumption of applications, they have long been the target of various optimizations. Large cache hierarchies are a mainstay in modern designs in order to avoid the long latency and high energy associated with accessing DRAM on every load or store request. With growing data set sizes, building ever larger caches is not necessarily an effective use of silicon real estate. We present recent work that improves the effectiveness of cache storage and reduces the cost of memory accesses by exploiting the inherently noisy or imprecise data that these applications operate on. First, we consider work that selectively forgoes loading data from the caches and memory when the processor can make a reasonable estimate of the value that is needed. Next, we explore work that selectively determines which values to store in the cache through approximate deduplication of data; by reducing how much data needs to be stored in the cache, we see an increase in the effective cache capacity.
Natalie Enright Jerger, Joshua San Miguel
Chapter 21. Towards Breaking the Memory Bandwidth Wall Using Approximate Value Prediction
Abstract
In this chapter, we introduce a novel solution to tackle two fundamental memory bottlenecks in accelerator-rich architectures: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). Exploiting the inherent error resilience of a wide range of applications, we introduce an approximation technique, called rollback-free value prediction (RFVP). When certain safe-to-approximate load operations miss in the cache, RFVP predicts the requested values. However, RFVP never checks for or recovers from load value mispredictions. As such, RFVP avoids the high cost of pipeline flushes and re-executions. RFVP mitigates the memory wall by enabling the execution to continue without stalling for long-latency memory accesses. To mitigate the bandwidth wall, RFVP drops some fraction of load requests which miss in the cache after predicting their values, hence reducing memory bandwidth contention by removing them from the system. This drop rate then becomes a knob to control the trade-off between performance/energy efficiency and output quality. Employing our technique in a modern GPU across a diverse set of applications delivers, on average, 40% speedup and 31% energy reduction, with average 8.8% quality loss. With 10% loss in quality, the benefits reach a maximum of 2.4× speedup and 2.0× energy reduction.
Amir Yazdanbakhsh, Gennady Pekhimenko, Hadi Esmaeilzadeh, Onur Mutlu, Todd C. Mowry
Chapter 22. Compilation and Other Software Techniques Enabling Approximate Computing
Abstract
Successful application of approximate computing requires cooperation across the entire hardware–software stack. Generally, the hardware will need to provide some means of approximation. If so, it has to be reflected in the instruction set architecture and exposed to assembly programmers or the compiler. At the other end of the stack, users are often required to indicate the amount of accuracy or quality of service loss that is tolerable. The challenge would be to bring the two together in a synergistic and automated workflow so as to maximize the productivity of the developers. This is where the compiler, middleware, runtime, and automated software analysis come into the picture. In this chapter, we shall examine techniques for loop perforation, precision analysis, as well as how to utilize half precision, inexact hardware, and approximate memory with the aim of achieving more energy efficient or better performing code at the expense of user-acceptable loss of accuracy in the results. We shall also discuss the limitations of what current techniques can achieve.
Weng-Fai Wong, Pooja Roy, Rajashi Ray, Nhut-Minh Ho
Backmatter
Metadaten
Titel
Approximate Circuits
herausgegeben von
Sherief Reda
Prof. Dr. Muhammad Shafique
Copyright-Jahr
2019
Electronic ISBN
978-3-319-99322-5
Print ISBN
978-3-319-99321-8
DOI
https://doi.org/10.1007/978-3-319-99322-5

Neuer Inhalt