Skip to main content
main-content

Über dieses Buch

This book covers algorithmic and hardware implementation techniques to enable embedded deep learning. The authors describe synergetic design approaches on the application-, algorithmic-, computer architecture-, and circuit-level that will help in achieving the goal of reducing the computational cost of deep learning algorithms. The impact of these techniques is displayed in four silicon prototypes for embedded deep learning.

Gives a wide overview of a series of effective solutions for energy-efficient neural networks on battery constrained wearable devices;

Discusses the optimization of neural networks for embedded deployment on all levels of the design hierarchy – applications, algorithms, hardware architectures, and circuits – supported by real silicon prototypes;

Elaborates on how to design efficient Convolutional Neural Network processors, exploiting parallelism and data-reuse, sparse operations, and low-precision computations;

Supports the introduced theory and design concepts by four real silicon prototypes. The physical realization’s implementation and achieved performances are discussed elaborately to illustrated and highlight the introduced cross-layer design concepts.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Embedded Deep Neural Networks

Abstract
Deep learning networks have recently come up as the state-of-the-art classification algorithms in artificial intelligence, achieving super-human performance in a number of perceptive tasks in computer vision and automated speech recognition. Although these networks are extremely powerful, bringing their functionality to always-on embedded devices and hence to wearable applications is currently impossible because of their compute and memory requirements. First, this chapter introduces the basic concepts in machine learning and deep learning: network architectures and how to train them. Second, this chapter lists the challenges associated with the large compute requirements in deep learning and outlines a vision to overcome them. Finally, this chapter gives an overview of my contributions to the field and a general structure of the book.
Bert Moons, Daniel Bankman, Marian Verhelst

Chapter 2. Optimized Hierarchical Cascaded Processing

Abstract
As discussed in Chap. 1, neural network-based applications are still too costly for them to be embedded on mobile and always-on devices. This chapter discusses a first application-level solution for this problem. In this chapter, the wake-up-based detection scenario is generalized to hierarchical cascaded processing, where a hierarchy of increasingly complex classifiers, each designed and trained for a specific sub-task, is used to minimize the overall system’s energy cost. An optimal hierarchical cascaded system takes input data statistics and the cost of building blocks into account to minimize the energy consumption of the full application. The approach is hence a hardware/cost-algorithm aware co-optimization and a first major contribution of this text. The chapter introduces all relevant terminology and theory, derives a roofline to analyze the performance of a cascaded system, introduces an optimization framework, and finally illustrates the usage and benefits of cascades in a 100-face recognition example. The chips designed in Chap. 5 are specifically tuned for usage in a hierarchical setup: networks at reduced precision can be used for simple tasks at a high energy efficiency. The chips designed in Chap. 6 are good candidates for wake-up stages.
Bert Moons, Daniel Bankman, Marian Verhelst

Chapter 3. Hardware-Algorithm Co-optimizations

Abstract
As discussed in Chap. 1, neural network-based applications are still too costly for them to be embedded on mobile and always-on devices. This chapter discusses hardware aware algorithm-level solutions for this problem. As an introduction to this topic, this chapter gives an overview of existing work in hardware and neural network co-optimizations. Two own contributions in hardware-algorithm optimization are discussed and compared: network quantization either at test- and train-time. The chapter ends with a methodology for designing minimum energy quantized neural networks—networks trained for low-precision fixed-point operation, a second major contribution of this text.
Bert Moons, Daniel Bankman, Marian Verhelst

Chapter 4. Circuit Techniques for Approximate Computing

Abstract
This chapter focuses on approximate computing (AC), a set of software- and primarily hardware-level techniques in which algorithm accuracy is traded for energy consumption by deliberately introducing acceptable errors into the computing process. It is hence a means of efficiently exploiting a neural network’s fault-tolerance to reduce its energy consumption, as was first discussed on the system level in Chap. 3. Approximate computing techniques have become crucial to reduce energy in modern neural network acceleration, as computational and storage demands are still high and traditional methods in device engineering and architectural design fail to significantly reduce those costs. The first part of this chapter is a general introduction to common approximate computing techniques on several levels of the design hierarchy. The second part focuses on dynamic-voltage-accuracy-frequency-scaling (DVAFS), a third major contribution of this text. It is a dynamic arithmetic precision scaling method on the circuit-level that enables minimum energy test-time FPNNs and QNNs, as discussed in Chap. 3. Chapter 5 discusses two physically implemented CNN chips that apply this DVAFS technique in real silicon. BinarEye, discussed in Chap. 6, can be used in DVAFS modes as well.
Bert Moons, Daniel Bankman, Marian Verhelst

Chapter 5. ENVISION: Energy-Scalable Sparse Convolutional Neural Network Processing

Abstract
This chapter focuses on Envision: two generations of energy-scalable sparse convolutional neural network processors. They achieve SotA energy-efficiency through leveraging the three key CNN-characteristics discussed in Chap. 3. (a) Inherent CNN parallelism is exploited through a highly parallelized processor architecture that minimizes internal memory bandwidth. (b) Inherent network sparsity in pruned networks and RELU activated feature maps is leveraged by compressing sparse IO-datastreams and skipping unnecessary computations. (c) The inherent fault-tolerance of CNNs is exploited by making this architecture DVAS or DVAFS compatible in Envision V1 and V2, respectively, according to the theory discussed in Chap. 4. This capability allows minimizing energy consumption for any CNN, with any computational precision requirement up to 16b fixed-point. Through its energy-scalability and high energy-efficiency, Envision lends itself perfectly for hierarchical applications, discussed in Chap. 2. It hereby enables CNN processing in always-on embedded applications.
Bert Moons, Daniel Bankman, Marian Verhelst

Chapter 6. BINAREYE: Digital and Mixed-Signal Always-On Binary Neural Network Processing

Abstract
The Envision CNN processors discussed in Chap. 5 are efficient but not sufficient for always-on embedded inference. Both neural networks and ASICs can be further optimized for such specific applications. To this end, this chapter focuses on two prototypes of BinaryNets, neural networks with all weights and activations constrained to + ∕ − 1. Both chips target always-on visual applications and can be used as visual wake-up sensors in a hierarchical vision application, as discussed in Chap. 2. These two chips are orthogonally optimized. The Mixed-Signal Binary Neural Net (MSBNN) accelerator is an implementation of the 256X architecture. It is fully optimized for energy efficiency, by leveraging analog computations. The all-digital BinarEye, an implementation of the SX architecture, focuses on the system level. It is designed for flexibility, allowing it to trade-off energy for network accuracy at run-time. This chapter discuses and compares both designs.
Bert Moons, Daniel Bankman, Marian Verhelst

Chapter 7. Conclusions, Contributions, and Future Work

Abstract
This dissertation has focused on techniques to minimize the energy consumption of deep learning algorithms for embedded applications on battery-constrained wearable edge devices. Although SotA in many typical machine-learning tasks, deep learning algorithms are also very costly in terms of energy consumption, due to their large amount of required computations and huge model sizes. Because of this, deep learning applications on battery-constrained wearables have only been possible through wireless connections with a resourceful cloud. This setup has several drawbacks. First, there are privacy concerns. This setup requires users to share their raw data—images, video, locations, and speech—with a remote system. As most users are not willing to share all of this, large-scale applications cannot yet be developed. Second, the cloud-setup requires users to be connected all the time, which is unfeasible given current cellular coverage. Furthermore, real-time applications require low latency connections, which cannot be guaranteed using the current communication infrastructure. Finally, this wireless connection is very inefficient—requiring too much energy per transferred bit for real-time data transfer on energy constrained platforms. All these issues—privacy, latency/connectivity, and costly wireless connections—can be resolved by moving towards computing on the edge.
Bert Moons, Daniel Bankman, Marian Verhelst

Backmatter

Weitere Informationen