skip to main content
10.1145/3020078.3021736acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Published:22 February 2017Publication History

ABSTRACT

As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state-of-the-art FPGA implementations of VGG model.

References

  1. O. Russakovsky, et al. ImageNet large-scale visual recognition challenge. In Int. J. Computer Vision, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Krizhevsky, et al. ImageNet classification with deep convolutional neural networks. In Neural Information Processing Systems (NIPS), 1097--1105, 2012.Google ScholarGoogle Scholar
  3. K. Simonyan, et al. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Int. Conf. Learning Representations (ICLR), 2015.Google ScholarGoogle Scholar
  4. K. He, et al. Deep residual learning for image recognition. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016. Google ScholarGoogle ScholarCross RefCross Ref
  5. Y.-H. Chen, et al. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In IEEE Int. Solid-State Circuits Conf. (ISSCC), 2016. Google ScholarGoogle ScholarCross RefCross Ref
  6. Y.-H. Chen, et al. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks?, In ACM/IEEE Int. Symp. Computer Architecture (ISCA), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Zhang, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM Int. Symp. on Field-Programmable Gate Arrays (FPGA), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Zhang, et al. Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster. In ACM Int. Symp. on Low Power Electronics and Design (ISLPED), 326--331, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Suda, et al. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Int. Symp. on Field-Programmable Gate Arrays (FPGA), 16--25, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Qiu, et al. Going deeper with embedded FPGA platform for convolutional neural network. In ACM Int. Symp. on Field-Programmable Gate Arrays (FPGA), 26--35, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Ma, et al. Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA, In Int. Conf. Field-Programmable Logic and Applications (FPL), 2016.Google ScholarGoogle Scholar
  12. H. Li, et al. A High Performance FPGA-based Accelerator for Large-Scale Convolutional Neural Networks, In Int. Conf. Field-Programmable Logic and Applications (FPL), 2016.Google ScholarGoogle Scholar
  13. A. Rahman, et al. Efficient FPGA acceleration of Convolutional Neural Networks using logical-3D compute array. In Design, Auto. & Test in Europe Conf. (DATE), 2016.Google ScholarGoogle Scholar
  14. M. Motamedi, et al. Design space exploration of FPGA-based Deep Convolutional Neural Networks. In Asia and South Pacific Design Auto. Conf. (ASP-DAC), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Han, et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In Int. Conf. Learning Representations (ICLR), 2016.Google ScholarGoogle Scholar
  16. S. Han, et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In ACM/IEEE Int. Symp. Computer Architecture (ISCA), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Bacon, et al. Compiler transformations for high-performance computing. In ACM Computing Surveys (CSUR), Volume 26 Issue 4, Pages 345--420, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
          February 2017
          312 pages
          ISBN:9781450343541
          DOI:10.1145/3020078

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 February 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          FPGA '17 Paper Acceptance Rate25of101submissions,25%Overall Acceptance Rate125of627submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader