skip to main content
10.1145/2966986.2967011guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks

Authors Info & Claims
Published:07 November 2016Publication History

ABSTRACT

With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-intensive convolutional layers and communication-intensive fully connected (FCN) layers. Second, we design Caffeine with the goal to maximize the underlying FPGA computing and bandwidth resource utilization, with a key focus on the bandwidth optimization by the memory access reorganization not studied in prior work. Moreover, we implement Caffeine in the portable high-level synthesis and provide various hardware/software definable parameters for user configurations. Finally, we also integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet network on multiple FPGA platforms. Caffeine achieves a peak performance of 365 GOPS on Xilinx KU060 FPGA and 636 GOPS on Virtex7 690t FPGA. This is the best published result to our best knowledge. We achieve more than 100× speedup on FCN layers over previous FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 7.3× and 43.5× performance and energy gains over Caffe on a 12-core Xeon server, and 1.5× better energy-efficiency over the GPU implementation on a medium-sized FPGA (KU060). Performance projections to a system with a high-end FPGA (Virtex7 690t) shows even higher gains.

References

  1. [1].Taigman Y. et al., “Deepface: Closing the gap to human-level performance in face verification”, in CVPR, 2014, pp. 17011708.Google ScholarGoogle Scholar
  2. [2].He K. et al., “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”, arXiv preprint arXiv:1502.01852, 2015.Google ScholarGoogle Scholar
  3. [3].Girshick R. et al., “Rich feature hierarchies for accurate object detection and semantic segmentation”, in CVPR, 2014, pp. 580587.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4].Ji S. et al., “3d convolutional neural networks for human action recognition”, TPAMI, vol. 35, no. 1, pp. 221231, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5].Coates A. et al., “Deep learning with cots hpc systems”, in ICML, 2013, pp. 13371345.Google ScholarGoogle Scholar
  6. [6].Yadan O. et al., “Multi-gpu training of convnets”, arXiv preprint arXiv:1312.5853, p. 17, 2013.Google ScholarGoogle Scholar
  7. [7].Yu K., “Large-scale deep learning at baidu”, in CIKM. ACM, 2013, pp. 22112212.Google ScholarGoogle Scholar
  8. [8].Krizhevsky A. et al., “Imagenet classification with deep convolutional neural networks”. in NIPS. 2012. PP. 10971105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9].Zeiler M. D. et al., “Visualizing and understanding convolutional networks”, in ECCV 2014. Springer 2014, pp. 818833.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10].Szegedy C. et al., “Going deeper with convolutions”, arXiv preprint arXiv:1409.4842. 2014.Google ScholarGoogle Scholar
  11. [11].Simonyan K. et al., “Very deep convolutional networks for large-scale image recognition”, arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  12. [12].Jia Y. Q. C., “An Open Source Convolutional Architecture for Fast Feature Embedding”, http://caffe.berkeleyvision.org, 2013.Google ScholarGoogle Scholar
  13. [13].Zhang C. et al., “Optimizing fpga-based accelerator design for deep convolutional neural networks”, in FPGA. ACM, 2015, pp. 161170.Google ScholarGoogle Scholar
  14. [14].Chen T. et al., “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning”, in ACM SIGPLAN Notices, vol. 49, no. 4. ACM, 2014, pp. 269284.Google ScholarGoogle Scholar
  15. [15].Farabet C. et al., “Cnp: An fpga-based processor for convolutional networks”, in FPL. IEEE, 2009, pp. 3237.Google ScholarGoogle Scholar
  16. [16].Chakradhar S. et al., “A dynamically configurable coprocessor for convolutional neural networks”, in ACM SIGARCH ComputerArchitecture News, vol. 38, no. 3. ACM. 2010. pp. 247257.Google ScholarGoogle Scholar
  17. [17].Aysegul D. et al., “Accelerating deep neural networks on mobile processor with embedded programmable logic”, in NIPS. IEEE, 2013.Google ScholarGoogle Scholar
  18. [18].Cadambi S. et al., “A programmable parallel accelerator for learning and classification”, in PACT. ACM, 2010, pp. 273284.Google ScholarGoogle Scholar
  19. [19].Sankaradas M. et al., “A massively parallel coprocessor for convolutional neural networks”, in ASAP. IEEE, 2009, pp. 5360.Google ScholarGoogle Scholar
  20. [20].Peemen M. et al., “Memory-centric. acceleraror design for convolutional neural networks”. in ICCD. IEEE. 2013. pp. 1319.Google ScholarGoogle Scholar
  21. [21].Ovtcharov K. et al., “Accelerating deep convolutional neural networks using specialized hardware”, February 2015.Google ScholarGoogle Scholar
  22. [22].Suda N. et al., “Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks”, in FPGA. ACM, 2016, pp. 1625.Google ScholarGoogle Scholar
  23. [23].Qiu J. et al., “Going deeper with embedded fpga platform for convolutional neural network”, in FPGA. ACM, 2016, pp. 2635.Google ScholarGoogle Scholar
  24. [24].Choi Y.-k. et al., “A quantitative analysis on microarchitectures of modern cpu-fpga platforms”, in DAC 2016, pp. 109: 1109: 6.Google ScholarGoogle Scholar
  25. [25].Bergstra J. et al., “Theano: a cpu and gpu math expression compiler”, in SciPy, vol. 4, 2010, p. 3.Google ScholarGoogle Scholar
  26. [26].Suite V. D., “Ultrascale architecture fpgas memory interface solutions v7.0”, Technical report Xilinx, 04 2015. Tech. Rep., 2015.Google ScholarGoogle Scholar
  27. [27].Mittal S., “A survey of techniques for managing and leveraging caches in gpus”, Journal of Circuits, Systems, and Computers, vol. 23, no. 08, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28].Torch7”, http://torch.ch.Google ScholarGoogle Scholar
  29. [29].Abadi M. et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems”, arXiv preprint arXiv:1603.04467, 2016.Google ScholarGoogle Scholar
  30. [30].Zuo W. et al., “Improving high level synthesis optimization opportunity through polyhedral transformations”, in FPGA. ACM, 2013, pp. 918.Google ScholarGoogle Scholar
  31. [31].Williams S. et al., “Roofline: an insightful visual performance model for multicore architectures”, CACM, vol. 52, no. 4, pp. 6576, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image Guide Proceedings
          2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
          Nov 2016
          946 pages

          Copyright © 2016

          Publisher

          IEEE Press

          Publication History

          • Published: 7 November 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Qualifiers

          • research-article