research-article

Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks

Authors:
Chen Zhang

Center for Energy-Efficient Computing and Applications, Peking University, Beijing, China

Center for Energy-Efficient Computing and Applications, Peking University, Beijing, China
View Profile

,
Zhenman Fang

Computer Science Department, University of California, Los Angeles, USA

Computer Science Department, University of California, Los Angeles, USA
View Profile

,
Peipei Zhou

Computer Science Department, University of California, Los Angeles, USA

Computer Science Department, University of California, Los Angeles, USA
View Profile

,
Peichen Pan

Falcon-computing Inc., USA

Falcon-computing Inc., USA
View Profile

,
Jason Cong

Center for Energy-Efficient Computing and Applications, Peking University, Beijing, China

Center for Energy-Efficient Computing and Applications, Peking University, Beijing, China
View Profile

2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)Nov 2016Pages 1–8https://doi.org/10.1145/2966986.2967011

Published:07 November 2016Publication History

2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Pages 1–8

ABSTRACT

With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-intensive convolutional layers and communication-intensive fully connected (FCN) layers. Second, we design Caffeine with the goal to maximize the underlying FPGA computing and bandwidth resource utilization, with a key focus on the bandwidth optimization by the memory access reorganization not studied in prior work. Moreover, we implement Caffeine in the portable high-level synthesis and provide various hardware/software definable parameters for user configurations. Finally, we also integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet network on multiple FPGA platforms. Caffeine achieves a peak performance of 365 GOPS on Xilinx KU060 FPGA and 636 GOPS on Virtex7 690t FPGA. This is the best published result to our best knowledge. We achieve more than 100× speedup on FCN layers over previous FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 7.3× and 43.5× performance and energy gains over Caffe on a 12-core Xeon server, and 1.5× better energy-efficiency over the GPU implementation on a medium-sized FPGA (KU060). Performance projections to a system with a high-end FPGA (Virtex7 690t) shows even higher gains.

References

[1].Taigman Y. et al., “Deepface: Closing the gap to human-level performance in face verification”, in CVPR, 2014, pp. 1701–1708.Google Scholar
[2].He K. et al., “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”, arXiv preprint arXiv:1502.01852, 2015.Google Scholar
[3].Girshick R. et al., “Rich feature hierarchies for accurate object detection and semantic segmentation”, in CVPR, 2014, pp. 580–587.Google ScholarDigital Library
[4].Ji S. et al., “3d convolutional neural networks for human action recognition”, TPAMI, vol. 35, no. 1, pp. 221–231, 2013.Google ScholarDigital Library
[5].Coates A. et al., “Deep learning with cots hpc systems”, in ICML, 2013, pp. 1337–1345.Google Scholar
[6].Yadan O. et al., “Multi-gpu training of convnets”, arXiv preprint arXiv:1312.5853, p. 17, 2013.Google Scholar
[7].Yu K., “Large-scale deep learning at baidu”, in CIKM. ACM, 2013, pp. 2211–2212.Google Scholar
[8].Krizhevsky A. et al., “Imagenet classification with deep convolutional neural networks”. in NIPS. 2012. PP. 1097–1105.Google ScholarDigital Library
[9].Zeiler M. D. et al., “Visualizing and understanding convolutional networks”, in ECCV 2014. Springer 2014, pp. 818–833.Google ScholarCross Ref
[10].Szegedy C. et al., “Going deeper with convolutions”, arXiv preprint arXiv:1409.4842. 2014.Google Scholar
[11].Simonyan K. et al., “Very deep convolutional networks for large-scale image recognition”, arXiv preprint arXiv:1409.1556, 2014.Google Scholar
[12].Jia Y. Q. C., “An Open Source Convolutional Architecture for Fast Feature Embedding”, http://caffe.berkeleyvision.org, 2013.Google Scholar
[13].Zhang C. et al., “Optimizing fpga-based accelerator design for deep convolutional neural networks”, in FPGA. ACM, 2015, pp. 161–170.Google Scholar
[14].Chen T. et al., “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning”, in ACM SIGPLAN Notices, vol. 49, no. 4. ACM, 2014, pp. 269–284.Google Scholar
[15].Farabet C. et al., “Cnp: An fpga-based processor for convolutional networks”, in FPL. IEEE, 2009, pp. 32–37.Google Scholar
[16].Chakradhar S. et al., “A dynamically configurable coprocessor for convolutional neural networks”, in ACM SIGARCH ComputerArchitecture News, vol. 38, no. 3. ACM. 2010. pp. 247–257.Google Scholar
[17].Aysegul D. et al., “Accelerating deep neural networks on mobile processor with embedded programmable logic”, in NIPS. IEEE, 2013.Google Scholar
[18].Cadambi S. et al., “A programmable parallel accelerator for learning and classification”, in PACT. ACM, 2010, pp. 273–284.Google Scholar
[19].Sankaradas M. et al., “A massively parallel coprocessor for convolutional neural networks”, in ASAP. IEEE, 2009, pp. 53–60.Google Scholar
[20].Peemen M. et al., “Memory-centric. acceleraror design for convolutional neural networks”. in ICCD. IEEE. 2013. pp. 13–19.Google Scholar
[21].Ovtcharov K. et al., “Accelerating deep convolutional neural networks using specialized hardware”, February 2015.Google Scholar
[22].Suda N. et al., “Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks”, in FPGA. ACM, 2016, pp. 16–25.Google Scholar
[23].Qiu J. et al., “Going deeper with embedded fpga platform for convolutional neural network”, in FPGA. ACM, 2016, pp. 26–35.Google Scholar
[24].Choi Y.-k. et al., “A quantitative analysis on microarchitectures of modern cpu-fpga platforms”, in DAC 2016, pp. 109: 1–109: 6.Google Scholar
[25].Bergstra J. et al., “Theano: a cpu and gpu math expression compiler”, in SciPy, vol. 4, 2010, p. 3.Google Scholar
[26].Suite V. D., “Ultrascale architecture fpgas memory interface solutions v7.0”, Technical report Xilinx, 04 2015. Tech. Rep., 2015.Google Scholar
[27].Mittal S., “A survey of techniques for managing and leveraging caches in gpus”, Journal of Circuits, Systems, and Computers, vol. 23, no. 08, 2014.Google ScholarCross Ref
[28].“Torch7”, http://torch.ch.Google Scholar
[29].Abadi M. et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems”, arXiv preprint arXiv:1603.04467, 2016.Google Scholar
[30].Zuo W. et al., “Improving high level synthesis optimization opportunity through polyhedral transformations”, in FPGA. ACM, 2013, pp. 9–18.Google Scholar
[31].Williams S. et al., “Roofline: an insightful visual performance model for multicore architectures”, CACM, vol. 52, no. 4, pp. 65–76, 2009.Google ScholarDigital Library

Index Terms

Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...
Read More
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning ...
Read More
Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks
With the recent advancement of multilayer convolutional neural networks (CNNs) and fully connected networks (FCNs), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
Nov 2016
946 pages

Copyright © 2016
Sponsors
In-Cooperation
Publisher
IEEE Press
Publication History
- Published: 7 November 2016
Permissions
Request permissions about this article.
Request Permissions
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 195
  Total Citations
  View Citations
- 723
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks

2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks

2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

Digital Edition

Share this Publication link

Share on Social Media