research-article

Public Access

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Authors:
Chi Zhang

University of Southern California, Los Angeles, CA, USA

University of Southern California, Los Angeles, CA, USA
View Profile

,
Viktor Prasanna

University of Southern California, Los Angeles, CA, USA

University of Southern California, Los Angeles, CA, USA
View Profile

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFebruary 2017Pages 35–44https://doi.org/10.1145/3020078.3021727

Published:22 February 2017Publication History

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 35–44

ABSTRACT

We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational requirements of the convolutional layer. We map the frequency domain algorithms onto a highly-parallel OaA-based 2D convolver design on the FPGA. Then, we propose a novel data layout in shared memory for efficient data communication between the CPU and the FPGA. To reduce the memory access latency and sustain peak performance of the FPGA, our design employs double buffering. To reduce the inter-layer data remapping latency, we exploit concurrent processing on the CPU and the FPGA. Our approach can be applied to any kernel size less than the chosen FFT size with appropriate zero-padding leading to acceleration of a wide range of CNN models. We exploit the data parallelism of OaA-based 2D convolver and task parallelism to scale the overall system performance. By using OaA, the number of floating point operations is reduced by 39.14% ~54.10% for the state-of-art CNNs. We implement VGG16, AlexNet and GoogLeNet on Intel QuickAssist QPI FPGA Platform. These designs sustain 123.48 GFLOPs/sec, 83.00 GFLOPs/sec and 96.60 GFLOPs/sec, respectively. Compared with the state-of-the-art AlexNet implementation, our design achieves 1.35x GFLOPs/sec improvement using 3.33x less multipliers and 1.1x less memory. Compared with the state-of-art VGG16 implementation, our design has 0.66x GFLOPs/sec using 3.48x less multipliers without impacting the classification accuracy. For GoogLeNet implementation, our design achieves 5.56x improvement in performance compared with 16 threads running on a 10 Core Intel Xeon Processor at 2.8 GHz.

References

B. Bosi, G. Bois, and Y. Savaria. Reconfigurable Pipelined 2D Convolvers for Fast Digital Signal Processing. IEEE Trans. On Very Large Scale Integration (VLSI) Systems, 1999.Google Scholar
R. Chen and V. K. Prasanna. Energy Optimizations for FPGA-based 2-D FFT Architecture. In High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pages 1--6, Sept 2014. Google ScholarCross Ref
R. Chen, S. Siriyal, and V. K. Prasanna. Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '15, pages 240--249, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
C. Farabet, Y. Lecun, K. Kavukcuoglu, B. Martini, P. Akselrod, S. Talay, and E. Culurciello. Large-Scale FPGA-Based Convolutional Networks. In R. Bekkerman, M. Bilenko, and J. Langford, editors, Scaling Up Machine Learning, pages 399--419. Cambridge University Press, 2011. Cambridge Books.Google ScholarCross Ref
M. Hemnani, S. Palekar, P. Dixit, and P. Joshi. Hardware optimization of complex multiplication scheme for DSP application. In Computer, Communication and Control (IC4), 2015 International Conference on, pages 1--4, Sept 2015. Google ScholarCross Ref
T. Highlander and A. Rodriguez. Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add. CoRR, abs/1601.06815, 2016.Google Scholar
F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and textless 0.5MB model size. CoRR, abs/1602.07360, 2016.Google Scholar
Intel Inc. XeonGoogle Scholar
FPGA Platform for the Data Center. https://www.ece.cmu.edu/calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf.Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097--1105. Curran Associates, Inc., 2012.Google ScholarDigital Library
M. Mathieu, M. Henaff, and Y. LeCun. Fast Training of Convolutional Networks through FFTs. CoRR, abs/1312.5851, 2013.Google Scholar
Micron Technology, Inc. The Convey HC-2 Computer. https://www.micron.com/about/about-the-convey-computer-acquisition.Google Scholar
Y. Qiao, J. Shen, T. Xiao, Q. Yang, M. Wen, and C. Zhang. FPGA-accelerated deep convolutional neural networks for high throughput and energy efficiency. Concurrency and Computation: Practice and Experience, pages n/a--n/a, 2016. cpe.3850.Google Scholar
J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA'16. ACM, 2016. Google ScholarDigital Library
D. Scherer, H. Schulz, and S. Behnke. Accelerating Large-Scale Convolutional Neural Networks with Parallel Graphics Multiprocessors, pages 82--91. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.Google Scholar
K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556, 2014.Google Scholar
N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and Y. Cao. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '16, pages 16--25, New York, NY, USA, 2016. Google ScholarDigital Library
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. CoRR, abs/1409.4842, 2014.Google Scholar
Wikipedia. https://en.wikipedia.org/wiki/Multidimensional_discrete_convolution#Overlap_and_Add.Google Scholar
Xilinx Inc. Zynq-7000 All Programmable SoC. http://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html.Google Scholar
M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. CoRR, abs/1311.2901, 2013.Google Scholar
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '15, pages 161--170, New York, NY, USA, 2015.Google ScholarDigital Library
X. Zhang, J. Zou, X. Ming, K. He, and J. Sun. Efficient and Accurate Approximations of Nonlinear Convolutional Networks. CoRR, abs/1411.4229, 2014.Google Scholar
A. Zlateski, K. Lee, and H. S. Seung. ZNN - A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-Core and Many-Core Shared Memory Machines. CoRR, abs/1510.06706, 2015.Google Scholar

Index Terms

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
1. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
      2. Reconfigurable logic applications

Recommendations

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple ...
Read More
FPGA, GPU, and CPU implementations of Jacobi algorithm for eigenanalysis

Parallel implementations of Jacobi algorithm for eigenanalysis of a matrix on most commonly used high performance computing (HPC) devices such as central processing unit (CPU), graphics processing unit (GPU), and field-programmable gate array (FPGA) are ...
Read More
Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2017
312 pages
ISBN:9781450343541
DOI:10.1145/3020078
General Chair:
Jonathan Greene
Microsemi, USA
,
Program Chair:
Jason H. Anderson
University of Toronto, Canada
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 February 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CPU
FPGA
concurrent processing
convolutional neural networks
discrete fourier transform
double buffering
overlap-and-add
shared memory
Qualifiers
- research-article
Conference

Acceptance Rates
FPGA '17 Paper Acceptance Rate25of101submissions,25%Overall Acceptance Rate125of627submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 106
  Total Citations
  View Citations
- 3,972
  Total Downloads
- Downloads (Last 12 months)383
- Downloads (Last 6 weeks)51
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

FPGA, GPU, and CPU implementations of Jacobi algorithm for eigenanalysis

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

FPGA, GPU, and CPU implementations of Jacobi algorithm for eigenanalysis

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media