ABSTRACT
Convolutional neural network (CNN) finds applications in a variety of computer vision applications ranging from object recognition and detection to scene understanding owing to its exceptional accuracy. There exist different algorithms for CNNs computation. In this paper, we explore conventional convolution algorithm with a faster algorithm using Winograd's minimal filtering theory for efficient FPGA implementation. Distinct from the conventional convolution algorithm, Winograd algorithm uses less computing resources but puts more pressure on the memory bandwidth. We first propose a fusion architecture that can fuse multiple layers naturally in CNNs, reusing the intermediate data. Based on this fusion architecture, we explore heterogeneous algorithms to maximize the throughput of a CNN. We design an optimal algorithm to determine the fusion and algorithm strategy for each layer. We also develop an automated toolchain to ease the mapping from Caffe model to FPGA bitstream using Vivado HLS. Experiments using widely used VGG and AlexNet demonstrate that our design achieves up to 1.99X performance speedup compared to the prior fusion-based FPGA accelerator for CNNs.
- M. Alwani, H. Chen, M. Ferdman, and P. Milder. Fused-layer cnn accelerators. In MICRO, 2016.Google ScholarCross Ref
- S. Cadambi and et al. A programmable parallel accelerator for learning and classification. In PACT, 2010. Google ScholarDigital Library
- S. Chakradhar and et al. A dynamically configurable coprocessor for convolutional neural networks. In ISCA, 2010. Google ScholarDigital Library
- T. Chen and et al. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, 2014. Google ScholarDigital Library
- Y.-H. Chen, J. Emer, and V. Sze. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ISCA, 2016. Google ScholarDigital Library
- J. Cong and et al. High-level synthesis for fpgas: From prototyping to deployment. TCAD, 2011. Google ScholarDigital Library
- J. Cong and B. Xiao. Minimizing computation in convolutional neural networks. In ICANN, 2014.Google ScholarCross Ref
- Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. In NIPS, 2016.Google ScholarDigital Library
- S. Han and et al. Eie: efficient inference engine on compressed deep neural network. In ISCA, 2016. Google ScholarDigital Library
- S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv, 2015.Google Scholar
- S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. TPAMI, 2013. Google ScholarDigital Library
- Y. Jia and et al. Caffe: Convolutional architecture for fast feature embedding. In MM, 2014. Google ScholarDigital Library
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. Google ScholarDigital Library
- A. Lavin. Fast algorithms for convolutional neural networks. arXiv, 2015.Google Scholar
- Y. Liang and et al. High-level synthesis: productivity, performance, and software constraints. IJECE, 2012. Google ScholarDigital Library
- B. Liu and et al. Sparse convolutional neural networks. In CVPR, 2015.Google Scholar
- L. Lu, Y. Liang, Q. Xiao, and S. Yan. Evaluating fast algorithms for convolutional neural networks on fpgas. In FCCM, 2017.Google ScholarCross Ref
- J. Qiu and et al. Going deeper with embedded FPGA platform for convolutional neural network. In FPGA, 2016. Google ScholarDigital Library
- H. Sharma and et al. Dnnweaver: From high-level deep network models to fpga acceleration. In CogArch, 2016.Google Scholar
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv, 2014.Google Scholar
- L. Song and et al. C-Brain: a deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In DAC, 2016. Google ScholarDigital Library
- C. Szegedy and et al. Going deeper with convolutions. In CVPR, 2015.Google ScholarCross Ref
- Y. Wang, J. Xu, Y. Han, H. Li, and X. Li. DeepBurning: automatic generation of FPGA-based learning accelerators for the neural network family. In DAC, 2016. Google ScholarDigital Library
- W. Wen and et al. Learning structured sparsity in deep neural networks. In NIPS, 2016.Google Scholar
- S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. CACM, 2009. Google ScholarDigital Library
- S. Winograd. Arithmetic complexity of computations. Siam, 1980.Google Scholar
- C. Zhang and et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In FPGA, 2015. Google ScholarDigital Library
Recommendations
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run ...
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysCurrent-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
Accelerating Large-Scale Deep Convolutional Neural Networks on Multi-core Vector Accelerators
Network and Parallel ComputingAbstractThis paper proposes an efficient algorithm mapping method for accelerating deep convolutional neural networks, which includes: (1) Proposing an efficient transformation method, which converts CNN’s convolutional layer and fully connected layer ...
Comments