ABSTRACT
Convolutional Neural Network (CNN) has been used in a variety of fields such as computer vision, speech recognition, and natural language processing. Because the amount of computation has increased tremendously, CNN has lately been accelerated through accelerators such as Graphic Processing Unit (GPU). However, resource-constrained embedded platforms such as Internet of Things (IoT) devices cannot afford to have such accelerators. Therefore, it is important to accelerate CNN by only the CPU efficiently. In this paper, we propose a method to accelerate CNN by using the Single Instruction Multiple Data (SIMD) unit integrated in many CPUs. Modern CPU includes a SIMD unit which is commonly used for vector operations. The proposed method implemented on an ARM's NEON can maximize the utilization of vector registers in the SIMD unit. Our proposed implementation has achieved a speed-up of up to 2.66 in execution time and an energy reduction of up to 3.55 times than the conventional implementation.
- Simard, P. Y., Steinkraus, D., and Platt, J. C. 2003. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, 2 (ICDAR '03). IEEE Computer Society, Washington, DC, USA, 958-. Google ScholarDigital Library
- Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., and Yu, D. 2014. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 22, 10 (Oct. 2014), 1533--1545. Google ScholarDigital Library
- Collobert, R. and Weston, J. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning (ICML '08). ACM, New York, NY, USA, 160--167. Google ScholarDigital Library
- Liu, S. and Deng, W. 2015. Very deep convolutional neural network based image classification using small training sample size. In Proceedings of 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 730--734.Google Scholar
- He, K., Zhang, X., Ren, S., and Sun, J. 2016. Deep Residual Learning for Image Recognition, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770--778.Google Scholar
- Lomont, C. 2011. Introduction to Intel Advanced Vector Extensions. Intel White Paper.Google Scholar
- ARM. Architecture support for NEON and VFP. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/CJAJBFBF.html/Google Scholar
- Michael, J. F. 1966. Very high-speed computing systems. In Proceedings of the IEEE. 54, 1901--1909.Google ScholarCross Ref
- Siegel, H. J., Siegel, L. J., Kemmerer, F. C., PT Jr, M., HE Jr, S., and Smith, S. D. 1981. PASM: A partitionable SIMD/MIMD system for image processing and pattern recognition. IEEE Transactions on computers, 30, 12 (Dec. 1981), 934--947. Google ScholarDigital Library
- Lai, L., Suda, N., and Chandra, V. 2018. CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs. arXiv preprint arXiv:1801.06601.Google Scholar
- LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86, 11 (Nov 1998), 2278--2324.Google Scholar
- Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (May 2017), 84--90. Google ScholarDigital Library
- Chen, L. C., Barron, J. T., Papandreou, G., Murphy, K., & Yuille, A. L. 2016. Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4545--4554.Google ScholarCross Ref
- Raspberry PI Foundation. RASPBERRY PI 3 MODEL B. https://www.raspberrypi.org/products/raspberry-pi-3-model-b_2016/Google Scholar
- OpenMP. OpenMP Specifications. http://www.openmp.org/specifications.Google Scholar
- LeCun, Y., Cortes, C., Burges, C. J. 2010. MNIST handwritten digit database. AT&T Labs. http://yann.lecun.com/exdb/mnist.Google Scholar
Index Terms
- Efficient SIMD implementation for accelerating convolutional neural network
Recommendations
Larrabee: A Many-Core x86 Architecture for Visual Computing
The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architecture's programmability as compared to standard GPUs. The ...
Larrabee: a many-core x86 architecture for visual computing
This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are ...
Exploring OpenMP GPU Offloading for Implementing Convolutional Neural Networks
PMAM'23: Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and ManycoresComputing on heterogeneous architecture involving CPUs and accelerators is now a popular choice of parallel computing. As a directive-based programming model, OpenMP has become more and more comprehensive that supports a large variety of hardware ...
Comments