research-article

Efficient SIMD implementation for accelerating convolutional neural network

Authors:
Sung-Jin Lee

Hangdang-dong, Seongdong-gu, Seoul, Korea

Hangdang-dong, Seongdong-gu, Seoul, Korea
View Profile

,
Sang-Soo Park

Hangdang-dong, Seongdong-gu, Seoul, Korea

Hangdang-dong, Seongdong-gu, Seoul, Korea
View Profile

,
Ki-Seok Chung

Hangdang-dong, Seongdong-gu, Seoul, Korea

Hangdang-dong, Seongdong-gu, Seoul, Korea
View Profile

ICCIP '18: Proceedings of the 4th International Conference on Communication and Information ProcessingNovember 2018Pages 174–179https://doi.org/10.1145/3290420.3290444

Published:02 November 2018Publication History

ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing

Pages 174–179

ABSTRACT

Convolutional Neural Network (CNN) has been used in a variety of fields such as computer vision, speech recognition, and natural language processing. Because the amount of computation has increased tremendously, CNN has lately been accelerated through accelerators such as Graphic Processing Unit (GPU). However, resource-constrained embedded platforms such as Internet of Things (IoT) devices cannot afford to have such accelerators. Therefore, it is important to accelerate CNN by only the CPU efficiently. In this paper, we propose a method to accelerate CNN by using the Single Instruction Multiple Data (SIMD) unit integrated in many CPUs. Modern CPU includes a SIMD unit which is commonly used for vector operations. The proposed method implemented on an ARM's NEON can maximize the utilization of vector registers in the SIMD unit. Our proposed implementation has achieved a speed-up of up to 2.66 in execution time and an energy reduction of up to 3.55 times than the conventional implementation.

References

Simard, P. Y., Steinkraus, D., and Platt, J. C. 2003. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, 2 (ICDAR '03). IEEE Computer Society, Washington, DC, USA, 958-. Google ScholarDigital Library
Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., and Yu, D. 2014. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 22, 10 (Oct. 2014), 1533--1545. Google ScholarDigital Library
Collobert, R. and Weston, J. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning (ICML '08). ACM, New York, NY, USA, 160--167. Google ScholarDigital Library
Liu, S. and Deng, W. 2015. Very deep convolutional neural network based image classification using small training sample size. In Proceedings of 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 730--734.Google Scholar
He, K., Zhang, X., Ren, S., and Sun, J. 2016. Deep Residual Learning for Image Recognition, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770--778.Google Scholar
Lomont, C. 2011. Introduction to Intel Advanced Vector Extensions. Intel White Paper.Google Scholar
ARM. Architecture support for NEON and VFP. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/CJAJBFBF.html/Google Scholar
Michael, J. F. 1966. Very high-speed computing systems. In Proceedings of the IEEE. 54, 1901--1909.Google ScholarCross Ref
Siegel, H. J., Siegel, L. J., Kemmerer, F. C., PT Jr, M., HE Jr, S., and Smith, S. D. 1981. PASM: A partitionable SIMD/MIMD system for image processing and pattern recognition. IEEE Transactions on computers, 30, 12 (Dec. 1981), 934--947. Google ScholarDigital Library
Lai, L., Suda, N., and Chandra, V. 2018. CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs. arXiv preprint arXiv:1801.06601.Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86, 11 (Nov 1998), 2278--2324.Google Scholar
Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (May 2017), 84--90. Google ScholarDigital Library
Chen, L. C., Barron, J. T., Papandreou, G., Murphy, K., & Yuille, A. L. 2016. Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4545--4554.Google ScholarCross Ref
Raspberry PI Foundation. RASPBERRY PI 3 MODEL B. https://www.raspberrypi.org/products/raspberry-pi-3-model-b_2016/Google Scholar
OpenMP. OpenMP Specifications. http://www.openmp.org/specifications.Google Scholar
LeCun, Y., Cortes, C., Burges, C. J. 2010. MNIST handwritten digit database. AT&T Labs. http://yann.lecun.com/exdb/mnist.Google Scholar

Index Terms

Efficient SIMD implementation for accelerating convolutional neural network
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
    2. Parallel architectures
      1. Single instruction, multiple data
  2. Embedded and cyber-physical systems
    1. Embedded systems

Recommendations

Larrabee: A Many-Core x86 Architecture for Visual Computing

The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architecture's programmability as compared to standard GPUs. The ...
Read More
Larrabee: a many-core x86 architecture for visual computing

This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are ...
Read More
Exploring OpenMP GPU Offloading for Implementing Convolutional Neural Networks
PMAM'23: Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores

Computing on heterogeneous architecture involving CPUs and accelerators is now a popular choice of parallel computing. As a directive-based programming model, OpenMP has become more and more comprehensive that supports a large variety of hardware ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing
November 2018
326 pages
ISBN:9781450365345
DOI:10.1145/3290420
Conference Chairs:
Jalel Ben-Othman
University of Paris 13, France
,
Hui Yu
University of Portsmouth, the United Kingdom, UK
,
Program Chairs:
Herwig Unger
University of Hagen, Germany
,
Masayuki Arai
Graduate School of Science and Engineering Teikyo University, Japan
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CNN
CPU acceleration
LeNet-5
NEON
OpenMP
SIMD
parallel processing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate61of301submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 499
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient SIMD implementation for accelerating convolutional neural network

ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Larrabee: A Many-Core x86 Architecture for Visual Computing

Larrabee: a many-core x86 architecture for visual computing

Exploring OpenMP GPU Offloading for Implementing Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient SIMD implementation for accelerating convolutional neural network

ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Larrabee: A Many-Core x86 Architecture for Visual Computing

Larrabee: a many-core x86 architecture for visual computing

Exploring OpenMP GPU Offloading for Implementing Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media