DianNao family: energy-efficient hardware accelerators for machine learning

Authors:
Yunji Chen

ICT, CAS, China

ICT, CAS, China
View Profile

,
Tianshi Chen

ICT, CAS, China

ICT, CAS, China
View Profile

,
Zhiwei Xu

ICT, CAS, China

ICT, CAS, China
View Profile

,
Ninghui Sun

ICT, CAS, China

ICT, CAS, China
View Profile

,
Olivier Temam

Inria Saclay, France

Inria Saclay, France
View Profile

Authors Info & Claims

Communications of the ACM Volume 59 Issue 11November 2016pp 105–112https://doi.org/10.1145/2996864

Published:28 October 2016Publication History

Communications of the ACM

Abstract

Machine Learning (ML) tasks are becoming pervasive in a broad range of applications, and in a broad range of systems (from embedded systems to data centers). As computer architectures evolve toward heterogeneous multi-cores composed of a mix of cores and hardware accelerators, designing hardware accelerators for ML techniques can simultaneously achieve high efficiency and broad application scope.

While efficient computational primitives are important for a hardware accelerator, inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an Amdahl's law effect, and thus, they should become a first-order concern, just like in processors, rather than an element factored in accelerator design on a second step. In this article, we introduce a series of hardware accelerators (i.e., the DianNao family) designed for ML (especially neural networks), with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that, on a number of representative neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip DaDianNao system (a member of the DianNao family).

References

Cadambi, S., Durdanovic, I., Jakkula, V., Sankaradass, M., Cosatto, E., Chakradhar, S., Graf, H.P. A massively parallel fpga-based coprocessor for support vector machines. In 17th IEEE Symposium on Field Programmable Custom Computing Machines, 2009. FCCM'09 (2009; IEEE, 115--122. Google ScholarDigital Library
Chakradhar, S., Sankaradas, M., Jakkula, V., Cadambi, S. A dynamically configurable coprocessor for convolutional neural networks. In International Symposium on Computer Architecture (Saint Malo, France, June 2010). ACM 38(3): 247--257. Google ScholarDigital Library
Chan, E. Algorithmic Trading: Winning Strategies and Their Rationale. John Wiley & Sons, 2013. Google ScholarDigital Library
Chen, T., Chen, Y., Duranton, M., Guo, Q., Hashmi, A., Lipasti, M., Nere, A., Qiu, S., Sebag, M., Temam, O. BenchNN: On the broad potential application scope of hardware neural network accelerators. In International Symposium on Workload Characterization, 2012. Google ScholarDigital Library
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS), (March 2014). ACM 49(4): 269--284. Google ScholarDigital Library
Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N., Temam, O. Dadiannao: A machine-learning supercomputer. In ACM/IEEE International Symposium on Microarchitecture (MICRO) (December 2014). IEEE Computer Society, 609--622. Google ScholarDigital Library
Coates, A., Huval, B., Wang, T., Wu, D.J., Ng, A.Y. Deep learning with cots HPC systems. In International Conference on Machine Learning, 2013: 1337--1345.Google Scholar
Deng, J. Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR) (2009). IEEE, 248--255.Google ScholarCross Ref
Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O. Shidiannao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture (ISCA'15) (2015). ACM, 92--104. Google ScholarDigital Library
Esmaeilzadeh, H., Blem, E., Amant, R.S., Sankaralingam, K., Burger, D. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA) (June 2011). IEEE, 365--376. Google ScholarDigital Library
Esmaeilzadeh, H., Sampson, A., Ceze, L., Burger, D. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (Dec 2012). IEEE Computer Society, 449--460. Google ScholarDigital Library
Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshop (June 2011). IEEE, 109--116.Google ScholarCross Ref
Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y. Neuflow: A runtime reconfigurable dataflow processor for vision. In 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2011). IEEE, 109--116.Google ScholarCross Ref
Frery, A., de Araujo, C., Alice, H., Cerqueira, J., Loureiro, J.A., de Lima, M.E., Oliveira, M., Horta, M., et al. Hyperspectral images clustering on reconfigurable hardware using the k-means algorithm. In Proceedings of the 16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003 (2003). IEEE, 99--104. Google ScholarDigital Library
Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B.C., Richardson, S., Kozyrakis, C., Horowitz, M. Understanding sources of inefficiency in general-purpose chips. In International Symposium on Computer Architecture (New York, New York, USA, 2010). ACM, 38(3): 37--47. Google ScholarDigital Library
Hinton, G., Srivastava, N. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv: …, 1--18, 2012.Google Scholar
Hussain, H.M., Benkrid, K., Seker, H., Erdogan, A.T. Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data. In 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS) (2011). IEEE, 248--255.Google ScholarCross Ref
Keckler, S. Life after Dennard and how I learned to love the Picojoule (keynote). In International Symposium on Microarchitecture, Keynote presentation, Sao Paolo, Dec. 2011.Google Scholar
Kim, J.Y., Kim, M., Lee, S., Oh, J., Kim, K., Yoo, H.-J.A. GOPS 496 mW real-time multi-object recognition processor with bio-inspired neural perception engine. IEEE Journal of Solid-State Circuits 45, 1 (Jan. 2010), 32--45.Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (2012), 1--9. Google ScholarDigital Library
Krizhevsky, A., Sutskever, I., Hinton, G. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (2012) 1--9. Google ScholarDigital Library
Larkin, D., Kinane, A., O'Connor, N.E. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices. In Neural Information Processing (2006). Springer, Berlin Heidelberg, 1178--1188. Google ScholarDigital Library
Le, Q.V. Building high-level features using large scale unsupervised learning. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013). IEEE, 8595--8598.Google ScholarCross Ref
Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y. Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning, June 2012.Google ScholarDigital Library
LeCun, Y., Bengio, Y., Hintion, G. Deep learning. Nature 521, 7553 (2015), 436--444.Google ScholarCross Ref
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 11 (1998), 2278--2324.Google ScholarCross Ref
Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42 (New York, NY, USA, 2009). ACM, 469--480. Google ScholarDigital Library
Liu, D., Chen, T., Liu, S., Zhou, J., Zhou, S., Teman, O., Feng, X., Zhou, X., Chen, Y. Pudiannao: A polyvalent machine learning accelerator. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2015). ACM, 369--381. Google ScholarDigital Library
Maashri, A.A., Debole, M., Cotter, M., Chandramoorthy, N., Xiao, Y., Narayanan, V., Chakrabarti, C. Accelerating neuromorphic vision algorithms for recognition. In Proceedings of the 49th Annual Design Automation Conference (2012). ACM, 579--584. Google ScholarDigital Library
Maeda, N., Komatsu, S., Morimoto, M., Shimazaki, Y. A 0.41 µa standby leakage 32 kb embedded SRAM with low-voltage resume-standby utilizing all digital current comparator in 28 nm hkmg CMOS. In International Symposium on VLSI Circuits (VLSIC), 2012.Google Scholar
Majumdar, A., Cadambi, S., Becchi, M., Chakradhar, S.T., Graf, H.P. A massively parallel, energy efficient programmable accelerator for learning and classification. ACM Trans. Arch. Code Optim. (TACO) 9, 1 (2012), 6. Google ScholarDigital Library
Majumdar, A., Cadambi, S., Chakradhar, S.T. An energy-efficient heterogeneous system for embedded learning and classification. Embedded Systems Letters 3, 1 (2011), 42--45. Google ScholarDigital Library
Manolakos, E.S., Stamoulias, I. IP-cores design for the KNN classifier. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS) (2010). IEEE, 4133--4136.Google ScholarCross Ref
Maruyama, T. Real-time k-means clustering for color images on reconfigurable hardware. In 18th International Conference on Pattern Recognition (ICPR) (Aug 2006). IEEE, Volume 2, 816--819. Google ScholarDigital Library
Muller, M. Dark silicon and the internet. In EE Times "Designing with ARM" Virtual Conference, 26, 70(2010), 285--288.Google Scholar
Papadonikolakis, M., Bouganis, C. A heterogeneous FPGA architecture for support vector machine training. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (May 2010). IEEE, 211--214. Google ScholarDigital Library
Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C., Horowitz, M.A. Convolution engine: Balancing efficiency & flexibility in specialized computing. In International Symposium on Computer Architecture, 2013). ACM, 41(3), 24--35. Google ScholarDigital Library
Sermanet, P., Chintala, S., LeCun, Y. Convolutional neural networks applied to house numbers digit classification. In Pattern Recognition (ICPR), …, 2012.Google Scholar
Sermanet, P., LeCun, Y. Traffic sign recognition with multi-scale convolutional networks. In International Joint Conference on Neural Networks (July 2011). IEEE, 2809--2813.Google ScholarCross Ref
Stamoulias, I., Manolakos, E.S. Parallel architectures for the KNN classifier--design of soft IP cores and FPGA implementations. ACM Transactions on Embedded Computing Systems (TECS) 13, 2 (2013), 22. Google ScholarDigital Library
Swanson, S., Michelson, K., Schwerin, A., Oskin, M. Wavescalar. In ACM/IEEE International Symposium on Microarchitecture (MICRO) (Dec 2003). IEEE Computer Society, 291. Google ScholarDigital Library
Temam, O. The rebirth of neural networks. In International Symposium on Computer Architecture, (2010). Google ScholarDigital Library
Temam, O. A defect-tolerant accelerator for emerging high-performance applications. In International Symposium on Computer Architecture (Sep 2012). Portland, Oregon, 40(3), 356--367. Google ScholarDigital Library
Vanhoucke, V., Senior, A., Mao, M.Z. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop (NIPS) (2011). Vol. 1.Google Scholar
Wang, G., Anand, D., Butt, N., Cestero, A., Chudzik, M., Ervin, J., Fang, S., Freeman, G., Ho, H., Khan, B., Kim, B., Kong, W., Krishnan, R., Krishnan, S., Kwon, O., Liu, J., McStay, K., Nelson, E., Nummy, K., Parries, P., Sim, J., Takalkar, R., Tessier, A., Todi, R., Malik, R., Stiffler, S., Iyer, S. Scaling deep trench based EDRAM on SOI to 32 nm and beyond. In IEEE International Electron Devices Meeting (IEDM) (2009). IEEE, 1--4.Google ScholarCross Ref
Wolpert, D.H. The lack of a priori distinctions between learning algorithms. Neural Comput. 8, 7 (1996), 1341--1390. Google ScholarDigital Library
Yeh, Y.-J., Li, H.-Y., Hwang, W.-J., Fang, C.-Y. Fpga implementation of KNN classifier based on wavelet transform and partial distance search. In Image Analysis (June 2007). Springer Berlin Heidelberg, 512--521. Google ScholarDigital Library

Index Terms

DianNao family: energy-efficient hardware accelerators for machine learning
1. Computing methodologies
  1. Machine learning
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs

Recommendations

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural ...
Read More
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ASPLOS '14

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural ...
Read More
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ASPLOS '14

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 59, Issue 11
November 2016
118 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3013530
Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 152
  Total Citations
  View Citations
- 10,203
  Total Downloads
- Downloads (Last 12 months)734
- Downloads (Last 6 weeks)136
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

DianNao family: energy-efficient hardware accelerators for machine learning

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning