skip to main content
research-article
Free Access

DianNao family: energy-efficient hardware accelerators for machine learning

Published:28 October 2016Publication History
Skip Abstract Section

Abstract

Machine Learning (ML) tasks are becoming pervasive in a broad range of applications, and in a broad range of systems (from embedded systems to data centers). As computer architectures evolve toward heterogeneous multi-cores composed of a mix of cores and hardware accelerators, designing hardware accelerators for ML techniques can simultaneously achieve high efficiency and broad application scope.

While efficient computational primitives are important for a hardware accelerator, inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an Amdahl's law effect, and thus, they should become a first-order concern, just like in processors, rather than an element factored in accelerator design on a second step. In this article, we introduce a series of hardware accelerators (i.e., the DianNao family) designed for ML (especially neural networks), with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that, on a number of representative neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip DaDianNao system (a member of the DianNao family).<!-- END_PAGE_1 -->

References

  1. Cadambi, S., Durdanovic, I., Jakkula, V., Sankaradass, M., Cosatto, E., Chakradhar, S., Graf, H.P. A massively parallel fpga-based coprocessor for support vector machines. In 17th IEEE Symposium on Field Programmable Custom Computing Machines, 2009. FCCM'09 (2009; IEEE, 115--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Chakradhar, S., Sankaradas, M., Jakkula, V., Cadambi, S. A dynamically configurable coprocessor for convolutional neural networks. In International Symposium on Computer Architecture (Saint Malo, France, June 2010). ACM 38(3): 247--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chan, E. Algorithmic Trading: Winning Strategies and Their Rationale. John Wiley & Sons, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen, T., Chen, Y., Duranton, M., Guo, Q., Hashmi, A., Lipasti, M., Nere, A., Qiu, S., Sebag, M., Temam, O. BenchNN: On the broad potential application scope of hardware neural network accelerators. In International Symposium on Workload Characterization, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS), (March 2014). ACM 49(4): 269--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N., Temam, O. Dadiannao: A machine-learning supercomputer. In ACM/IEEE International Symposium on Microarchitecture (MICRO) (December 2014). IEEE Computer Society, 609--622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Coates, A., Huval, B., Wang, T., Wu, D.J., Ng, A.Y. Deep learning with cots HPC systems. In International Conference on Machine Learning, 2013: 1337--1345.Google ScholarGoogle Scholar
  8. Deng, J. Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR) (2009). IEEE, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  9. Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O. Shidiannao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture (ISCA'15) (2015). ACM, 92--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Esmaeilzadeh, H., Blem, E., Amant, R.S., Sankaralingam, K., Burger, D. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA) (June 2011). IEEE, 365--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Esmaeilzadeh, H., Sampson, A., Ceze, L., Burger, D. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (Dec 2012). IEEE Computer Society, 449--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshop (June 2011). IEEE, 109--116.Google ScholarGoogle ScholarCross RefCross Ref
  13. Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y. Neuflow: A runtime reconfigurable dataflow processor for vision. In 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2011). IEEE, 109--116.Google ScholarGoogle ScholarCross RefCross Ref
  14. Frery, A., de Araujo, C., Alice, H., Cerqueira, J., Loureiro, J.A., de Lima, M.E., Oliveira, M., Horta, M., et al. Hyperspectral images clustering on reconfigurable hardware using the k-means algorithm. In Proceedings of the 16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003 (2003). IEEE, 99--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B.C., Richardson, S., Kozyrakis, C., Horowitz, M. Understanding sources of inefficiency in general-purpose chips. In International Symposium on Computer Architecture (New York, New York, USA, 2010). ACM, 38(3): 37--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hinton, G., Srivastava, N. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv: …, 1--18, 2012.Google ScholarGoogle Scholar
  17. Hussain, H.M., Benkrid, K., Seker, H., Erdogan, A.T. Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data. In 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS) (2011). IEEE, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  18. Keckler, S. Life after Dennard and how I learned to love the Picojoule (keynote). In International Symposium on Microarchitecture, Keynote presentation, Sao Paolo, Dec. 2011.Google ScholarGoogle Scholar
  19. Kim, J.Y., Kim, M., Lee, S., Oh, J., Kim, K., Yoo, H.-J.A. GOPS 496 mW real-time multi-object recognition processor with bio-inspired neural perception engine. IEEE Journal of Solid-State Circuits 45, 1 (Jan. 2010), 32--45.Google ScholarGoogle Scholar
  20. Krizhevsky, A., Sutskever, I., Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (2012), 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Krizhevsky, A., Sutskever, I., Hinton, G. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (2012) 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Larkin, D., Kinane, A., O'Connor, N.E. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices. In Neural Information Processing (2006). Springer, Berlin Heidelberg, 1178--1188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Le, Q.V. Building high-level features using large scale unsupervised learning. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013). IEEE, 8595--8598.Google ScholarGoogle ScholarCross RefCross Ref
  24. Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y. Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning, June 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. LeCun, Y., Bengio, Y., Hintion, G. Deep learning. Nature 521, 7553 (2015), 436--444.Google ScholarGoogle ScholarCross RefCross Ref
  26. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 11 (1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  27. Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42 (New York, NY, USA, 2009). ACM, 469--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Liu, D., Chen, T., Liu, S., Zhou, J., Zhou, S., Teman, O., Feng, X., Zhou, X., Chen, Y. Pudiannao: A polyvalent machine learning accelerator. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2015). ACM, 369--381. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Maashri, A.A., Debole, M., Cotter, M., Chandramoorthy, N., Xiao, Y., Narayanan, V., Chakrabarti, C. Accelerating neuromorphic vision algorithms for recognition. In Proceedings of the 49th Annual Design Automation Conference (2012). ACM, 579--584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Maeda, N., Komatsu, S., Morimoto, M., Shimazaki, Y. A 0.41 µa standby leakage 32 kb embedded SRAM with low-voltage resume-standby utilizing all digital current comparator in 28 nm hkmg CMOS. In International Symposium on VLSI Circuits (VLSIC), 2012.Google ScholarGoogle Scholar
  31. Majumdar, A., Cadambi, S., Becchi, M., Chakradhar, S.T., Graf, H.P. A massively parallel, energy efficient programmable accelerator for learning and classification. ACM Trans. Arch. Code Optim. (TACO) 9, 1 (2012), 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Majumdar, A., Cadambi, S., Chakradhar, S.T. An energy-efficient heterogeneous system for embedded learning and classification. Embedded Systems Letters 3, 1 (2011), 42--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Manolakos, E.S., Stamoulias, I. IP-cores design for the KNN classifier. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS) (2010). IEEE, 4133--4136.Google ScholarGoogle ScholarCross RefCross Ref
  34. Maruyama, T. Real-time k-means clustering for color images on reconfigurable hardware. In 18th International Conference on Pattern Recognition (ICPR) (Aug 2006). IEEE, Volume 2, 816--819. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Muller, M. Dark silicon and the internet. In EE Times "Designing with ARM" Virtual Conference, 26, 70(2010), 285--288.Google ScholarGoogle Scholar
  36. Papadonikolakis, M., Bouganis, C. A heterogeneous FPGA architecture for support vector machine training. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (May 2010). IEEE, 211--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C., Horowitz, M.A. Convolution engine: Balancing efficiency & flexibility in specialized computing. In International Symposium on Computer Architecture, 2013). ACM, 41(3), 24--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Sermanet, P., Chintala, S., LeCun, Y. Convolutional neural networks applied to house numbers digit classification. In Pattern Recognition (ICPR), …, 2012.Google ScholarGoogle Scholar
  39. Sermanet, P., LeCun, Y. Traffic sign recognition with multi-scale convolutional networks. In International Joint Conference on Neural Networks (July 2011). IEEE, 2809--2813.Google ScholarGoogle ScholarCross RefCross Ref
  40. Stamoulias, I., Manolakos, E.S. Parallel architectures for the KNN classifier--design of soft IP cores and FPGA implementations. ACM Transactions on Embedded Computing Systems (TECS) 13, 2 (2013), 22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Swanson, S., Michelson, K., Schwerin, A., Oskin, M. Wavescalar. In ACM/IEEE International Symposium on Microarchitecture (MICRO) (Dec 2003). IEEE Computer Society, 291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Temam, O. The rebirth of neural networks. In International Symposium on Computer Architecture, (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Temam, O. A defect-tolerant accelerator for emerging high-performance applications. In International Symposium on Computer Architecture (Sep 2012). Portland, Oregon, 40(3), 356--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Vanhoucke, V., Senior, A., Mao, M.Z. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop (NIPS) (2011). Vol. 1.Google ScholarGoogle Scholar
  45. Wang, G., Anand, D., Butt, N., Cestero, A., Chudzik, M., Ervin, J., Fang, S., Freeman, G., Ho, H., Khan, B., Kim, B., Kong, W., Krishnan, R., Krishnan, S., Kwon, O., Liu, J., McStay, K., Nelson, E., Nummy, K., Parries, P., Sim, J., Takalkar, R., Tessier, A., Todi, R., Malik, R., Stiffler, S., Iyer, S. Scaling deep trench based EDRAM on SOI to 32 nm and beyond. In IEEE International Electron Devices Meeting (IEDM) (2009). IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  46. Wolpert, D.H. The lack of a priori distinctions between learning algorithms. Neural Comput. 8, 7 (1996), 1341--1390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yeh, Y.-J., Li, H.-Y., Hwang, W.-J., Fang, C.-Y. Fpga implementation of KNN classifier based on wavelet transform and partial distance search. In Image Analysis (June 2007). Springer Berlin Heidelberg, 512--521. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DianNao family: energy-efficient hardware accelerators for machine learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Communications of the ACM
        Communications of the ACM  Volume 59, Issue 11
        November 2016
        118 pages
        ISSN:0001-0782
        EISSN:1557-7317
        DOI:10.1145/3013530
        • Editor:
        • Moshe Y. Vardi
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 28 October 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format