Abstract
Machine Learning (ML) tasks are becoming pervasive in a broad range of applications, and in a broad range of systems (from embedded systems to data centers). As computer architectures evolve toward heterogeneous multi-cores composed of a mix of cores and hardware accelerators, designing hardware accelerators for ML techniques can simultaneously achieve high efficiency and broad application scope.
While efficient computational primitives are important for a hardware accelerator, inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an Amdahl's law effect, and thus, they should become a first-order concern, just like in processors, rather than an element factored in accelerator design on a second step. In this article, we introduce a series of hardware accelerators (i.e., the DianNao family) designed for ML (especially neural networks), with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that, on a number of representative neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip DaDianNao system (a member of the DianNao family).<!-- END_PAGE_1 -->
- Cadambi, S., Durdanovic, I., Jakkula, V., Sankaradass, M., Cosatto, E., Chakradhar, S., Graf, H.P. A massively parallel fpga-based coprocessor for support vector machines. In 17th IEEE Symposium on Field Programmable Custom Computing Machines, 2009. FCCM'09 (2009; IEEE, 115--122. Google ScholarDigital Library
- Chakradhar, S., Sankaradas, M., Jakkula, V., Cadambi, S. A dynamically configurable coprocessor for convolutional neural networks. In International Symposium on Computer Architecture (Saint Malo, France, June 2010). ACM 38(3): 247--257. Google ScholarDigital Library
- Chan, E. Algorithmic Trading: Winning Strategies and Their Rationale. John Wiley & Sons, 2013. Google ScholarDigital Library
- Chen, T., Chen, Y., Duranton, M., Guo, Q., Hashmi, A., Lipasti, M., Nere, A., Qiu, S., Sebag, M., Temam, O. BenchNN: On the broad potential application scope of hardware neural network accelerators. In International Symposium on Workload Characterization, 2012. Google ScholarDigital Library
- Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS), (March 2014). ACM 49(4): 269--284. Google ScholarDigital Library
- Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N., Temam, O. Dadiannao: A machine-learning supercomputer. In ACM/IEEE International Symposium on Microarchitecture (MICRO) (December 2014). IEEE Computer Society, 609--622. Google ScholarDigital Library
- Coates, A., Huval, B., Wang, T., Wu, D.J., Ng, A.Y. Deep learning with cots HPC systems. In International Conference on Machine Learning, 2013: 1337--1345.Google Scholar
- Deng, J. Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR) (2009). IEEE, 248--255.Google ScholarCross Ref
- Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O. Shidiannao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture (ISCA'15) (2015). ACM, 92--104. Google ScholarDigital Library
- Esmaeilzadeh, H., Blem, E., Amant, R.S., Sankaralingam, K., Burger, D. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA) (June 2011). IEEE, 365--376. Google ScholarDigital Library
- Esmaeilzadeh, H., Sampson, A., Ceze, L., Burger, D. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (Dec 2012). IEEE Computer Society, 449--460. Google ScholarDigital Library
- Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshop (June 2011). IEEE, 109--116.Google ScholarCross Ref
- Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y. Neuflow: A runtime reconfigurable dataflow processor for vision. In 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2011). IEEE, 109--116.Google ScholarCross Ref
- Frery, A., de Araujo, C., Alice, H., Cerqueira, J., Loureiro, J.A., de Lima, M.E., Oliveira, M., Horta, M., et al. Hyperspectral images clustering on reconfigurable hardware using the k-means algorithm. In Proceedings of the 16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003 (2003). IEEE, 99--104. Google ScholarDigital Library
- Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B.C., Richardson, S., Kozyrakis, C., Horowitz, M. Understanding sources of inefficiency in general-purpose chips. In International Symposium on Computer Architecture (New York, New York, USA, 2010). ACM, 38(3): 37--47. Google ScholarDigital Library
- Hinton, G., Srivastava, N. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv: …, 1--18, 2012.Google Scholar
- Hussain, H.M., Benkrid, K., Seker, H., Erdogan, A.T. Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data. In 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS) (2011). IEEE, 248--255.Google ScholarCross Ref
- Keckler, S. Life after Dennard and how I learned to love the Picojoule (keynote). In International Symposium on Microarchitecture, Keynote presentation, Sao Paolo, Dec. 2011.Google Scholar
- Kim, J.Y., Kim, M., Lee, S., Oh, J., Kim, K., Yoo, H.-J.A. GOPS 496 mW real-time multi-object recognition processor with bio-inspired neural perception engine. IEEE Journal of Solid-State Circuits 45, 1 (Jan. 2010), 32--45.Google Scholar
- Krizhevsky, A., Sutskever, I., Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (2012), 1--9. Google ScholarDigital Library
- Krizhevsky, A., Sutskever, I., Hinton, G. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (2012) 1--9. Google ScholarDigital Library
- Larkin, D., Kinane, A., O'Connor, N.E. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices. In Neural Information Processing (2006). Springer, Berlin Heidelberg, 1178--1188. Google ScholarDigital Library
- Le, Q.V. Building high-level features using large scale unsupervised learning. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013). IEEE, 8595--8598.Google ScholarCross Ref
- Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y. Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning, June 2012.Google ScholarDigital Library
- LeCun, Y., Bengio, Y., Hintion, G. Deep learning. Nature 521, 7553 (2015), 436--444.Google ScholarCross Ref
- Lecun, Y., Bottou, L., Bengio, Y., Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 11 (1998), 2278--2324.Google ScholarCross Ref
- Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42 (New York, NY, USA, 2009). ACM, 469--480. Google ScholarDigital Library
- Liu, D., Chen, T., Liu, S., Zhou, J., Zhou, S., Teman, O., Feng, X., Zhou, X., Chen, Y. Pudiannao: A polyvalent machine learning accelerator. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2015). ACM, 369--381. Google ScholarDigital Library
- Maashri, A.A., Debole, M., Cotter, M., Chandramoorthy, N., Xiao, Y., Narayanan, V., Chakrabarti, C. Accelerating neuromorphic vision algorithms for recognition. In Proceedings of the 49th Annual Design Automation Conference (2012). ACM, 579--584. Google ScholarDigital Library
- Maeda, N., Komatsu, S., Morimoto, M., Shimazaki, Y. A 0.41 µa standby leakage 32 kb embedded SRAM with low-voltage resume-standby utilizing all digital current comparator in 28 nm hkmg CMOS. In International Symposium on VLSI Circuits (VLSIC), 2012.Google Scholar
- Majumdar, A., Cadambi, S., Becchi, M., Chakradhar, S.T., Graf, H.P. A massively parallel, energy efficient programmable accelerator for learning and classification. ACM Trans. Arch. Code Optim. (TACO) 9, 1 (2012), 6. Google ScholarDigital Library
- Majumdar, A., Cadambi, S., Chakradhar, S.T. An energy-efficient heterogeneous system for embedded learning and classification. Embedded Systems Letters 3, 1 (2011), 42--45. Google ScholarDigital Library
- Manolakos, E.S., Stamoulias, I. IP-cores design for the KNN classifier. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS) (2010). IEEE, 4133--4136.Google ScholarCross Ref
- Maruyama, T. Real-time k-means clustering for color images on reconfigurable hardware. In 18th International Conference on Pattern Recognition (ICPR) (Aug 2006). IEEE, Volume 2, 816--819. Google ScholarDigital Library
- Muller, M. Dark silicon and the internet. In EE Times "Designing with ARM" Virtual Conference, 26, 70(2010), 285--288.Google Scholar
- Papadonikolakis, M., Bouganis, C. A heterogeneous FPGA architecture for support vector machine training. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (May 2010). IEEE, 211--214. Google ScholarDigital Library
- Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C., Horowitz, M.A. Convolution engine: Balancing efficiency & flexibility in specialized computing. In International Symposium on Computer Architecture, 2013). ACM, 41(3), 24--35. Google ScholarDigital Library
- Sermanet, P., Chintala, S., LeCun, Y. Convolutional neural networks applied to house numbers digit classification. In Pattern Recognition (ICPR), …, 2012.Google Scholar
- Sermanet, P., LeCun, Y. Traffic sign recognition with multi-scale convolutional networks. In International Joint Conference on Neural Networks (July 2011). IEEE, 2809--2813.Google ScholarCross Ref
- Stamoulias, I., Manolakos, E.S. Parallel architectures for the KNN classifier--design of soft IP cores and FPGA implementations. ACM Transactions on Embedded Computing Systems (TECS) 13, 2 (2013), 22. Google ScholarDigital Library
- Swanson, S., Michelson, K., Schwerin, A., Oskin, M. Wavescalar. In ACM/IEEE International Symposium on Microarchitecture (MICRO) (Dec 2003). IEEE Computer Society, 291. Google ScholarDigital Library
- Temam, O. The rebirth of neural networks. In International Symposium on Computer Architecture, (2010). Google ScholarDigital Library
- Temam, O. A defect-tolerant accelerator for emerging high-performance applications. In International Symposium on Computer Architecture (Sep 2012). Portland, Oregon, 40(3), 356--367. Google ScholarDigital Library
- Vanhoucke, V., Senior, A., Mao, M.Z. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop (NIPS) (2011). Vol. 1.Google Scholar
- Wang, G., Anand, D., Butt, N., Cestero, A., Chudzik, M., Ervin, J., Fang, S., Freeman, G., Ho, H., Khan, B., Kim, B., Kong, W., Krishnan, R., Krishnan, S., Kwon, O., Liu, J., McStay, K., Nelson, E., Nummy, K., Parries, P., Sim, J., Takalkar, R., Tessier, A., Todi, R., Malik, R., Stiffler, S., Iyer, S. Scaling deep trench based EDRAM on SOI to 32 nm and beyond. In IEEE International Electron Devices Meeting (IEDM) (2009). IEEE, 1--4.Google ScholarCross Ref
- Wolpert, D.H. The lack of a priori distinctions between learning algorithms. Neural Comput. 8, 7 (1996), 1341--1390. Google ScholarDigital Library
- Yeh, Y.-J., Li, H.-Y., Hwang, W.-J., Fang, C.-Y. Fpga implementation of KNN classifier based on wavelet transform and partial distance search. In Image Analysis (June 2007). Springer Berlin Heidelberg, 512--521. Google ScholarDigital Library
Index Terms
- DianNao family: energy-efficient hardware accelerators for machine learning
Recommendations
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsMachine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural ...
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ASPLOS '14Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural ...
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ASPLOS '14Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural ...
Comments