nach oben

Cluster Computing

Erschienen in:

10.07.2021

Data scheduling and placement in deep learning accelerator

verfasst von: Seyedeh Yasaman Hosseini Mirmahaleh, Midia Reshadi, Nader Bagherzadeh, Ahmad Khademzadeh

Erschienen in: Cluster Computing | Ausgabe 4/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Deep neural networks (DNNs) have been employed to different devices as a popular machine learning algorithm (ML) owing to deploy the Internet of Things (IoT), data mining in cloud computing, and web search engines, which MLs had an impressive effect on IoT’s edge level nodes. Deploying DNN-based applications leads to memory access problems, including communication delay, energy efficiency, and bandwidth requirement. We propose a bus scheduling for data placement on distributed local buffers in a deep learning accelerator (DLA). The contributions of this paper include (1) providing a method for data-flow mapping between off-chip DRAM and distributed local buffers, and a flow mapping approach for data transfer between distributed local buffers and processing elements (PEs) (2) employing distributed local buffers in four directions for traffic distribution on a mesh based on memory access mechanism (3) bus scheduling for data placement on distributed local buffers. Simulated experiment based on typical DNN (i.e., AlexNet, VGG-16, and GoogLeNet) workflows demonstrates the effectiveness of the design: (1) the scheduling and mapping methods improve total runtime and bandwidth requirement by approximately 42.29% and 88.95% compared with TPU, respectively. Additionally, (2) our methods reduce total runtime for row-column stationary plus by approximately 99% compared with weight-stationary data-flow in CONV1 and CONV11 of VGG-16, respectively. This work reports the simulation results based on distributing AlexNet, VGG-16, and GoogLeNet’s traffics as the popular CNNs and DNNs models, whereas it investigates our method's efficiency for other trained models.

Vorheriger Artikel Orchestrating real-time IoT workflows in a fog computing environment utilizing partial computations with end-to-end error propagation

Nächster Artikel Parallel Version of Local Search Heuristic Algorithm to Solve Capacitated Vehicle Routing Problem

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Int. J. (SOLID-STATE CIRCUITS) (2016)

Dally, W.J., Thomas Gray, C., Poulton, J., Khailany, B., Wilson, J., Dennison, L.: Hardware-enabled artificial intelligence. In: IEEE International Conference (SVCDT) (2018)

Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. IEEE Int. J. (JoP) 105, 2295–2329 (2017)

Andri, R., Cavigelli, L., Rossiy, D., Benini, L.: Hyperdrive: a systolically scalable binary-weight CNN inference engine for mW IoT end-nodes. In: IEEE International Conference (ISVLSI) (2018)

Luo, T., Liu, S., Li, L., Wang, Y., Zhang, S., Chen, T., Xu, Z., Temam, O., Chen, Y.: DaDianNao: a machine-learning supercomputer. IEEE Int. J. (Computer) (2016)

Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, X., Temam, O.: ShiDianNao: shifting vision processing closer to the sensor. In: IEEE International Conference (ISCA) (2015)

Chen, Y.-H., Emer, J.S., Sze, V.: Eyeriss v2: a flexible and high-performance accelerator for emerging deep neural networks. IEEE Int. J. (ArXiv) (2018)

Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: IEEE International Conference (ISCA) (2017)

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE International Conference (CVPR’15) (2015)

10.

Schuiki, F., Schaffner, M., Gürkaynak, F.K., Benini, L.: A scalable near-memory architecture for training deep neural networks on large in-memory datasets. IEEE Int. J. (ITC) (2019)

11.

Samajdar, A., Zhu, Y., Whatmough, P., Mattina, M., Krishna, T.: SCALE-sim: systolic CNN accelerator simulator. In: IEEE International Conference (ASPLOS’18) (2018)

12.

Tang, T., Li, S., Xie, Y., Jouppi, N.: MLPAT: a power, area, timing modeling framework for machine learning accelerators. In: IEEE International Conference (MICRO’18) (2018)

13.

Gao, M., Pu, J., Yang, X., Horowitz, M., Kozyrakis, C.: TETRIS: scalable and efficient neural network acceleration with 3D memory. In: IEEE International Conference (ASPLOS) (2017)

14.

Chen (Jimmy), K.-C., Ebrahimi, M., Wang, T.-Y., Yang, Y.-C.: NoC-based DNN accelerator: a future design paradigm. In: Conference (NOCS ’19) (2019)

15.

Mirmahaleh, S.Y., Reshadi, M., Bagherzadeh, N.: Flow mapping on mesh-based deep learning accelerator. J. Parallel Distrib. Comput. 144, 80–97 (2020)CrossRef

16.

Kwon, H., Samajdar, A., Krishna, T.: A communication-centric approach for designing flexible DNN accelerators. In: IEEE International Journal (MICRO) (2018)

17.

Ascia, G., Catania, V., Monteleone, S., Palesi, M., Patti, D., Jose, J.: Analyzing networks-on-chip based deep neural networks. In: Conference (NOCS ’19) (2019)

18.

Mirmahaleh, S.Y.H., Reshadi, M., Shabani, H., Guo, X., Bagherzadeh, N.: Flow mapping and data distribution on mesh-based deep learning accelerator. In: Proceedings of international symposium on networks-on chip, New York, NY, USA, October 17–18, 2019 (NOCS ’19), 8 pages (2019)

19.

Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: IEEE International Conference (ASPLOS) (2014)

20.

Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y.: An empirical evaluation of deep architectures on problems with many factors of variation. In: IEEE International Conference (ICML) (2007)

21.

Han, X., Zhou, D., Wang, S., Kimura, S.: CNN-MERP: an FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks. In: IEEE International Conference (ICCD) (2016)

22.

Li, J., Mei, X., Prokhorov, D.: Deep neural network for structural prediction and lane detection in traffic scene. IEEE Trans. Neural Netw. Learning Syst. 28, 690–703 (2016)CrossRef

23.

Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Understanding the limitations of existing energy-efficient design approaches for deep neural networks. In: IEEE International Conference (SYSML’18) (2018)

24.

Chen, Y.-H., Emer, J., Sze, V.: Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Int. J. (Computer Society) 37, 12–21 (2017)

25.

Kwon, H., Samajdar, A., Krishna, T.: Rethinking NoCs for spatial neural network accelerators. In: IEEE International Conference (NOCS) (2017)

26.

Choi, W., Duraisamy, K., Kim, R.G., Doppa, J.R., Pande, P.P., Marculescu, D., Marculescu, R.: On-chip communication network for efficient training of deep convolutional networks on heterogeneous manycore systems. IEEE Int. J. (Computers) 67, 672–686 (2017)MathSciNetMATH

27.

Joardar, B.K., Choi, W., Kim, R.G., Doppa, J.R., Pande, P.P., Marculescu, D., Marculescu, R.: 3D NoC-enabled heterogeneous manycore architectures for accelerating CNN training: performance and thermal trade-offs. In: IEEE International Conference (NoC'17) (2017)

28.

Park, S., Bong, K., Shin, D., Lee, J., Choi, S., Yoo, H.-J.: A 1.93UPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications. In: IEEE International Conference (ISSCC) (2015)

29.

Wang, C., Gong, L., Yu, Q., Li, X., Xie, Y., Zhou, X.: DLAU: a scalable deep learning accelerator unit on FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. https://doi.org/10.1109/TCAD.2016.2587683 (2016)

30.

Adolf, R., Rama, S., Reagen, B., Wei, G.-Y., Brooks, D.: Fathom: reference workloads for modern deep learning methods. IEEE International arXiv.org (2016)

31.

Song, L., Wang, Y., Han, Y., Zhao, X., Liu, B., Li, X.: C-brain: a deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In: IEEE International Conference (DAC) (2016)

32.

Guo, K., Lingzhi, Qiu, J., Yao, S., Han, S., Wang, Y., Yang, H.: Angel-eye: a complete design flow for mapping CNN onto customized hardware. In: IEEE International Conference (VLSI) (2016)

33.

Reagen, B., Whatmough, P., Adolf, R., Rama, S., Lee, H., Lee, S.K., Hernández-Lobato, J.M., Wei, G.-Y.: David brooks, minerva: enabling low-power, highly-accurate deep neural network accelerators. In: IEEE International Conference (Computer Architecture) (2016)

34.

Jouppi, N.P., Young, C., Patil, N., Patterson, D.: Motivation for and evaluation of the first tensor processing unit. IEEE Int. J. (Micro) 38, 10–19 (2018)

35.

Kwon, H., Emer, J.S., Krishna, T.: MAERI: enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In: IEEE International Conference (ASPLOS’18) (2018)

36.

Kwon, H., Pellauer, M., Krishna, T.: MAESTRO: an open-source infrastructure for modeling dataflows within deep learning accelerators. In: IEEE International Conference (ArXiv) (2018)

37.

Yang, T.-J., Chen, Y.-H.: Vivienne sze, designing energy-efficient convolutional neural networks using energy-aware pruning. In: IEEE International Conference (CVPR) (2017)

38.

Cavigelli, L., Magno, M., Benini, L.: Accelerating real-time embedded scene labeling with convolutional networks. In: IEEE International Conference (DAC) (2015)

39.

Mirmahaleh, S.Y., Rahmani, A.M.: DNN pruning and mapping on NoC-Based communication infrastructure. Microelectron. J. 94, 104655 (2019)CrossRef

40.

Karam, R., Puri, R., Bhunia, S.: Energy-efficient adaptive hardware accelerator for text mining application kernels. IEEE Int. J. (VLSI) 24, 3526–3537 (2016)

41.

Firuzan, A., Modarresi, M., Daneshtalab, M., Reshadi, M.: Reconfigurable network-on-chip for 3D neural network accelerators. In: IEEE International Conference (NOCS’18) (2018)

42.

Samajdar, A., Mannan, P., Garg, K., Krishna, T.: GeneSys: enabling continuous learning through neural network evolution in hardware. In: IEEE Int. Conference (ArXiv) (2018)

43.

Li, J., Yan, G., Lu, W., Jiang, S., Gong, S., Wu, J., Li, X.: SmartShuttle: optimizing off-chip memory accesses for deep learning accelerators. In: IEEE International Conference (DATE), 2018

44.

http://synergy.ece.gatech.edu/tools/maestro/

45.

https://github.com/ARM-software/SCALE-Sim

46.

Dahl, G., Sainath, T., Hinton, G.: Improving deep neural networks for LVCSR using rectified linear units and dropout. In: IEEE International Conference (Acoustics, Speech and Signal Processing) (2013)

47.

Huang, P., He, X., Gao, J., Deng, L.: Learning deep structured semantic models for web search using clickthrough data. In: IEEE International Conference (Information and Knowledge Management) (2013)

48.

Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall PTR, Upper Saddle River, NJ (1998)MATH

49.

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: integrated recognition, localization and detection using convolutional networks. In: IEEE International Conference (CoRR) (2013)

50.

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE International Conference (CVPR) (2016)

51.

Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE International Conference (CVPR) (2014)

52.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. IEEE Int. J. Comput. Vis. 115, 211–252 (2015)MathSciNetCrossRef

53.

Hagan, M.T., Demuth, H.B., Beale, M.H., De Jesús, O.: IEEE International Book, 2nd Edition. Martin Hagan Publisher (2014)

54.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: IEEE International Conference (NIPS) (2012)

55.

Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J.: SCNN: an accelerator for compressed-sparse convolutional neural networks. In: IEEE International Conference (ISCA) (2017)

56.

Albaqsami, A., Hosseini, M.S., Bagherzadeh, N.: HTF-MPR: a heterogeneous TensorFlow mapper targeting performance using genetic algorithms and gradient boosting regressors. In: IEEE International Conference (DATE) (2018)

57.

Ebrahimi, M., Daneshtalab, M., Liljeberg, P., Tenhunen, H.: HAMUM-A novel routing protocol for unicast and multicast traffic in MPSoCs. In: IEEE Int. Conference (PDP) (2010)

58.

Daneshtalab, M., Ebrahimi, M., Mohammadi, S., Afzali-Kusha, A.: Low-distance path-based multicast routing algorithm for network-on-chips. IEEE Int. J. 3, 430 (2009)

59.

Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., Xie, Y.: PRIME: a novel processing-in memory architecture for neural network computation in ReRAM-based main memory. In: IEEE International Conference (ISCA) (2016)

60.

Moons, B., Verhelst, M.: A 0.3–2.6 UPS/W precision-scalable processor for real-time large-scale ConvNets. IEEE International Symposium (VLSI) (2016).

61.

https://github.com/davidepatti/noxim

62.

Catania, V., Mineo, A., Palesi, M., Patti, D., Monteleone, S.: cycle-accurate network on chip simulation with Noxim. IEEE Int. J. (TOMACS) 27, 1–25 (2016)

63.

Jimmy Chen, K.-C., Wang, T.-Y.: NN-Noxim: high-level cycle-accurate NoC-based neural networks simulator. In: IEEE International Conference (NOCARC) (2018)

64.

Jimmy Chen, K.-C., Wang, T.-Y.: NN-Noxim: high-level cycle-accurate NoC-based neural networks simulator. In: Conference (NOCARC) (2018)

65.

Chen, K.-C., Wang, T.-Y., Yang, Y.-C.: Cycle-accurate NoC-based convolutional neural network simulator. In: Conference (COINS’19) (2019)

66.

https://www.xilinx.com/products/design-tools/vivado.html

67.

Kwon, H., Krishna, T.: OpenSMART: single-cycle multi-hop NoC generator in BSV and chisel. In: IEEE International Conference (ISPASS) (2017)

68.

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: IEEE International Conference (ICLR) (2015)

69.

http://caffe.berkeleyvision.org/

70.

https://www.tensorflow.org/

Titel: Data scheduling and placement in deep learning accelerator
verfasst von: Seyedeh Yasaman Hosseini Mirmahaleh
Midia Reshadi
Nader Bagherzadeh
Ahmad Khademzadeh
Publikationsdatum: 10.07.2021
Verlag: Springer US
Erschienen in: Cluster Computing / Ausgabe 4/2021
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI: https://doi.org/10.1007/s10586-021-03355-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2021

A simple and efficient storage format for SIMD-accelerated SpMV

An efficient energy-aware approach for dynamic VM consolidation on cloud platforms

Accelerating distributed deep neural network training with pipelined MPI allreduce

Workflow scheduling of scientific workflows under simultaneous deadline and budget constraints

Resource provisioning for containerized applications

Video event detection, classification and retrieval using ensemble feature selection

Premium Partner