Skip to main content
Erschienen in: International Journal of Parallel Programming 5-6/2019

11.02.2019

Tracing and Profiling Machine Learning Dataflow Applications on GPU

verfasst von: Pierre Zins, Michel Dagenais

Erschienen in: International Journal of Parallel Programming | Ausgabe 5-6/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we propose a profiling and tracing method for dataflow applications with GPU acceleration. Dataflow models can be represented by graphs and are widely used in many domains like signal processing or machine learning. Within the graph, the data flows along the edges, and the nodes correspond to the computing units that process the data. To accelerate the execution, some co-processing units, like GPUs, are often used for computing intensive nodes. The work in this paper aims at providing useful information about the execution of the dataflow graph on the available hardware, in order to understand and possibly improve the performance. The collected traces include low-level information about the CPU, from the Linux Kernel (system calls), as well as mid-level and high-level information respectively about intermediate libraries like CUDA, HIP or HSA, and the dataflow model. This is followed by post-mortem analysis and visualization steps in order to enhance the trace and show useful information to the user. To demonstrate the effectiveness of the method, it was evaluated for TensorFlow, a well-known machine learning library that uses a dataflow computational graph to represent the algorithms. We present a few examples of machine learning applications that can be optimized with the help of the information provided by our proposed method. For example, we reduce the execution time of a face recognition application by a factor of 5X. We suggest a better placement of the computation nodes on the available hardware components for a distributed application. Finally, we also enhance the memory management of an application to speed up the execution.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38(8), 114 (1965) Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38(8), 114 (1965)
2.
Zurück zum Zitat Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: Gpu computing. Proc. IEEE 96(5), 879–899 (2008)CrossRef Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: Gpu computing. Proc. IEEE 96(5), 879–899 (2008)CrossRef
4.
Zurück zum Zitat Boutellier, J., Nyländen, T.: Design flow for GPU and multicore execution of dynamic dataflow programs. J. Signal Process. Syst. 89(3), 469–478 (2017)CrossRef Boutellier, J., Nyländen, T.: Design flow for GPU and multicore execution of dynamic dataflow programs. J. Signal Process. Syst. 89(3), 469–478 (2017)CrossRef
5.
Zurück zum Zitat Bezati, E., Mattavelli, M., Raulet, M.: Rvc-cal dataflow implementations of mpeg avc/h.264 cabac decoding. In: 2010 Conference on Design and Architectures for Signal and Image Processing (DASIP), pp. 207–213 (2010) Bezati, E., Mattavelli, M., Raulet, M.: Rvc-cal dataflow implementations of mpeg avc/h.264 cabac decoding. In: 2010 Conference on Design and Architectures for Signal and Image Processing (DASIP), pp. 207–213 (2010)
6.
Zurück zum Zitat Hentati, M., Aoudni, Y., Nezan, J.F., Abid, M.: A hierarchical implementation of hadamard transform using rvc-cal dataflow programming and dynamic partial reconfiguration. In: Proceedings of the 2012 Conference on Design and Architectures for Signal and Image Processing, pp. 1–7 (2012) Hentati, M., Aoudni, Y., Nezan, J.F., Abid, M.: A hierarchical implementation of hadamard transform using rvc-cal dataflow programming and dynamic partial reconfiguration. In: Proceedings of the 2012 Conference on Design and Architectures for Signal and Image Processing, pp. 1–7 (2012)
7.
Zurück zum Zitat Blattner, T., Keyrouz, W., Halem, M., Brady, M., Bhattacharyya, S.S.: A hybrid task graph scheduler for high performance image processing workflows. In: 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 634–637 (2015) Blattner, T., Keyrouz, W., Halem, M., Brady, M., Bhattacharyya, S.S.: A hybrid task graph scheduler for high performance image processing workflows. In: 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 634–637 (2015)
8.
Zurück zum Zitat Bourrasset, C., Maggiani, L., Srot, J., Berry, F.: Dataflow object detection system for fpga-based smart camera. IET Circuits Devices Syst. 10(4), 280–291 (2016)CrossRef Bourrasset, C., Maggiani, L., Srot, J., Berry, F.: Dataflow object detection system for fpga-based smart camera. IET Circuits Devices Syst. 10(4), 280–291 (2016)CrossRef
9.
Zurück zum Zitat Halbwachs, N., Caspi, P., Raymond, P., Pilaud, D.: The synchronous data flow programming language lustre. Proc. IEEE 79(9), 1305–1320 (1991)CrossRef Halbwachs, N., Caspi, P., Raymond, P., Pilaud, D.: The synchronous data flow programming language lustre. Proc. IEEE 79(9), 1305–1320 (1991)CrossRef
10.
Zurück zum Zitat Caspi, P., Pilaud, D., Halbwachs, N., Plaice, J.A.: LUSTRE: a declarative language for real-time programming. In: Proceedings of the 14th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL ’87, pp. 178–188. ACM, New York (1987) Caspi, P., Pilaud, D., Halbwachs, N., Plaice, J.A.: LUSTRE: a declarative language for real-time programming. In: Proceedings of the 14th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL ’87, pp. 178–188. ACM, New York (1987)
11.
Zurück zum Zitat Wadge, W.W., Ashcroft, E.A.: LUCID, the Dataflow Programming Language. Academic Press Professional, Inc., San Diego (1985)MATH Wadge, W.W., Ashcroft, E.A.: LUCID, the Dataflow Programming Language. Academic Press Professional, Inc., San Diego (1985)MATH
12.
Zurück zum Zitat Eker, J., Janneck, J.W.: CAL language report: specification of the CAL actor language (2003) Eker, J., Janneck, J.W.: CAL language report: specification of the CAL actor language (2003)
13.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3–6, 2012, Lake Tahoe, Nevada, USA, pp. 1106–1114 (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3–6, 2012, Lake Tahoe, Nevada, USA, pp. 1106–1114 (2012)
14.
Zurück zum Zitat Theano Development Team.: Theano: a python framework for fast computation of mathematical expressions (2016). CoRR arXiv:1605.02688 Theano Development Team.: Theano: a python framework for fast computation of mathematical expressions (2016). CoRR arXiv:​1605.​02688
15.
Zurück zum Zitat Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation (2010) Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation (2010)
16.
Zurück zum Zitat Abadi, M., Isard, M., Murray, D.G.: A computational model for tensorflow (an introduction) (2017) Abadi, M., Isard, M., Murray, D.G.: A computational model for tensorflow (an introduction) (2017)
17.
20.
Zurück zum Zitat Fournier, P.-M., Desnoyers, M., Dagenais, M.R.: Combined tracing of the kernel and applications with LTTng. In: Proceedings of the 2009 Linux Symposium (2009) Fournier, P.-M., Desnoyers, M., Dagenais, M.R.: Combined tracing of the kernel and applications with LTTng. In: Proceedings of the 2009 Linux Symposium (2009)
25.
Zurück zum Zitat Gregg, B.: Strace wow much syscall (2014) Gregg, B.: Strace wow much syscall (2014)
26.
27.
Zurück zum Zitat Rostedt, S.: Finding Origins of Latencies Using Ftrace (2009) Rostedt, S.: Finding Origins of Latencies Using Ftrace (2009)
28.
29.
Zurück zum Zitat Desnoyers, M., Dagenais, M.R.: The LTTng tracer: a low impact performance and behavior monitor for GNU/Linux. OLS (Ottawa Linux Symposium) (2006) Desnoyers, M., Dagenais, M.R.: The LTTng tracer: a low impact performance and behavior monitor for GNU/Linux. OLS (Ottawa Linux Symposium) (2006)
33.
Zurück zum Zitat Ponweiser, T.: Profiling and tracing tools for performance analysis of large scale applications (2017) Ponweiser, T.: Profiling and tracing tools for performance analysis of large scale applications (2017)
34.
Zurück zum Zitat Pillet, V., Labarta, J., Cortes, T., Girona, S., and Departament D’arquitectura De Computadors.: Paraver: a tool to visualize and analyze parallel code. Technical report, In WoTUG-18 (1995) Pillet, V., Labarta, J., Cortes, T., Girona, S., and Departament D’arquitectura De Computadors.: Paraver: a tool to visualize and analyze parallel code. Technical report, In WoTUG-18 (1995)
35.
Zurück zum Zitat Canale, M., Casale-Brunet, S., Bezati, E., Mattavelli, M., Janneck, J., Casale-Brunet, S., Bezati, E., Mattavelli, M., Marco Mattavelli@epfl Ch., Janneck, J.: Dataflow programs analysis and optimization using model predictive control techniques two examples of bounded buffer scheduling: deadlock avoidance and deadlock recovery strategies. J. Signal Process. Syst. 84, 371–381 (2016) Canale, M., Casale-Brunet, S., Bezati, E., Mattavelli, M., Janneck, J., Casale-Brunet, S., Bezati, E., Mattavelli, M., Marco Mattavelli@epfl Ch., Janneck, J.: Dataflow programs analysis and optimization using model predictive control techniques two examples of bounded buffer scheduling: deadlock avoidance and deadlock recovery strategies. J. Signal Process. Syst. 84, 371–381 (2016)
36.
Zurück zum Zitat Janneck, J.W., Miller, I.D., Parlour, D.B.: Profiling dataflow programs. In: 2008 IEEE International Conference on Multimedia and Expo, ICME 2008-Proceedings, pp. 1065–1068 (2008) Janneck, J.W., Miller, I.D., Parlour, D.B.: Profiling dataflow programs. In: 2008 IEEE International Conference on Multimedia and Expo, ICME 2008-Proceedings, pp. 1065–1068 (2008)
37.
Zurück zum Zitat Brunet, S.C., Mattavelli, M., Janneck, J.W.: Profiling of dataflow programs using post mortem causation traces. In: IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation, pp. 220–225 (2012) Brunet, S.C., Mattavelli, M., Janneck, J.W.: Profiling of dataflow programs using post mortem causation traces. In: IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation, pp. 220–225 (2012)
38.
Zurück zum Zitat Mysore, S., Mazloom, B., Agrawal, B., Sherwood, T.: Understanding and visualizing full systems with data flow tomography (2008)CrossRef Mysore, S., Mazloom, B., Agrawal, B., Sherwood, T.: Understanding and visualizing full systems with data flow tomography (2008)CrossRef
39.
Zurück zum Zitat Osmari, D.K., Vo, H.T., Silva, C.T., Comba, J.L.D., Lins, L.: Visualization and analysis of parallel dataflow execution with smart traces. In: Brazilian Symposium of Computer Graphic and Image Processing, pp. 165–172 (2014) Osmari, D.K., Vo, H.T., Silva, C.T., Comba, J.L.D., Lins, L.: Visualization and analysis of parallel dataflow execution with smart traces. In: Brazilian Symposium of Computer Graphic and Image Processing, pp. 165–172 (2014)
40.
Zurück zum Zitat Stoner, G.: ROCm: platform for a new era of heterogeneous in HPC and ultrascale computing (2016) Stoner, G.: ROCm: platform for a new era of heterogeneous in HPC and ultrascale computing (2016)
42.
Zurück zum Zitat Goli, M., Iwanski, L., Richards, A.: Accelerated machine learning using TensorFlow and SYCL on OpenCL Devices. In: Proceedings of the 5th International Workshop on OpenCL, IWOCL 2017, pp. 8:1–8:4. ACM, New York (2017) Goli, M., Iwanski, L., Richards, A.: Accelerated machine learning using TensorFlow and SYCL on OpenCL Devices. In: Proceedings of the 5th International Workshop on OpenCL, IWOCL 2017, pp. 8:1–8:4. ACM, New York (2017)
43.
Zurück zum Zitat Keryell, R., Reyes, R., Howes, L.: Khronos sycl for opencl: a tutorial. In: Proceedings of the 3rd International Workshop on OpenCL, IWOCL ’15, pp. 24:1–24:1. ACM, New York (2015) Keryell, R., Reyes, R., Howes, L.: Khronos sycl for opencl: a tutorial. In: Proceedings of the 3rd International Workshop on OpenCL, IWOCL ’15, pp. 24:1–24:1. ACM, New York (2015)
44.
46.
Zurück zum Zitat Couturier, D., Dagenais, M.R.: LTTng CLUST: a system-wide unified CPU and GPU tracing tool for OpenCL applications. Adv. Softw. Eng. 2015, 2:2–2:2 (2015)CrossRef Couturier, D., Dagenais, M.R.: LTTng CLUST: a system-wide unified CPU and GPU tracing tool for OpenCL applications. Adv. Softw. Eng. 2015, 2:2–2:2 (2015)CrossRef
47.
Zurück zum Zitat Poirier, B., Roy, R., Dagenais, M.: Accurate offline synchronization of distributed traces using kernel-level events. SIGOPS Oper. Syst. Rev. 44(3), 75–87 (2010)CrossRef Poirier, B., Roy, R., Dagenais, M.: Accurate offline synchronization of distributed traces using kernel-level events. SIGOPS Oper. Syst. Rev. 44(3), 75–87 (2010)CrossRef
48.
Zurück zum Zitat Jabbarifar, M.: On line trace synchronization for large scale distributed systems. PhD thesis, École Polytechnique de Montréal (2013) Jabbarifar, M.: On line trace synchronization for large scale distributed systems. PhD thesis, École Polytechnique de Montréal (2013)
49.
Zurück zum Zitat Wininger, F., Ezzati-Jivan, N., Dagenais, M.R.: A declarative framework for stateful analysis of execution traces. Softw. Qual. J. 25, 201–229 (2016)CrossRef Wininger, F., Ezzati-Jivan, N., Dagenais, M.R.: A declarative framework for stateful analysis of execution traces. Softw. Qual. J. 25, 201–229 (2016)CrossRef
50.
Zurück zum Zitat Kouame, K., Ezzati-Jivan, N., Dagenais, M.R.: A flexible data-driven approach for execution trace filtering. In: 2015 IEEE International Congress on Big Data, pp. 698–703 (2015) Kouame, K., Ezzati-Jivan, N., Dagenais, M.R.: A flexible data-driven approach for execution trace filtering. In: 2015 IEEE International Congress on Big Data, pp. 698–703 (2015)
51.
Zurück zum Zitat Moindrot, O.: Triplet Loss and Online Triplet Mining in TensorFlow (2018) Moindrot, O.: Triplet Loss and Online Triplet Mining in TensorFlow (2018)
52.
Zurück zum Zitat Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.A.: Striving for simplicity: the all convolutional net (2014). CoRR arXiv:1412.6806 Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.A.: Striving for simplicity: the all convolutional net (2014). CoRR arXiv:​1412.​6806
53.
Zurück zum Zitat Mayer, R., Mayer, C., Laich, L.: The TensorFlow Partitioning and Scheduling Problem: It’s the Critical Path! pp. 1–6 (2017) Mayer, R., Mayer, C., Laich, L.: The TensorFlow Partitioning and Scheduling Problem: It’s the Critical Path! pp. 1–6 (2017)
54.
Zurück zum Zitat Mirhoseini, A., Pham, H., Le, Q.V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J.: Device placement optimization with reinforcement learning. In: Icml (2017) Mirhoseini, A., Pham, H., Le, Q.V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J.: Device placement optimization with reinforcement learning. In: Icml (2017)
Metadaten
Titel
Tracing and Profiling Machine Learning Dataflow Applications on GPU
verfasst von
Pierre Zins
Michel Dagenais
Publikationsdatum
11.02.2019
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 5-6/2019
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-019-00630-5

Weitere Artikel der Ausgabe 5-6/2019

International Journal of Parallel Programming 5-6/2019 Zur Ausgabe