Skip to main content
Erschienen in: International Journal of Parallel Programming 2/2018

28.12.2016

Accelerating Data Analytics on Integrated GPU Platforms via Runtime Specialization

verfasst von: Naila Farooqui, Indrajit Roy, Yuan Chen, Vanish Talwar, Rajkishore Barik, Brian Lewis, Tatiana Shpeisman, Karsten Schwan

Erschienen in: International Journal of Parallel Programming | Ausgabe 2/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Integrated GPU systems are a cost-effective and energy-efficient option for accelerating data-intensive applications. While these platforms have reduced overhead of offloading computation to the GPU and potential for fine-grained resource scheduling, there remain several open challenges: (1) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (2) the complex architecture and programming models of such systems require substantial application knowledge to achieve high performance, and (3) as such systems become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such integrated GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Toward this end, this work proposes two novel schedulers with distinct goals: (a) a device-affinity, contention-aware scheduler that incorporates instrumentation-driven optimizations to improve the throughput of running diverse applications on integrated CPU–GPU servers, and (b) a specialized, affinity-aware work-stealing scheduler that efficiently distributes work across all CPU and GPU cores for the same application, taking into account both application characteristics and architectural differences of the underlying devices.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
3.
Zurück zum Zitat Agrawal, K., He, Y., Leiserson, C.E.: Adaptive work stealing with parallelism feedback. In: Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’07, pp. 112–120. ACM, New York (2007). doi:10.1145/1229428.1229448 Agrawal, K., He, Y., Leiserson, C.E.: Adaptive work stealing with parallelism feedback. In: Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’07, pp. 112–120. ACM, New York (2007). doi:10.​1145/​1229428.​1229448
4.
6.
Zurück zum Zitat Ariel, A., Fung, W.W.L., Turner, A.E., Aamodt, T.M.: Visualizing complex dynamics in many-core accelerator architectures. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 164–174. IEEE Computer Society, White Plains, NY, USA (2010) Ariel, A., Fung, W.W.L., Turner, A.E., Aamodt, T.M.: Visualizing complex dynamics in many-core accelerator architectures. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 164–174. IEEE Computer Society, White Plains, NY, USA (2010)
7.
Zurück zum Zitat Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp 23(2), 187–198 (2011)CrossRef Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp 23(2), 187–198 (2011)CrossRef
8.
Zurück zum Zitat Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.M.W.: An adaptive performance modeling tool for gpu architectures. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10, pp. 105–114. ACM, New York (2010). doi:10.1145/1693453.1693470 Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.M.W.: An adaptive performance modeling tool for gpu architectures. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10, pp. 105–114. ACM, New York (2010). doi:10.​1145/​1693453.​1693470
9.
Zurück zum Zitat Bakhoda, A., Yuan, G., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing cuda workloads using a detailed gpu simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 163–174. Boston, MA, USA (2009) Bakhoda, A., Yuan, G., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing cuda workloads using a detailed gpu simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 163–174. Boston, MA, USA (2009)
10.
Zurück zum Zitat Barik, R., Kaleem, R., Majeti, D., Lewis, B.T., Shpeisman, T., Hu, C., Ni, Y., Adl-Tabatabai, A.R.: Efficient mapping of irregular c++ applications to integrated gpus. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pp. 33:33–33:43. ACM, New York, NY, USA (2014) Barik, R., Kaleem, R., Majeti, D., Lewis, B.T., Shpeisman, T., Hu, C., Ni, Y., Adl-Tabatabai, A.R.: Efficient mapping of irregular c++ applications to integrated gpus. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pp. 33:33–33:43. ACM, New York, NY, USA (2014)
11.
Zurück zum Zitat Becchi, M., Sajjapongse, K., Graves, I., Procter, A., Ravi, V., Chakradhar, S.: A virtual memory based runtime to support multi-tenancy in clusters with gpus. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’12, pp. 97–108. ACM, New York, NY, USA (2012). doi:10.1145/2287076.2287090 Becchi, M., Sajjapongse, K., Graves, I., Procter, A., Ravi, V., Chakradhar, S.: A virtual memory based runtime to support multi-tenancy in clusters with gpus. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’12, pp. 97–108. ACM, New York, NY, USA (2012). doi:10.​1145/​2287076.​2287090
12.
Zurück zum Zitat Bender, M.A., Rabin, M.O.: Scheduling cilk multithreaded parallel programs on processors of different speeds. In: Proceedings of the Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’00, pp. 13–21. ACM, New York (2000). doi:10.1145/341800.341803 Bender, M.A., Rabin, M.O.: Scheduling cilk multithreaded parallel programs on processors of different speeds. In: Proceedings of the Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’00, pp. 13–21. ACM, New York (2000). doi:10.​1145/​341800.​341803
14.
Zurück zum Zitat Boyer, M., Skadron, K., Che, S., Jayasena, N.: Load balancing in a changing world: dealing with heterogeneity and performance variability. In: Proceedings of the ACM International Conference on Computing Frontiers, CF ’13, pp. 21:1–21:10. ACM, New York (2013) Boyer, M., Skadron, K., Che, S., Jayasena, N.: Load balancing in a changing world: dealing with heterogeneity and performance variability. In: Proceedings of the ACM International Conference on Computing Frontiers, CF ’13, pp. 21:1–21:10. ACM, New York (2013)
15.
Zurück zum Zitat Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on gpus. In: 2012 IEEE International Symposium on Workload Characterization (IISWC), pp. 141–151 (2012). doi:10.1109/IISWC.2012.6402918 Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on gpus. In: 2012 IEEE International Symposium on Workload Characterization (IISWC), pp. 141–151 (2012). doi:10.​1109/​IISWC.​2012.​6402918
16.
Zurück zum Zitat Cederman, D., Tsigas, P.: On dynamic load balancing on graphics processors. In: Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. GH ’08, pp. 57–64. Aire-la-Ville, Switzerland (2008) Cederman, D., Tsigas, P.: On dynamic load balancing on graphics processors. In: Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. GH ’08, pp. 57–64. Aire-la-Ville, Switzerland (2008)
17.
Zurück zum Zitat Chase, D., Lev, Y.: Dynamic circular work-stealing deque. In: Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’05, pp. 21–28. ACM, New York (2005). doi:10.1145/1073970.1073974 Chase, D., Lev, Y.: Dynamic circular work-stealing deque. In: Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’05, pp. 21–28. ACM, New York (2005). doi:10.​1145/​1073970.​1073974
18.
Zurück zum Zitat Chatterjee, S., Grossman, M., Sbîrlea, A.S., Sarkar, V.: Dynamic task parallelism with a GPU work-stealing runtime system. In: Rajopadhye, S.V., Strout, M.M. (eds.) Languages and Compilers for Parallel Computing, 24th International Workshop, LCPC 2011, Fort Collins, CO, USA, September 8–10, 2011. Revised Selected Papers, Lecture Notes in Computer Science, vol. 7146, pp. 203–217. Springer (2011). doi:10.1007/978-3-642-36036-7_14 Chatterjee, S., Grossman, M., Sbîrlea, A.S., Sarkar, V.: Dynamic task parallelism with a GPU work-stealing runtime system. In: Rajopadhye, S.V., Strout, M.M. (eds.) Languages and Compilers for Parallel Computing, 24th International Workshop, LCPC 2011, Fort Collins, CO, USA, September 8–10, 2011. Revised Selected Papers, Lecture Notes in Computer Science, vol. 7146, pp. 203–217. Springer (2011). doi:10.​1007/​978-3-642-36036-7_​14
19.
Zurück zum Zitat Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 44–54 (2009). doi:10.1109/IISWC.2009.5306797 Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 44–54 (2009). doi:10.​1109/​IISWC.​2009.​5306797
20.
Zurück zum Zitat Chen, L., Villa, O., Krishnamoorthy, S., Gao, G.: Dynamic load balancing on single- and multi-GPU systems. In: IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (2010). doi:10.1109/IPDPS.2010.5470413 Chen, L., Villa, O., Krishnamoorthy, S., Gao, G.: Dynamic load balancing on single- and multi-GPU systems. In: IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (2010). doi:10.​1109/​IPDPS.​2010.​5470413
21.
Zurück zum Zitat Collange, S., Defour, D., Parello, D.: Barra, a modular functional gpu simulator for gpgpu. Tech. Rep. hal-00359342 (2009) Collange, S., Defour, D., Parello, D.: Barra, a modular functional gpu simulator for gpgpu. Tech. Rep. hal-00359342 (2009)
22.
Zurück zum Zitat Farooqui, N., Kerr, A., Eisenhauer, G., Schwan, K., Yalamanchili, S.: Lynx: A dynamic instrumentation system for data-parallel applications on gpgpu architectures. In: 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 58 –67 (2012). doi:10.1109/ISPASS.2012.6189206 Farooqui, N., Kerr, A., Eisenhauer, G., Schwan, K., Yalamanchili, S.: Lynx: A dynamic instrumentation system for data-parallel applications on gpgpu architectures. In: 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 58 –67 (2012). doi:10.​1109/​ISPASS.​2012.​6189206
23.
Zurück zum Zitat Gautier, T., Ferreira Lima, J.V., Maillard, N., Raffin, B.: Locality-aware work stealing on multi-CPU and multi-GPU architectures. In: 6th Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG). Berlin, Germany (2013). https://hal.inria.fr/hal-00780890 Gautier, T., Ferreira Lima, J.V., Maillard, N., Raffin, B.: Locality-aware work stealing on multi-CPU and multi-GPU architectures. In: 6th Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG). Berlin, Germany (2013). https://​hal.​inria.​fr/​hal-00780890
24.
Zurück zum Zitat Goswami, N., Shankar, R., Joshi, M., Li, T.: Exploring gpgpu workloads: Characterization methodology, analysis and microarchitecture evaluation implications. In: 2010 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10 (2010). doi:10.1109/IISWC.2010.5649549 Goswami, N., Shankar, R., Joshi, M., Li, T.: Exploring gpgpu workloads: Characterization methodology, analysis and microarchitecture evaluation implications. In: 2010 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10 (2010). doi:10.​1109/​IISWC.​2010.​5649549
25.
Zurück zum Zitat Grewe, D., Wang, Z., O’Boyle, M.: Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In: IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–10 (2013). doi:10.1109/CGO.2013.6494993 Grewe, D., Wang, Z., O’Boyle, M.: Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In: IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–10 (2013). doi:10.​1109/​CGO.​2013.​6494993
28.
Zurück zum Zitat Guo, Y., Zhao, J., Cave, V., Sarkar, V.: Slaw: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10, pp. 341–342. ACM, New York (2010). doi:10.1145/1693453.1693504 Guo, Y., Zhao, J., Cave, V., Sarkar, V.: Slaw: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10, pp. 341–342. ACM, New York (2010). doi:10.​1145/​1693453.​1693504
29.
Zurück zum Zitat Gupta, V., Schwan, K., Tolia, N., Talwar, V., Ranganathan, P.: Pegasus: Coordinated scheduling for virtualized accelerator-based systems. In: Proceedings of the 2011 Usenix Annual Technical Conference, Portland, USA (2011) Gupta, V., Schwan, K., Tolia, N., Talwar, V., Ranganathan, P.: Pegasus: Coordinated scheduling for virtualized accelerator-based systems. In: Proceedings of the 2011 Usenix Annual Technical Conference, Portland, USA (2011)
30.
Zurück zum Zitat Hong, S., Kim, S.K., Oguntebi, T., Olukotun, K.: Accelerating cuda graph algorithms at maximum warp. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming. PPoPP ’11, pp. 267–276. ACM, New York (2011) Hong, S., Kim, S.K., Oguntebi, T., Olukotun, K.: Accelerating cuda graph algorithms at maximum warp. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming. PPoPP ’11, pp. 267–276. ACM, New York (2011)
32.
Zurück zum Zitat Jiménez, V.J., Vilanova, L., Gelado, I., Gil, M., Fursin, G., Navarro, N.: Predictive runtime code scheduling for heterogeneous architectures. In: Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers, HiPEAC ’09, pp. 19–33. Springer, Berlin, Heidelberg (2009). doi:10.1007/978-3-540-92990-1_4 Jiménez, V.J., Vilanova, L., Gelado, I., Gil, M., Fursin, G., Navarro, N.: Predictive runtime code scheduling for heterogeneous architectures. In: Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers, HiPEAC ’09, pp. 19–33. Springer, Berlin, Heidelberg (2009). doi:10.​1007/​978-3-540-92990-1_​4
33.
Zurück zum Zitat Kaleem, R., Barik, R., Shpeisman, T., Lewis, B.T., Hu, C., Pingali, K.: Adaptive heterogeneous scheduling for integrated gpus. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 151–162. ACM, New York (2014). doi:10.1145/2628071.2628088 Kaleem, R., Barik, R., Shpeisman, T., Lewis, B.T., Hu, C., Pingali, K.: Adaptive heterogeneous scheduling for integrated gpus. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 151–162. ACM, New York (2014). doi:10.​1145/​2628071.​2628088
34.
Zurück zum Zitat Kato, S., Lakshmanan, K., Rajkumar, R., Ishikawa, Y.: Timegraph: Gpu scheduling for real-time multi-tasking environments. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference. USENIXATC’11, pp. 2–2. USENIX Association, Berkeley, CA, USA (2011) Kato, S., Lakshmanan, K., Rajkumar, R., Ishikawa, Y.: Timegraph: Gpu scheduling for real-time multi-tasking environments. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference. USENIXATC’11, pp. 2–2. USENIX Association, Berkeley, CA, USA (2011)
35.
Zurück zum Zitat Kato, S., McThrow, M., Maltzahn, C., Brandt, S.: Gdev: First-class gpu resource management in the operating system. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference. USENIX ATC’12, pp. 37–37. USENIX Association, Berkeley, CA, USA (2012) Kato, S., McThrow, M., Maltzahn, C., Brandt, S.: Gdev: First-class gpu resource management in the operating system. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference. USENIX ATC’12, pp. 37–37. USENIX Association, Berkeley, CA, USA (2012)
36.
Zurück zum Zitat Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of ptx kernels. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 3–12 (2009). doi:10.1109/IISWC.2009.5306801 Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of ptx kernels. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 3–12 (2009). doi:10.​1109/​IISWC.​2009.​5306801
37.
Zurück zum Zitat Kim, J., Kim, H., Lee, J.H., Lee, J.: Achieving a single compute device image in OpenCL for multiple GPUs. In: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP ’11, pp. 277–288. ACM, NY, USA (2011). doi:10.1145/1941553.1941591 Kim, J., Kim, H., Lee, J.H., Lee, J.: Achieving a single compute device image in OpenCL for multiple GPUs. In: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP ’11, pp. 277–288. ACM, NY, USA (2011). doi:10.​1145/​1941553.​1941591
38.
Zurück zum Zitat Kim, S., Roy, I., Talwar, V.: Evaluating integrated graphics processors for data center workloads. In: Proceedings of the Workshop on Power-Aware Computing and Systems, HotPower ’13, pp. 8:1–8:5. ACM, New York, NY, USA (2013) Kim, S., Roy, I., Talwar, V.: Evaluating integrated graphics processors for data center workloads. In: Proceedings of the Workshop on Power-Aware Computing and Systems, HotPower ’13, pp. 8:1–8:5. ACM, New York, NY, USA (2013)
39.
Zurück zum Zitat Kumar, V., Frampton, D., Blackburn, S.M., Grove, D., Tardieu, O.: Work-stealing without the baggage. In: Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’12, pp. 297–314. ACM, New York (2012). doi:10.1145/2384616.2384639 Kumar, V., Frampton, D., Blackburn, S.M., Grove, D., Tardieu, O.: Work-stealing without the baggage. In: Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’12, pp. 297–314. ACM, New York (2012). doi:10.​1145/​2384616.​2384639
40.
Zurück zum Zitat Lê, N.M., Pop, A., Cohen, A., Zappa Nardelli, F.: Correct and efficient work-stealing for weak memory models. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, pp. 69–80. ACM, New York (2013). doi:10.1145/2442516.2442524 Lê, N.M., Pop, A., Cohen, A., Zappa Nardelli, F.: Correct and efficient work-stealing for weak memory models. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, pp. 69–80. ACM, New York (2013). doi:10.​1145/​2442516.​2442524
41.
Zurück zum Zitat Lee, J., Samadi, M., Park, Y., Mahlke, S.: Transparent CPU–GPU collaboration for data-parallel kernels on heterogeneous systems. In: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, PACT (2013) Lee, J., Samadi, M., Park, Y., Mahlke, S.: Transparent CPU–GPU collaboration for data-parallel kernels on heterogeneous systems. In: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, PACT (2013)
42.
Zurück zum Zitat Lee, K., Liu, L.: Efficient data partitioning model for heterogeneous graphs in the cloud. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 46:1–46:12. ACM, New York (2013) Lee, K., Liu, L.: Efficient data partitioning model for heterogeneous graphs in the cloud. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 46:1–46:12. ACM, New York (2013)
43.
Zurück zum Zitat Li, D., Becchi, M.: Deploying graph algorithms on gpus: an adaptive solution. In: 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), pp. 1013–1024 (2013). doi:10.1109/IPDPS.2013.101 Li, D., Becchi, M.: Deploying graph algorithms on gpus: an adaptive solution. In: 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), pp. 1013–1024 (2013). doi:10.​1109/​IPDPS.​2013.​101
44.
Zurück zum Zitat Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012)CrossRef Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012)CrossRef
45.
Zurück zum Zitat Luk, C.K., Hong, S., Kim, H.: Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pp. 45–55. ACM, New York (2009). doi:10.1145/1669112.1669121 Luk, C.K., Hong, S., Kim, H.: Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pp. 45–55. ACM, New York (2009). doi:10.​1145/​1669112.​1669121
46.
Zurück zum Zitat Luo, L., Wong, M., Hwu, W.M.: An effective gpu implementation of breadth-first search. In: Proceedings of the 47th design automation conference, pp. 52–55. ACM (2010) Luo, L., Wong, M., Hwu, W.M.: An effective gpu implementation of breadth-first search. In: Proceedings of the 47th design automation conference, pp. 52–55. ACM (2010)
47.
Zurück zum Zitat Menychtas, K., Shen, K., Scott, M.L.: Disengaged scheduling for fair, protected access to fast computational accelerators. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pp. 301–316. ACM, New York (2014). doi:10.1145/2541940.2541963 Menychtas, K., Shen, K., Scott, M.L.: Disengaged scheduling for fair, protected access to fast computational accelerators. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pp. 301–316. ACM, New York (2014). doi:10.​1145/​2541940.​2541963
48.
Zurück zum Zitat Min, S.J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: In Fifth Conference on Partitioned Global Address Space Programming Models (2011) Min, S.J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: In Fifth Conference on Partitioned Global Address Space Programming Models (2011)
49.
Zurück zum Zitat Nguyen, D., Lenharth, A., Pingali, K.: A lightweight infrastructure for graph analytics. Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP ’13, pp. 456–471. ACM, New York (2013) Nguyen, D., Lenharth, A., Pingali, K.: A lightweight infrastructure for graph analytics. Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP ’13, pp. 456–471. ACM, New York (2013)
50.
Zurück zum Zitat Nilakant, K., Yoneki, E.: On the efficacy of apus for heterogeneous graph computation. In: Fourth Workshop on Systems for Future Multicore Architectures (2014) Nilakant, K., Yoneki, E.: On the efficacy of apus for heterogeneous graph computation. In: Fourth Workshop on Systems for Future Multicore Architectures (2014)
51.
Zurück zum Zitat NVIDIA: NVIDIA Compute Visual Profiler. NVIDIA Corporation, Santa Clara, California, 4.0 edn. (2011) NVIDIA: NVIDIA Compute Visual Profiler. NVIDIA Corporation, Santa Clara, California, 4.0 edn. (2011)
52.
Zurück zum Zitat NVIDIA: NVIDIA CUDA Tools SDK CUPTI. NVIDIA Corporation, Santa Clara, California, 1.0 edn. (2011) NVIDIA: NVIDIA CUDA Tools SDK CUPTI. NVIDIA Corporation, Santa Clara, California, 1.0 edn. (2011)
53.
Zurück zum Zitat Pandit, P., Govindarajan, R.: Fluidic kernels: cooperative execution of opencl programs on multiple heterogeneous devices. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pp. 273:273–273:283. ACM, New York (2014). doi:10.1145/2544137.2544163 Pandit, P., Govindarajan, R.: Fluidic kernels: cooperative execution of opencl programs on multiple heterogeneous devices. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pp. 273:273–273:283. ACM, New York (2014). doi:10.​1145/​2544137.​2544163
54.
Zurück zum Zitat Phull, R., Li, C.H., Rao, K., Cadambi, H., Chakradhar, S.: Interference-driven resource management for gpu-based heterogeneous clusters. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, HPDC ’12, pp. 109–120. ACM, New York (2012). doi:10.1145/2287076.2287091 Phull, R., Li, C.H., Rao, K., Cadambi, H., Chakradhar, S.: Interference-driven resource management for gpu-based heterogeneous clusters. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, HPDC ’12, pp. 109–120. ACM, New York (2012). doi:10.​1145/​2287076.​2287091
55.
Zurück zum Zitat Ravi, V.T., Becchi, M., Agrawal, G., Chakradhar, S.: Supporting gpu sharing in cloud environments with a transparent runtime consolidation framework. In: Proceedings of the 20th international symposium on High performance distributed computing, HPDC ’11, pp. 217–228. ACM, New York (2011). doi:10.1145/1996130.1996160 Ravi, V.T., Becchi, M., Agrawal, G., Chakradhar, S.: Supporting gpu sharing in cloud environments with a transparent runtime consolidation framework. In: Proceedings of the 20th international symposium on High performance distributed computing, HPDC ’11, pp. 217–228. ACM, New York (2011). doi:10.​1145/​1996130.​1996160
56.
Zurück zum Zitat Ravi, V.T., Becchi, M., Jiang, W., Agrawal, G., Chakradhar, S.: Scheduling concurrent applications on a cluster of cpu–gpu nodes. In: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster. Cloud and Grid Computing (ccgrid 2012), CCGRID ’12, pp. 140–147. IEEE Computer Society, Washington (2012) Ravi, V.T., Becchi, M., Jiang, W., Agrawal, G., Chakradhar, S.: Scheduling concurrent applications on a cluster of cpu–gpu nodes. In: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster. Cloud and Grid Computing (ccgrid 2012), CCGRID ’12, pp. 140–147. IEEE Computer Society, Washington (2012)
57.
Zurück zum Zitat Ravi, V.T., Ma, W., Chiu, D., Agrawal, G.: Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, pp. 137–146. ACM, New York (2010). doi:10.1145/1810085.1810106 Ravi, V.T., Ma, W., Chiu, D., Agrawal, G.: Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, pp. 137–146. ACM, New York (2010). doi:10.​1145/​1810085.​1810106
58.
Zurück zum Zitat Ribic, H., Liu, Y.D.: Energy-efficient work-stealing language runtimes. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pp. 513–528. ACM, New York (2014). doi:10.1145/2541940.2541971 Ribic, H., Liu, Y.D.: Energy-efficient work-stealing language runtimes. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pp. 513–528. ACM, New York (2014). doi:10.​1145/​2541940.​2541971
59.
Zurück zum Zitat Rossbach, C.J., Currey, J., Silberstein, M., Ray, B., Witchel, E.: Ptask: operating system abstractions to manage gpus as compute devices. In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. SOSP ’11, pp. 233–248. ACM, New York (2011) Rossbach, C.J., Currey, J., Silberstein, M., Ray, B., Witchel, E.: Ptask: operating system abstractions to manage gpus as compute devices. In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. SOSP ’11, pp. 233–248. ACM, New York (2011)
60.
Zurück zum Zitat Rossbach, C.J., Yu, Y., Currey, J., Martin, J.P., Fetterly, D.: Dandelion: a compiler and runtime for heterogeneous systems. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP ’13, pp. 49–68. ACM, New York (2013) Rossbach, C.J., Yu, Y., Currey, J., Martin, J.P., Fetterly, D.: Dandelion: a compiler and runtime for heterogeneous systems. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP ’13, pp. 49–68. ACM, New York (2013)
61.
Zurück zum Zitat Sbîrlea, A., Zou, Y., Budimlíc, Z., Cong, J., Sarkar, V.: Mapping a data-flow programming model onto heterogeneous platforms. In: Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, LCTES ’12, pp. 61–70. ACM, New York (2012). doi:10.1145/2248418.2248428 Sbîrlea, A., Zou, Y., Budimlíc, Z., Cong, J., Sarkar, V.: Mapping a data-flow programming model onto heterogeneous platforms. In: Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, LCTES ’12, pp. 61–70. ACM, New York (2012). doi:10.​1145/​2248418.​2248428
62.
63.
Zurück zum Zitat Scogland, T., Rountree, B., chun Feng, W., De Supinski, B.: Heterogeneous task scheduling for accelerated OpenMP. In: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), pp. 144–155 (2012). doi:10.1109/IPDPS.2012.23 Scogland, T., Rountree, B., chun Feng, W., De Supinski, B.: Heterogeneous task scheduling for accelerated OpenMP. In: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), pp. 144–155 (2012). doi:10.​1109/​IPDPS.​2012.​23
64.
Zurück zum Zitat Wang, L., Cui, H., Duan, Y., Lu, F., Feng, X., Yew, P.C.: An adaptive task creation strategy for work-stealing scheduling. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10, pp. 266–277. ACM, New York (2010). doi:10.1145/1772954.1772992 Wang, L., Cui, H., Duan, Y., Lu, F., Feng, X., Yew, P.C.: An adaptive task creation strategy for work-stealing scheduling. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10, pp. 266–277. ACM, New York (2010). doi:10.​1145/​1772954.​1772992
65.
Zurück zum Zitat Wu, H., Diamos, G., Sheard, T., Aref, M., Baxter, S., Garland, M., Yalamanchili, S.: Red fox: an execution environment for relational query processing on gpus. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pp. 44:44–44:54. ACM, New York (2014) Wu, H., Diamos, G., Sheard, T., Aref, M., Baxter, S., Garland, M., Yalamanchili, S.: Red fox: an execution environment for relational query processing on gpus. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pp. 44:44–44:54. ACM, New York (2014)
66.
Zurück zum Zitat Zhang, Y., Owens, J.D.: A quantitative performance analysis model for gpu architectures. In: 17th International Conference on High-Performance Computer Architecture (HPCA-17), pp. 382–393. IEEE Computer Society, San Antonio, TX, USA (2011) Zhang, Y., Owens, J.D.: A quantitative performance analysis model for gpu architectures. In: 17th International Conference on High-Performance Computer Architecture (HPCA-17), pp. 382–393. IEEE Computer Society, San Antonio, TX, USA (2011)
Metadaten
Titel
Accelerating Data Analytics on Integrated GPU Platforms via Runtime Specialization
verfasst von
Naila Farooqui
Indrajit Roy
Yuan Chen
Vanish Talwar
Rajkishore Barik
Brian Lewis
Tatiana Shpeisman
Karsten Schwan
Publikationsdatum
28.12.2016
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 2/2018
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-016-0482-x

Weitere Artikel der Ausgabe 2/2018

International Journal of Parallel Programming 2/2018 Zur Ausgabe