Skip to main content
Erschienen in: Cluster Computing 1/2019

22.12.2017

PRODA: improving parallel programs on GPUs through dependency analysis

verfasst von: Xiong Wei, Ming Hu, Tao Peng, Minghua Jiang, Zhiying Wang, Xiao Qin

Erschienen in: Cluster Computing | Sonderheft 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

GPU’s powerful parallel processing capability has been highly recognized throughout the industry; however, GPU computing environments have not yet been widely used in the field of parallel computing. In this study, we develop a method of parallelization of serial programs for GPU computing. In particular, we propose an approach called PRODA to speedup parallel programs on GPUs through dependency analysis. PRODA provides theoretical underpins of task partitioning in parallel programs running in GPU computing environments. At the heart of PRODA is an analyzer for program workflows as well as data and function dependencies in a GPU program. With the dependency analysis in place, PRODA assigns computing tasks to multiple GPU cores in a way to speedup the performance of parallel program on GPUs. An overarching goal of PRODA is to minimize data communication cost between GPUs and main memory of a host CPU. PRODA achieves this goal by apply deploying two strategies. First, PRODA assigns functions processing the same data to a GPU core. Second, PRODA runs multiple independent functions on separate GPU cores. In doing so, PRODA improves the parallelism of parallel programs. We evaluate the performance of PRODA by running two popular benchmarks (i.e., AES and T26) on an 256-core system, where key length is set to 256 bits. The experimental results show that the speedup ratio of AES governed by PRODA is 5.2. Specifically, PRODA improves the performance of the existing CFM scheme by a factor of 1.39. To measure cost of parallel computing, we test PRODA and the alternative solutions by running AES under the 256-bit key length on 128 cores. The cost of parallel computing in PRODA is 524.8ms, which is 61.2% lower than that of the existing SA solution. The parallel efficiency of PRODA is 2.08, which represents an improvement of the PDM algorithm by a factor of 2.08.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Jacob, P., Zia, A., Erdogan, O., Belemjian, P.M., Kim, J.W., Chu, M., Kraft, R.P., Mcdonald, J.F., Bernstein, K.: Mitigating memory wall effects in high-clock-rate and multicore cmos 3-d processor memory stacks. Proc. IEEE 97(1), 108–122 (2009)CrossRef Jacob, P., Zia, A., Erdogan, O., Belemjian, P.M., Kim, J.W., Chu, M., Kraft, R.P., Mcdonald, J.F., Bernstein, K.: Mitigating memory wall effects in high-clock-rate and multicore cmos 3-d processor memory stacks. Proc. IEEE 97(1), 108–122 (2009)CrossRef
2.
Zurück zum Zitat Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: Nvidia tesla: a unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008)CrossRef Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: Nvidia tesla: a unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008)CrossRef
3.
Zurück zum Zitat Hennessy, J.L., Patterson, D.A., Arpaci-Dusseau, A.C.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann Pub., an imprint of Elsevier (2007) Hennessy, J.L., Patterson, D.A., Arpaci-Dusseau, A.C.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann Pub., an imprint of Elsevier (2007)
4.
Zurück zum Zitat Koop, M.J., Huang, W., Gopalakrishnan, K., Panda, D.K.: Performance analysis and evaluation of PCIE 2.0 and quad-data rate infiniband. In: Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects, pp. 85–92 (2008) Koop, M.J., Huang, W., Gopalakrishnan, K., Panda, D.K.: Performance analysis and evaluation of PCIE 2.0 and quad-data rate infiniband. In: Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects, pp. 85–92 (2008)
5.
Zurück zum Zitat Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. In: IEEE Des. Test, pp. 66–73 (2010) Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. In: IEEE Des. Test, pp. 66–73 (2010)
6.
Zurück zum Zitat Pacheco, P.S.: An Introduction to Parallel Programming, Vol. 5, No. 4, p. 357359 (2011) Pacheco, P.S.: An Introduction to Parallel Programming, Vol. 5, No. 4, p. 357359 (2011)
7.
Zurück zum Zitat Jian-Minga, L.I., Xiang-Peib, H.U., Pang, Z.L., Qian, K.M.: A parallel ant colony optimization algorithm based on fine-grained model with gpu-accelerated. Control Decis. 24(8), 1132–1136 (2009)MathSciNet Jian-Minga, L.I., Xiang-Peib, H.U., Pang, Z.L., Qian, K.M.: A parallel ant colony optimization algorithm based on fine-grained model with gpu-accelerated. Control Decis. 24(8), 1132–1136 (2009)MathSciNet
8.
Zurück zum Zitat Mohr, E., Kranz, D.A., Halstead, R.H. and Jr.: Lazy task creation: a technique for increasing the granularity of parallel programs. In: IEEE Transactions on Parallel and Distributed Systems, pp. 264–280 (1991) Mohr, E., Kranz, D.A., Halstead, R.H. and Jr.: Lazy task creation: a technique for increasing the granularity of parallel programs. In: IEEE Transactions on Parallel and Distributed Systems, pp. 264–280 (1991)
9.
Zurück zum Zitat Levine, B.G., Lebard, D.N., Devane, R., Shinoda, W., Kohlmeyer, A., Klein, M.L.: Micellization studied by gpu-accelerated coarse-grained molecular dynamics. J. Chem. Theory Comput. 7(12), 4135–4145 (2011)CrossRef Levine, B.G., Lebard, D.N., Devane, R., Shinoda, W., Kohlmeyer, A., Klein, M.L.: Micellization studied by gpu-accelerated coarse-grained molecular dynamics. J. Chem. Theory Comput. 7(12), 4135–4145 (2011)CrossRef
10.
Zurück zum Zitat Rauber, T., Rnger, G.: Parallel Programming—for Multicore and Cluster Systems. Springer, Heidelberg (2010)MATH Rauber, T., Rnger, G.: Parallel Programming—for Multicore and Cluster Systems. Springer, Heidelberg (2010)MATH
11.
Zurück zum Zitat Hwu, W.M., Ryoo, S., Ueng, S.Z., Kelm, J.H., Gelado, I., Stone, S.S., Kidd, R.E., Baghsorkhi, S.S., Mahesri, A.A., Tsao, S.C.: Implicitly parallel programming models for thousand-core microprocessors. In: Design Automation Conference, 2007. DAC ’07. 44th ACM/IEEE, pp. 754–759 (2007) Hwu, W.M., Ryoo, S., Ueng, S.Z., Kelm, J.H., Gelado, I., Stone, S.S., Kidd, R.E., Baghsorkhi, S.S., Mahesri, A.A., Tsao, S.C.: Implicitly parallel programming models for thousand-core microprocessors. In: Design Automation Conference, 2007. DAC ’07. 44th ACM/IEEE, pp. 754–759 (2007)
12.
Zurück zum Zitat Lucas, P.: The development of the data-parallel gpu programming language CGIS. In: In International Conference on Computational Science, pp. 200–203 (2006) Lucas, P.: The development of the data-parallel gpu programming language CGIS. In: In International Conference on Computational Science, pp. 200–203 (2006)
13.
Zurück zum Zitat Mellorcrummey, J.: Center for programming models for scalable parallel computing. In: Scitech Connect Center for Programming Models for Scalable Parallel Computing (2008) Mellorcrummey, J.: Center for programming models for scalable parallel computing. In: Scitech Connect Center for Programming Models for Scalable Parallel Computing (2008)
14.
Zurück zum Zitat Bikshandi, G., Guo, J., Hoeflinger, D., Almsi, G., Fraguela, B.B., Garzarn, M.J., Padua, D.A., Praun, C.V.: Programming for parallelism and locality with hierarchically tiled arrays. In: Proceedings of the Eleventh Acm Sigplan Symposium on Principles and Practice of Parallel Program, pp. 48–57 (2006) Bikshandi, G., Guo, J., Hoeflinger, D., Almsi, G., Fraguela, B.B., Garzarn, M.J., Padua, D.A., Praun, C.V.: Programming for parallelism and locality with hierarchically tiled arrays. In: Proceedings of the Eleventh Acm Sigplan Symposium on Principles and Practice of Parallel Program, pp. 48–57 (2006)
15.
Zurück zum Zitat D’Alberto, P.D., Nicolau, A.: Adaptive Strassen’s matrix multiplication. In: ICs Proceedings of Annual International Conference on Supercomputing, pp. 284–292 (2007) D’Alberto, P.D., Nicolau, A.: Adaptive Strassen’s matrix multiplication. In: ICs Proceedings of Annual International Conference on Supercomputing, pp. 284–292 (2007)
16.
Zurück zum Zitat Wang, Z., Liu, Y., Chiu, S.: An efficient parallel collaborative filtering algorithm on multi-gpu platform. J. Supercomput. 72(6), 2080–2094 (2016)CrossRef Wang, Z., Liu, Y., Chiu, S.: An efficient parallel collaborative filtering algorithm on multi-gpu platform. J. Supercomput. 72(6), 2080–2094 (2016)CrossRef
17.
Zurück zum Zitat Cui, S., Großschädl, J., Liu, Z., Xu, Q.: High-speed elliptic curve cryptography on the NVIDIA GT200 graphics processing unit. In: Lecture Notes in Computer Science (2014) Cui, S., Großschädl, J., Liu, Z., Xu, Q.: High-speed elliptic curve cryptography on the NVIDIA GT200 graphics processing unit. In: Lecture Notes in Computer Science (2014)
18.
Zurück zum Zitat Roujol, S., De Senneville, B.D., Vahala, E., Sørensen, T.S., Moonen, C., Ries, M.: Online real-time reconstruction of adaptive TSENSE with commodity CPU/GPU hardware. Magn. Reson. Med. 62(6), 16581664 (2009)CrossRef Roujol, S., De Senneville, B.D., Vahala, E., Sørensen, T.S., Moonen, C., Ries, M.: Online real-time reconstruction of adaptive TSENSE with commodity CPU/GPU hardware. Magn. Reson. Med. 62(6), 16581664 (2009)CrossRef
19.
Zurück zum Zitat Tetsuya, O., Minh, T.T., Jinpil, L., Taisuke, B., Mitsuhisa, S.: Extend to GPU for Xcalablemp: a parallel programming language. In: IPSJ Sig. Notes (2011) Tetsuya, O., Minh, T.T., Jinpil, L., Taisuke, B., Mitsuhisa, S.: Extend to GPU for Xcalablemp: a parallel programming language. In: IPSJ Sig. Notes (2011)
20.
Zurück zum Zitat Choi, W.H., Liu, X.: Case study: runtime reduction of a buffer insertion algorithm using GPU parallel programming. In: SOC Conference (SOCC), 2010 IEEE International, pp. 121–126 (2010) Choi, W.H., Liu, X.: Case study: runtime reduction of a buffer insertion algorithm using GPU parallel programming. In: SOC Conference (SOCC), 2010 IEEE International, pp. 121–126 (2010)
21.
Zurück zum Zitat Raymond, N., Samuel, T., Olivier, A.: GPU/CPU Work Sharing Mechanism on XMP-dev, High-level Parallel Programming Language for GPU Cluster, Vol. 2014, pp. 87–96 (2013) Raymond, N., Samuel, T., Olivier, A.: GPU/CPU Work Sharing Mechanism on XMP-dev, High-level Parallel Programming Language for GPU Cluster, Vol. 2014, pp. 87–96 (2013)
22.
Zurück zum Zitat Branover, A., Foley, D., Steinman, M.: Amd fusion apu: Llano. IEEE Micro 32(2), 28–37 (2012)CrossRef Branover, A., Foley, D., Steinman, M.: Amd fusion apu: Llano. IEEE Micro 32(2), 28–37 (2012)CrossRef
23.
Zurück zum Zitat Jr, R.H.H.: Multilisp: a language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7(4), 501–538 (1985)CrossRefMATH Jr, R.H.H.: Multilisp: a language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7(4), 501–538 (1985)CrossRefMATH
24.
Zurück zum Zitat Zhang, C., Huang, K., Cui, X., Chen, Y.: Programming-level power measurement for GPU clusters. In: Green Computing and Communications (GreenCom). IEEE/ACM International Conference on, Vol. 2011, pp. 182–187 (2011) Zhang, C., Huang, K., Cui, X., Chen, Y.: Programming-level power measurement for GPU clusters. In: Green Computing and Communications (GreenCom). IEEE/ACM International Conference on, Vol. 2011, pp. 182–187 (2011)
25.
Zurück zum Zitat Wataru, T., Xu, J., Ken, W.: An implementation and evaluation of a compiler for ACTGPU, an actor-based asynchronous parallel programming language. In: IPSJ Sig Notes, vol. 2012 (2012) Wataru, T., Xu, J., Ken, W.: An implementation and evaluation of a compiler for ACTGPU, an actor-based asynchronous parallel programming language. In: IPSJ Sig Notes, vol. 2012 (2012)
26.
Zurück zum Zitat Grant, B., Mock, M., Philipose, M., Chambers, C., Eggers, S.J.: DyC: an expressive annotation-directed dynamic compiler for c. Theor. Comput. Sci. 248(12), 147–199 (2000)CrossRefMATH Grant, B., Mock, M., Philipose, M., Chambers, C., Eggers, S.J.: DyC: an expressive annotation-directed dynamic compiler for c. Theor. Comput. Sci. 248(12), 147–199 (2000)CrossRefMATH
27.
Zurück zum Zitat Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011) Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011)
28.
Zurück zum Zitat Lattner, C., Adve, V.: Llvm: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, pp. 75–86 (2004) Lattner, C., Adve, V.: Llvm: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, pp. 75–86 (2004)
29.
Zurück zum Zitat Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of PTX kernels. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp. 3–12 (2009) Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of PTX kernels. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp. 3–12 (2009)
30.
Zurück zum Zitat Chang, C.T., Chen, Y.S., Wu, I.W., Shann, J.J.: A translation framework for automatic translation of annotated llvm ir into opencl kernel function. In: Smart Innovation Systems and Technologies (2013) Chang, C.T., Chen, Y.S., Wu, I.W., Shann, J.J.: A translation framework for automatic translation of annotated llvm ir into opencl kernel function. In: Smart Innovation Systems and Technologies (2013)
31.
Zurück zum Zitat Saeed-Akbari, A., Mosecker, L., Schwedt, A., Bleck, W.: Characterization and prediction of flow behavior in high-manganese twinning induced plasticity steels: Part I. Mechanism maps and work-hardening behavior. Metall. Mater. Trans. A 43(5), 1688–1704 (2012)CrossRef Saeed-Akbari, A., Mosecker, L., Schwedt, A., Bleck, W.: Characterization and prediction of flow behavior in high-manganese twinning induced plasticity steels: Part I. Mechanism maps and work-hardening behavior. Metall. Mater. Trans. A 43(5), 1688–1704 (2012)CrossRef
32.
Zurück zum Zitat Lee, J., Sato, M., Boku, T.: Openmpd: a directive-based data parallel language extension for distributed memory systems pp. 121–128 (2008) Lee, J., Sato, M., Boku, T.: Openmpd: a directive-based data parallel language extension for distributed memory systems pp. 121–128 (2008)
33.
Zurück zum Zitat Wolf, F., Mohr, B.: Automatic performance analysis of hybrid MPI/OpenMP applications. J. Syst. Archit. 49(3), 421439 (2003) Wolf, F., Mohr, B.: Automatic performance analysis of hybrid MPI/OpenMP applications. J. Syst. Archit. 49(3), 421439 (2003)
34.
Zurück zum Zitat Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. In: ASPLOS XIII: Proceedings of the 13th International Conference on Architectural, pp. 287–296 (2008) Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. In: ASPLOS XIII: Proceedings of the 13th International Conference on Architectural, pp. 287–296 (2008)
35.
Zurück zum Zitat Lastovetsky, A., Reddy, R.: Heterompi: Towards a message-passing library for heterogeneous networks of computers. Journal of Parallel and Distributed Computing 66, 197220 (2006)CrossRefMATH Lastovetsky, A., Reddy, R.: Heterompi: Towards a message-passing library for heterogeneous networks of computers. Journal of Parallel and Distributed Computing 66, 197220 (2006)CrossRefMATH
36.
Zurück zum Zitat Knobloch, M., Foszczynski, M., Homberg, W., Pleiter, D., Bttiger, H.: Mapping fine-grained power measurements to HPC application runtime characteristics on IBM POWER7. Comput. Sci. Res. Dev. 29(3–4), 211–219 (2013) Knobloch, M., Foszczynski, M., Homberg, W., Pleiter, D., Bttiger, H.: Mapping fine-grained power measurements to HPC application runtime characteristics on IBM POWER7. Comput. Sci. Res. Dev. 29(3–4), 211–219 (2013)
37.
Zurück zum Zitat Hoshi, T., Ootsu, K., Ohkawa, T., and Yokota, T.: “Runtime overhead reduction in automated parallel processing system using valgrind,” in International Symposium on Computing and NETWORKING, (2013) pp. 572–576 Hoshi, T., Ootsu, K., Ohkawa, T., and Yokota, T.: “Runtime overhead reduction in automated parallel processing system using valgrind,” in International Symposium on Computing and NETWORKING, (2013) pp. 572–576
38.
Zurück zum Zitat Guire, N.M.: Linux kernel GCOV-tool analysis (2006) Guire, N.M.: Linux kernel GCOV-tool analysis (2006)
39.
Zurück zum Zitat Wang, G., Tang, T., Fang, X., Ren, X.: Program optimization of array-intensive spec2k benchmarks on multithreaded GPU using CUDA and brook+. In: Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on, pp. 292–299 (2009) Wang, G., Tang, T., Fang, X., Ren, X.: Program optimization of array-intensive spec2k benchmarks on multithreaded GPU using CUDA and brook+. In: Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on, pp. 292–299 (2009)
40.
Zurück zum Zitat Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ISCA ’09 Proceedings of the 36th Annual International Symposium on Computer Architecture, pp. 152–163 (2009) Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ISCA ’09 Proceedings of the 36th Annual International Symposium on Computer Architecture, pp. 152–163 (2009)
41.
Zurück zum Zitat Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid cpu-gpu execution. Clust. Comput. 16(1), 131–155 (2013)CrossRef Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid cpu-gpu execution. Clust. Comput. 16(1), 131–155 (2013)CrossRef
42.
Zurück zum Zitat Galloy, M.: GPU Accelerated Curve Fitting with IDL. American Geophysical Union, Washington, DC (2012) Galloy, M.: GPU Accelerated Curve Fitting with IDL. American Geophysical Union, Washington, DC (2012)
43.
Zurück zum Zitat Nakashima, T., Fujiwara, A.: A cost optimal parallel algorithm for patience sorting. Parallel Process. Lett. 16(1), 39–51 (2006)MathSciNetCrossRef Nakashima, T., Fujiwara, A.: A cost optimal parallel algorithm for patience sorting. Parallel Process. Lett. 16(1), 39–51 (2006)MathSciNetCrossRef
44.
45.
Zurück zum Zitat Alonso, P., Cortina, R., Daz, I., Hernndez, V., Ranilla, J.: A simple cost-optimal parallel algorithm to solve linear equation systems. Information 3(3), 297–304 (2003)MathSciNet Alonso, P., Cortina, R., Daz, I., Hernndez, V., Ranilla, J.: A simple cost-optimal parallel algorithm to solve linear equation systems. Information 3(3), 297–304 (2003)MathSciNet
46.
Zurück zum Zitat Bahl, A.K., Baltzer, O., Rau-Chaplin, A., Varghese, B., Whiteway, A.: Multi-GPU computing for achieving speedup in real-time aggregate risk analysis. High performance computing on graphics processing units (hgpu.org, Chaplin, 2013) Bahl, A.K., Baltzer, O., Rau-Chaplin, A., Varghese, B., Whiteway, A.: Multi-GPU computing for achieving speedup in real-time aggregate risk analysis. High performance computing on graphics processing units (hgpu.org, Chaplin, 2013)
47.
Zurück zum Zitat Zhao, X.D., Liang, S.X., Sun, Z.C., Liu, Z.B., Han, S.L., Ren, X.F.: Foundation and analysis of computational efficiency for hydrodynamic model based on GPU parallel algorithm. J. Dalian Univ. Technol. (2014) Zhao, X.D., Liang, S.X., Sun, Z.C., Liu, Z.B., Han, S.L., Ren, X.F.: Foundation and analysis of computational efficiency for hydrodynamic model based on GPU parallel algorithm. J. Dalian Univ. Technol. (2014)
48.
Zurück zum Zitat Daemen, J., Rijmen, V.: The Design of Rijndael: AES the Advanced Encryption Standard. Springer, Berlin (2002)CrossRefMATH Daemen, J., Rijmen, V.: The Design of Rijndael: AES the Advanced Encryption Standard. Springer, Berlin (2002)CrossRefMATH
Metadaten
Titel
PRODA: improving parallel programs on GPUs through dependency analysis
verfasst von
Xiong Wei
Ming Hu
Tao Peng
Minghua Jiang
Zhiying Wang
Xiao Qin
Publikationsdatum
22.12.2017
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe Sonderheft 1/2019
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-017-1295-4

Weitere Artikel der Sonderheft 1/2019

Cluster Computing 1/2019 Zur Ausgabe