nach oben

Cluster Computing

Erschienen in:

22.12.2017

PRODA: improving parallel programs on GPUs through dependency analysis

verfasst von: Xiong Wei, Ming Hu, Tao Peng, Minghua Jiang, Zhiying Wang, Xiao Qin

Erschienen in: Cluster Computing | Sonderheft 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

GPU’s powerful parallel processing capability has been highly recognized throughout the industry; however, GPU computing environments have not yet been widely used in the field of parallel computing. In this study, we develop a method of parallelization of serial programs for GPU computing. In particular, we propose an approach called PRODA to speedup parallel programs on GPUs through dependency analysis. PRODA provides theoretical underpins of task partitioning in parallel programs running in GPU computing environments. At the heart of PRODA is an analyzer for program workflows as well as data and function dependencies in a GPU program. With the dependency analysis in place, PRODA assigns computing tasks to multiple GPU cores in a way to speedup the performance of parallel program on GPUs. An overarching goal of PRODA is to minimize data communication cost between GPUs and main memory of a host CPU. PRODA achieves this goal by apply deploying two strategies. First, PRODA assigns functions processing the same data to a GPU core. Second, PRODA runs multiple independent functions on separate GPU cores. In doing so, PRODA improves the parallelism of parallel programs. We evaluate the performance of PRODA by running two popular benchmarks (i.e., AES and T26) on an 256-core system, where key length is set to 256 bits. The experimental results show that the speedup ratio of AES governed by PRODA is 5.2. Specifically, PRODA improves the performance of the existing CFM scheme by a factor of 1.39. To measure cost of parallel computing, we test PRODA and the alternative solutions by running AES under the 256-bit key length on 128 cores. The cost of parallel computing in PRODA is 524.8ms, which is 61.2% lower than that of the existing SA solution. The parallel efficiency of PRODA is 2.08, which represents an improvement of the PDM algorithm by a factor of 2.08.

Vorheriger Artikel Analyzing and visualizing comprehensive and personalized online product reviews

Nächster Artikel Moving average multi directional local features for speaker recognition

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Jacob, P., Zia, A., Erdogan, O., Belemjian, P.M., Kim, J.W., Chu, M., Kraft, R.P., Mcdonald, J.F., Bernstein, K.: Mitigating memory wall effects in high-clock-rate and multicore cmos 3-d processor memory stacks. Proc. IEEE 97(1), 108–122 (2009)CrossRef

Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: Nvidia tesla: a unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008)CrossRef

Hennessy, J.L., Patterson, D.A., Arpaci-Dusseau, A.C.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann Pub., an imprint of Elsevier (2007)

Koop, M.J., Huang, W., Gopalakrishnan, K., Panda, D.K.: Performance analysis and evaluation of PCIE 2.0 and quad-data rate infiniband. In: Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects, pp. 85–92 (2008)

Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. In: IEEE Des. Test, pp. 66–73 (2010)

Pacheco, P.S.: An Introduction to Parallel Programming, Vol. 5, No. 4, p. 357359 (2011)

Jian-Minga, L.I., Xiang-Peib, H.U., Pang, Z.L., Qian, K.M.: A parallel ant colony optimization algorithm based on fine-grained model with gpu-accelerated. Control Decis. 24(8), 1132–1136 (2009)MathSciNet

Mohr, E., Kranz, D.A., Halstead, R.H. and Jr.: Lazy task creation: a technique for increasing the granularity of parallel programs. In: IEEE Transactions on Parallel and Distributed Systems, pp. 264–280 (1991)

Levine, B.G., Lebard, D.N., Devane, R., Shinoda, W., Kohlmeyer, A., Klein, M.L.: Micellization studied by gpu-accelerated coarse-grained molecular dynamics. J. Chem. Theory Comput. 7(12), 4135–4145 (2011)CrossRef

10.

Rauber, T., Rnger, G.: Parallel Programming—for Multicore and Cluster Systems. Springer, Heidelberg (2010)MATH

11.

Hwu, W.M., Ryoo, S., Ueng, S.Z., Kelm, J.H., Gelado, I., Stone, S.S., Kidd, R.E., Baghsorkhi, S.S., Mahesri, A.A., Tsao, S.C.: Implicitly parallel programming models for thousand-core microprocessors. In: Design Automation Conference, 2007. DAC ’07. 44th ACM/IEEE, pp. 754–759 (2007)

12.

Lucas, P.: The development of the data-parallel gpu programming language CGIS. In: In International Conference on Computational Science, pp. 200–203 (2006)

13.

Mellorcrummey, J.: Center for programming models for scalable parallel computing. In: Scitech Connect Center for Programming Models for Scalable Parallel Computing (2008)

14.

Bikshandi, G., Guo, J., Hoeflinger, D., Almsi, G., Fraguela, B.B., Garzarn, M.J., Padua, D.A., Praun, C.V.: Programming for parallelism and locality with hierarchically tiled arrays. In: Proceedings of the Eleventh Acm Sigplan Symposium on Principles and Practice of Parallel Program, pp. 48–57 (2006)

15.

D’Alberto, P.D., Nicolau, A.: Adaptive Strassen’s matrix multiplication. In: ICs Proceedings of Annual International Conference on Supercomputing, pp. 284–292 (2007)

16.

Wang, Z., Liu, Y., Chiu, S.: An efficient parallel collaborative filtering algorithm on multi-gpu platform. J. Supercomput. 72(6), 2080–2094 (2016)CrossRef

17.

Cui, S., Großschädl, J., Liu, Z., Xu, Q.: High-speed elliptic curve cryptography on the NVIDIA GT200 graphics processing unit. In: Lecture Notes in Computer Science (2014)

18.

Roujol, S., De Senneville, B.D., Vahala, E., Sørensen, T.S., Moonen, C., Ries, M.: Online real-time reconstruction of adaptive TSENSE with commodity CPU/GPU hardware. Magn. Reson. Med. 62(6), 16581664 (2009)CrossRef

19.

Tetsuya, O., Minh, T.T., Jinpil, L., Taisuke, B., Mitsuhisa, S.: Extend to GPU for Xcalablemp: a parallel programming language. In: IPSJ Sig. Notes (2011)

20.

Choi, W.H., Liu, X.: Case study: runtime reduction of a buffer insertion algorithm using GPU parallel programming. In: SOC Conference (SOCC), 2010 IEEE International, pp. 121–126 (2010)

21.

Raymond, N., Samuel, T., Olivier, A.: GPU/CPU Work Sharing Mechanism on XMP-dev, High-level Parallel Programming Language for GPU Cluster, Vol. 2014, pp. 87–96 (2013)

22.

Branover, A., Foley, D., Steinman, M.: Amd fusion apu: Llano. IEEE Micro 32(2), 28–37 (2012)CrossRef

23.

Jr, R.H.H.: Multilisp: a language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7(4), 501–538 (1985)CrossRefMATH

24.

Zhang, C., Huang, K., Cui, X., Chen, Y.: Programming-level power measurement for GPU clusters. In: Green Computing and Communications (GreenCom). IEEE/ACM International Conference on, Vol. 2011, pp. 182–187 (2011)

25.

Wataru, T., Xu, J., Ken, W.: An implementation and evaluation of a compiler for ACTGPU, an actor-based asynchronous parallel programming language. In: IPSJ Sig Notes, vol. 2012 (2012)

26.

Grant, B., Mock, M., Philipose, M., Chambers, C., Eggers, S.J.: DyC: an expressive annotation-directed dynamic compiler for c. Theor. Comput. Sci. 248(12), 147–199 (2000)CrossRefMATH

27.

Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011)

28.

Lattner, C., Adve, V.: Llvm: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, pp. 75–86 (2004)

29.

Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of PTX kernels. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp. 3–12 (2009)

30.

Chang, C.T., Chen, Y.S., Wu, I.W., Shann, J.J.: A translation framework for automatic translation of annotated llvm ir into opencl kernel function. In: Smart Innovation Systems and Technologies (2013)

31.

Saeed-Akbari, A., Mosecker, L., Schwedt, A., Bleck, W.: Characterization and prediction of flow behavior in high-manganese twinning induced plasticity steels: Part I. Mechanism maps and work-hardening behavior. Metall. Mater. Trans. A 43(5), 1688–1704 (2012)CrossRef

32.

Lee, J., Sato, M., Boku, T.: Openmpd: a directive-based data parallel language extension for distributed memory systems pp. 121–128 (2008)

33.

Wolf, F., Mohr, B.: Automatic performance analysis of hybrid MPI/OpenMP applications. J. Syst. Archit. 49(3), 421439 (2003)

34.

Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. In: ASPLOS XIII: Proceedings of the 13th International Conference on Architectural, pp. 287–296 (2008)

35.

Lastovetsky, A., Reddy, R.: Heterompi: Towards a message-passing library for heterogeneous networks of computers. Journal of Parallel and Distributed Computing 66, 197220 (2006)CrossRefMATH

36.

Knobloch, M., Foszczynski, M., Homberg, W., Pleiter, D., Bttiger, H.: Mapping fine-grained power measurements to HPC application runtime characteristics on IBM POWER7. Comput. Sci. Res. Dev. 29(3–4), 211–219 (2013)

37.

Hoshi, T., Ootsu, K., Ohkawa, T., and Yokota, T.: “Runtime overhead reduction in automated parallel processing system using valgrind,” in International Symposium on Computing and NETWORKING, (2013) pp. 572–576

38.

Guire, N.M.: Linux kernel GCOV-tool analysis (2006)

39.

Wang, G., Tang, T., Fang, X., Ren, X.: Program optimization of array-intensive spec2k benchmarks on multithreaded GPU using CUDA and brook+. In: Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on, pp. 292–299 (2009)

40.

Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ISCA ’09 Proceedings of the 36th Annual International Symposium on Computer Architecture, pp. 152–163 (2009)

41.

Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid cpu-gpu execution. Clust. Comput. 16(1), 131–155 (2013)CrossRef

42.

Galloy, M.: GPU Accelerated Curve Fitting with IDL. American Geophysical Union, Washington, DC (2012)

43.

Nakashima, T., Fujiwara, A.: A cost optimal parallel algorithm for patience sorting. Parallel Process. Lett. 16(1), 39–51 (2006)MathSciNetCrossRef

44.

Akl, S.G.: An adaptive and cost-optimal parallel algorithm for minimum spanning trees. Computing 36(3), 271–277 (1986)MathSciNetCrossRefMATH

45.

Alonso, P., Cortina, R., Daz, I., Hernndez, V., Ranilla, J.: A simple cost-optimal parallel algorithm to solve linear equation systems. Information 3(3), 297–304 (2003)MathSciNet

46.

Bahl, A.K., Baltzer, O., Rau-Chaplin, A., Varghese, B., Whiteway, A.: Multi-GPU computing for achieving speedup in real-time aggregate risk analysis. High performance computing on graphics processing units (hgpu.org, Chaplin, 2013)

47.

Zhao, X.D., Liang, S.X., Sun, Z.C., Liu, Z.B., Han, S.L., Ren, X.F.: Foundation and analysis of computational efficiency for hydrodynamic model based on GPU parallel algorithm. J. Dalian Univ. Technol. (2014)

48.

Daemen, J., Rijmen, V.: The Design of Rijndael: AES the Advanced Encryption Standard. Springer, Berlin (2002)CrossRefMATH

Titel: PRODA: improving parallel programs on GPUs through dependency analysis
verfasst von: Xiong Wei
Ming Hu
Tao Peng
Minghua Jiang
Zhiying Wang
Xiao Qin
Publikationsdatum: 22.12.2017
Verlag: Springer US
Erschienen in: Cluster Computing / Ausgabe Sonderheft 1/2019
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI: https://doi.org/10.1007/s10586-017-1295-4

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Sonderheft 1/2019

Modeling and analysis for stock return movements along with exchange rates and interest rates in Markov regime-switching models

Correction to: Runtime self-monitoring approach of business process compliance in cloud environments

Distributed SVM face recognition based on Hadoop

Bucket-size balancing locality sensitive hashing using the map reduce paradigm

Cluster and cloud computing framework for scientific metrology in flow control

Expected value model of bus gas station site layout problem with fuzzy demand in supplementary fuel using genetic algorithm