nach oben

International Journal of Parallel Programming

Erschienen in:

01.04.2014

A Case Study of Implementing Supernode Transformations

verfasst von: Johann Steinbrecher, Cesar J. Philippidis, Weijia Shang

Erschienen in: International Journal of Parallel Programming | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Supernode transformation is a technique to decrease the communication overhead by partitioning and scheduling a loop nest to a multi-processor system. This is achieved by grouping a number of iterations in a perfectly nested loop with regular dependences as a \(supernode\). Previous work has been focusing on finding the optimal supernode size and shape as well as an optimal execution schedule for multi-processor systems with unbounded resources. This paper emphasizes on the actual implementation strategies of supernode transformations on multi-core systems with limited resources. Using an example, the longest common subsequence (LCS) problem, we present and compare three different multithreading implementations. A formula for the total execution time of each method is presented. The techniques are benchmarked on a 12-core and a 4-core machine. On the 12-core machine our first technique, which yields increased data locality, speeds up the unaltered sequential loop nest 16.7 times. Combining this technique with skewing the loop by changing the linear schedule scores a 42.6 speedup. A more sophisticated method that executes entire rows of the loop nest in one thread scores a 59.5 speedup. Concepts presented and discussed in this paper on the LCS problem serve as basic foundation for implementations at regular dependence algorithms.

Vorheriger Artikel A Survey of Parallel and Distributed Algorithms for the Steiner Tree Problem

Nächster Artikel Accelerating Single Iteration Performance of CUDA-Based 3D Reaction–Diffusion Simulations

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

1003.1 standard for information technology portable operating system interface (posix) rationale (informative). IEEE Std 1003.1-2001. Rationale (Informative) pp. i–310 (2001)

Baskaran, M.M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. CGO ’10, pp. 200–209. ACM, New York, NY (2010)

Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pp. 7–16. IEEE Computer Society, Washington, DC (2004)

Ben Mabrouk, B., Hasni, H., Mahjoub, Z.: Parallelization of the dynamic programming algorithm for solving the longest common subsequence problem. In: Computer Systems and Applications (AICCSA), 2010 IEEE/ACS International Conference on, pp. 1–8 (2010)

Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh international Symposium on, pp. 39–48 (2000)

Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’08, pp. 101–113. ACM, New York, NY (2008)

Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, Cambridge (2007)

Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2001)MATH

Dimitriou, G., Polychronopoulos, C.: Loop scheduling for multithreaded processors. In: Parallel Computing in Electrical Engineering, 2004. PARELEC 2004. International Conference on, pp. 361–366 (2004)

10.

Du, Z.-H., Lim, C.-C., Li, X.-F., Yang, C., Zhao, Q., Ngai, T.-F.: A Cost-driven Compilation Framework for Speculative Parallelization of Sequential Programs vol. 39, pp. 71–81. ACM, New York, NY (2004)

11.

Feautrier, P.: Dataflow analysis of array and scalar references. Int. J. Parallel Program. 20, 23–53 (1991)CrossRefMATH

12.

Garcia, T., Myoupo, J.-F., Seme, D.: A coarse-grained multicomputer algorithm for the longest common subsequence problem. In: Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, pp. 349–356 (2003)

13.

Goumas, G., Drosinos, N., Koziris, N.: Communication-aware supernode shape. Parallel Distrib. Syst. IEEE Trans. 20(4), 498–511 (2009)CrossRef

14.

Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pp. 147–157. ACM, New York, NY (2009)

15.

Hirschberg, D.S.: Algorithms for the longest common subsequence problem. J. ACM 24, 664–675 (1977)CrossRefMATHMathSciNet

16.

Hodzic, E., Shang, W.: On time optimal supernode shape. Parallel Distrib. Syst. IEEE Trans. 13(12), 1220–1233 (2002)CrossRef

17.

Hodzic, E., Shang, W.: On supernode transformation with minimized total running time. Parallel Distrib. Syst. IEEE Trans. 9(5), 417–428 (1998)CrossRef

18.

Hogstedt, K., Carter, L., Ferrante, J.: On the parallel execution time of tiled loops. Parallel Distrib. Syst. IEEE Trans. 14(3), 307–321 (2003)CrossRef

19.

Hu, S.-H., Wang, C.-W., Chen, H.-L.: An efficient and hardware-implementable systolic algorithm for the longest common subsequence problem. In: Machine Learning and Cybernetics, 2008 International Conference on, vol. 6, pp. 3150–3155 (2008)

20.

Inoue, H., Nakatani, T.: Performance of multi-process and multi-thread processing on multi-core smt processors. In: Workload Characterization (IISWC), 2010 IEEE International Symposium on, pp. 1–10 (2010)

21.

Irigoin, F., Triolet, R.: Supernode partitioning. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’88, pp. 319–329. ACM, New York, NY (1988)

22.

Korkin, D., Wang, Q., Shang, Y.: An efficient parallel algorithm for the multiple longest common subsequence (mlcs) problem. In: Parallel Processing, 2008. ICPP ’08. 37th International Conference on, pp. 354–363 (2008)

23.

Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, Saint-Malo, France, pp. 451–460. ACM, New York, NY (2010)

24.

Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Comput. 24(3–4), 445–475 (1998)CrossRefMATHMathSciNet

25.

Liu, J., Wu, S.: Research on longest common subsequence fast algorithm. In: 2011 International Conference on Consumer Electronics, Communications and Networks (CECNet), pp. 4338–4341 (2011)

26.

Mitchell, N., Carter, L., Ferrante, J., Tullsen, D.: Ilp versus tlp on smt. In: Supercomputing, ACM/IEEE 1999 Conference, p. 37 (1999)

27.

Nuzman, D., Zaks, A.: Outer-loop vectorization: revisited for short simd architectures. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pp. 2–11. ACM, New York, NY (2008)

28.

Ramanujam, J., Sadayappan, P.: Tiling multidimensional iteration spaces for multicomputers. J. Parallel Distrib. Comput. 16(2), 108–120 (1992)CrossRef

29.

Riakiotakis, I., Tsanakas, P.: Dynamic scheduling of nested loops with uniform dependencies in heterogeneous networks of workstations. In: Parallel Architectures, Algorithms and Networks, 2005. ISPAN 2005. Proceedings. 8th International Symposium on, p. 6 (2005)

30.

Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.-Z., Stratton, J.A., Hwu, W.-M.W.: Program optimization space pruning for a multithreaded gpu. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’08, pp. 195–204. ACM, New York, NY (2008)

31.

Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core x86 architecture for visual computing. In: ACM SIGGRAPH 2008 papers, SIGGRAPH ’08, pp. 18:1–18:15. ACM, New York, NY (2008)

32.

Shang, W., Fortes, J.A.B.: Time optimal linear schedules for algorithms with uniform dependencies. IEEE Trans. Comput. 40(6), 723–742 (1991)CrossRefMathSciNet

33.

Xue, J.: Communication-minimal tiling of uniform dependence loops. J. Parallel Distrib. Comput. 42(1), 42–59 (1997)

Titel: A Case Study of Implementing Supernode Transformations
verfasst von: Johann Steinbrecher
Cesar J. Philippidis
Weijia Shang
Publikationsdatum: 01.04.2014
Verlag: Springer US
Erschienen in: International Journal of Parallel Programming / Ausgabe 2/2014
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-013-0248-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 2/2014

Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications

A Survey of Parallel and Distributed Algorithms for the Steiner Tree Problem

Accelerating Single Iteration Performance of CUDA-Based 3D Reaction–Diffusion Simulations

Erratum to: Accelerating Single Iteration Performance of CUDA-Based 3D Reaction–Diffusion Simulations

The Experience in Designing and Evaluating the High Performance Cluster Netuno

Parallelization of Full Search Motion Estimation Algorithm for Parallel and Distributed Platforms