Skip to main content
Erschienen in: International Journal of Parallel Programming 2/2014

01.04.2014

A Case Study of Implementing Supernode Transformations

verfasst von: Johann Steinbrecher, Cesar J. Philippidis, Weijia Shang

Erschienen in: International Journal of Parallel Programming | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Supernode transformation is a technique to decrease the communication overhead by partitioning and scheduling a loop nest to a multi-processor system. This is achieved by grouping a number of iterations in a perfectly nested loop with regular dependences as a \(supernode\). Previous work has been focusing on finding the optimal supernode size and shape as well as an optimal execution schedule for multi-processor systems with unbounded resources. This paper emphasizes on the actual implementation strategies of supernode transformations on multi-core systems with limited resources. Using an example, the longest common subsequence (LCS) problem, we present and compare three different multithreading implementations. A formula for the total execution time of each method is presented. The techniques are benchmarked on a 12-core and a 4-core machine. On the 12-core machine our first technique, which yields increased data locality, speeds up the unaltered sequential loop nest 16.7 times. Combining this technique with skewing the loop by changing the linear schedule scores a 42.6 speedup. A more sophisticated method that executes entire rows of the loop nest in one thread scores a 59.5 speedup. Concepts presented and discussed in this paper on the LCS problem serve as basic foundation for implementations at regular dependence algorithms.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat 1003.1 standard for information technology portable operating system interface (posix) rationale (informative). IEEE Std 1003.1-2001. Rationale (Informative) pp. i–310 (2001) 1003.1 standard for information technology portable operating system interface (posix) rationale (informative). IEEE Std 1003.1-2001. Rationale (Informative) pp. i–310 (2001)
2.
Zurück zum Zitat Baskaran, M.M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. CGO ’10, pp. 200–209. ACM, New York, NY (2010) Baskaran, M.M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. CGO ’10, pp. 200–209. ACM, New York, NY (2010)
3.
Zurück zum Zitat Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pp. 7–16. IEEE Computer Society, Washington, DC (2004) Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pp. 7–16. IEEE Computer Society, Washington, DC (2004)
4.
Zurück zum Zitat Ben Mabrouk, B., Hasni, H., Mahjoub, Z.: Parallelization of the dynamic programming algorithm for solving the longest common subsequence problem. In: Computer Systems and Applications (AICCSA), 2010 IEEE/ACS International Conference on, pp. 1–8 (2010) Ben Mabrouk, B., Hasni, H., Mahjoub, Z.: Parallelization of the dynamic programming algorithm for solving the longest common subsequence problem. In: Computer Systems and Applications (AICCSA), 2010 IEEE/ACS International Conference on, pp. 1–8 (2010)
5.
Zurück zum Zitat Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh international Symposium on, pp. 39–48 (2000) Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh international Symposium on, pp. 39–48 (2000)
6.
Zurück zum Zitat Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’08, pp. 101–113. ACM, New York, NY (2008) Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’08, pp. 101–113. ACM, New York, NY (2008)
7.
Zurück zum Zitat Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, Cambridge (2007) Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, Cambridge (2007)
8.
Zurück zum Zitat Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2001)MATH Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2001)MATH
9.
Zurück zum Zitat Dimitriou, G., Polychronopoulos, C.: Loop scheduling for multithreaded processors. In: Parallel Computing in Electrical Engineering, 2004. PARELEC 2004. International Conference on, pp. 361–366 (2004) Dimitriou, G., Polychronopoulos, C.: Loop scheduling for multithreaded processors. In: Parallel Computing in Electrical Engineering, 2004. PARELEC 2004. International Conference on, pp. 361–366 (2004)
10.
Zurück zum Zitat Du, Z.-H., Lim, C.-C., Li, X.-F., Yang, C., Zhao, Q., Ngai, T.-F.: A Cost-driven Compilation Framework for Speculative Parallelization of Sequential Programs vol. 39, pp. 71–81. ACM, New York, NY (2004) Du, Z.-H., Lim, C.-C., Li, X.-F., Yang, C., Zhao, Q., Ngai, T.-F.: A Cost-driven Compilation Framework for Speculative Parallelization of Sequential Programs vol. 39, pp. 71–81. ACM, New York, NY (2004)
11.
Zurück zum Zitat Feautrier, P.: Dataflow analysis of array and scalar references. Int. J. Parallel Program. 20, 23–53 (1991)CrossRefMATH Feautrier, P.: Dataflow analysis of array and scalar references. Int. J. Parallel Program. 20, 23–53 (1991)CrossRefMATH
12.
Zurück zum Zitat Garcia, T., Myoupo, J.-F., Seme, D.: A coarse-grained multicomputer algorithm for the longest common subsequence problem. In: Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, pp. 349–356 (2003) Garcia, T., Myoupo, J.-F., Seme, D.: A coarse-grained multicomputer algorithm for the longest common subsequence problem. In: Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, pp. 349–356 (2003)
13.
Zurück zum Zitat Goumas, G., Drosinos, N., Koziris, N.: Communication-aware supernode shape. Parallel Distrib. Syst. IEEE Trans. 20(4), 498–511 (2009)CrossRef Goumas, G., Drosinos, N., Koziris, N.: Communication-aware supernode shape. Parallel Distrib. Syst. IEEE Trans. 20(4), 498–511 (2009)CrossRef
14.
Zurück zum Zitat Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pp. 147–157. ACM, New York, NY (2009) Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pp. 147–157. ACM, New York, NY (2009)
16.
Zurück zum Zitat Hodzic, E., Shang, W.: On time optimal supernode shape. Parallel Distrib. Syst. IEEE Trans. 13(12), 1220–1233 (2002)CrossRef Hodzic, E., Shang, W.: On time optimal supernode shape. Parallel Distrib. Syst. IEEE Trans. 13(12), 1220–1233 (2002)CrossRef
17.
Zurück zum Zitat Hodzic, E., Shang, W.: On supernode transformation with minimized total running time. Parallel Distrib. Syst. IEEE Trans. 9(5), 417–428 (1998)CrossRef Hodzic, E., Shang, W.: On supernode transformation with minimized total running time. Parallel Distrib. Syst. IEEE Trans. 9(5), 417–428 (1998)CrossRef
18.
Zurück zum Zitat Hogstedt, K., Carter, L., Ferrante, J.: On the parallel execution time of tiled loops. Parallel Distrib. Syst. IEEE Trans. 14(3), 307–321 (2003)CrossRef Hogstedt, K., Carter, L., Ferrante, J.: On the parallel execution time of tiled loops. Parallel Distrib. Syst. IEEE Trans. 14(3), 307–321 (2003)CrossRef
19.
Zurück zum Zitat Hu, S.-H., Wang, C.-W., Chen, H.-L.: An efficient and hardware-implementable systolic algorithm for the longest common subsequence problem. In: Machine Learning and Cybernetics, 2008 International Conference on, vol. 6, pp. 3150–3155 (2008) Hu, S.-H., Wang, C.-W., Chen, H.-L.: An efficient and hardware-implementable systolic algorithm for the longest common subsequence problem. In: Machine Learning and Cybernetics, 2008 International Conference on, vol. 6, pp. 3150–3155 (2008)
20.
Zurück zum Zitat Inoue, H., Nakatani, T.: Performance of multi-process and multi-thread processing on multi-core smt processors. In: Workload Characterization (IISWC), 2010 IEEE International Symposium on, pp. 1–10 (2010) Inoue, H., Nakatani, T.: Performance of multi-process and multi-thread processing on multi-core smt processors. In: Workload Characterization (IISWC), 2010 IEEE International Symposium on, pp. 1–10 (2010)
21.
Zurück zum Zitat Irigoin, F., Triolet, R.: Supernode partitioning. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’88, pp. 319–329. ACM, New York, NY (1988) Irigoin, F., Triolet, R.: Supernode partitioning. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’88, pp. 319–329. ACM, New York, NY (1988)
22.
Zurück zum Zitat Korkin, D., Wang, Q., Shang, Y.: An efficient parallel algorithm for the multiple longest common subsequence (mlcs) problem. In: Parallel Processing, 2008. ICPP ’08. 37th International Conference on, pp. 354–363 (2008) Korkin, D., Wang, Q., Shang, Y.: An efficient parallel algorithm for the multiple longest common subsequence (mlcs) problem. In: Parallel Processing, 2008. ICPP ’08. 37th International Conference on, pp. 354–363 (2008)
23.
Zurück zum Zitat Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, Saint-Malo, France, pp. 451–460. ACM, New York, NY (2010) Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, Saint-Malo, France, pp. 451–460. ACM, New York, NY (2010)
24.
Zurück zum Zitat Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Comput. 24(3–4), 445–475 (1998)CrossRefMATHMathSciNet Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Comput. 24(3–4), 445–475 (1998)CrossRefMATHMathSciNet
25.
Zurück zum Zitat Liu, J., Wu, S.: Research on longest common subsequence fast algorithm. In: 2011 International Conference on Consumer Electronics, Communications and Networks (CECNet), pp. 4338–4341 (2011) Liu, J., Wu, S.: Research on longest common subsequence fast algorithm. In: 2011 International Conference on Consumer Electronics, Communications and Networks (CECNet), pp. 4338–4341 (2011)
26.
Zurück zum Zitat Mitchell, N., Carter, L., Ferrante, J., Tullsen, D.: Ilp versus tlp on smt. In: Supercomputing, ACM/IEEE 1999 Conference, p. 37 (1999) Mitchell, N., Carter, L., Ferrante, J., Tullsen, D.: Ilp versus tlp on smt. In: Supercomputing, ACM/IEEE 1999 Conference, p. 37 (1999)
27.
Zurück zum Zitat Nuzman, D., Zaks, A.: Outer-loop vectorization: revisited for short simd architectures. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pp. 2–11. ACM, New York, NY (2008) Nuzman, D., Zaks, A.: Outer-loop vectorization: revisited for short simd architectures. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pp. 2–11. ACM, New York, NY (2008)
28.
Zurück zum Zitat Ramanujam, J., Sadayappan, P.: Tiling multidimensional iteration spaces for multicomputers. J. Parallel Distrib. Comput. 16(2), 108–120 (1992)CrossRef Ramanujam, J., Sadayappan, P.: Tiling multidimensional iteration spaces for multicomputers. J. Parallel Distrib. Comput. 16(2), 108–120 (1992)CrossRef
29.
Zurück zum Zitat Riakiotakis, I., Tsanakas, P.: Dynamic scheduling of nested loops with uniform dependencies in heterogeneous networks of workstations. In: Parallel Architectures, Algorithms and Networks, 2005. ISPAN 2005. Proceedings. 8th International Symposium on, p. 6 (2005) Riakiotakis, I., Tsanakas, P.: Dynamic scheduling of nested loops with uniform dependencies in heterogeneous networks of workstations. In: Parallel Architectures, Algorithms and Networks, 2005. ISPAN 2005. Proceedings. 8th International Symposium on, p. 6 (2005)
30.
Zurück zum Zitat Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.-Z., Stratton, J.A., Hwu, W.-M.W.: Program optimization space pruning for a multithreaded gpu. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’08, pp. 195–204. ACM, New York, NY (2008) Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.-Z., Stratton, J.A., Hwu, W.-M.W.: Program optimization space pruning for a multithreaded gpu. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’08, pp. 195–204. ACM, New York, NY (2008)
31.
Zurück zum Zitat Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core x86 architecture for visual computing. In: ACM SIGGRAPH 2008 papers, SIGGRAPH ’08, pp. 18:1–18:15. ACM, New York, NY (2008) Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core x86 architecture for visual computing. In: ACM SIGGRAPH 2008 papers, SIGGRAPH ’08, pp. 18:1–18:15. ACM, New York, NY (2008)
32.
Zurück zum Zitat Shang, W., Fortes, J.A.B.: Time optimal linear schedules for algorithms with uniform dependencies. IEEE Trans. Comput. 40(6), 723–742 (1991)CrossRefMathSciNet Shang, W., Fortes, J.A.B.: Time optimal linear schedules for algorithms with uniform dependencies. IEEE Trans. Comput. 40(6), 723–742 (1991)CrossRefMathSciNet
33.
Zurück zum Zitat Xue, J.: Communication-minimal tiling of uniform dependence loops. J. Parallel Distrib. Comput. 42(1), 42–59 (1997) Xue, J.: Communication-minimal tiling of uniform dependence loops. J. Parallel Distrib. Comput. 42(1), 42–59 (1997)
Metadaten
Titel
A Case Study of Implementing Supernode Transformations
verfasst von
Johann Steinbrecher
Cesar J. Philippidis
Weijia Shang
Publikationsdatum
01.04.2014
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 2/2014
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-013-0248-7

Weitere Artikel der Ausgabe 2/2014

International Journal of Parallel Programming 2/2014 Zur Ausgabe