nach oben

Erschienen in:

2019 | OriginalPaper | Buchkapitel

Toward Efficient Architecture-Independent Algorithms for Dynamic Programs

verfasst von : Mohammad Mahdi Javanmard, Pramod Ganapathi, Rathish Das, Zafar Ahmad, Stephen Tschudi, Rezaul Chowdhury

Erschienen in: High Performance Computing

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

We argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi- and manycores) and distributed-memory settings. The depth-first recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivity when resource availability (e.g., available space in shared caches) changes during runtime. Higher data locality leads to better intra-node I/O and cache performance and lower inter-node communication complexity, which in turn can reduce running times and energy consumption. Indeed, we show that a class of grid-based parallel recursive divide-and-conquer algorithms (for dynamic programs) can be run with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributed-memory machines (communication complexity) without changing the algorithm’s basic structure.

Two-way recursive divide-and-conquer algorithms are known for solving dynamic programming (DP) problems on shared-memory multicore machines. In this paper, we show how to extend them to run efficiently also on manycore GPUs and distributed-memory machines.

Our GPU algorithms work efficiently even when the data is too large to fit into the host RAM. These are external-memory algorithms based on recursive r-way divide and conquer, where r (\(\ge 2\)) varies based on the current depth of the recursion. Our distributed-memory algorithms are also based on multi-way recursive divide and conquer that extends naturally inside each shared-memory multicore/manycore compute node. We show that these algorithms are work-optimal and have low latency and bandwidth bounds.

We also report empirical results for our GPU and distribute memory algorithms.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Layout-Aware Embedding for Quantum Annealing Processors

Nächstes Kapitel Petaflop Seismic Simulations in the Public Cloud

As of November 2018, the supercomputers ranked 1 (Summit), 2 (Sierra), 6 (ABCI), 7 (Piz Daint), and 8 (Titan) in order of Rpeak (TFlop/s) are networks of hybrid CPU+GPU nodes [4].

Temporal locality — whenever a block of data is brought into a faster level of cache/memory from a slower level, as much useful work as possible is performed on this data before removing the block from the faster level.

I.e., faster and closer to the processing core(s).

Standard Template Library for Extra Large Data Sets (STXXL). http://stxxl.sourceforge.net/

The Stampede Supercomputing Cluster. https://www.tacc.utexas.edu/stampede/

The Stampede2 Supercomputing Cluster. https://www.tacc.utexas.edu/systems/stampede2/

Top 500 Supercomputers of the World. https://www.top500.org/lists/2018/06/

Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)CrossRef

Aggarwal, A., Chandra, A.K., Snir, M.: Communication complexity of PRAMs. Theor. Comput. Sci. 71(1), 3–28 (1990)MathSciNetMATHCrossRef

Aho, A.V., Hopcroft, J.E.: The Design and Analysis of Computer Algorithms. Pearson Education India, Noida (1974)MATH

Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numer. 23, 1–155 (2014)MathSciNetMATHCrossRef

Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Communication-optimal parallel algorithm for strassen’s matrix multiplication. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 193–204. ACM (2012)

10.

Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl. 32(3), 866–901 (2011)MathSciNetMATHCrossRef

11.

Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Graph expansion and communication costs of fast matrix multiplication. J. ACM (JACM) 59(6), 32 (2012)MathSciNetMATHCrossRef

12.

Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)MATH

13.

Bender, M., Ebrahimi, R., Fineman, J., Ghasemiesfeh, G., Johnson, R., McCauley, S.: Cache-adaptive algorithms. In: SODA (2014)

14.

Buluç, A., Gilbert, J.R., Budak, C.: Solving path problems on the GPU. Parallel Comput. 36(5), 241–253 (2010)MATHCrossRef

15.

Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Technical report, Montana State University. Bozeman Engineering Research Labs (1969)

16.

Carson, E., Knight, N., Demmel, J.: Avoiding communication in two-sided Krylov subspace methods. Technical report, EECS, UC Berkeley (2011)

17.

Cherng, C., Ladner, R.: Cache efficient simple dynamic programming. In: AofA, pp. 49–58 (2005)

18.

Chowdhury, R., Ganapathi, P., Tang, Y., Tithi, J.J.: Provably efficient scheduling of cache-oblivious wavefront algorithms. In: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 339–350. ACM, July 2017

19.

Chowdhury, R., et al.: AUTOGEN: automatic discovery of efficient recursive divide-&-conquer algorithms for solving dynamic programming problems. ACM Trans. Parallel Comput. 4(1), 4 (2017). https://doi.org/10.1145/3125632CrossRef

20.

Chowdhury, R.A., Ramachandran, V.: Cache-efficient dynamic programming algorithms for multicores. In: SPAA, pp. 207–216 (2008)

21.

Chowdhury, R.A., Ramachandran, V.: The cache-oblivious Gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. Theory Comput. Syst. 47(4), 878–919 (2010)MathSciNetMATHCrossRef

22.

Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)MATH

23.

D’Alberto, P., Nicolau, A.: R-Kleene: a high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica 47(2), 203–213 (2007)MathSciNetMATHCrossRef

24.

Dekel, E., Nassimi, D., Sahni, S.: Parallel matrix and graph algorithms. SIAM J. Comput. 10(4), 657–675 (1981)MathSciNetMATHCrossRef

25.

Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), A206–A239 (2012)MathSciNetMATHCrossRef

26.

Diament, B., Ferencz, A.: Comparison of parallel APSP algorithms (1999)

27.

Djidjev, H., Thulasidasan, S., Chapuis, G., Andonov, R., Lavenier, D.: Efficient multi-GPU computation of all-pairs shortest paths. In: IPDPS, pp. 360–369 (2014)

28.

Driscoll, M., Georganas, E., Koanantakool, P., Solomonik, E., Yelick, K.: A communication-optimal n-body algorithm for direct interactions. In: IPDPS, pp. 1075–1084. IEEE (2013)

29.

Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: FOCS, pp. 285–297 (1999)

30.

Galil, Z., Giancarlo, R.: Speeding up dynamic programming with applications to molecular biology. TCS 64(1), 107–118 (1989)MathSciNetMATHCrossRef

31.

Galil, Z., Park, K.: Parallel algorithms for dynamic programming recurrences with more than \(O(1)\) dependency. JPDC 21(2), 213–222 (1994)MATH

32.

Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York (1997)MATHCrossRef

33.

Habbal, M.B., Koutsopoulos, H.N., Lerman, S.R.: A decomposition algorithm for the all-pairs shortest path problem on massively parallel computer architectures. Transp. Sci. 28(4), 292–308 (1994)MathSciNetMATHCrossRef

34.

Harish, P., Narayanan, P.: Accelerating large graph algorithms on the GPU using CUDA. In: HiPC, pp. 197–208 (2007)

35.

Holzer, S., Wattenhofer, R.: Optimal distributed all pairs shortest paths and applications. In: PODC, pp. 355–364. ACM (2012)

36.

Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)MATHCrossRef

37.

Itzhaky, S., et al.: Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. In: OOPSLA, pp. 145–164. ACM (2016)

38.

Jenq, J.F., Sahni, S.: All pairs shortest paths on a hypercube multiprocessor (1987)

39.

Johnsson, S.L.: Minimizing the communication time for matrix multiplication on multiprocessors. Parallel Comput. 19(11), 1235–1257 (1993)CrossRef

40.

Katz, G.J., Kider Jr., J.T.: All-pairs shortest-paths for large graphs on the GPU. In: ACM SIGGRAPH/EUROGRAPHICS, pp. 47–55 (2008)

41.

Kogge, P., Shalf, J.: Exascale computing trends: adjusting to the “new normal” for computer architecture. Comput. Sci. Eng. 15(6), 16–26 (2013)CrossRef

42.

Krusche, P., Tiskin, A.: Efficient longest common subsequence computation using bulk-synchronous parallelism. In: Gavrilova, M.L., et al. (eds.) ICCSA 2006. LNCS, vol. 3984, pp. 165–174. Springer, Heidelberg (2006). https://doi.org/10.1007/11751649_18CrossRef

43.

Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to Parallel Computing: Design and Analysis of Algorithms, vol. 400. Benjamin/Cummings, Redwood City (1994)MATH

44.

Kumar, V., Singh, V.: Scalability of parallel algorithms for the all-pairs shortest-path problem. J. Parallel Distrib. Comput. 13(2), 124–138 (1991)CrossRef

45.

Liu, W., Schmidt, B., Voss, G., Muller-Wittig, W.: Streaming algorithms for biological sequence alignment on GPUs. TPDS 18(9), 1270–1281 (2007)

46.

Liu, W., Schmidt, B., Voss, G., Schroder, A., Muller-Wittig, W.: Bio-sequence database scanning on a GPU. In: IPDPS, 8 pp. (2006)

47.

Lund, B., Smith, J.W.: A multi-stage CUDA kernel for Floyd-Warshall. arXiv preprint arXiv:1001.4108 (2010)

48.

Manavski, S.A., Valle, G.: CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. BMC Bioinform. 9(2), 1 (2008)

49.

Matsumoto, K., Nakasato, N., Sedukhin, S.G.: Blocked all-pairs shortest paths algorithm for hybrid CPU-GPU system. In: HPCC, pp. 145–152 (2011)

50.

Meyerhenke, H., Sanders, P., Schulz, C.: Parallel graph partitioning for complex networks. IEEE Trans. Parallel Distrib. Syst. 28(9), 2625–2638 (2017)CrossRef

51.

Nishida, K., Ito, Y., Nakano, K.: Accelerating the dynamic programming for the matrix chain product on the GPU. In: ICNC, pp. 320–326 (2011)

52.

Nishida, K., Nakano, K., Ito, Y.: Accelerating the dynamic programming for the optimal polygon triangulation on the GPU. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds.) ICA3PP 2012. LNCS, vol. 7439, pp. 1–15. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33078-0_1CrossRef

53.

Rizk, G., Lavenier, D.: GPU accelerated RNA folding algorithm. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009. LNCS, vol. 5544, pp. 1004–1013. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01970-8_101CrossRef

54.

Schulte, M.J., et al.: Achieving exascale capabilities through heterogeneous computing. IEEE Micro 35(4), 26–36 (2015)CrossRef

55.

Sibeyn, J.F.: External matrix multiplication and all-pairs shortest path. IPL 91(2), 99–106 (2004)MathSciNetMATHCrossRef

56.

Solomon, S., Thulasiraman, P.: Performance study of mapping irregular computations on GPUs. In: IPDPS Workshops and PhD Forum, pp. 1–8 (2010)

57.

Solomonik, E., Ballard, G., Demmel, J., Hoefler, T.: A communication-avoiding parallel algorithm for the symmetric eigenvalue problem. In: SPAA, pp. 111–121. ACM (2017)

58.

Solomonik, E., Buluc, A., Demmel, J.: Minimizing communication in all-pairs shortest paths. In: IPDPS, pp. 548–559 (2013)

59.

Solomonik, E., Carson, E., Knight, N., Demmel, J.: Trade-offs between synchronization, communication, and computation in parallel linear algebra computations. TOPC 3(1), 3 (2016)

60.

Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5_10CrossRef

61.

Steffen, P., Giegerich, R., Giraud, M.: GPU parallelization of algebraic dynamic programming. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009. LNCS, vol. 6068, pp. 290–299. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14403-5_31CrossRef

62.

Striemer, G.M., Akoglu, A.: Sequence alignment with GPU: performance and design challenges. In: IPDPS, pp. 1–10 (2009)

63.

Tan, G., Sun, N., Gao, G.R.: A parallel dynamic programming algorithm on a multi-core architecture. In: SPAA, pp. 135–144. ACM (2007)

64.

Tang, Y., You, R., Kan, H., Tithi, J., Ganapathi, P., Chowdhury, R.: Improving parallelism of recursive stencil computations without sacrificing cache performance. In: WOSC, pp. 1–7 (2014)

65.

Tiskin, A.: Bulk-synchronous parallel Gaussian elimination. J. Math. Sci. 108(6), 977–991 (2002)MathSciNetMATHCrossRef

66.

Tiskin, A.: Communication-efficient parallel gaussian elimination. In: Malyshkin, V.E. (ed.) PaCT 2003. LNCS, vol. 2763, pp. 369–383. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45145-7_35CrossRef

67.

Tiskin, A.: Communication-efficient parallel generic pairwise elimination. Future Gener. Comput. Syst. 23(2), 179–188 (2007)CrossRef

68.

Tiskin, A.: All-pairs shortest paths computation in the BSP model. In: Orejas, F., Spirakis, P.G., van Leeuwen, J. (eds.) ICALP 2001. LNCS, vol. 2076, pp. 178–189. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48224-5_15CrossRef

69.

Tithi, J.J., Ganapathi, P., Talati, A., Aggarwal, S., Chowdhury, R.: High-performance energy-efficient recursive dynamic programming with matrix-multiplication-like flexible kernels. In: IPDPS, pp. 303–312 (2015)

70.

Towns, J., et al.: XSEDE: accelerating scientific discovery. Comput. Sci. Eng. 16(5), 62–74 (2014)CrossRef

71.

Venkataraman, G., Sahni, S., Mukhopadhyaya, S.: A blocked all-pairs shortest-paths algorithm. JEA 8, 2–2 (2003)MathSciNetMATHCrossRef

72.

Volkov, V., Demmel, J.: LU, QR and Cholesky factorizations using vector capabilities of GPUs. EECS, UC Berkeley, Technical report UCB/EECS-2008-49, May 2008

73.

Waterman, M.S.: Introduction to Computational Biology: Maps. Sequences and Genomes. Chapman & Hall Ltd., New York (1995)MATHCrossRef

74.

Wu, C.C., Wei, K.C., Lin, T.H.: Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization. In: ICPADS, pp. 45–52 (2012)

75.

Xiao, S., Aji, A.M., Feng, W.c.: On the robust mapping of dynamic programming onto a graphics processing unit. In: ICPADS, pp. 26–33 (2009)

Titel: Toward Efficient Architecture-Independent Algorithms for Dynamic Programs
verfasst von: Mohammad Mahdi Javanmard
Pramod Ganapathi
Rathish Das
Zafar Ahmad
Stephen Tschudi
Rezaul Chowdhury
Verlag: Springer International Publishing
Buch: High Performance Computing
Print ISBN: 978-3-030-20655-0

Electronic ISBN: 978-3-030-20656-7

Copyright-Jahr: 2019
DOI: https://doi.org/10.1007/978-3-030-20656-7_8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner