skip to main content
10.1145/2688500.2688514acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency

Published:24 January 2015Publication History

ABSTRACT

State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best parallelism at the same time. While these algorithms achieve cache-optimality through the use of a recursive divide-and-conquer (DAC) strategy, scheduling tasks at the granularity of task dependency introduces artificial dependencies in addition to those arising from the defining recurrence equations. We removed the artificial dependency by scheduling tasks ready for execution as soon as all its real dependency constraints are satisfied, while preserving the cache-optimality by inheriting the DAC strategy. We applied our approach to a set of widely known dynamic programming problems, such as Floyd-Warshall's All-Pairs Shortest Paths, Stencil, and LCS. Theoretical analyses show that our techniques improve the span of 2-way DAC-based Floyd Warshall's algorithm on an $n$ node graph from $Thn^2n$ to $Thn$, stencil computations on a $d$-dimensional hypercubic grid of width $w$ for $h$ time steps from $Th(d^2 h) w^ (d+2) - 1$ to $Thh$, and LCS on two sequences of length $n$ each from $Thn^_2 3$ to $Thn$. In each case, the total work and cache complexity remain asymptotically optimal. Experimental measurements exhibit a $3$ - $5$ times improvement in absolute running time, $10$ - $20$ times improvement in burdened span by Cilkview, and approximately the same L1/L2 cache misses by PAPI.

References

  1. U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000), pages 1–12, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work stealing. In IPDPS, pages 1–12. IEEE, April 2010.Google ScholarGoogle Scholar
  3. A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley Publishing Company, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Bellman. Dynamic Programming. Princeton University Press, 1957. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Bikshandi, J. Guo, D. Hoeflinger, G. Almasi, B. B. Fraguela, M. J. Garzarán, D. Padua, and C. von Praun. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’06, pages 48–57, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Bleck, C. Rooth, D. Hu, and L. T. Smith. Salinity-driven thermocline transients in a wind- and thermohaline-forced isopycnic coordinate model of the North Atlantic. Journal of Physical Oceanography, 22(12):1486–1505, 1992. ISSN 0022-3670.Google ScholarGoogle ScholarCross RefCross Ref
  7. G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 235–244. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages 355–366, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 207–216, Santa Barbara, California, July 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. SC Conference, 0:42, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Chowdhury. Cache-efficient Algorithms and Data Structures: Theory and Experimental Evaluation. PhD thesis, Department of Computer Sciences, The University of Texas at Austin, Austin, Texas, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Chowdhury and V. Ramachandran. Cache-efficient Dynamic Programming Algorithms for Multicores. In Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 207–216, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Chowdhury and V. Ramachandran. The cache-oblivious Gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. Theory of Computing Systems, 47(4):878–919, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. Journal of Parallel and Distributed Computing (Special issue on best papers from IPDPS 2010, 2011 and 2012), 73(7):911–925, 2013. A preliminary version appeared as {17}.Google ScholarGoogle Scholar
  15. R. A. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. In In Proc. of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 06, pages 591–600, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. A. Chowdhury, H.-S. Le, and V. Ramachandran. Cache-oblivious dynamic programming for bioinformatics. TCBB, 7(3):495–510, July-Sept. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. A. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. In Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium, pages 1––12, April 2010..Google ScholarGoogle ScholarCross RefCross Ref
  18. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC, pages 4:1–4:12, Austin, TX, Nov. 15–18 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  21. H. Dursun, K.-i. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta. In-core optimization of high-order stencil computations. In PDPTA, pages 533–538, Las Vegas, NV, July13–16 2009.Google ScholarGoogle Scholar
  22. R. Floyd. Algorithm 97 (SHORTEST PATH). Commun. ACM, 5(6): 345, 1962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, pages 361–366, Cambridge, MA, June 20–22 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA, pages 271–280, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. Theory of Computing Systems, 45(2):203– 233, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI ’98, pages 212–223, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cacheoblivious algorithms. In FOCS, pages 285–297, New York, NY, Oct. 17–19 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Z. Galil and R. Giancarlo. Speeding up dynamic programming with applications to molecular biology. Theoretical Computer Science, 64: 107–118, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Z. Galil and K. Park. Parallel algorithms for dynamic programming recurrences with more than O(1) dependency. Journal of Parallel and Distributed Computing, 21:213–222, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Giegerich, C. Meyer, and P. Steffen. A discipline of dynamic programming over sequence data. Science of Computer Programming, 51(3):215–263, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162:705–708, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  32. J. Guo, G. Biksh, B. B. Fraguela, M. J. Garzarn, and D. Padua. Programming with tiles. In In PPoPP 08: Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview scalability analyzer. In SPAA, pages 145–156, Santorini, Greece, June 13–15 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Intel Corporation. The Intel Many Integrated Core Architecture. http://www.intel.com/content/www/us/en/ architecture-and-technology/many-integrated-core/ intel-many-integrated-core-architecture.html, 2011.Google ScholarGoogle Scholar
  36. S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In MSP, pages 36–43, Chicago, IL, June 12 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC, pages 51–60, San Jose, CA, 2006. ISBN 1-59593-578-9.. URL http://doi.acm.org/10.1145/1178597.1178605. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. O. S. Kennedy. Applications of dynamic programming to agriculture, forestry and fisheries: Review and prognosis. Review of Marketing and Agricultural Economics, 49(03), 1981.Google ScholarGoogle Scholar
  39. S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, San Diego, CA, June 10–13 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Maleki, M. Musuvathi, and T. Mytkowicz. Parallelizing dynamic programming through rank convergence. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’14, pages 219–232, New York, NY, USA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. ACM. ISBN 978-1-4503-2656-8.Google ScholarGoogle Scholar
  42. J. Meng, J. W. Sheaffer, and K. Skadron. Exploiting inter-thread temporal locality for chip multithreading. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010, pages 1–12, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  43. A. Nakano, R. Kalia, and P. Vashishta. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Computer Physics Communications, 83(2-3):197–214, 1994. ISSN 0010-4655.Google ScholarGoogle ScholarCross RefCross Ref
  44. A. Nitsure. Implementation and optimization of a cache oblivious lattice Boltzmann algorithm. Master’s thesis, Institut für Informatic, Friedrich-Alexander-Universität Erlangen-Nürnberg, July 2006.Google ScholarGoogle Scholar
  45. L. Peng, R. Seymour, K.-i. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPDPS, pages 1–11, Rome, Italy, May 23–29 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. A. Robichek, E. J. Elton, and M. J. Gruber. Dynamic programming applications in finance. The Journal of Finance, 26(2):473–506, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  47. D. Romer. It’s fourth down and what does the bellman equation say? a dynamic programming analysis of football strategy. Technical report, National Bureau of Economic Research, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  48. J. Rust. Numerical dynamic programming in economics. Handbook of computational economics, 1:619–729, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  49. J. Shun, G. E. Blelloch, J. T. Fineman, and P. B. Gibbons. Reducing contention through priority updates. In SPAA, pages 152–163, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. H. V. Simhadri, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and A. Kyrola. Experimental analysis of space-bounded schedulers. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’14, pages 30–41, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. D. K. Smith. Dynamic programming and board games: A survey. European Journal of Operational Research, 176(3):1299–1318, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  52. A. Taflove and S. Hagness. Computational Electrodynamics: The Finite-Difference Time-Domain Method. Artech House, Norwood, MA, 2000. ISBN 1580530761.Google ScholarGoogle Scholar
  53. Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In SPAA, San Jose, CA, USA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Y. Tang, R. A. Chowdhury, C.-K. Luk, and C. E. Leiserson. Coding stencil computation using the Pochoir stencil-specification language. In HotPar’11, Berkeley, CA, USA, May 2011.Google ScholarGoogle Scholar
  55. G. Tzenakis, A. Papatriantafyllou, H. Vandierendonck, P. Pratikakis, and D. S. Nikolopoulos. BDDT: block-level dynamic dependence analysis for task-based parallelism. In Advanced Parallel Processing Technologies - 10th International Symposium, APPT 2013, Stockholm, Sweden, August 27-28, 2013, Revised Selected Papers, pages 17–31, 2013.Google ScholarGoogle Scholar
  56. S. Warshall. A theorem on boolean matrices. J. ACM, 9(1):11–12, 1962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. M. Waterman. Introduction to Computational Biology. Chapman & Hall, London, UK, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  58. S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. In IPDPS, pages 1–14, Miami, FL, Apr. 2008.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
            January 2015
            290 pages
            ISBN:9781450332057
            DOI:10.1145/2688500
            • cover image ACM SIGPLAN Notices
              ACM SIGPLAN Notices  Volume 50, Issue 8
              PPoPP '15
              August 2015
              290 pages
              ISSN:0362-1340
              EISSN:1558-1160
              DOI:10.1145/2858788
              • Editor:
              • Andy Gill
              Issue’s Table of Contents

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 24 January 2015

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate230of1,014submissions,23%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader