research-article

Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency

Authors:
Yuan Tang

Fudan University, China

Fudan University, China
View Profile

,
Ronghui You

Fudan University, China

Fudan University, China
View Profile

,
Haibin Kan

Fudan University, China

Fudan University, China
View Profile

,
Jesmin Jahan Tithi

Stony Brook University, USA

Stony Brook University, USA
View Profile

,
Pramod Ganapathi

Stony Brook University, USA

Stony Brook University, USA
View Profile

,
Rezaul A. Chowdhury

Stony Brook University, USA

Stony Brook University, USA
View Profile

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingJanuary 2015Pages 205–214https://doi.org/10.1145/2688500.2688514

Published:24 January 2015Publication History

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 205–214

ABSTRACT

State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best parallelism at the same time. While these algorithms achieve cache-optimality through the use of a recursive divide-and-conquer (DAC) strategy, scheduling tasks at the granularity of task dependency introduces artificial dependencies in addition to those arising from the defining recurrence equations. We removed the artificial dependency by scheduling tasks ready for execution as soon as all its real dependency constraints are satisfied, while preserving the cache-optimality by inheriting the DAC strategy. We applied our approach to a set of widely known dynamic programming problems, such as Floyd-Warshall's All-Pairs Shortest Paths, Stencil, and LCS. Theoretical analyses show that our techniques improve the span of 2-way DAC-based Floyd Warshall's algorithm on an $n$ node graph from $Thn^2n$ to $Thn$, stencil computations on a $d$-dimensional hypercubic grid of width $w$ for $h$ time steps from $Th(d^2 h) w^ (d+2) - 1$ to $Thh$, and LCS on two sequences of length $n$ each from $Thn^_2 3$ to $Thn$. In each case, the total work and cache complexity remain asymptotically optimal. Experimental measurements exhibit a $3$ - $5$ times improvement in absolute running time, $10$ - $20$ times improvement in burdened span by Cilkview, and approximately the same L1/L2 cache misses by PAPI.

References

U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000), pages 1–12, 2000. Google ScholarDigital Library
K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work stealing. In IPDPS, pages 1–12. IEEE, April 2010.Google Scholar
A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley Publishing Company, 1974. Google ScholarDigital Library
R. Bellman. Dynamic Programming. Princeton University Press, 1957. Google ScholarDigital Library
G. Bikshandi, J. Guo, D. Hoeflinger, G. Almasi, B. B. Fraguela, M. J. Garzarán, D. Padua, and C. von Praun. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’06, pages 48–57, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
R. Bleck, C. Rooth, D. Hu, and L. T. Smith. Salinity-driven thermocline transients in a wind- and thermohaline-forced isopycnic coordinate model of the North Atlantic. Journal of Physical Oceanography, 22(12):1486–1505, 1992. ISSN 0022-3670.Google ScholarCross Ref
G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 235–244. ACM, 2004. Google ScholarDigital Library
G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages 355–366, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 207–216, Santa Barbara, California, July 1995. Google ScholarDigital Library
S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. SC Conference, 0:42, 2000. Google ScholarDigital Library
R. Chowdhury. Cache-efficient Algorithms and Data Structures: Theory and Experimental Evaluation. PhD thesis, Department of Computer Sciences, The University of Texas at Austin, Austin, Texas, 2007. Google ScholarDigital Library
R. Chowdhury and V. Ramachandran. Cache-efficient Dynamic Programming Algorithms for Multicores. In Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 207–216, 2008. Google ScholarDigital Library
R. Chowdhury and V. Ramachandran. The cache-oblivious Gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. Theory of Computing Systems, 47(4):878–919, 2010. Google ScholarDigital Library
R. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. Journal of Parallel and Distributed Computing (Special issue on best papers from IPDPS 2010, 2011 and 2012), 73(7):911–925, 2013. A preliminary version appeared as {17}.Google Scholar
R. A. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. In In Proc. of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 06, pages 591–600, 2006. Google ScholarDigital Library
R. A. Chowdhury, H.-S. Le, and V. Ramachandran. Cache-oblivious dynamic programming for bioinformatics. TCBB, 7(3):495–510, July-Sept. 2010. Google ScholarDigital Library
R. A. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. In Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium, pages 1––12, April 2010..Google ScholarCross Ref
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009. Google ScholarDigital Library
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC, pages 4:1–4:12, Austin, TX, Nov. 15–18 2008. Google ScholarDigital Library
R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.Google ScholarCross Ref
H. Dursun, K.-i. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta. In-core optimization of high-order stencil computations. In PDPTA, pages 533–538, Las Vegas, NV, July13–16 2009.Google Scholar
R. Floyd. Algorithm 97 (SHORTEST PATH). Commun. ACM, 5(6): 345, 1962. Google ScholarDigital Library
M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, pages 361–366, Cambridge, MA, June 20–22 2005. Google ScholarDigital Library
M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA, pages 271–280, 2006. Google ScholarDigital Library
M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. Theory of Computing Systems, 45(2):203– 233, 2009. Google ScholarDigital Library
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI ’98, pages 212–223, 1998. Google ScholarDigital Library
M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cacheoblivious algorithms. In FOCS, pages 285–297, New York, NY, Oct. 17–19 1999. Google ScholarDigital Library
Z. Galil and R. Giancarlo. Speeding up dynamic programming with applications to molecular biology. Theoretical Computer Science, 64: 107–118, 1989. Google ScholarDigital Library
Z. Galil and K. Park. Parallel algorithms for dynamic programming recurrences with more than O(1) dependency. Journal of Parallel and Distributed Computing, 21:213–222, 1994. Google ScholarDigital Library
R. Giegerich, C. Meyer, and P. Steffen. A discipline of dynamic programming over sequence data. Science of Computer Programming, 51(3):215–263, 2004. Google ScholarDigital Library
O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162:705–708, 1982.Google ScholarCross Ref
J. Guo, G. Biksh, B. B. Fraguela, M. J. Garzarn, and D. Padua. Programming with tiles. In In PPoPP 08: Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. Google ScholarDigital Library
D. Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, 1997. Google ScholarDigital Library
Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview scalability analyzer. In SPAA, pages 145–156, Santorini, Greece, June 13–15 2010. Google ScholarDigital Library
Intel Corporation. The Intel Many Integrated Core Architecture. http://www.intel.com/content/www/us/en/ architecture-and-technology/many-integrated-core/ intel-many-integrated-core-architecture.html, 2011.Google Scholar
S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In MSP, pages 36–43, Chicago, IL, June 12 2005. Google ScholarDigital Library
S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC, pages 51–60, San Jose, CA, 2006. ISBN 1-59593-578-9.. URL http://doi.acm.org/10.1145/1178597.1178605. Google ScholarDigital Library
J. O. S. Kennedy. Applications of dynamic programming to agriculture, forestry and fisheries: Review and prognosis. Review of Marketing and Agricultural Economics, 49(03), 1981.Google Scholar
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, San Diego, CA, June 10–13 2007. Google ScholarDigital Library
S. Maleki, M. Musuvathi, and T. Mytkowicz. Parallelizing dynamic programming through rank convergence. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’14, pages 219–232, New York, NY, USA, 2014. Google ScholarDigital Library
ACM. ISBN 978-1-4503-2656-8.Google Scholar
J. Meng, J. W. Sheaffer, and K. Skadron. Exploiting inter-thread temporal locality for chip multithreading. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010, pages 1–12, 2010.Google ScholarCross Ref
A. Nakano, R. Kalia, and P. Vashishta. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Computer Physics Communications, 83(2-3):197–214, 1994. ISSN 0010-4655.Google ScholarCross Ref
A. Nitsure. Implementation and optimization of a cache oblivious lattice Boltzmann algorithm. Master’s thesis, Institut für Informatic, Friedrich-Alexander-Universität Erlangen-Nürnberg, July 2006.Google Scholar
L. Peng, R. Seymour, K.-i. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPDPS, pages 1–11, Rome, Italy, May 23–29 2009. Google ScholarDigital Library
A. A. Robichek, E. J. Elton, and M. J. Gruber. Dynamic programming applications in finance. The Journal of Finance, 26(2):473–506, 1971.Google ScholarCross Ref
D. Romer. It’s fourth down and what does the bellman equation say? a dynamic programming analysis of football strategy. Technical report, National Bureau of Economic Research, 2002.Google ScholarCross Ref
J. Rust. Numerical dynamic programming in economics. Handbook of computational economics, 1:619–729, 1996.Google ScholarCross Ref
J. Shun, G. E. Blelloch, J. T. Fineman, and P. B. Gibbons. Reducing contention through priority updates. In SPAA, pages 152–163, 2013. Google ScholarDigital Library
H. V. Simhadri, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and A. Kyrola. Experimental analysis of space-bounded schedulers. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’14, pages 30–41, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
D. K. Smith. Dynamic programming and board games: A survey. European Journal of Operational Research, 176(3):1299–1318, 2007.Google ScholarCross Ref
A. Taflove and S. Hagness. Computational Electrodynamics: The Finite-Difference Time-Domain Method. Artech House, Norwood, MA, 2000. ISBN 1580530761.Google Scholar
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In SPAA, San Jose, CA, USA, 2011. Google ScholarDigital Library
Y. Tang, R. A. Chowdhury, C.-K. Luk, and C. E. Leiserson. Coding stencil computation using the Pochoir stencil-specification language. In HotPar’11, Berkeley, CA, USA, May 2011.Google Scholar
G. Tzenakis, A. Papatriantafyllou, H. Vandierendonck, P. Pratikakis, and D. S. Nikolopoulos. BDDT: block-level dynamic dependence analysis for task-based parallelism. In Advanced Parallel Processing Technologies - 10th International Symposium, APPT 2013, Stockholm, Sweden, August 27-28, 2013, Revised Selected Papers, pages 17–31, 2013.Google Scholar
S. Warshall. A theorem on boolean matrices. J. ACM, 9(1):11–12, 1962. Google ScholarDigital Library
M. Waterman. Introduction to Computational Biology. Chapman & Hall, London, UK, 1995.Google ScholarCross Ref
S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. In IPDPS, pages 1–14, Miami, FL, Apr. 2008.Google ScholarCross Ref

Index Terms

Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency

Recommendations

Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms
SPAA '17: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures

Iterative wavefront algorithms for evaluating dynamic programming recurrences exploit optimal parallelism but show poor cache performance. Tiled-iterative wavefront algorithms achieve optimal cache complexity and high parallelism but are cache-aware and ...
Read More
Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency
PPoPP '15

State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best ...
Read More
Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance
WOSC '14: Proceedings of the Second Workshop on Optimizing Stencil Computations

The state-of-the-art "trapezoidal decomposition algorithm" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2015
290 pages
ISBN:9781450332057
DOI:10.1145/2688500
General Chair:
Albert Cohen
INRIA, France
,
Program Chair:
David Grove
IBM Research, USA
ACM SIGPLAN Notices Volume 50, Issue 8
PPoPP '15
August 2015
290 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2858788
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 January 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cilk
cache-oblivious parallel algorithms
cache-oblivious wavefront
dynamic programming
multi-core
nested parallel computation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 692
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms

Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency

Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance