skip to main content
10.1145/1810479.1810519acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

Low depth cache-oblivious algorithms

Published:13 June 2010Publication History

ABSTRACT

In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.

Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

References

  1. U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3), 2002.Google ScholarGoogle Scholar
  2. A. Aggarwal, B. Alpern, A. Chandra, and M. Snir. A model for hierarchical memory. In ACM STOC'87, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Alpern, L. Carter, E. Feig, and T. Selker. The uniform memory hierarchy model of computation. Algorithmica, 12, 1994.Google ScholarGoogle Scholar
  4. B. Alpern, L. Carter, and J. Ferrante. Modeling parallel computers as memory hierarchies. In Proc. 1993 Conf. on Programming Models for Massively Parallel Computers, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  5. L. Arge, M. A. Bender, E. D. Demaine, B. Holland-Minkley, and J. I. Munro. Cache-oblivious priority queue and graph algorithm applications. In ACM STOC'02, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Arge, G. S. Brodal, and R. Fagerberg. Cache-oblivous data structures. In D. Mehta and S. Sahni, editors, Handbook of Data Structures and Applications. CRC Press, 2005.Google ScholarGoogle Scholar
  7. L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In ACM SPAA'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Arge, M. T. Goodrich, and N. Sitchinava. Parallel external memory graph algorithms. Manuscript, 2009.Google ScholarGoogle Scholar
  9. N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In ACM SPAA'98, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. A. Bender, G. S. Brodal, R. Fagerberg, R. Jacob, and E. Vicari. Optimal sparse matrix dense vector multiplication in the I/O-model. In ACM SPAA'07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. A. Bender, B. C. Kuszmaul, S.-H. Teng, and K. Wang. Optimal cache-oblivious mesh layout. Computing Research Repository (CoRR) abs/0705.1033, 2007.Google ScholarGoogle Scholar
  12. G. Bilardi. Models for parallel and hierarchical computation. In Proc. 4th ACM International Conf. on Computing Frontiers, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 39(3), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In ACM-SIAM SODA'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In ACM SPAA'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. Journal of the ACM, 46(2), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Brief announcement: Low-depth cache oblivious sorting. In ACM SPAA'09, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low-depth cache oblivious algorithms. Technical Report CMU-CS-TR-134, Computer Science Department, Carnegie Mellon University, 2009 http://reports-archive.adm.cs.cmu.edu/anon/2009/CMU-CS-09-134.pdf.Google ScholarGoogle Scholar
  19. R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, and K. H. Randall. Dag-consistent distributed shared memory. In IPPS'96, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. J. Parallel Distrib. Comput., 37(1), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. S. Brodal. Cache-oblivious algorithms and data structures. In Proc. 9th Scandinavian Workshop on Algorithm Theory, 2004. LNCS, vol. 3111. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  23. G. S. Brodal and R. Fagerberg. Cache oblivious distribution sweeping. In ICALP '02, 2002. LNCS, vol. 2380. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. S. Brodal, R. Fagerberg, and G. Moruz. Cache-aware and cache-oblivious adaptive sorting. In ICALP'05, 2005. LNCS, vol. 3580. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. S. Brodal, R. Fagerberg, and K. Vinther. Engineering a cache-oblivious sorting algorithm. ACM Journal of Experimental Algorithmics, 12, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Charles, C. Donawa, K. Ebcioglu, C. Grothoff, A. Kielstra, C. von Praun, V. Saraswat, and V. Sarkar. X10: An object-oriented approach to non-uniform clustered computing. In Proc. ACM SIGPLAN Conf. on Object-Oriented Programming Languages and Applications, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter. External-memory graph algorithms. In ACM-SIAM SODA'95, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. A. Chowdhury and V. Ramachandran. The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. In ACM SPAA'07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. A. Chowdhury and V. Ramachandran. Cache-efficient dynamic programming algorithms for multicores. In ACM SPAA'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. A. Chowdhury, V. Ramachandran, and F. Silvestri. Oblivious algorithms for multicore, network, and petascale computing. Manuscript, 2009.Google ScholarGoogle Scholar
  31. R. A. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. In IEEE IPDPS'10, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  32. R. Cole and U. Vishkin. Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms. In ACM STOC'86, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. E. Santos, R. Subramonian, and T. von Eicken. Logp: Towards a realistic model of parallel computation. In ACM PPOPP'93, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. E. D. Demaine. Cache-oblivious algorithms and data structures. In Lecture Notes from the EEF Summer School on Massive Data Sets, LNCS. Springer-Verlag, 2002.Google ScholarGoogle Scholar
  35. K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In Supercomputing'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. G. Franceschini. Proximity mergesort: Optimal in-place sorting in the cache-oblivious model. In ACM-SIAM SODA'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. W. D. Frazer and A. C. McKellar. Samplesort: A sampling approach to minimal storage tree sorting. Journal of the ACM, 17(3), 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In IEEE FOCS'99, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In ACM SPAA'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Jaja. An Introduction to Parallel Algorithms. Addison-Wesley, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. P. Kumar. Cache oblivious algorithms. In U. Meyer, P. Sanders, and J. Sibeyn, editors, Algorithms for Memory Hierarchies. Springer, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  42. E. Ladan-Mozes and C. E. Leiserson. A consistency architecture for hierarchical shared caches. In ACM SPAA'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. R. J. Lipton and R. E. Tarjan. A separator theorem for planar graphs. SIAM Journal on Applied Mathematics, 36, 1979.Google ScholarGoogle Scholar
  44. OpenMP Architecture Review Board. OpenMP application program interface. Technical Report Version 3.0, 2008.Google ScholarGoogle Scholar
  45. S. Rajasekaran and J. H. Reif. Optimal and sublogarithmic time randomized parallel sorting algorithms. SIAM J. Comput., 18(3), 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. E. Savage and M. Zubair. A unified model for multicore architectures. In Proc. 1st International Forum on Next-Generation Multicore/Manycore Technologies, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging rules. Commun. ACM, 28(2), 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8), 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. L. G. Valiant. A bridging model for multicore computing. In ESA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Low depth cache-oblivious algorithms

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
                June 2010
                378 pages
                ISBN:9781450300797
                DOI:10.1145/1810479

                Copyright © 2010 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 13 June 2010

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                Overall Acceptance Rate447of1,461submissions,31%

                Upcoming Conference

                SPAA '24

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader