ABSTRACT
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.
- U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3), 2002.Google Scholar
- A. Aggarwal, B. Alpern, A. Chandra, and M. Snir. A model for hierarchical memory. In ACM STOC'87, 1987. Google ScholarDigital Library
- B. Alpern, L. Carter, E. Feig, and T. Selker. The uniform memory hierarchy model of computation. Algorithmica, 12, 1994.Google Scholar
- B. Alpern, L. Carter, and J. Ferrante. Modeling parallel computers as memory hierarchies. In Proc. 1993 Conf. on Programming Models for Massively Parallel Computers, 1993.Google ScholarCross Ref
- L. Arge, M. A. Bender, E. D. Demaine, B. Holland-Minkley, and J. I. Munro. Cache-oblivious priority queue and graph algorithm applications. In ACM STOC'02, 2002. Google ScholarDigital Library
- L. Arge, G. S. Brodal, and R. Fagerberg. Cache-oblivous data structures. In D. Mehta and S. Sahni, editors, Handbook of Data Structures and Applications. CRC Press, 2005.Google Scholar
- L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In ACM SPAA'08, 2008. Google ScholarDigital Library
- L. Arge, M. T. Goodrich, and N. Sitchinava. Parallel external memory graph algorithms. Manuscript, 2009.Google Scholar
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In ACM SPAA'98, 1998. Google ScholarDigital Library
- M. A. Bender, G. S. Brodal, R. Fagerberg, R. Jacob, and E. Vicari. Optimal sparse matrix dense vector multiplication in the I/O-model. In ACM SPAA'07, 2007. Google ScholarDigital Library
- M. A. Bender, B. C. Kuszmaul, S.-H. Teng, and K. Wang. Optimal cache-oblivious mesh layout. Computing Research Repository (CoRR) abs/0705.1033, 2007.Google Scholar
- G. Bilardi. Models for parallel and hierarchical computation. In Proc. 4th ACM International Conf. on Computing Frontiers, 2007. Google ScholarDigital Library
- G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 39(3), 1996. Google ScholarDigital Library
- G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In ACM-SIAM SODA'08, 2008. Google ScholarDigital Library
- G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In ACM SPAA'04, 2004. Google ScholarDigital Library
- G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. Journal of the ACM, 46(2), 1999. Google ScholarDigital Library
- G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Brief announcement: Low-depth cache oblivious sorting. In ACM SPAA'09, 2009. Google ScholarDigital Library
- G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low-depth cache oblivious algorithms. Technical Report CMU-CS-TR-134, Computer Science Department, Carnegie Mellon University, 2009 http://reports-archive.adm.cs.cmu.edu/anon/2009/CMU-CS-09-134.pdf.Google Scholar
- R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, and K. H. Randall. Dag-consistent distributed shared memory. In IPPS'96, 1996. Google ScholarDigital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. J. Parallel Distrib. Comput., 37(1), 1996. Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5), 1999. Google ScholarDigital Library
- G. S. Brodal. Cache-oblivious algorithms and data structures. In Proc. 9th Scandinavian Workshop on Algorithm Theory, 2004. LNCS, vol. 3111. Springer.Google ScholarCross Ref
- G. S. Brodal and R. Fagerberg. Cache oblivious distribution sweeping. In ICALP '02, 2002. LNCS, vol. 2380. Springer. Google ScholarDigital Library
- G. S. Brodal, R. Fagerberg, and G. Moruz. Cache-aware and cache-oblivious adaptive sorting. In ICALP'05, 2005. LNCS, vol. 3580. Springer. Google ScholarDigital Library
- G. S. Brodal, R. Fagerberg, and K. Vinther. Engineering a cache-oblivious sorting algorithm. ACM Journal of Experimental Algorithmics, 12, 2008. Google ScholarDigital Library
- P. Charles, C. Donawa, K. Ebcioglu, C. Grothoff, A. Kielstra, C. von Praun, V. Saraswat, and V. Sarkar. X10: An object-oriented approach to non-uniform clustered computing. In Proc. ACM SIGPLAN Conf. on Object-Oriented Programming Languages and Applications, 2005. Google ScholarDigital Library
- Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter. External-memory graph algorithms. In ACM-SIAM SODA'95, 1995. Google ScholarDigital Library
- R. A. Chowdhury and V. Ramachandran. The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. In ACM SPAA'07, 2007. Google ScholarDigital Library
- R. A. Chowdhury and V. Ramachandran. Cache-efficient dynamic programming algorithms for multicores. In ACM SPAA'08, 2008. Google ScholarDigital Library
- R. A. Chowdhury, V. Ramachandran, and F. Silvestri. Oblivious algorithms for multicore, network, and petascale computing. Manuscript, 2009.Google Scholar
- R. A. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. In IEEE IPDPS'10, 2010.Google ScholarCross Ref
- R. Cole and U. Vishkin. Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms. In ACM STOC'86, 1986. Google ScholarDigital Library
- D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. E. Santos, R. Subramonian, and T. von Eicken. Logp: Towards a realistic model of parallel computation. In ACM PPOPP'93, 1993. Google ScholarDigital Library
- E. D. Demaine. Cache-oblivious algorithms and data structures. In Lecture Notes from the EEF Summer School on Massive Data Sets, LNCS. Springer-Verlag, 2002.Google Scholar
- K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In Supercomputing'06, 2006. Google ScholarDigital Library
- G. Franceschini. Proximity mergesort: Optimal in-place sorting in the cache-oblivious model. In ACM-SIAM SODA'04, 2004. Google ScholarDigital Library
- W. D. Frazer and A. C. McKellar. Samplesort: A sampling approach to minimal storage tree sorting. Journal of the ACM, 17(3), 1970. Google ScholarDigital Library
- M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In IEEE FOCS'99, 1999. Google ScholarDigital Library
- M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In ACM SPAA'06, 2006. Google ScholarDigital Library
- J. Jaja. An Introduction to Parallel Algorithms. Addison-Wesley, 1992. Google ScholarDigital Library
- P. Kumar. Cache oblivious algorithms. In U. Meyer, P. Sanders, and J. Sibeyn, editors, Algorithms for Memory Hierarchies. Springer, 2003.Google ScholarCross Ref
- E. Ladan-Mozes and C. E. Leiserson. A consistency architecture for hierarchical shared caches. In ACM SPAA'08, 2008. Google ScholarDigital Library
- R. J. Lipton and R. E. Tarjan. A separator theorem for planar graphs. SIAM Journal on Applied Mathematics, 36, 1979.Google Scholar
- OpenMP Architecture Review Board. OpenMP application program interface. Technical Report Version 3.0, 2008.Google Scholar
- S. Rajasekaran and J. H. Reif. Optimal and sublogarithmic time randomized parallel sorting algorithms. SIAM J. Comput., 18(3), 1989. Google ScholarDigital Library
- J. E. Savage and M. Zubair. A unified model for multicore architectures. In Proc. 1st International Forum on Next-Generation Multicore/Manycore Technologies, 2008. Google ScholarDigital Library
- D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging rules. Commun. ACM, 28(2), 1985. Google ScholarDigital Library
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8), 1990. Google ScholarDigital Library
- L. G. Valiant. A bridging model for multicore computing. In ESA, 2008. Google ScholarDigital Library
Index Terms
- Low depth cache-oblivious algorithms
Recommendations
Cache-Oblivious Algorithms
This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: ...
Brief announcement: low depth cache-oblivious sorting
SPAA '09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architecturesCache-oblivious algorithms have the advantage of achieving good sequential cache complexity across all levels of a multi-level cache hierarchy, regardless of the specifics (cache size and cache line size) of each level. In this paper, we describe cache-...
Effects of Cache Coherency in Multiprocessors
In many commercial multiprocessor systems, each processor accesses the memory through a private cache. One problem that could limit the extensibility of the system and its performance is the enforcement of cache coherence. A mechanism must exist which ...
Comments