ABSTRACT
To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without compromising locality and introducing small overheads. Software-only schedulers can implement various scheduling algorithms that match the characteristics of different applications and programming models, but suffer significant overheads as they synchronize and communicate task information over the deep cache hierarchy of a large-scale CMP. To reduce these costs, hardware-only schedulers like Carbon, which implement task queuing and scheduling in hardware, have been proposed. However, a hardware-only solution fixes the scheduling algorithm and leaves no room for other uses of the custom hardware.
This paper presents a combined hardware-software approach to build fine-grain schedulers that retain the flexibility of software schedulers while being as fast and scalable as hardware ones. We propose asynchronous direct messages (ADM), a simple architectural extension that provides direct exchange of asynchronous, short messages between threads in the CMP without going through the memory hierarchy. ADM is sufficient to implement a family of novel, software-mostly schedulers that rely on low-overhead messaging to efficiently coordinate scheduling and transfer task information. These schedulers match and often exceed the performance and scalability of Carbon when using the same scheduling algorithm. When the ADM runtime tailors its scheduling algorithm to application characteristics, it outperforms Carbon by up to 70%.
- A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The MIT Alewife machine: architecture and performance. Proc. of the 22nd annual International Symposium on Computer Architecture, 1995. Google ScholarDigital Library
- S. Agarwal, R. Barik, D. Bonachea, V. Sarkar, R. K. Shyamasundar, and K. Yelick. Deadlock-free scheduling of X10 computations with bounded resources. In Proc. of the 19th annual ACM Symposium on Parallel Algorithms and Architectures, 2007. Google ScholarDigital Library
- G. Al-Kadi and A. S. Terechko. A hardware task scheduler for embedded video processing. In Proc. of the 4th International Conference on High Performance and Embedded Architectures and Compilers, 2009. Google ScholarDigital Library
- A. R. Alameldeen and D. A. Wood. Variability in architectural simulations of multi-threaded workloads. In Proc. of the 9th International Symposium on High-Performance Computer Architecture, 2003. Google ScholarDigital Library
- B. Ang, D. Chiou, L. Rudolf, and Arvind. Message passing support on StarT-Voyager. In Proc. of the 5th International Conference on High Performance Computing, 1998. Google ScholarDigital Library
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proc. of the 10th annual ACM Symposium on Parallel Algorithms and Architectures, 1998. Google ScholarDigital Library
- D. A. Bader and V. Sachdeva. A cache-aware parallel implementation of the push-relabel network flow algorithm and experimental evaluation of the gap relabeling heuristic. In Proc. 18th International Conference on Parallel and Distributed Computing Systems, 2005.Google Scholar
- J. Balfour and W. J. Dally. Design tradeoffs for tiled CMP on-chip networks. In Proc. of the 20th annual International Conference on Supercomputing, 2006. Google ScholarDigital Library
- S. Bell et al. TILE64 processor: A 64-core SoC with mesh interconnect. In International Solid-State Circuits Conference, 2008.Google ScholarCross Ref
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. Technical Report TR-811-08, Princeton University, 2008.Google ScholarDigital Library
- N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4), 2006. Google ScholarDigital Library
- G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM, 46(2), 1999. Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. In Proc. of the 35th Annual Symposium on Foundations of Computer Science, 1994. Google ScholarDigital Library
- A. Bracy, K. Doshi, and Q. Jacobson. Disintermediated active communication. IEEE Comput. Archit. Lett., 5(2), 2006. Google ScholarDigital Library
- J. Canny. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell., 8(6), 1986. Google ScholarDigital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proc. of the 20th annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, 2005. Google ScholarDigital Library
- D. Chase and Y. Lev. Dynamic circular work-stealing deque. In Proc. of the 17th annual ACM Symposium on Parallel Algorithms and Architectures, 2005. Google ScholarDigital Library
- S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In Proc. of the 19th annual ACM Symposium on Parallel Algorithms and Architectures, 2007. Google ScholarDigital Library
- G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In Proc. of the 37th International Conference on Parallel Processing, 2008. Google ScholarDigital Library
- W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., 2003. Google ScholarDigital Library
- A. Duran, J. Corbalán, and E. Ayguadé. Evaluation of OpenMP task scheduling strategies. In 4th International Workshop in OpenMP, 2008. Google ScholarDigital Library
- Z. Fang, L. Zhang, J. B. Carter, A. Ibrahim, and M. A. Parker. Active memory operations. In Proc. of the 21st annual International Conference on Supercomputing, 2007. Google ScholarDigital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proc. of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, 1998. Google ScholarDigital Library
- G. Grohoski. Niagara2: A highly-threaded server-on-a-chip. In 18th Hot Chips Symposium, 2006.Google Scholar
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for terminally strict parallel programs. In Proc. of the 23rd IEEE International Parallel and Distributed Processing Symposium, 2009. Google ScholarDigital Library
- M. D. Hill and M. R. Marty. Amdahl's law in the multicore era. Computer, 41(7), 2008. Google ScholarDigital Library
- R. Hoffmann, M. Korch, and T. Rauber. Performance evaluation of task pools based on hardware synchronization. In Proc. of the 2004 ACM/IEEE Conference on Supercomputing, 2004. Google ScholarDigital Library
- HPF Language Specification. Version 2.0, 1997.Google Scholar
- Intel TBB. http://www.threadingbuildingblocks.org.Google Scholar
- Intel Tera-scale Computing Research Program. http://www.intel.com/research/platform/terascale.Google Scholar
- A. Kägi, D. Burger, and J. R. Goodman. Efficient synchronization: let them eat QOLB. In Proc. of the 24th annual International Symposium on Computer Architecture, 1997. Google ScholarDigital Library
- J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, B. Tuohy, A. Mahesri, S. Lumetta, M. Frank, and S. J. Patel. Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In Proc. of the 36th International Symposium on Computer Architecture, 2009. Google ScholarDigital Library
- M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. P. Chew. Scheduling strategies for optimistic parallel execution of irregular programs. In Proc. of the 20th annual Symposium on Parallelism in Algorithms and Architectures, 2008. Google ScholarDigital Library
- S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In Proc. of the 34th annual International Symposium on Computer Architecture, 2007. Google ScholarDigital Library
- W. S. Lee, W. Dally, S. Keckler, N. Carter, and A. Chang. An efficient, protected message interface. IEEE Computer, 31(11), 1998. Google ScholarDigital Library
- K. Mackenzie, J. Kubiatowicz, M. Frank, W. Lee, V. Lee, A. Agarwal, and M. Kaashoek. Exploiting two-case delivery for fast protected messaging. In Proc. of the 4th International Symposium on High-Performance Computer Architecture, 1998. Google ScholarDigital Library
- M. M. Martin et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Computer Architecture News, 2005. Google ScholarDigital Library
- A. Mathuriya, D. A. Bader, C. E. Heitsch, and S. C. Harvey. GTfold: a scalable multicore code for RNA secondary structure prediction. In Proc. of the 2009 ACM Symposium on Applied Computing, 2009. Google ScholarDigital Library
- V. Nagarajan and R. Gupta. ECMon: exposing cache events for monitoring. In Proc. of the 36th annual International Symposium on Computer Architecture, 2009. Google ScholarDigital Library
- M. D. Noakes, D. A. Wallach, and W. J. Dally. The J-machine multicomputer: an architectural evaluation. In Proc. of the 20th annual International Symposium on Computer Architecture, 1993. Google ScholarDigital Library
- OpenMP Application Program Interface. Version 2.5, 2005.Google Scholar
- R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array. In Proc. of the 13th International Conference on Parallel Architectures and Compilation Techniques, 2004. Google ScholarDigital Library
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many--core x86 architecture for visual computing. ACM Trans. Graph., 27(3), 2008. Google ScholarDigital Library
- M. Själander, A. Terechko, and M. Duranton. A look-ahead task management unit for embedded multi-core architectures. In Proc. of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools, 2008. Google ScholarDigital Library
- M. F. Spear, A. Shriraman, H. Hossain, S. Dwarkadas, and M. L. Scott. Alert-on-update: a communication aid for shared memory multiprocessors. In Proc. of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2007. Google ScholarDigital Library
- J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan. GRAMPS: A programming model for graphics pipelines. ACM Trans. Graph., 28(1), 2009. Google ScholarDigital Library
- S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. Technical Report HPL-2008-20, HP Labs, 2008.Google Scholar
- T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active messages: a mechanism for integrated communication and computation. In Proc. of the 19th annual International Symposium on Computer Architecture, 1992. Google ScholarDigital Library
- H. Wong, A. Bracy, E. Schuchman, T. M. Aamodt, J. D. Collins, P. H. Wang, G. Chinya, A. K. Groen, H. Jiang, and H. Wang. Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor. In Proc. of the 17th International Conference on Parallel Architectures and Compilation Techniques, 2008. Google ScholarDigital Library
Index Terms
- Flexible architectural support for fine-grain scheduling
Recommendations
Flexible architectural support for fine-grain scheduling
ASPLOS '10To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without ...
Flexible architectural support for fine-grain scheduling
ASPLOS '10To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without ...
Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems
To harness the compute resource of many-core system with tens to hundreds of cores, applications have to expose parallelism to the hardware. Researchers are aggressively looking for program execution models that make it easier to expose parallelism and ...
Comments