skip to main content
article

Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Published:21 March 2007Publication History
Skip Abstract Section

Abstract

The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid- and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses.

In this paper we describe the design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units. The primary advantage of using the PMU infrastructure is that it is fine-grained (down to the cache line) and has relatively low overhead. We have implemented our scheme in Linux running on an 8-way Power5 SMP-CMP-SMT multi-processor. For commercial multithreaded server workloads (VolanoMark, SPECjbb, and RUBiS), we are able to demonstrate reductions in cross-chip cache accesses of up to 70%. These reductions lead to application-reported performance improvements of up to 7%.

References

  1. C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. Treadmarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2):18--28, Feb 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In Intl. Conf. on Supercomputing, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. Bellosa. Follow-on scheduling: Using TLB information to reduce cache misses. In Symp. on Operating Systems Principles - Work in Progress Session, 1997.Google ScholarGoogle Scholar
  4. F. Bellosa and M. Steckermeier. The performance implications of locality information usage in shared-memory multiprocessors. J. of Parallel and Distributed Computing, 37(1):113--121, Aug 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. R. Bulpin and I. A. Pratt. Hyper-threading aware process scheduling heuristics. In Usenix Annual Technical Conf., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. El-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas. Compatible phase co-scheduling on a CMP of multi-threaded processors. In Intl. Parallel and Distributed Processing Symp., 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In Usenix Annual Technical Conf., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Fedorova, C. Small, D. Nussbaum, and M. Seltzer. Chip multithreading systems need a new operating system scheduler. In SIGOPS European Workshop, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Harizopoulos and A. Ailamaki. STEPS towards cache-resident transaction processing. In Conf. on Very Large Data Bases, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264--323, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Koka and M. H. Lipasti. Opportunities for cache friendly process scheduling. In Workshop on Interaction Between Operating Systems and Computer Architecture, 2005.Google ScholarGoogle Scholar
  12. J. Larus and M. Parkes. Using cohort scheduling to enhance server performance. In Usenix Annual Technical Conf., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. L. McGregor, C. D. Antonopoulos, and D. S. Nikolopoulos. Scheduling algorithms for effective thread pairing on hybrid multiprocessors. In Intl. Parallel and Distributed Processing Symp., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Nakajima and V. Pallipadi. Enhancements for Hyper-Threading technology in the operating system - seeking the optimal micro-architectural scheduling. In Workshop on Industrial Experiences with Systems Software, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-sensitive scheduling for SMT processors. Technical report, Dept. of Computer Science & Engineering, Univ. of Washington, 2000.Google ScholarGoogle Scholar
  16. J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In Conf. on Architectural Support for Programming Languages and Operating Systems, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Settle, J. Kihm, A. Janiszewski, and D. A. Connors. Architectural support for enhanced SMT job scheduling. In Symp. on Parallel Architectures and Compilation Techniques, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Conf. on Architectural Support for Programming Languages and Operating Systems, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Sridharan, B. Keck, R. Murphy, S. Chandra, and P. Kogge. Thread migration to improve synchronization performance. In Workshop on Operating System Interference in High Performance Applications, 2006.Google ScholarGoogle Scholar
  20. E. G. Suh, L. Rudolph, and S. Devadas. Effects of memory performance on parallel job scheduling. In D. G. Feitelson and L. Rudolph, editors, Workshop on Job Scheduling Strategies for Parallel Processing, volume 2221 of Lecture Notes in Computer Science, pages 116--132, Cambridge, MA, Jun 16 2001. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. G. Suh, L. Rudolph, and S. Devadas. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Symp. on High-Performance Computer Architecture, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Thekkah and S. J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Intl. Symp. on Computer Architecture, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Weissman. Performance counters and state sharing annotations: a unified approach to thread locality. In Conf. on Architectural Support for Programming Languages and Operating Systems, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable internet services. In Symp. on Operating Systems Principles, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

                            Recommendations

                            Comments

                            Login options

                            Check if you have access through your login credentials or your institution to get full access on this article.

                            Sign in

                            Full Access

                            PDF Format

                            View or Download as a PDF file.

                            PDF

                            eReader

                            View online with eReader.

                            eReader