article

Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Authors:
David Tam

University of Toronto, Toronto, Canada

University of Toronto, Toronto, Canada
View Profile

,
Reza Azimi

University of Toronto, Toronto, Canada

University of Toronto, Toronto, Canada
View Profile

,
Michael Stumm

University of Toronto, Toronto, Canada

University of Toronto, Toronto, Canada
View Profile

Authors Info & Claims

ACM SIGOPS Operating Systems Review Volume 41 Issue 3June 2007pp 47–58https://doi.org/10.1145/1272998.1273004

Published:21 March 2007Publication History

ACM SIGOPS Operating Systems Review

Abstract

The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid- and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses.

In this paper we describe the design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units. The primary advantage of using the PMU infrastructure is that it is fine-grained (down to the cache line) and has relatively low overhead. We have implemented our scheme in Linux running on an 8-way Power5 SMP-CMP-SMT multi-processor. For commercial multithreaded server workloads (VolanoMark, SPECjbb, and RUBiS), we are able to demonstrate reductions in cross-chip cache accesses of up to 70%. These reductions lead to application-reported performance improvements of up to 7%.

References

C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. Treadmarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2):18--28, Feb 1996. Google ScholarDigital Library
R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In Intl. Conf. on Supercomputing, 2005. Google ScholarDigital Library
F. Bellosa. Follow-on scheduling: Using TLB information to reduce cache misses. In Symp. on Operating Systems Principles - Work in Progress Session, 1997.Google Scholar
F. Bellosa and M. Steckermeier. The performance implications of locality information usage in shared-memory multiprocessors. J. of Parallel and Distributed Computing, 37(1):113--121, Aug 1996. Google ScholarDigital Library
J. R. Bulpin and I. A. Pratt. Hyper-threading aware process scheduling heuristics. In Usenix Annual Technical Conf., 2005. Google ScholarDigital Library
A. El-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas. Compatible phase co-scheduling on a CMP of multi-threaded processors. In Intl. Parallel and Distributed Processing Symp., 2006. Google ScholarDigital Library
A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In Usenix Annual Technical Conf., 2005. Google ScholarDigital Library
A. Fedorova, C. Small, D. Nussbaum, and M. Seltzer. Chip multithreading systems need a new operating system scheduler. In SIGOPS European Workshop, 2004. Google ScholarDigital Library
S. Harizopoulos and A. Ailamaki. STEPS towards cache-resident transaction processing. In Conf. on Very Large Data Bases, 2004. Google ScholarDigital Library
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264--323, 1999. Google ScholarDigital Library
P. Koka and M. H. Lipasti. Opportunities for cache friendly process scheduling. In Workshop on Interaction Between Operating Systems and Computer Architecture, 2005.Google Scholar
J. Larus and M. Parkes. Using cohort scheduling to enhance server performance. In Usenix Annual Technical Conf., 2002. Google ScholarDigital Library
R. L. McGregor, C. D. Antonopoulos, and D. S. Nikolopoulos. Scheduling algorithms for effective thread pairing on hybrid multiprocessors. In Intl. Parallel and Distributed Processing Symp., 2005. Google ScholarDigital Library
J. Nakajima and V. Pallipadi. Enhancements for Hyper-Threading technology in the operating system - seeking the optimal micro-architectural scheduling. In Workshop on Industrial Experiences with Systems Software, 2002. Google ScholarDigital Library
S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-sensitive scheduling for SMT processors. Technical report, Dept. of Computer Science & Engineering, Univ. of Washington, 2000.Google Scholar
J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In Conf. on Architectural Support for Programming Languages and Operating Systems, 1996. Google ScholarDigital Library
A. Settle, J. Kihm, A. Janiszewski, and D. A. Connors. Architectural support for enhanced SMT job scheduling. In Symp. on Parallel Architectures and Compilation Techniques, 2004. Google ScholarDigital Library
A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Conf. on Architectural Support for Programming Languages and Operating Systems, 2000. Google ScholarDigital Library
S. Sridharan, B. Keck, R. Murphy, S. Chandra, and P. Kogge. Thread migration to improve synchronization performance. In Workshop on Operating System Interference in High Performance Applications, 2006.Google Scholar
E. G. Suh, L. Rudolph, and S. Devadas. Effects of memory performance on parallel job scheduling. In D. G. Feitelson and L. Rudolph, editors, Workshop on Job Scheduling Strategies for Parallel Processing, volume 2221 of Lecture Notes in Computer Science, pages 116--132, Cambridge, MA, Jun 16 2001. Springer-Verlag. Google ScholarDigital Library
E. G. Suh, L. Rudolph, and S. Devadas. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Symp. on High-Performance Computer Architecture, 2002. Google ScholarDigital Library
R. Thekkah and S. J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Intl. Symp. on Computer Architecture, 1994. Google ScholarDigital Library
B. Weissman. Performance counters and state sharing annotations: a unified approach to thread locality. In Conf. on Architectural Support for Programming Languages and Operating Systems, 1998. Google ScholarDigital Library
M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable internet services. In Symp. on Operating Systems Principles, 2001. Google ScholarDigital Library

Index Terms

Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Recommendations

RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
ASPLOS 2009

Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been ...
Read More
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors
EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory ...
Read More
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
ASPLOS 2009

Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGOPS Operating Systems Review Volume 41, Issue 3
EuroSys'07 Conference Proceedings
June 2007
386 pages
ISSN:0163-5980
DOI:10.1145/1272998
Issue’s Table of Contents
EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
March 2007
431 pages
ISBN:9781595936363
DOI:10.1145/1272996
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 March 2007
Check for updates
Author Tags
CMP
SMP
SMT
affinity scheduling
cache behavior
cache locality
detecting sharing
hardware performance counters
hardware performance monitors
multithreading
performance monitoring unit
resource allocation
shared caches
sharing
simultaneous multithreading
single-chip multiprocessors
thread migration
thread placement
thread scheduling
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 236
  Total Citations
  View Citations
- 2,122
  Total Downloads
- Downloads (Last 12 months)89
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations

Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations