research-article

Enhancing operating system support for multicore processors by using hardware performance monitoring

Authors:
Reza Azimi

University of Toronto, Canada

University of Toronto, Canada
View Profile

,
David K. Tam

University of Toronto, Canada

University of Toronto, Canada
View Profile

,
Livio Soares

University of Toronto, Canada

University of Toronto, Canada
View Profile

,
Michael Stumm

University of Toronto, Canada

University of Toronto, Canada
View Profile

Authors Info & Claims

ACM SIGOPS Operating Systems Review Volume 43 Issue 2April 2009pp 56–65https://doi.org/10.1145/1531793.1531803

Published:21 April 2009Publication History

ACM SIGOPS Operating Systems Review

Abstract

Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27% improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10% reduction in cache miss rates due to reduced cache pollution, resulting in up to 7% improvement in IPC; and (3) up to 70% reduction in remote cache accesses due to thread clustering, resulting in up to 7% application-level improvement.

References

BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h Processors. AMD, Sept. 2007.Google Scholar
J. Appavoo, D. Da Silva, O. Krieger, M. Auslander, M. Ostrowski, B. Rosenburg, A. Waterland, R. Wisniewski, J. Xenidis, M. Stumm, and L. Soares. Experience distributing objects in an SMMP OS. ACM Transactions on Computer Systems (TOCS), 25(3):6:1--6:52, Aug. 2007. Google ScholarDigital Library
R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In Int'l Conf. on Supercomputing (ICS), 2005. Google ScholarDigital Library
F. Bellosa and M. Steckermeier. The performance implications of locality information usage in shared-memory multiprocessors. J. of Parallel & Distributed Computing, 37(1):113--121, Aug. 1996. Google ScholarDigital Library
B. Bershad, D. Lee, T. Romer, and J. Chen. Avoiding conflict misses dynamically in large direct-mapped caches. In Int'l Conf. on Architectural Support for Programming Languages & Operating Systems (ASPLOS), 1994. Google ScholarDigital Library
R. Bryant, H.-Y. Chang, and B. Rosenburg. Operating system support for parallel programming on RP3. IBM J. of Research & Development, 35(5-6):617--634, Sept./Nov. 1991. Google ScholarDigital Library
D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In Int'l Symp. on High Performance Computer Architecture (HPCA), 2005. Google ScholarDigital Library
J. Chapin, A. Herrod, M. Rosenblum, and A. Gupta. Memory system performance of UNIX on CC-NUMA multiprocessors. In Int'l Conf. on Measurement & Modeling of Computer Systems (SIGMETRICS), 1995. Google ScholarDigital Library
S. Cho and L. Jin. Managing distributed, shared L2 caches through OS-level page allocation. In Int'l Symp. on Microarchitecture (MICRO), 2006. Google ScholarDigital Library
S. Cho and L. Jin. Better than the two: Exceeding private and shared caches via two-dimensional page coloring. InWorkshop on Chip Multiprocessor Memory Systems and Interconnects (CMP-MSI), 2007.Google Scholar
J. Corbalan, X. Martorell, and J. Labarta. Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000. In Int'l Conf. on Supercomputing (ICS), 2003. Google ScholarDigital Library
M. Flynn and P. Hung. Microprocessor design issues: Thoughts on the road ahead. IEEE Micro, 25(3):16--31, 2005. Google ScholarDigital Library
B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado: Maximizing locality and concurrency in a shared memory multiprocessor operating system. In Symp. on Operating Systems Design & Implementation (OSDI), 1999. Google ScholarDigital Library
F. Guo and Y. Solihin. An analytical model for cache replacement policy performance. In Int'l Conf. on Measurement & Modeling of Computer Systems (SIGMETRICS), 2006. Google ScholarDigital Library
IBM PowerPC 970FX RISC Microprocessor User's Manual. IBM, 2006.Google Scholar
Intel Itanium 2 Reference Manual: For Software Development & Optimization. Number 251110-003. Intel, May 2004.Google Scholar
R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In Int'l Conf. on Supercomputing (ICS), 2004. Google ScholarDigital Library
H. Kannan, F. Guo, L. Zhao, R. Illikkal, R. Iyer, D. Newell, Y. Solihin, and C.Kozyrakis. From chaos to QoS: Case studies in CMP resource management. In Workshop on Design, Architecture, & Simulation of Chip Multi-Processors (dasCMP), 2006.Google Scholar
S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Int'l Conf. on Parallel Architecture & Compiliation Techniques (PACT), 2004. Google ScholarDigital Library
R. LaRowe, J. Wilkes, and C. Ellis. Exploiting operating system support for dynamic page placement on a NUMA shared memory multiprocessor. In Symp. on Principles & Practice of Parallel Programming (PPoPP), 1991. Google ScholarDigital Library
J. Liedtke, H. Härtig, and M. Hohmuth. OS-controlled cache predictability for real-time systems. In Real-Time Technology & Applications Symp. (RTAS), 1997. Google ScholarDigital Library
J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Int'l Symp. on High Performance Computer Architecture (HPCA), 2008.Google Scholar
W. Lynch, B. Bray, and M. Flynn. The effect of page allocation on caches. In Int'l Symp. on Microarchitecture (MICRO), 1992. Google ScholarDigital Library
R. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation techniques and storage hierarchies. IBM Systems J., 9(2):78--117, 1970.Google ScholarDigital Library
J. Mellor-Crummey and M. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), 9(1):21--65, Feb 1991. Google ScholarDigital Library
L. Noordergraaf and R. Zak. SMP system interconnect instrumentation for performance analysis. In Conf. on Supercomputing (SC), pages 1--9, 2002. Google ScholarDigital Library
M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Int'l Symp. on Microarchitecture (MICRO), 2006. Google ScholarDigital Library
N. Rafique, W. Lim, and M. Thottethodi. Architectural support for operating system-driven CMP cache management. In Int'l Conf. on Parallel Architecture & Compiliation Techniques (PACT), 2006. Google ScholarDigital Library
T. Sherwood, B. Calder, and J. Emer. Reducing cache misses using hardware and software page placement. In Conf. on Supercomputing (SC), 1999. Google ScholarDigital Library
B. Sinharoy, R. Kalla, J. Tendler, R. Eickemeyer, and J. Joyner. POWER5 system microarchitecture. IBM J. of Research & Development, 49(4/5):505--522, 2005. Google ScholarDigital Library
L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an OS-level, softwareonly pollute buffer. In Int'l Symp. on Microarchitecture (MICRO), 2008. Google ScholarDigital Library
B. Sprunt. Pentium 4 performance monitoring features. IEEE Micro, 22(4):72--82, Jul./Aug. 2002. Google ScholarDigital Library
E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. of Supercomputing, 28(1):7--26, 2004. Google ScholarDigital Library
D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared L2 caches on multicore systems in software. In Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA), 2007.Google Scholar
D. Tam, R. Azimi, L. Soares, and M. Stumm. RapidMRC: Approximating L2 miss rate curves on commodity systems for online optimizations. In Int'l Conf. on Architectural Support for Programming Languages & Operating Systems (ASPLOS), 2009. Google ScholarDigital Library
D. Tam, R. Azimi, and M. Stumm. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In European Conf. on Computer Systems (EuroSys), 2007. Google ScholarDigital Library
M. Tikir and J. Hollingsworth. Using hardware counters to automatically improve memory performance. In Conf. on Supercomputing (SC), 2004. Google ScholarDigital Library
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CCNUMA compute servers. In Conf. on Programming Language Design & Implementation (PLDI), 1996.Google Scholar
K. Wilson and B. Aglietti. Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C. In Conf. on Supercomputing (SC), 2001. Google ScholarDigital Library
M. Zaleski. YETI: a graduallY Extensible Trace Interpreter. PhD thesis, Univ. Toronto, 2008. Google ScholarDigital Library

Index Terms

Enhancing operating system support for multicore processors by using hardware performance monitoring

Recommendations

Adaptive prefetching using global history buffer in multicore processors

Data prefetching is a well-known technique to hide the memory latency in the last-level cache (LCC). Among many prefetching methods in recent years, the Global History Buffer (GHB) proves to be efficient in terms of cost and speedup. In this paper, we ...
Read More
Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Read More
Surpassing the TLB performance of superpages with less operating system support

Many commercial microprocessor architectures have added translation lookaside buffer (TLB) support for superpages. Superpages differ from segments because their size must be a power of two multiple of the base page size and they must be aligned in both ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGOPS Operating Systems Review Volume 43, Issue 2
April 2009
119 pages
ISSN:0163-5980
DOI:10.1145/1531793
Issue’s Table of Contents

Copyright © 2009 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 April 2009
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 52
  Total Citations
  View Citations
- 1,331
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhancing operating system support for multicore processors by using hardware performance monitoring

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

Adaptive prefetching using global history buffer in multicore processors

Increasing hardware data prefetching performance using the second-level cache

Surpassing the TLB performance of superpages with less operating system support

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Enhancing operating system support for multicore processors by using hardware performance monitoring

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

Adaptive prefetching using global history buffer in multicore processors

Increasing hardware data prefetching performance using the second-level cache

Surpassing the TLB performance of superpages with less operating system support

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media